Thursday, November 21, 2013

Aggregating Aggregation: "Exception Handling or: How I learned to Stop Worrying and Let R Handle Imperfections"

The code for parsing Indeed.com by keyword and area in the previous post has some primitive features. It looks for exact patterns, and based on the usual syntax of Indeed HTML, subtracts a fixed number of characters to make a usable link which I can re-paste.
I think the error I have experienced comes from this fixed number subtraction taking away/leaving a relevant/irrelevant character. It works for most links, but not all. Ideally, I would just skip the exceptions and let the loop continue through.

R does this through tryCatch(), which has a confusing documentation. Reading through this posting from the Working With Data blog helped me in writing a function that loads as many pages of Indeed as specified.





tryCatch takes your code and runs it. If an error or warning occurs, it redirects to the warning/error "handler", below the tryCatch{}. In redirecting, it doesn't stop the main loop with an error so my problem seems avoided although I don't understand what happens entirely. I ran through 5 pages of Indeed postings for analyst, and it spit out 50 instances of the expected output (as shown in the previous post).

Now I need to figure out what kind of output I want this to give, how I can speed up the execution, and automate a list of user inputs for both the keywords searched and multiple search terms, ie, "analyst","physics",... rather than the single "analyst".

but the legs are operational!

Wednesday, November 20, 2013

Aggregating Aggregation: Finding Relevant Jobs

The job hunting process is like climbing a mountain. Each time the mountain crests, you assume it is the top only to be disappointed when there is another ascent. C'est la vie. To that end, I'm trying to implement some robotic legs to do my grunt work to the next ascent.

This is a function that takes the argument that specify a search to Indeed.com, read the page, find the job links, and scrapes the third party redirect postings for relevant keywords. The non-function implementation works well but this bugs out quickly.




Example inputs as "entry level analyst"==search terms, "sacramento ca"==geographic area, 100==radius, 1==the number of pages to search. The bug appears on the fifth third party scrape, with values==defs. Debugging the function takes me to the line that scrapes the HTML but I'm still not sure where the error comes up. Obviously a work in progress, but if I can automate looking through 60 pages of Indeed.com a day, I can save myself a nice chunk of time.