Automatic search

bamboo_panda · January 6, 2015, 10:40pm

I’m working on making a workflow to perform the following:
Search news websites (NY Times, Washington Post, WSJ) for preset keywords, and then if there is an article that is not older than say 2 days that has these keywords, I want to open it, and copy to nvALT or evernote. I have started to practice writing scripts using Applescript, so I figured how to select an entire article, copy and paste into a notebook. But how do I make automatic search? If you could at least nudge me in the right direction, I would appreciate it.

ccstone · January 12, 2015, 6:38pm

Hey There,

If you want help with this you need to provide a more detailed description of the work-flow you want to process.

For example:

A) What exact URLs do you want to search?

B) What software are you using to view those pages - or are you wanting to scrape them using something line curl or wget?

Scraping web-pages for information ranges from fairly simple to grossly difficult depending upon the complexity and variability of the pages in question.

–
Best Regards,
Chris

bamboo_panda · January 14, 2015, 9:30pm

Thank you for your response ccstone. I’d like to search the following URL: http://www.nytimes.com/pages/politics/index.html?action=click&pgtype=Homepage&region=TopBar&module=HPMiniNav&contentCollection=Politics&WT.nav=page

Its a New York Times Politics section. As for your second question, I use Chrome as my browser, I’m not sure if that’s what you were asking. I haven’t tried using either curl or wget, but willing to learn if they will help me get the job done.

ccstone · January 15, 2015, 12:23am

Hey There,

Download and install the freeware text editor TextWrangler.

Then load the URL you posted above in Chrome and run the following AppleScript from the Applescript Editor:

tell application "Google Chrome"
  tell active tab of front window
    set _text to execute javascript "window.document.documentElement.outerHTML"
  end tell
end tell
tell application "TextWrangler"
  make new document with properties {name:"NY Times Source", text:_text}
  set zoomed of front window to true
  activate
end tell

That’s one way to get the source which you can then parse using something like the Satimage.osax or Perl, Python, Ruby, etc.

In Safari you can get just the text in the front window like this:

tell application "Safari"
  tell front document
    set docSrc to its text
  end tell
end tell

Sometimes that text is formatted in a way that’s not too hard to parse, sometimes it’s not.

There may be a way to do this in Chrome using JavaScript, but I don’t know it.

Search Google for: methods of web scraping

If you really want to do this be prepared to spend a lot of time and effort learning how.

–
Best Regards,
Chris

bamboo_panda · January 15, 2015, 1:13pm

Awesome! Thanks Chris. I have found that there are many services that will scrap pages for free, such as import.io or scrapy. I’d still rather learn to do it myself. Since it seems like I will have to learn another language to work with internet objects anyway, could you suggest one? I’m leaning towards Python because of its easy syntax and wide applicability.

ccstone · January 15, 2015, 10:47pm

Hey There,

I'm quite interested in Python, because of the reasons you mention and because of the way it uses space to delineate code. I find it to be very readable and will probably tackle it next - after I get a little more proficient with Perl.

Therefore I deem it to be a good choice, although I cannot say so unequivocally due to my lack of experience with the language.

--
Best Regards,
Chris

Automatic search

Options