Macro Request: Get table from web page and save as CSV file

eBoof · August 2, 2015, 1:57am

I like the work done in this thread Macro Request: Get webpage elements by class
I tried to apply it to my need which is to take the table of data from this web page and save it to a CSV file.
http://www.asx.com.au/asx/statistics/prevBusDayAnns.do
Eventually I’d like to add it to an MySQL database but that’s for me to do later.
ATM I’m afraid my skills are not up to it. I think I’m at the age where I’m starting to forget more than I ever knew.
Can anyone help to get me started.
Cheers
Rob

JMichaelTX · August 2, 2015, 6:33am

Do you have Excel?

If so, you might try an Excel Web Query.
Excel will read the web page, and pull of all tables on that page.
You could then save as CSV.

For more info see:
Using a Web Query to Load Tables in Excel 2011 for Mac

ccstone · August 2, 2015, 7:59am

Hey JM,

I can't get that to work with those instructions. Can you?

-Chris

JMichaelTX · August 2, 2015, 8:12am

Yeah but it's a bit flaky.

For some dumb reason, Excel will Not read any type of text file for the web query except the old MS-DOS encoding. So you have to use MS Word (I couldn't get TextEdit or TextWrangler to save properly) to save the URL as a MS-DOS text file.

Then it will open in Excel.

Where did it fail with you?

JMichaelTX · August 2, 2015, 8:18am

It was a bit slow, but it finally pulled the table from
http://www.asx.com.au/asx/statistics/prevBusDayAnns.do

korm · August 2, 2015, 3:13pm

The table at the asx.com site linked by the OP pastes really nicely – no drama – no Microsoft weirdness – to Numbers (3.5), and can be exported at once to CSV. Even the PDF links work in the Numbers sheet.

ccstone · August 2, 2015, 4:55pm

Hey JM,

Ah, I missed the MS-DOS text step — so it completely failed to do the query.

Hey Ed,

Oh, nice! I've never liked Numbers, but I do appreciate when things just work.

I did it the hard way:

I ran the source of the page through html2text from Macports to get the table nice and pretty, and then I scarfed the links with AppleScript/JavaScript.

curl -L --user-agent 'Opera/9.70 (Linux ppc64 ; U; en) Presto/2.2.1' \
--url 'http://www.asx.com.au/asx/statistics/prevBusDayAnns.do' \
| html2text -style pretty

------------------------------------------------------------
# Extract Links from front Safari Window
------------------------------------------------------------
set regexStr to "display=pdf"
set linkList to safari_links(regexStr, "*", "href")

------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------
on safari_links(regexStr, tagName, tagType)
  set js to "function in_array (array, item) {
  for (var i=0; i < array.length; i++) {
    if ( array[i] == item ) {
      return true;}}
  return false;}
  var a_tags = document.getElementsByTagName('" & tagName & "');
  var href_array = new Array();
  var reg = new RegExp(/" & regexStr & "/i);
  for (var i=0; i < a_tags.length; i++) {
    var href = a_tags[i]." & tagType & ";
    if ( reg.test(href)) {
      if ( !in_array(href_array, href)) {
        href_array.push(href);}}}
 href_array;"
  
  try
    tell application "Safari" to set linkList to do JavaScript js in document 1
    if linkList = missing value then set linkList to {}
  on error
    set linkList to {}
  end try
  
  return linkList
  
end safari_links
------------------------------------------------------------

All of this is fairly fast, but there's cleanup and assembly work to do.

I like the Excel and Numbers methods better in this case.

-Chris

ComplexPoint · August 2, 2015, 5:24pm

The GUI route to getting a copyable selection can take a little bit of slightly viscous dragging and selection boundary fiddling on a large webpage table.

Doing it with this table prompted me to sketch a simple 'extend selection to whole table' macro for Safari and Chrome:

eBoof · August 3, 2015, 12:43am

Thanks to everyone who contributed to this thread!
The route that CP takes is probably the way I’ll go.
Curiously if I cut and paste that into TextEdit (in a rich textbook) it keeps the links.
CP, I wondered whether there was a way to grab the table in javascript from the table tag. The problem with that would be if there are more than one table.
I can automate the process by locating the window in a certain position and then selecting a cell based on that. Is there a cleaner way to do that?
Extracting the pdf links to put then in a database maybe more difficult but thanks very much for the kickstart.
Cheers
Rob

ComplexPoint · August 3, 2015, 2:13pm

If you, or your office, are happy to do a little JavaScript writing, you could adapt the ‘select whole table’ script:

Use the same process to get a reference to the whole of a selected table, and then
walk through the TR rows of the table, harvesting either the .textContent, or the .innerHTML, of each TD cell, or the .href and .text of each enclosed <A> link element.

ccstone · August 7, 2015, 4:10pm

Hey Rob,

Not so curious. This feature has been around a long time. TextEdit and other Cocoa RTF-aware editors like Jedit X do a very respectable job of converting data copied and pasted from Web Browsers

I'm sure there's a way to find and select a specific cell with JavaScript, but there's a very simple way to manage for the moment:

A) Set the FIND PASTEBOARD with Keyboard Maestro and then type Cmd-G to find next.
• In the case of your trading table a unique item to find is "Headline".

B) Activate CP's select-table macro.

As for getting the links to the PDFs that's really easy to do just by parsing the HTML. It's cleaner to do with JavaScript, but you have to know how.

It would take me about 15-20 minutes to write an AppleScript to parse all the elements out of that page, but I would use the Satimage.osax AppleScript Extension to provide the regular expression support.

-Chris

eBoof · August 7, 2015, 9:51pm

Thanks Chris, Awesome!
This gives me lots to think about. For the moment I’m pottering through using KM macros and mixing them with excel macros step by step. I would say it was going well except for Excel’s unexpected behaviour especially with the clipboard It’s a PITA. If I select a cell and paste, it pastes one something. If I place the cursor in the cell and paste it pastes something else.
ATM I’m using Excel for a UDF on extracting those links and for arranging the file name in the format ASXAnns150807 and then saving as csv. Each of those steps would be easy if it weren’t for the peculiarities of Excel . But I’m getting there.
My problem is that because I don’t do it everyday I have to refamiliarise myself with each of the coding. KM, Excel macros, Javascript, Applescript and Filemaker when I use that. But I’m loving the journey.
When I get this done I’ll start again and use Javascript and Applescript and see if I can get rid of Excel in the process.
Right now I’m going to take a break from it and look at Smile. Thanks to you! Cheers and thanks Rob

JMichaelTX · August 7, 2015, 10:31pm

If you are thinking of learning more about one of these, and you often need to manipulate data from a web page, then JavaScript is probably a good choice.

See

eBoof · August 8, 2015, 3:05am

Thanks for that JM. I’m waiting on Javascript - the Good Parts to come from the Book Depository. I ordered it a few days ago after reading about it on another thread on here.
I’m keen to learn (Ha, I’m in my 60s) but life or work will probably get in the way. And next time I need it I’ll need to famil myself with it all over again.
I will try to keep at it every week because I enjoy it.
We’ll see.
Cheers
Rob

JMichaelTX · August 8, 2015, 3:32am

Rob, consider using Kindle. It is almost always cheaper than the printed edition, and it makes the book instantly available on every device you own.

I'm just getting started reading this, but I am very impressed with the unique approach of this book:
A Smarter Way to Learn JavaScript -- by Mark Myers

ccstone · August 8, 2015, 7:02pm

Hey Rob,

Here's why it's generally better to do things as directly as possible.

I used a fairly brute-force approach here, because I find the table-rows I want and then systemically strip off the junk I don't want.

If I were doing this for myself I would eventually streamline the script, but this way I was able to do the job in about 20 minutes or less.

You open the page in Safari and run the script.

A fraction of a second later you have a tab-delimited, date-stamped report on your Desktop ready to be imported or pasted into Excel (and that could be automated too).

Note that you don't need Smile — just the Satimage.osax — not that I'm discouraging anyone from using Smile — it's a very powerful if difficult to master automation tool.

The Satimage.osax installer installs 1 osax file:

/Library/ScriptingAdditions/Satimage.osax

It's safe. I've been using it since 2003.

-Chris

Edit: 2015/08/09 15:03 — I am highly P.0.'d to find that the forum is rendering text in a code-block and mangling it. Therefore I am posting a text script (.applescript) file for download.

Tab-Delimted-Report.applescript.zip (1.5 KB)

------------------------------------------------------------
-- Create Tab-Delimted Report on Desktop from:
-- http://www.asx.com.au/asx/statistics/prevBusDayAnns.do
-- REQUIRES SATIMAGE.OSAX
-- http://tinyurl.com/dc3soh
-- USE THE DOWNLOADABLE FILE ABOVE AND NOT THIS TEXT!!!!
------------------------------------------------------------
set _text to safariOuterHTML()
set foundList to fnd("(?m)^<tr class=\"(?:altrow)?\">[\\n\\r]<td>(?-i)[A-Z]{3}(?i)</td>.+?</tr>", _text, true, true) of me
set foundList to cng("\\r", "\\n", foundList) of me
set foundList to cng("<tr class=\"(?:altrow)?\">\\n", "", foundList) of me
set foundList to cng("</?td>", "", foundList) of me
set foundList to cng("</tr>", "", foundList) of me
set foundList to cng("<td class=\"pricesens\">.+title=[[:punct:]](price sensitive)[[:punct:]]>", "\\1", foundList) of me
set foundList to cng("<a onclick.+?href=[[:punct:]]", "", foundList) of me
set foundList to cng("[[:punct:]]>PDF</a>", "", foundList) of me
set foundList to cng("(/asx/)", "http://www.asx.com.au\\1", foundList) of me
set foundList to cng("&amp;", "&", foundList) of me
set foundList to cng("^[[:blank:]]*$\\n", "", foundList) of me
set foundList to cng("\\A\\s+|\\s+\\Z", "", foundList) of me
set foundList to cng("\\n", "\\t", foundList) of me
set foundList to join foundList using return
set reportName to "ASX Announcements " & (strftime (current date) into "%Y.%m.%d · %H.%M.%S")
writetext foundList to ("~/Desktop/" & reportName) & ".txt"

------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------
on cng(_find, _replace, _data)
  change _find into _replace in _data with regexp without case sensitive
end cng
------------------------------------------------------------
on fnd(_find, _data, _all, strRslt)
  try
	find text _find in _data all occurrences _all string result strRslt with regexp without case sensitive
  on error
	return false
  end try
end fnd
------------------------------------------------------------
on safariOuterHTML()
  tell application "Safari"
	tell front document to return do JavaScript "window.document.documentElement.outerHTML"
  end tell
end safariOuterHTML
------------------------------------------------------------

eBoof · August 8, 2015, 10:49pm

Thanks Chris.
I’m having trouble running the script. Does it matter which editor I am are using? In both the Apple Script editor and Smile I get syntax errors
"`"
Found unknown token
onclick
Syntax error
expected “,” or “)” but found identifier

…and so on.

Feeling quite dumb that I don’t know this.
Thanks again for your help.

ccstone · August 9, 2015, 8:10pm

Hey Rob,

Don't feel dumb. Apparently the KM Forum software (Discourse) is rendering text in a code-block — which should never happen. I will be posting a bug report to them shortly.

That's what's causing this unknown token error.

To compensate I have posted a downloadable text-script in the original post right above the script.

-Chris

peternlewis · August 10, 2015, 1:31am

Yes, a curious bug, it replaces slosh-slosh-1 with slosh-backquote. Almost certainly it is doing a regex substitution for slosh-1.

I edited your post to use applescript Fenced code blocks to see if that would work, but it suffers the same problem. And I had to change the comments character from # to – or it does not highlight the comments correctly.

I left the change, but it didn’t help, although the syntax coloring is nice.

peternlewis · August 10, 2015, 1:38am

Hmm, the applescript Fenced code blocks do support # comments, just not that particular # comment.

Reported to Discourse meta.

Macro Request: Get table from web page and save as CSV file

Options