Need Tool/Script to Download Text and URL of Multiple Links on Multiple Web Pages

JMichaelTX · December 14, 2016, 8:23pm

I’m sure I can firgure out how to write a script for this, but I thought I’d first just ask if anyone has already done this, or something similar?

I need to download to a CSV file the name and URL of all diseases listed by CDC on this page:

Diseases & Conditions A-Z Index

Your ROI:

I am building a AirTable DB for Medical Records, which I will share with everone when I am done.
Part of this is a table of Diseases, which has the Disease Name, and the CDC URL for more info.

As an example, taking the first disease on the list:
Abdominal Aortic Aneurysm — see Aortic Aneurysm

I need a CSV line like this:

Abdominal Aortic Aneurysm — see Aortic Aneurysm,https://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_aortic_aneurysm.htm

The script needs to get each disease on the page, then go to the next page, and repeat until done.

Thanks for any help or shared scripts.

peternlewis · December 15, 2016, 5:29am

This will get you most of the way there. It processes all 27 pages, and creates a csv (well, actually tsv) with the first column being the named of the disease, the second column being the sub-type or see also entry, the third column is the URL, and the fourth is either level2 (sub disease) or level3 (see also). I have not processed the HTML for the Disease names, but that should be just a matter of a few search & replaces.

Diseases.kmmacros (11 KB)

JMichaelTX · December 15, 2016, 8:46pm

Peter, thank you so much! This is totally awesome!

I hope this did not distract you for long from your work on Keyboard Maestro. And, hopefully, other users will find this as a great example of the power of Keyboard Maestro in scraping (extracting) web pages.

JMichaelTX · December 15, 2016, 9:30pm

Peter, thanks again. I just ran your macro as is, and it worked perfectly.
Actually, better than I asked for:

TSV is much better than CSV
- renamed to .txt, and then it opened perfectly in Excel 2011
It captures the sub-diseases (like "Cancer, Lung) , putting the sub-disease in a separate colulmn
Your use of "levels" is very helpful to program for this

Here's an example of how it looks in Excel:

This is EXACTLY what I needed.

JMichaelTX · December 16, 2016, 1:52am

###Results
For anyone interested, here is an Excel file of the complete disease list from:
[CDC Diseases & Conditions A-Z Index]
(https://www.cdc.gov/DiseasesConditions/az/a.html)

CDC Disease List Clean.xlsx.zip (44.4 KB)

Thanks to Peter's great macro, I was able to easily extract all of the data from the CDC disease pages. Then, with just a few RegEx tricks in TextWrangler, I cleaned up the extraction and moved some data to other columns. The cleanup had nothing to do with the extraction -- that was perfect. It was due to the alternate names that CDC embedded in the main name field.

I hope to have my AirTable Database app ready to publish/share within a few days.

Here is what my Diseases table is looking like for now.
AirTable is very powerful, but also veryf easy to use. Here I easily created two formula fields to dynamically generate the URL for search of the disease at CDC and WebMD.

ccstone · December 16, 2016, 2:10am

Not quite as refined as Peter's, but a trifle faster.

-Chris

Download CDC Data → Convert to TSV file → Open in Excel v1.0.kmmacros (5.8 KB)

demirtas1 · October 24, 2017, 11:47am

am I right in assuming that this macro scrapes the web page without even loading it up?

I had asked a similar question in a recent another post

ccstone · October 26, 2017, 5:01pm

Yes.

-Chris

Need Tool/Script to Download Text and URL of Multiple Links on Multiple Web Pages

Options