I’m sure I can firgure out how to write a script for this, but I thought I’d first just ask if anyone has already done this, or something similar?
I need to download to a CSV file the name and URL of all diseases listed by CDC on this page:
Diseases & Conditions A-Z Index
I am building a AirTable DB for Medical Records, which I will share with everone when I am done.
Part of this is a table of Diseases, which has the Disease Name, and the CDC URL for more info.
As an example, taking the first disease on the list:
Abdominal Aortic Aneurysm — see Aortic Aneurysm
I need a CSV line like this:
Abdominal Aortic Aneurysm — see Aortic Aneurysm,https://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_aortic_aneurysm.htm
The script needs to get each disease on the page, then go to the next page, and repeat until done.
Thanks for any help or shared scripts.
This will get you most of the way there. It processes all 27 pages, and creates a csv (well, actually tsv) with the first column being the named of the disease, the second column being the sub-type or see also entry, the third column is the URL, and the fourth is either level2 (sub disease) or level3 (see also). I have not processed the HTML for the Disease names, but that should be just a matter of a few search & replaces.
Diseases.kmmacros (11 KB)
Peter, thank you so much! This is totally awesome!
I hope this did not distract you for long from your work on Keyboard Maestro. And, hopefully, other users will find this as a great example of the power of Keyboard Maestro in scraping (extracting) web pages.
Peter, thanks again. I just ran your macro as is, and it worked perfectly.
Actually, better than I asked for:
- TSV is much better than CSV
- renamed to .txt, and then it opened perfectly in Excel 2011
- It captures the sub-diseases (like "Cancer, Lung) , putting the sub-disease in a separate colulmn
- Your use of "levels" is very helpful to program for this
Here's an example of how it looks in Excel:
This is EXACTLY what I needed.
For anyone interested, here is an Excel file of the complete disease list from:
[CDC Diseases & Conditions A-Z Index]
CDC Disease List Clean.xlsx.zip (44.4 KB)
Thanks to Peter's great macro, I was able to easily extract all of the data from the CDC disease pages. Then, with just a few RegEx tricks in TextWrangler, I cleaned up the extraction and moved some data to other columns. The cleanup had nothing to do with the extraction -- that was perfect. It was due to the alternate names that CDC embedded in the main name field.
I hope to have my AirTable Database app ready to publish/share within a few days.
Here is what my Diseases table is looking like for now.
AirTable is very powerful, but also veryf easy to use. Here I easily created two formula fields to dynamically generate the URL for search of the disease at CDC and WebMD.
Not quite as refined as Peter's, but a trifle faster.
Download CDC Data → Convert to TSV file → Open in Excel v1.0.kmmacros (5.8 KB)
am I right in assuming that this macro scrapes the web page without even loading it up?
I had asked a similar question in a recent another post