What is Best Way to Automate Download and Processing of Large Amounts of Data from a Web Site?

Hello everyone!

After some time using Keyboard Maestro for ever-growing macros for my work, I am getting the feeling that my macros are getting so complex and big that I either need to switch to actual programming or really improve my data handling.

Example: I use KM to access a web interface of a database and download the data in order to do something with it, but it is a CSV with 32 columns and a few hundred lines/entries.

The resulting regex (sheesh) to handle this data per line has 479 characters and isn't even completed yet.

So, I wonder if I should use some kind of programming and a database instead of dealing with all that data in KM with 32+ variables.

What would you recommend? Which programming language? Stick to KM instead?
I am really stumped how to progress from my status quo, you see...

I hope you don't mind that I have revised your topic title to better reflect the question you have asked.

FROM:
From Macros to Programming - where to go from here?

TO:
What is Best Way to Automate Download and Processing of Large Amounts of Data from a Web Site?

This will greatly help you attract more experienced users to help solve your problem, and will help future readers find your question, and the solution.

To be more helpful, we need more details about your workflow:

  1. Web page URL
  2. Exactly what processing is required?
  3. What is destination of the data?

Having said that, perhaps I can offer some general suggestions:

  1. Use Excel as the intermediate destination for your data
    • Save CSV data to a file and open/import using Excel
    • If you can D/L tab-delimited data from your Web site, either to the Clipboard or to a file, you can also easily paste into Excel, or open that file.
  2. If you have complex processing, then automate Excel with either Excel VBA or AppleScript.
  3. If you are building your own dataset, then consider storing all data ultimately in a SQL database.
  4. If your KM Macros are long and complex, consider breaking the Macro down into logical segments, with each segment as a separate Macro. Then call these "sub-macros" using the KM Execute Macro action.
    • This may allow you to reuse blocks of KM Actions in multiple Macros
    • Markes it easier to test and maintain
  5. If your Macros are very complex, it might be beneficial using a scripting language like AppleScript or JavaScript for Automation (JXA)
    • It just depends on what you are doing.
1 Like

Thank you, there are some very useful hints in your reply.

The website/database is not publicly accessible, but I can give a sample of the data typically contained.

The file downloaded contains a few hundred blocks of data. One block looks like this:

"GH00012345-00000001";1;"GH00012345";81;"Company Name LLC";;;"220122";"Article Name";"";"101:113";;;"0.00";"kg";;;;276;"GH00012345";1;"";"";"";"";;"";"";"";"";"X";""

The first column is the ID and is unique to the block. Other contents can vary quite a bit, it is possible to group them by the location ID "GH00012345" for example.

My goal ultimately is to:

  1. Download the CSV every day (using KM and the web interface)
  2. Compare the blocks (preferably by using the ID in the first column)
  3. Report if any new data has been added. Data also gets removed occasionally, so simply counting the blocks of data doesn't help.

Ideally I should put all blocks or IDs into a database and compare between the downloaded versions? Like...

  1. I downloaded this set of IDs, store it in database with a datestamp
  2. I downloaded this set today, store it in DB with new datestamp
  3. Report a difference of IDs between the two

I just don't know how to get to this point.

For reference, here is the RegEx I created, which works reliably on each block of data:

"((GH[0-9]{8}|OGK[0-9]{12})-[0-9]{1,9})";([1-5]);"(GH[0-9]{8}|OGK[0-9]{12})";([0-9]{2,4});(".+?"|""|);("4041755973723"|);(".+?"|""|);("[0-9]{1,7}\/[0-9]{1}"|"[0-9]{1,7}"|);(".+?"|""|);(".+?"|"");("[0-9:]+?"|);([0-9]{2}\.[0-9]{2}\.[0-9]{4}|);("[0-9]{2}\:[0-9]{2}"|);"([0-9.]+?)";"(kg|St\.|g|)";([0-9]{3}|);("[0-9]{2}"|);("[A-Za-z-öäü]{1,22}"|);([0-9]{3}|);"(GH[0-9]{8}|OGK[0-9]{12})";(1|2|3|4);"(OGK[0-9]{12}|)";(".+?"|""|);(".+?"|""|);(".+?"|""|);(1|2|3|4|5|6|10|);(".+?"|""|);(.+?|"");(.+?|"");"(X|)";"(X|)";"(X|)"

Here's a small update:
From this recommendation I installed SQL Pro Studio and created an SQlite database with two tables "QSDBold" and "QSDBnew". With a query which I found on stackoverflow, it is possible to filter the unique identifier (column 1) like this:

Listing the lines which have been removed:

SELECT DISTINCT "Proben-ID"
FROM QSDBold
WHERE "Proben-ID" Not IN 
(SELECT DISTINCT "Proben-ID" FROM QSDBnew)

And by switching new and old in the query, it will list the lines that are new:

SELECT DISTINCT "Proben-ID"
FROM QSDBnew
WHERE "Proben-ID" Not IN 
(SELECT DISTINCT "Proben-ID" FROM QSDBold)

Now, I am just confused how to get KM and SQL Studio to work together. Can SQL Studio be scripted? Do I need to use interface scripting with KM? Is there a better solution?