Web Scraping Multiple Items With XPath / QuerySelector

christerdk · December 5, 2022, 1:55am

Hi All,

For the sake of anyone trying to do something similar in the future, I decided to make this write up, so the solution is easily accessible without having to scroll through the whole discussion.

The division of work is as follows:

Create a JavaScript function to traverse all nodes and pick up the URLs I needed (jpgs and mp4 links). The result is a linefeed separated list. Currently I'm running this manually to separate the two tasks and overlook the function, but it possible to merge that into the KM macro later. I used QuerySelector to get all the elements needed, but to gather the specifics I had to test for the presence of child elements and so on. I developed the script "raw" in the console of Chrome (no IDE on this Mac). There might be smarter way
The responsibility of the KM macro is to iterate over the collected links, each of these opening a chrome tab, waiting to download, press command-s, wait for dialog with save button to be present and save.

For those new to KM, I recommend a divide and conquer approach:

Meaning not everything needs necessarily to be automated in one, first go. From a generic perspective the solution is essentially two parts, gathering the data in JavaScript and then processing in KM macro. They both have their sets of challenges, so I recommend figuring out the data contract first (in this case the handover of a list of URLs, but can be whatever necessary) and work from there in each part.

Noteworthy resources:

Execute JavaScript in Browser Actions

Need Help with Using KM Variables in JavaScript

(This one is an extension of the previous link and shows how a function can be implemented and return a value:

Lines In collection

By using linefeeds as a separator, one avoids the need for splitting the string by an arbitrary separator in KM. YMMV, it depends on how complex your data is).

A big thank you goes out to @ccstone and @Nige_S for their help!

Web Scraping Multiple Items With XPath / QuerySelector

Options