How Collect All Matching Links on a Browser Page?

MitchellModel · November 7, 2016, 3:03pm

I want to collect all links on the current browser page that match a regular expression (or just all the links — I can handle the regular expression). Ideally I could choose between text, HTML, RTF, and Markup, but would be content with putting RTF on the clipboard.

Really, I just need the URLs of the links themselves, but better would be the title and URL, and any kind of simple text format would be fine.

Working with any one browser is fine (preferably Safari, as standard), but Safari, Chrome, and Firefox ideal. (I know that Firefox cannot be driven from KM nor scripted from AS.)

I have read a number of pages related to this, many with very long discussions and proposed macros, and I'm sure there are lots more. The ones I read include:

Copy selected link(s) from browser as markdown? [long discussion, lots of examples, some macros in various languages]
Set Clipboard to RTF Hyperlink & Plain Text MD Link
How do I get a link over which my mouse cursor hovers?
Google Search - Open the First 10 Links in Google in a New Tab

ComplexPoint · November 7, 2016, 3:51pm

I would probably just grab the text of all links and post-process.

(function () {
    'use strict';

     return Array.prototype.slice.call( document.links )
    .map(function(x) {
        return x.href;
    }).join('\n');
})();

( If JavaScript is a tool that comes readily to hand for you, you could do the regex filtering with a .filter() in lieu of that .map() )

MitchellModel · November 7, 2016, 4:28pm

Oh that was dumb of me. Haven’t done client side JS in quite a while — document.links — of course.

ComplexPoint · November 7, 2016, 7:21pm

These things do seem to declutter from memory as soon as one puts them aside for a moment ...

You could, of course, pass in any regexes as KM variables:

(function () {
    'use strict';
     
    var rgx = RegExp(document.kmvar.rgx1);    

     return Array.prototype.slice.call( document.links )
    .filter(function(x) {
        return rgx.test(x.href);
    }).join('\n');
})();

In Sierra onwards, ES6 syntax lets you drop a little noise:

(() => {
    'use strict';

    let rgx = RegExp(document.kmvar.rgx1);

    return Array.prototype.slice.call(document.links)
        .filter(x => rgx.test(x.href))
        .join('\n');
})();

MitchellModel · November 8, 2016, 4:21am

So, I tried to switch this to images. No problem without the regexp:

(function () {
    'use strict';

     return Array.prototype.slice.call(document.images)
    .map(function(x) {
        return x.src;
    }).join('\n');
})();

gives me the list of .png links I expected. But I am missing something about the filtered version.

(function () {
    'use strict';
     
    var rgx = RegExp(document.kmvar.rgx1);    

     return Array.prototype.slice.call( document.images)
    .filter(function(x) {
        return rgx.test(x.src);
    }).join('\n');
})();

gives me a bunch of lines of the form

  [object HTMLImageElement]

And I don’t think any were filtered out by the regexp (same number of lines as without the filtering). I assume I am missing something obvious — help?

ComplexPoint · November 8, 2016, 10:39am

Just a type plumbing issue I think – a filtered array of image objects is a shorter array of image objects (rather than a shorter array of texts).

Filtering a derived array of .src texts should yield something more tractable:

(function () {
    'use strict';

    var rgx = RegExp(document.kmvar.rgx1);

    return Array.prototype.slice.call(document.images)
        .map(function (img) { // array of src lines
            return img.src;
        })
        .filter(function (src) { // filtered
            return rgx.test(src);
        })
        .join('\n');
})();

ComplexPoint · November 8, 2016, 10:50am

(or, of course, reverse the composition of filter and map, which seems likely to reduce space but increase time – a little more traffic in the repeated fetching across img.src)

(function () {
    'use strict';

    var rgx = RegExp(document.kmvar.rgx1);

    return Array.prototype.slice.call(document.images)
        .filter(function (img) { // shorter array of image objects
            return rgx.test(img.src);
        })
        .map(function (img) { // translated to their texts
            return img.src;
        })
        .join('\n');
})();

MitchellModel · November 8, 2016, 4:00pm

Yes, thanks, that works. I have to study this some more, though, especially the slice.call part of it. As it turns out in your original code you can omit the .href, which I don’t understand at all but probably illuminates why my .src doesn’t get me the image’s link.

ComplexPoint · November 8, 2016, 6:59pm

The slice method derives a JS array from the contents of the link element collection object
(The latter doesn’t have its own map or filter methods).

Perhaps either .href is a default property of link elements, or JS is simply happy to coerce them ?

ComplexPoint · November 9, 2016, 5:50am

It finds it, of course, for the purpose of the filtering decision, but functions which are arguments to .filter are interpreted simply as predicates - whatever they return is just evaluated as a boolean – on which inclusion or exclusion turns.

(and Regex.test() is in any case a predicate function - it just returns a boolean expressing the presence or absence of a match)

ccstone · November 14, 2016, 12:11am

Edit 2019/07/07 11:17 CDT

Fixed a couple minor problems with more modern versions of Safari.

Hey Mitchell,

Rob's code is nice and compact.

Here's what I've used for nearly a decade:

--------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2016/11/13 17:45
# dMod: 2019/07/07 11:16
# Appl: Safari
# Task: Extract Links from the front Safari document with optional RegEx filter.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @Safari, @Extract, @Links, @Front, @Document
--------------------------------------------------------

# PROTOTYPE:
# safari_links(regexStr, tagName, tagAttribute)

# HREF:
set linkList to safari_links(".*", "a", "href")

# SRC:
set linkList to safari_links(".*", "img", "src") of me

# FILTERED SRC:
set linkList to safari_links("\\.(bmp|jpe?g|png|gif)", "img", "src") of me

--------------------------------------------------------
--» HANDLERS
--------------------------------------------------------
--  dMod: 2010/12/30 01:00
--  dMod: 2016/11/13 17:51 – Cleaned up the code just a little.
--  Task: Get Links from Safari Using Javascript and a Regular Expression
--------------------------------------------------------
on safari_links(regexStr, tagName, tagAttribute)
   set javascriptCMD to "

(function () {

   function in_array (array, item) {
      for (var i=0; i < array.length; i++) {
         if ( array[i] == item ) {
            return true;}}
      return false;}
   
      var a_tags = document.getElementsByTagName('" & tagName & "');
      var href_array = new Array();
      var reg = new RegExp(/" & regexStr & "/i);
   
      for (var i=0; i < a_tags.length; i++) {
         var href = a_tags[i]." & tagAttribute & ";
         if ( reg.test(href)) {
            if ( !in_array(href_array, href)) {
               href_array.push(href);}}}
   
      // Filter-out empty items from the link list.
      var jsOutput = href_array.join('\\n');
      jsOutput = jsOutput.replace(/\\s+/g, '\\n').split('\\n');
   
      return jsOutput;

})();

"
   
   try
      tell application "Safari" to set linkList to do JavaScript javascriptCMD in document 1
      if linkList = missing value then set linkList to {}
   on error
      set linkList to {}
   end try
   
   return linkList
   
end safari_links

--------------------------------------------------------

I have the handler in a library, so all I have to do is emplace one line of code with a Typinator abbreviation:

safari_links(regexStr, tagName, tagType)

And modify to suit my use-case.

-Chris

MitchellModel · November 17, 2016, 2:12pm

It turns out that for Safari only there is an Automator action!

There is also:

MitchellModel · December 9, 2016, 1:41am

This is wonderful — concise, flexible, and already debugged.

One suggestion: puzzling out what tagType means tripped me up briefly. Your IMG examples make it pretty clear what it is, but I made sure by looking at the code. I think a more appropriate name would be tagAttribute, although that leaves a bit of ambiguity: does it mean the attribute‘s name or its value, so technically it should be tagAttributeName, or perhaps the slightly shorter attributeName.

How Collect All Matching Links on a Browser Page?

Options