How to Use RegEx to Extract URL and Link Text from HTML Anchor Code?

JMichaelTX · July 21, 2015, 7:16am

Well, this is probably another one of those stupid mistakes, but I've tried many examples from the 'net, and can't seem to get a clean URL, without quotes.

Ideally what I'd like to have is one RegEx that returns matches of both the URL and the Link Text, if possible. If you know of a RegEx that will do this, please point me to it.

Here's an example:
HTML Code:

<a href="http://www.evernote.com/">Visit our HTML tutorial</a>

RegEx:
Ref: http://stackoverflow.com/questions/15926142/regular-expression-for-finding-href-value-of-a-a-link

<a\s+(?:[^>]*?\s+)?href="([^"]*)"

My KM Macro:
The last Action uses the above RegEx, but matches nothing.

TIA for all help and suggestions.

ComplexPoint · July 21, 2015, 7:58am

One of those cases where DOM and XPath (through an Execute Javascript in browser) is probably a lot simpler than string and regex.

If you look at the JavaScript snippet in the post below, you will see that once you have the link node in the DOM you can directly read off the .href and .text properties.

ComplexPoint · July 21, 2015, 8:46am

One approach to reading the link properties straight out of the DOM:

Place a link in a browser and read its .href and and .text.kmmacros (3.4 KB)

Execute shell script:

echo "$KMVAR_someLink" > ~/Desktop/tmpTiny.html; open -a "Google Chrome" -g ~/Desktop/tmpTiny.html

Execute JavaScript:

(function () {
	var oNode = document.evaluate(
		'//a[1]', // XPath for first link in the document
		document, null, 0, 0
	).iterateNext();

	return oNode ? 
		"[" + oNode.text + "](" + oNode.href + ")" :
		"";
})();

peternlewis · July 21, 2015, 9:09am

Trying to parse HTML with regex is a bit of a fools errand. You can get it to work in specific cases, but there is no way to make it work in the general case (especially with things like CDATA).

For example, this sort of HTML:

<a alt="<>" href="http://www.stairways.com">

would generate a failure for that regex.

But if you're going to try, its wise to get a regular expression tester. I use RegexMatch (from the Mac App Store), but there are many to choose from.

In this case, it looks to me like the first action has a curly quote in it:

The system will insert currly quotes if it gets half a chance, you can control this in the Edit ➤ Substitutions ➤ Smart Quotes menu.

ComplexPoint · July 21, 2015, 9:35am

To just read the link with the browser, and skip any file writing, you can use JavaScript to:

Add a div to whichever page happens to be loaded
set its .innerHTML to the value of a KM variable
read off the .text and .href properties directly

Get the browser to parse the link.kmmacros (2.5 KB)

(function () {
	var oDiv = document.createElement('div');
	
	oDiv.innerHTML = document.kmvar['someLink'];

	var oNode = oDiv.firstChild;

	return oNode ? 
		"[" + oNode.text + "](" + oNode.href + ")" :
		"";
})();

PS on regular expressions – they are great, but sometimes they do create a new problem : -)

peternlewis · July 21, 2015, 11:00am

That’s a very clever solution. It’s a shame that it does not look possible to do in JXA (at least I could not see any way to create the DOM document). Otherwise this method only works if you have a window of some sort open in Safari or Google Chrome (whichever one you use).

ComplexPoint · July 21, 2015, 1:25pm

It's a shame that it does not look possible to do in JXA (at least I could not see any way to create the DOM document)

JXA doesn't have a built-in reference to a browser name-space, but it can, of course, still pass code over to Chrome or Safari, opening a window if there isn't one.

Here's a quick draft of a JXA example:

// passing the link HTML directly here (at bottom), but it could be from the value of a KM VAR

(function(strLinkHTML) {
	
	// This function will be converted to a string and 
	// evaluated in the browser context, with an argument supplied
	// by .apply()
	
	function linkParse(strLinkHTML) {
		var oDiv, oNode;
	
		(oDiv = document.createElement('div')
		).innerHTML = strLinkHTML;
		
		return (
			oNode = oDiv.firstChild
		) ? "[" + oNode.text + "](" + oNode.href + ")" : ""
	}
	
	var appChrome = Application("Google Chrome");
		lstWins = appChrome.windows();
		lngWins = lstWins.length,
		
		// If no window is open we make one
		
		oWin = lngWins ? lstWins[0] : appChrome.Window().make(),
		
		// Compose the .js we need ...
		
		strJS = '(' + linkParse.toString() + ').apply(null, [\'' + strLinkHTML + '\'])';
		
	
	// and run it in Chrome - the Safari syntax is slightly different here
	return (
		oWin.activeTab.execute({
			javascript: strJS
		})
	);
	
})(
	'<a href="http://www.google.com/?q=purescript">purescript</a>'
);

ComplexPoint · July 21, 2015, 2:21pm

FWIW here is a JXA function

parsedLink(strLinkHTML, strBrowser)

which can use either Chrome or Safari.

(The browser name argument is optional – if you don’t specify “Chrome” or “Google Chrome”, the default is Safari.

Chrome turns out to do this 30%-40% faster, but neither needs as much as 2 milliseconds to return a markdown version of a link using this technique, so speed is unlikely to be relevant to the choice :- )

// If strBrowser is omitted, the default is Safari,
// though Chrome happens to do this c. 30-40% faster ...
//
// but neither needs as much as 2 milliseconds on this system
// so speed is probably not an issue : -)

function mdLinkFromHTML(strLinkHTML, strBrowser) {
  "use strict";
  strBrowser = ((strBrowser || '').indexOf("Chrome") === -1) ?
    "Safari" : "Google Chrome";

  // This function will be converted to a string and 
  // evaluated in the browser context, with an argument supplied
  // by .apply()

  function linkMD(strLinkHTML) {
    var oDiv, oNode;

    (oDiv = document.createElement('div')).innerHTML = strLinkHTML;

    return (
      oNode = oDiv.firstChild
    ) ? "[" + oNode.text + "](" + oNode.href + ")" : ""
  }

  var appBrowser = Application(strBrowser),
    blnSafari = (strBrowser.indexOf("Safari") === 0),
    lstWins = appBrowser.windows(),
    lngWins = lstWins.length,

    // If no window is open we make one

    oWin = lngWins ? lstWins[0] : blnSafari ?
    appBrowser.Document().make() && appBrowser.windows[0] :
    appBrowser.Window().make(),

    strJS = '(' + linkMD.toString() + ').apply(null, [\'' + strLinkHTML + '\'])';

  return (
    blnSafari ?
    appBrowser.doJavaScript(
      strJS, { "in" : oWin.tabs[0]
      }
    ) :
    oWin.activeTab.execute({
      "javascript": strJS
    })
  );

}


// A diversion: parsing a link 1000 times with one browser or the other
// to get a rough speed comparison
var blnTimeTest = false;

if (blnTimeTest) {

  var tmStart, tmFinish, t = 1000,
    varResult; // ADD ZEROS HERE...
  tmStart = new Date().getTime();
  while (t--) {

    // varResult = mdLinkFromHTML(
    //  '<a href="http://www.google.com/?q=purescript">purescript</a>',
    //  "Safari"
    // );

    varResult = mdLinkFromHTML(
      '<a href="http://www.google.com/?q=purescript">purescript</a>', "Chrome");
    

  }

  tmFinish = new Date().getTime();

  [tmFinish - tmStart, varResult];

} else {

  mdLinkFromHTML(
    '<a href="http://www.google.com/?q=purescript">purescript</a>',
    "Safari"
  );

}

JMichaelTX · July 21, 2015, 7:24pm

Thanks for the suggestion, Peter.

I can't find an app named exactly "RegexMatch" in the app store.
I did find these:

Any of them look like the one you use?

ComplexPoint · July 21, 2015, 8:17pm

and another approach is to experiment with the Find and Replace highlighting in the excellent Atom editor (which no modern family should be without – https://atom.io/) : -)

( As you edit a Regex in the Find field, all matching instances in the text are selected)

ComplexPoint · July 21, 2015, 9:09pm

Not sure if this is the one that Peter has in mind, but I do like it:

JMichaelTX · July 21, 2015, 9:32pm

Peter, I'm sure you're right.

But I guess I'm being a bit hard-headed about this, so I'm off on the proverbial fools errand, just to prove you right. It is interesting when I do some Google searches it turns up a lot of people trying to do this.

Of course ComplexPoint has some great ideas and code, and I'll likely end up using his stuff. But for now . . .

Here's what I'm thinking. All anchor tags must have at least a URL, right?
So, all I need to do is find the code segment that begins with "href" and take it from there.

So, given somewhere in the HTML code there is one of the following:

<a href="http://forum.keyboardmaestro.com/t/combining-rtf-in-clipboards/1556/7" class="title" style="color: rgb(34, 34, 34); [and perhaps more stuff before the ">"]

OR

the same as above without the quotes around the URL
<a href=http://forum.keyboardmaestro.com/t/combining-rtf-in-clipboards/1556/7 class="title" . . .

OR

Your worst case scenario:
<a alt="<>" href="http://www.stairways.com">

So in pseudo code:

<a [AnyText] href [AnyOf: Space = SingleQuote DoubleQuote] [theURL] [AnyOf: Space SingleQuote DoubleQuote] [AnyText] >

I think, but don't know how, one should be able to construct a RegEx with this logic:

FIND "<a" [AnyText] "href"
plus optionally [whitespace, =, whitespace, single quote, double quote]

That should put us at the beginning of the URL

Then match/capture:
any characters until it hits anyof [doubleQuote, SingleQuote, or space]

That marks the end of the URL

I can't think of any case where this would not work.

Now, if I can only find a RegEx guru to knows how to code my logic.

Here's one I found that seems to work well except for your case:

<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

JMichaelTX · July 21, 2015, 10:54pm

Thanks, Rob. I had just found RegExRX from searching. Turned up in MacUpdate.com. It does look very good, and is highly rated.

Also found this one that runs online:

peternlewis · July 22, 2015, 1:03am

This is the one I use: Discontinued Apps | MacUpdate

But I suspect they may have gone away, and anyway there are plenty of options, and while this app works fine for me and gives a solid result, there is nothing overly special about it.

One thing though, try to ensure whatever you use uses the standard Mac (ICU core) regular expressions - there are many variants of regular expressions with subtle differences.

ccstone · July 23, 2015, 6:35am

Hey Folks,

A very good online RegEx analyzer is Regular Expressions 101.

My go-to stand-alone analyzers are RegExRX and Patterns.

And of course there’s BBEdit and TextWrangler.

Since BBEdit runs 24/7 on my system I usually write complex regular expressions in it, although I will switch to one of the analyzers if I’m having problems getting it right.

-Chris

JMichaelTX · July 23, 2015, 7:10am

Thanks for the suggestions, Chis.

BTW, do you have any ideas on how to improve this pattern to extract the URL from an HTML anchor tag:

<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

It seems to work well, but it will fail with this HTML code:

<a alt="<>" href="http://www.stairways.com">

Thanks.

JMichaelTX · July 23, 2015, 8:32am

@ccstone: Thanks to your suggestion for RegEx online tool, I think I have found the BEST RegEx pattern to extract URL from HTML Anchor. But I’d still like your assessment of this pattern.

REF: https://regex101.com/r/rQ8mR1/1

SOURCE:  <a.+?\s*href\s*=\s*["\']?([^"\'\s>]+)["\']?
QUOTED: "<a.+?\\s*href\\s*=\\s*[\"\\']?([^\"\\'\\s>]+)[\"\\']?"

This works with ALL of my test cases, including the hard one from @peternlewis

<a alt="<>" href="http://www.stairways.com">

I’m going to run this for a while in my local test version of:
Macro: Set Clipboard to RTF Hyperlink & Plain Text MD Link – BETA 1.0

If all goes well, I’ll update the macro/script in the above thread.

JimmyHartington · July 23, 2015, 8:35am

John Gruber has also made a very complex regex to find urls:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Maybe that is usefull.

Here is the source on Github:

gist.github.com

https://gist.github.com/gruber/249502

Liberal Regex Pattern for All URLs

The regex patterns in this gist are intended to match any URLs,
including "mailto:foo@example.com", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
https://gist.github.com/gruber/8891611


# Single-line version of pattern:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

This file has been truncated. show original

JMichaelTX · July 23, 2015, 8:43am

Thanks Jimmy. I can see where that would be very useful in some cases.

In my case, the only URL I want is the HREF URL in an HTML anchor tag. There can sometimes be other URLs not of interest.

ComplexPoint · July 23, 2015, 9:33am

There’s a comparative table of URL regex performance here:

https://mathiasbynens.be/demo/url-regex

When I need something like that, I tend to copy paste the latest update from Diego Perini at:

gist.github.com

https://gist.github.com/dperini/729294

regex-weburl.js

//
// Regular Expression for URL validation
//
// Author: Diego Perini
// Updated: 2010/12/05
// License: MIT
//
// Copyright (c) 2010-2013 Diego Perini (http://www.iport.it)
//
// Permission is hereby granted, free of charge, to any person

This file has been truncated. show original

but generally, being lazy, I do prefer to let the browser do it for me : -)

How to Use RegEx to Extract URL and Link Text from HTML Anchor Code?

Options