How to Use RegEx to Extract URL and Link Text from HTML Anchor Code?

Thanks for the suggestion, Peter.

I can't find an app named exactly "RegexMatch" in the app store.
I did find these:

Any of them look like the one you use?

and another approach is to experiment with the Find and Replace highlighting in the excellent Atom editor (which no modern family should be without – https://atom.io/) : -)

( As you edit a Regex in the Find field, all matching instances in the text are selected)

Not sure if this is the one that Peter has in mind, but I do like it:

1 Like

Peter, I'm sure you're right.

But I guess I'm being a bit hard-headed about this, so I'm off on the proverbial fools errand, just to prove you right. :laughing: It is interesting when I do some Google searches it turns up a lot of people trying to do this.

Of course ComplexPoint has some great ideas and code, and I'll likely end up using his stuff. But for now . . .

Here's what I'm thinking. All anchor tags must have at least a URL, right?
So, all I need to do is find the code segment that begins with "href" and take it from there.

So, given somewhere in the HTML code there is one of the following:

<a href="http://forum.keyboardmaestro.com/t/combining-rtf-in-clipboards/1556/7" class="title" style="color: rgb(34, 34, 34); [and perhaps more stuff before the ">"]

OR

the same as above without the quotes around the URL
<a href=http://forum.keyboardmaestro.com/t/combining-rtf-in-clipboards/1556/7 class="title" . . .

OR

Your worst case scenario:
<a alt="<>" href="http://www.stairways.com">

So in pseudo code:

<a [AnyText] href [AnyOf: Space = SingleQuote DoubleQuote] [theURL] [AnyOf: Space SingleQuote DoubleQuote] [AnyText] >

I think, but don't know how, one should be able to construct a RegEx with this logic:

FIND "<a" [AnyText] "href"
plus optionally [whitespace, =, whitespace, single quote, double quote]

That should put us at the beginning of the URL

Then match/capture:
any characters until it hits anyof [doubleQuote, SingleQuote, or space]

That marks the end of the URL

I can't think of any case where this would not work.

Now, if I can only find a RegEx guru to knows how to code my logic. :smile:

Here's one I found that seems to work well except for your case:

<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

Thanks, Rob. I had just found RegExRX from searching. Turned up in MacUpdate.com. It does look very good, and is highly rated.

Also found this one that runs online:

This is the one I use: Discontinued Apps | MacUpdate

But I suspect they may have gone away, and anyway there are plenty of options, and while this app works fine for me and gives a solid result, there is nothing overly special about it.

One thing though, try to ensure whatever you use uses the standard Mac (ICU core) regular expressions - there are many variants of regular expressions with subtle differences.

Hey Folks,

A very good online RegEx analyzer is Regular Expressions 101.

My go-to stand-alone analyzers are RegExRX and Patterns.

And of course there’s BBEdit and TextWrangler.

Since BBEdit runs 24/7 on my system I usually write complex regular expressions in it, although I will switch to one of the analyzers if I’m having problems getting it right.

-Chris

1 Like

Thanks for the suggestions, Chis.

BTW, do you have any ideas on how to improve this pattern to extract the URL from an HTML anchor tag:

<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

It seems to work well, but it will fail with this HTML code:

<a alt="<>" href="http://www.stairways.com">

Thanks.

@ccstone: Thanks to your suggestion for RegEx online tool, I think I have found the BEST RegEx pattern to extract URL from HTML Anchor. But I’d still like your assessment of this pattern.

REF: https://regex101.com/r/rQ8mR1/1

SOURCE:  <a.+?\s*href\s*=\s*["\']?([^"\'\s>]+)["\']?
QUOTED: "<a.+?\\s*href\\s*=\\s*[\"\\']?([^\"\\'\\s>]+)[\"\\']?"

This works with ALL of my test cases, including the hard one from @peternlewis

<a alt="<>" href="http://www.stairways.com">

I’m going to run this for a while in my local test version of:
Macro: Set Clipboard to RTF Hyperlink & Plain Text MD Link – BETA 1.0

If all goes well, I’ll update the macro/script in the above thread.

John Gruber has also made a very complex regex to find urls:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Maybe that is usefull.

Here is the source on Github:

1 Like

Thanks Jimmy. I can see where that would be very useful in some cases.

In my case, the only URL I want is the HREF URL in an HTML anchor tag. There can sometimes be other URLs not of interest.

There’s a comparative table of URL regex performance here:

https://mathiasbynens.be/demo/url-regex

When I need something like that, I tend to copy paste the latest update from Diego Perini at:

but generally, being lazy, I do prefer to let the browser do it for me : -)

Rob, I get your point, and I want to really thank you for all of your help.

I know you prefer a JavaScript solution, but I chose not to go that route for these reasons:

  1. I needed a solution that would work with all Browsers, including FireFox, AND with any RTF app/document. JavaScript would not work with RTF documents, and AFAIK I can't use JS from KM with FF.
  2. The RegEx solution works for everything.
  3. The RegEx solution is simpler to me.
  4. I currently don't have a JavaScript environment setup to code, test, and debug JS.
  5. Re-learning Javascript
  6. It's been years since I last coded JS
  7. And that was in Windows and IE
  8. When I look at your JS code I don't have a clue how to use/modify, as in outputting the Page Title and URL to separate KM variables

But my biggest hurdle right now to JS is #4, especially testing and debugging.

So, I'd like to post some JS questions in this thread:
Learning & Using AppleScript & JavaScript for Automation (JXA)

Thanks again for everything.

Of course, and what matters is the job not the tools.

( 2, 3 and 5 all make a huge amount of sense, and polishing regex skills is always good )

( 1 and 4 may not be huge problems, as it happens – as long as your system has Safari somewhere, you can delegate parsing tasks to it while working in any other browser, and textutil can reframe any RTF process as an HTML process.

(On 4, Safari itself has an excellent JS debugger, and any code that you send to it from KM actions or AS code shows up in it. For app automation through JS, you would clearly have to get Yosemite+, but for browser and general JS, Safari and the command line JSC are already very rich scripting and debugging environments)

Always best, however, to use what already works quickly for you at the time.

ars longa vita brevis Life is short and shavable yaks are many : -) Always better to learn another human language than another machine language …

1 Like

Maybe . . .
Seems to me that human languages are far more diverse, have poor rules, and are subject to dialects and idioms. That's why Texan is so hard to learn. :laughing:

Rob, I really like this JXA function. :+1:
IMO, best sollution yet for parsing an HTML hyperlink

But I'd like to return a JavaScript array:

arrLink[0] -- the MD text
arrLink[1] -- the oNode.text
arrLink[2] -- the oNode.href

I know how to make these mods in normal JavaScript, but the code that is sent to the Browser is confusing to me.

How can I change the return to be the above array?

Thanks.

The browser evaluates a javascript string built from the the brief linkMD() function near the top of the script.

You can modify the return value of that function in any way you like. Here, for example, it has been edited to return an object with two properties (.txt and .ref), which you can then use to assemble the MD yourself:

function linkMD(strLinkHTML) {
    var oDiv, oNode;

    (oDiv = document.createElement('div')).innerHTML = strLinkHTML;

    return (
        oNode = oDiv.firstChild
    ) ? {
        txt: oNode.text,
        ref: oNode.href
    } : {}
}
1 Like

Thanks, Rob. That really helps. Not only does it give me the solution, but it also teaches me how to deal with similar in the future.

Rob, I'm trying to learn from you.

Was there a specific reason that you used Strict mode in this function?

Thanks.

I should really use it all the time. Sometimes I forget.

Strict mode is a better subset of JS, and allows the compiler to pick up glitches like the use of undefined variable names.

PS - thanks for bringing that to mind - I’ll add it to to my Textexpander snippet for JS modules

1 Like