How to Use RegEx to Extract URL and Link Text from HTML Anchor Code?

Rob, I get your point, and I want to really thank you for all of your help.

I know you prefer a JavaScript solution, but I chose not to go that route for these reasons:

  1. I needed a solution that would work with all Browsers, including FireFox, AND with any RTF app/document. JavaScript would not work with RTF documents, and AFAIK I can't use JS from KM with FF.
  2. The RegEx solution works for everything.
  3. The RegEx solution is simpler to me.
  4. I currently don't have a JavaScript environment setup to code, test, and debug JS.
  5. Re-learning Javascript
  6. It's been years since I last coded JS
  7. And that was in Windows and IE
  8. When I look at your JS code I don't have a clue how to use/modify, as in outputting the Page Title and URL to separate KM variables

But my biggest hurdle right now to JS is #4, especially testing and debugging.

So, I'd like to post some JS questions in this thread:
Learning & Using AppleScript & JavaScript for Automation (JXA)

Thanks again for everything.

Of course, and what matters is the job not the tools.

( 2, 3 and 5 all make a huge amount of sense, and polishing regex skills is always good )

( 1 and 4 may not be huge problems, as it happens – as long as your system has Safari somewhere, you can delegate parsing tasks to it while working in any other browser, and textutil can reframe any RTF process as an HTML process.

(On 4, Safari itself has an excellent JS debugger, and any code that you send to it from KM actions or AS code shows up in it. For app automation through JS, you would clearly have to get Yosemite+, but for browser and general JS, Safari and the command line JSC are already very rich scripting and debugging environments)

Always best, however, to use what already works quickly for you at the time.

ars longa vita brevis Life is short and shavable yaks are many : -) Always better to learn another human language than another machine language …

1 Like

Maybe . . .
Seems to me that human languages are far more diverse, have poor rules, and are subject to dialects and idioms. That's why Texan is so hard to learn. :laughing:

Rob, I really like this JXA function. :+1:
IMO, best sollution yet for parsing an HTML hyperlink

But I'd like to return a JavaScript array:

arrLink[0] -- the MD text
arrLink[1] -- the oNode.text
arrLink[2] -- the oNode.href

I know how to make these mods in normal JavaScript, but the code that is sent to the Browser is confusing to me.

How can I change the return to be the above array?

Thanks.

The browser evaluates a javascript string built from the the brief linkMD() function near the top of the script.

You can modify the return value of that function in any way you like. Here, for example, it has been edited to return an object with two properties (.txt and .ref), which you can then use to assemble the MD yourself:

function linkMD(strLinkHTML) {
    var oDiv, oNode;

    (oDiv = document.createElement('div')).innerHTML = strLinkHTML;

    return (
        oNode = oDiv.firstChild
    ) ? {
        txt: oNode.text,
        ref: oNode.href
    } : {}
}
1 Like

Thanks, Rob. That really helps. Not only does it give me the solution, but it also teaches me how to deal with similar in the future.

Rob, I'm trying to learn from you.

Was there a specific reason that you used Strict mode in this function?

Thanks.

I should really use it all the time. Sometimes I forget.

Strict mode is a better subset of JS, and allows the compiler to pick up glitches like the use of undefined variable names.

PS - thanks for bringing that to mind - I’ll add it to to my Textexpander snippet for JS modules

1 Like