Read text and attributes of web items by XPath

ComplexPoint · September 1, 2015, 12:40pm

Read XPath matches in Chrome or Safari.zip (11.4 KB)

Custom Keyboard Maestro Plug-in

NAME

Read (text and HTML attributes of) XPath matches in Chrome or Safari

VERSION

0.1

SYNOPSIS

Returns a JSON list of HTML element properties for the first node (or all nodes) matched by the supplied XPath
The XPath can be applied to:
- the root of the whole document,
- the start of the current selection,
- or the current hover position of the mouse.
In addition to the properties of each match, the JSON returned also includes:
- The URL and title of the document, and
- the element properties of any anchor node (selected, or under the mouse cursor) to which the XPath has been applied. ( See the applied to option below )
The text of the matches can be returned as Markdown, raw HTML, or plain unmarked text.
NB In Yosemite, Keyboard Maestro's Execute JavaScript for Applications returns a default 'may not be scriptable' error message from osascript when a compiled .scpt file is called.
- This action will unfortunately fail (generating an unparseable string instead of JSON) unless you switch off "Include Errors" in the upper right cogwheel dropdown of the action

OPTIONS

Read
- all matches
- first match only
Xpath:
- An XPath expression

Note 1: an XPath applied to a mouse position or selection should start with a dot ./, referring to the selected or hovered node

Note 2: In Google Chrome, a simple absolute XPath for a selected element can be obtained with:
- Right-Click, Inspect Element, then
- Right-Click, Copy XPath

(The copied XPath will work in Safari, as well as in Chrome itself – Safari does not have its own Copy XPath feature)

Applied to:
- document
- mouse position
- selection
Text as:
- Markdown
- inner HTML
- outer HTML
- plain text content
Browser:
- Chrome or Safari
- Safari
- Google Chrome

REQUIREMENTS

Yosemite
- The core script readXPathMatches.scpt is written in Javascript for Applications

INSTALLATION

Drag the .zip file onto the Keyboard Maestro icon in the OS X toolbar.
(if updating a previous version of the action, first manually remove the previous copy from the custom actions folder)
- ~/Library/Application Support/Keyboard Maestro/Keyboard Maestro Actions
- NB In Yosemite, Keyboard Maestro's Execute JavaScript for Applications returns a default 'may not be scriptable' error message from osascript when a compiled .scpt file is called.
  - This action will unfortunately fail (generating an unparseable string instead of JSON) unless you switch off "Include Errors" in the upper right cogwheel dropdown of the action

CONTACT

Rob Trew - Twitter @ComplexPoint

ComplexPoint · September 1, 2015, 12:42pm

For reference, unminified .js source code:

// Rob Trew, Twitter @ ComplexPoint 2015.
// ( the mdString() function includes code adapted from David Bengoa's https://gist.github.com/YouWoTMA/1762527 )
(function () {
  'use strict';

  function fnAttributes(strPath, strAnchor, strFormat, blnFirstOnly) {

    // PATH STARTS AT DOCUMENT ROOT, OR SELECTION ?
    var nodeAttribs = function (oNode) {
        varType = (
          oNode ?
          oNode.nodeType :
          null
        ),
        varAttribs = (
          (varType === Node.ELEMENT_NODE) &&
          oNode.hasAttributes()
        ) ? oNode.attributes : null,
        i = varAttribs ? varAttribs.length : 0,
        dct = {};

        dct.name = oNode.nodeName;
        dct.text = (varType !== Node.DOCUMENT_NODE) ?
          (
          strFormat.indexOf('HTML') !== -1 ?
          (
            strFormat.charAt(0) === 'i' ?
            oNode.innerHTML : oNode.outerHTML

          ) : (
            strFormat.charAt(0) === 'M' ?
            mdString(oNode, strHost) :
            oNode.textContent.replace(/\s+/g, " ")
          )
        ) : '';

        while (i--) {
          dct[varAttribs[i].name] = varAttribs[i].value;
        }
        return dct;
      },

      // The mdString() function includes code adapted from https://gist.github.com/YouWoTMA/1762527
      mdString = function (oNode, strHost) {

        function nodeMD(oNode, strContext) {

          function mdEscaped(text) {
            return text ? text.replace(/\s+/g, " ").replace(
              /[\\\-*_>#]/g, "\\$&"
            ) : '';
          }

          function nreps(s, n) {
            var o = '';
            if (n < 1) return o;
            while (n > 1) {
              if (n & 1) o += s;
              n >>= 1;
              s += s;
            }
            return o + s;
          }

          function chilnMD(oNode, strContext) {
            return Array.prototype.slice.call(oNode.childNodes).reduce(
              function (strMD, n) {
                return strMD + nodeMD(n, strContext);
              }, ''
            );
          }

          var nl = "\n\n",
            strHref = '',
            rgxProtocol = /^(ht|f)tp(s?)\:\/\//,
            strTag = oNode.tagName,
            strTagName = strTag ? strTag.toLowerCase() : '',
            lngType = oNode.nodeType;

          if (lngType === Node.TEXT_NODE) {
            return mdEscaped(oNode.nodeValue)

          } else if (lngType === Node.ELEMENT_NODE) {

            if (strContext === "block") {
              switch (strTagName) {
              case "br":
                return nl;
              case "hr":
                return nl + "---" + nl;
                // Block container elements
              case "p":
              case "div":
              case "section":
              case "address":
              case "center":
                return nl + chilnMD(oNode, "block") + nl;
              case "ul":
                return nl + chilnMD(oNode, "u") + nl;
              case "ol":
                return nl + chilnMD(oNode, "o") + nl;
              case "pre":
                return nl + "    " + chilnMD(oNode, "inline") + nl;
              case "code":
                if (oNode.childNodes.length === 1) {
                  break; // use the inline format
                }
                return nl + "    " + chilnMD(oNode, "inline") + nl;
              case "h1":
              case "h2":
              case "h3":
              case "h4":
              case "h5":
              case "h6":
              case "h7":
                return nl + nreps("#", +strTagName[1]) + "  " + chilnMD(
                  oNode,
                  "inline") + nl;
              case "blockquote":
                return nl + "> " + chilnMD(oNode, "inline") + nl;
              }
            }

            // UL | OL
            if (/^[ou]+$/.test(strContext)) {
              if (strTagName === "li") {
                return "\n" + nreps("  ", strContext.length - 1) +
                  (strContext[strContext.length - 1] ===
                    "o" ? "1. " : "- ") + chilnMD(oNode, strContext + "l");
              } else {
                console.log("[toMarkdown] - invalid element at this point " +
                  strContext.tagName);
                return chilnMD(oNode, "inline")
              }
            } else if (/^[ou]+l$/.test(strContext)) {
              return chilnMD(
                oNode,
                strContext.substr(
                  0, strContext.length - 1
                ) + (strTagName === "ul" ? "u" : "o")
              );
            }

            // IN-LINE
            switch (strTagName) {
            case "strong":
            case "b":
              return "**" + chilnMD(oNode, "inline") + "**";
            case "em":
            case "i":
              return "_" + chilnMD(oNode, "inline") + "_";
            case "code": // Inline version of code
              return "`" + chilnMD(oNode, "inline") + "`";
            case "a":
              return "[" + chilnMD(oNode, "inline") + "](" +
                (
                  strHref = oNode.getAttribute("href") || '',
                  rgxProtocol.test(strHref) ? strHref : (
                    (strHref && (strHref.charAt(0) === '#')) ?
                    strPageURL + strHref :
                    strHost + strHref
                  )
                ) + ")";
            case "img":
              return nl + "[_Image_: " + mdEscaped(oNode.getAttribute("alt")) +
                "](" +
                oNode.getAttribute("src") + ")" + nl;
            case "script":
            case "style":
            case "meta":
              return "";
            default:
              console.log("[toMarkdown] - undefined element " + strTagName)
              return chilnMD(oNode, strContext);
            }
          }
        }

        // Translated to Markdown
        // and LF sequences normalised
        function toMarkdown(oNode) {
          var strMD = nodeMD(oNode, "block");
          return strMD ? strMD.replace(/[\n]{2,}/g, "\n\n").replace(
            /^[\n]+/, "").replace(/[\n]+$/, "") : '';
        }

        /*******************/
        return toMarkdown(oNode);
      },


      oAnchor = (strAnchor === 'document') ?
      document : (
        (strAnchor === 'selection') ?
        window.getSelection().anchorNode : null
      ),

      // OR PATH STARTS AT MOUSE ?
      nh = oAnchor ? null : document.querySelectorAll(':hover'),
      iLast = (nh ? nh.length : null),
      nodeHover = iLast ? nh[iLast - 1] : null,

      // IF WE HAVE A STARTING POINT,
      // DOES THE PATH YIELD MATCHES THERE ?
      oRoot = oAnchor ? oAnchor : nodeHover,
      xr = oRoot ? document.evaluate(
        strPath,
        oRoot,
        null,
        blnFirstOnly ?
        XPathResult.FIRST_ORDERED_NODE_TYPE :
        XPathResult.ORDERED_NODE_ITERATOR_TYPE,
        null
      ) : null,

      // XPATHRESULTS --> [match] (list of any matches)
      nodesToRead = xr ? (
        blnFirstOnly ? [xr.singleNodeValue] :
        (function () {
          var lst = [],
            oNode = xr.iterateNext();

          while (oNode) {
            lst.push(oNode);
            oNode = xr.iterateNext();
          }

          return lst;
        })()
      ) : [],
      oLocn = window.location,
      strHost = oLocn.protocol + "//" + oLocn.host,
      strPageURL = document.URL;

    // HARVEST IN JSON FORMAT
    return JSON.stringify({
      'doc': {
        'URL': strPageURL,
        'title': document.title
      },
      'anchor': oRoot ? nodeAttribs(oRoot) : null,
      'xpath': strPath,
      'firstOnly': blnFirstOnly,
      'matches': nodesToRead.map(nodeAttribs)
    }, null, 2);
  }

  // Evaluate code for a function application to a named browser (Chrome | Safari)
  // fn --> [arg] --> strBrowserName --> a
  function evalJSinBrowser(fnMain, lstArgs, strBrowser) {

    var strFrontApp = strBrowser.indexOf(' or ') !== -1 ?
      Application("System Events").applicationProcesses.where({
        frontmost: true
      })[0].name() : '',
      strTarget = (
        strFrontApp && (
          ['Safari', 'Google Chrome'].indexOf(strFrontApp) !== -1
        )
      ) ? strFrontApp : (strBrowser !== 'Safari' ? 'Google Chrome' : 'Safari'),
      blnSafari = (strTarget === 'Safari'),
      appBrowser = Application(strTarget),
      lstWins = appBrowser.windows(),
      lngWins = lstWins.length,

      // an open window (new if none exists)
      oWin = lngWins && lstWins[0].id() !== -1 ?
      lstWins[0] : blnSafari ?
      appBrowser.Document().make() && appBrowser.windows[0] :
      appBrowser.Window().make(),

      strJS = [
        '(', fnMain.toString(), ').apply(null, ',
        JSON.stringify(lstArgs), ');'
      ].join('');
    return (
      blnSafari ?
      appBrowser.doJavaScript(
        strJS, {
          "in": oWin.currentTab
        }) :
      oWin.activeTab.execute({
        "javascript": strJS
      })
    );
  }

  /***** MAIN ***/
  var a = Application.currentApplication(),
    sysAttr = (
      a.includeStandardAdditions = true, a
    ).systemAttribute;

  return evalJSinBrowser(
    fnAttributes, [
      sysAttr("KMPARAM_XPath"),
      sysAttr("KMPARAM_applied_to"),
      sysAttr("KMPARAM_text_as"),
      sysAttr("KMPARAM_read") === 'first match'
    ],
    sysAttr("KMPARAM_browser")
  );
})();

JimmyHartington · September 1, 2015, 12:54pm

Hi Rob

Which usescases do you have for this action?

ComplexPoint · September 1, 2015, 12:57pm

It's something I do a lot – just posted one example under 'macros': copying a para from the web, with link to doc, preceding heading, and datestamp.

It's a general scraping scalpel, for things like harvesting links, or watching changing data in a particular position on a particular site.

JimmyHartington · November 6, 2015, 11:41am

Hi Rob

I have used this today.
The output is usable, but it seems to also through an error.

This is my macro:
http://cl.ly/343R0E1U0X2E/Read%20first%20Pakkeshop%20ID.kmmacros

And this the output:

2015-11-06 12:39:24.845 osascript[6948:46472] warning: failed to get scripting definition from /usr/bin/osascript; it may not be scriptable.
{
  "doc": {
    "URL": "https://customlocation.nokia.com/glsgroup/?lang=da_DK&country0=DK&xdm_e=https:2F%2Fgls-group.eu&xdm_c=default5939&xdm_p=1",
    "title": "GLS PakkeShop og depotsøgning"
  },
  "anchor": {
    "name": "#document",
    "text": ""
  },
  "xpath": "//*[@id=\"mainContent\"]/div/div[2]/div[1]/div[3]/div[2]/ul/li[1]/div[1]/div[1]",
  "firstOnly": true,
  "matches": [
    {
      "name": "DIV",
      "text": "PakkeShop-ID: 2080097171",
      "class": "psid"
    }
  ]
}

I expected to only get the PakkeShop-ID.

Now I can just search for it so that is fine.

But I wondered if this was an error on my system or something with the plugin.

ComplexPoint · November 6, 2015, 12:48pm

osacript is still a bit overhelpful with that warning …

The first thing I would try is to disable 'include Errors' on that action (click the control at top right)

JimmyHartington · March 2, 2016, 2:43pm

I have now used it again and removed the error. But I still get this long list of information:

{
  "doc": {
    "URL": "https://balleskolen.m.skoleintra.dk/parent/23157/David/contacts/students/cards",
    "title": "Elever - ForældreIntra"
  },
  "anchor": {
    "name": "#document",
    "text": ""
  },
  "xpath": "//*[@id=\"sk-contact-card-container\"]/div/div[2]/div[2]/div[1]/div[1]/span[2]",
  "firstOnly": true,
  "matches": [
    {
      "name": "SPAN",
      "text": "Merete Høiby Hartington",
      "class": "sk-labeledtext-value"
    }
  ]
}

In this case I would only like to have the text value. Is there an easy way to do this with your plugin?

ComplexPoint · March 2, 2016, 11:27pm

You've got one more stage to go

The action returns its results (possibly including several matches) as a JSON string. See the synopsis on the readme sheet above, and the the end of the main function in the action:

// HARVEST IN JSON FORMAT
return JSON.stringify({
    'doc': {
        'URL': strPageURL,
        'title': document.title
    },
    'anchor': oRoot ? nodeAttribs(oRoot) : null,
    'xpath': strPath,
    'firstOnly': blnFirstOnly,
    'matches': nodesToRead.map(nodeAttribs)
}, null, 2);

and that JSON string is what you are capturing in your variable.

Imagine you are capturing that JSON string in a KM variable called xPathHarvest :

You can now add one or more simple Execute JavaScript for Automation actions which extract the part(s) you want.

For example:

The key line of this is JSON.parse(strJSON), which reads the JSON string into an active JS object, which you can extract things from or do further work on.

You can extract the .text value of the first match with something like:

(function (strVarName) {
    'use strict';

    var strJSON = Application("Keyboard Maestro Engine")
        .variables.byName(strVarName)
        .value(),

        dctResult = JSON.parse(strJSON),

        lstMatches = dctResult.matches;


    return lstMatches.length ? lstMatches[0].text : undefined;

})('xPathHarvest');

Saving the output of that JS for Automation action in another KM variable like PakkeShopID

So the final stages of your macro might start to look something like:

JimmyHartington · March 3, 2016, 6:49am

Thanks. Now I understand the plugin.
And thanks for the script to extract the information.
It works now as expected.

kreal · December 27, 2018, 5:07pm

Hello @ComplexPoint!

The action works very well but i have some conditions when i have to use AppleScript instead of the action, don't you have an AppleScript to read XPath contents for Chrome and Safari?

Thanks in advance.

ccstone · December 28, 2018, 9:24am

Hey Kirill,

I thought we'd been over this before????

-Chris

Safari

----------------------------------------------------------------
# Auth: Christopher Stone & Rob Trew
# dCre: 2015/12/04 05:06
# dMod: 2018/12/28 03:14
# Appl: Safari
# Task: Get URLs from Google Search Results Page Using JavaScript & Xpath.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @Safari, @Google, @Links, @URLs, @JavaScript, @Xpath
----------------------------------------------------------------

set xpathStr to "//*[@class=\\'r\\']/a"

set jsCmdStr to "
var xpathResults = document.evaluate('" & xpathStr & "', document, null, 0, null),
  nodeList = [],
  oNode;

while (oNode = xpathResults.iterateNext()) {
  nodeList.push(oNode.href);
}

nodeList;
"

tell application "Safari"
   set linkList to (do JavaScript jsCmdStr in front document)
end tell

----------------------------------------------------------------

Google Chrome

----------------------------------------------------------------
# Auth: Christopher Stone & Rob Trew
# dCre: 2015/12/04 05:06
# dMod: 2018/12/28 03:17
# Appl: Google Chrome
# Task: Get URLs from Google Search Results Page Using JavaScript & Xpath.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @Safari, @Google, @Links, @URLs, @JavaScript, @Xpath
----------------------------------------------------------------

set xpathStr to "//*[@class=\\'r\\']/a"

set jsCmdStr to "
var xpathResults = document.evaluate('" & xpathStr & "', document, null, 0, null),
  nodeList = [],
  oNode;

while (oNode = xpathResults.iterateNext()) {
  nodeList.push(oNode.href);
}

nodeList;
"

tell application "Google Chrome"
   tell front window's active tab to set linkList to execute javascript jsCmdStr
end tell

----------------------------------------------------------------

kreal · December 28, 2018, 10:29am

Hey Chris!

I am using the script to get URL from xpath, not to get some other attributes and text of web items.

I am willing to get the same result as i am getting using the custom action by @ComplexPoint but using AppleScript as i am going to check xpath for specific text in three different browsers simultaneously, it is impossible using the custom action.

The AS result should be like this:

noisneil · April 10, 2023, 10:33am

Would it be difficult to get this to work in Brave too?

ccstone · April 11, 2023, 7:20am

You mean something like this?

** Note – the Google Search XPath has changed, since the original script debuted.

--------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2015/12/04 05:06
# dMod: 2023/04/11 02:17
# Appl: Brave Browser
# Task: Get URLs from Google Search Results Page Using JavaScript & Xpath.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @Google_Chrome, @Google, @Links, @URLs, @JavaScript, @Xpath
# NOTE: Working macOS 10.14.6 with Brave Browser 112.1.50.114
--------------------------------------------------------

set xpathStr to "//*[@class=\"yuRUbf\"]/a"

set jsCmdStr to "
var xpathResults = document.evaluate('" & xpathStr & "', document, null, 0, null),
  nodeList = [],
  oNode;

while (oNode = xpathResults.iterateNext()) {
  nodeList.push(oNode.href);
}

nodeList;
"

doJavaScriptInBrave(jsCmdStr)

--------------------------------------------------------
--» HANDLERS
--------------------------------------------------------
on doJavaScriptInBrave(jsCmdStr)
   try
      tell application "Brave Browser" to tell front window's active tab to execute javascript jsCmdStr
   on error e
      error "Error in handler doJavaScriptInBrave() of library NLb!" & return & return & e
   end try
end doJavaScriptInBrave
--------------------------------------------------------

noisneil · April 11, 2023, 8:44am

Sorry, I wasn't very clear. I was wondering whether @ComplexPoint's plugin could be adapted to work with Brave.

ccstone · April 11, 2023, 9:09am

Sure.

Brave is Chrome-based, so JavaScript from Apple Events is the same syntax as Google Chrome.

It'd take a little fiddling to do yourself, but if you ask Rob nicely he might add that functionality for you.