Read text and attributes of web items by XPath

webscrape
xpath

#1


Read XPath matches in Chrome or Safari.zip (11.4 KB)

Custom Keyboard Maestro Plug-in

NAME

  • Read (text and HTML attributes of) XPath matches in Chrome or Safari

VERSION

  • 0.1

SYNOPSIS

  • Returns a JSON list of HTML element properties for the first node (or all nodes) matched by the supplied XPath
  • The XPath can be applied to:
    • the root of the whole document,
    • the start of the current selection,
    • or the current hover position of the mouse.
  • In addition to the properties of each match, the JSON returned also includes:
    • The URL and title of the document, and
    • the element properties of any anchor node (selected, or under the mouse cursor) to which the XPath has been applied. ( See the applied to option below )
  • The text of the matches can be returned as Markdown, raw HTML, or plain unmarked text.
  • NB In Yosemite, Keyboard Maestro’s Execute JavaScript for Applications returns a default ‘may not be scriptable’ error message from osascript when a compiled .scpt file is called.
    • This action will unfortunately fail (generating an unparseable string instead of JSON) unless you switch off “Include Errors” in the upper right cogwheel dropdown of the action

OPTIONS

Note 1: an XPath applied to a mouse position or selection should start with a dot ./, referring to the selected or hovered node

Note 2: In Google Chrome, a simple absolute XPath for a selected element can be obtained with:
- Right-Click, Inspect Element, then
- Right-Click, Copy XPath

(The copied XPath will work in Safari, as well as in Chrome itself – Safari does not have its own Copy XPath feature)

REQUIREMENTS

  • Yosemite
    • The core script readXPathMatches.scpt is written in Javascript for Applications

INSTALLATION

  • Drag the .zip file onto the Keyboard Maestro icon in the OS X toolbar.
  • (if updating a previous version of the action, first manually remove the previous copy from the custom actions folder)
    • ~/Library/Application Support/Keyboard Maestro/Keyboard Maestro Actions
    • NB In Yosemite, Keyboard Maestro’s Execute JavaScript for Applications returns a default ‘may not be scriptable’ error message from osascript when a compiled .scpt file is called.
      • This action will unfortunately fail (generating an unparseable string instead of JSON) unless you switch off “Include Errors” in the upper right cogwheel dropdown of the action

CONTACT


#2

For reference, unminified .js source code:

// Rob Trew, Twitter @ ComplexPoint 2015.
// ( the mdString() function includes code adapted from David Bengoa's https://gist.github.com/YouWoTMA/1762527 )
(function () {
  'use strict';

  function fnAttributes(strPath, strAnchor, strFormat, blnFirstOnly) {

    // PATH STARTS AT DOCUMENT ROOT, OR SELECTION ?
    var nodeAttribs = function (oNode) {
        varType = (
          oNode ?
          oNode.nodeType :
          null
        ),
        varAttribs = (
          (varType === Node.ELEMENT_NODE) &&
          oNode.hasAttributes()
        ) ? oNode.attributes : null,
        i = varAttribs ? varAttribs.length : 0,
        dct = {};

        dct.name = oNode.nodeName;
        dct.text = (varType !== Node.DOCUMENT_NODE) ?
          (
          strFormat.indexOf('HTML') !== -1 ?
          (
            strFormat.charAt(0) === 'i' ?
            oNode.innerHTML : oNode.outerHTML

          ) : (
            strFormat.charAt(0) === 'M' ?
            mdString(oNode, strHost) :
            oNode.textContent.replace(/\s+/g, " ")
          )
        ) : '';

        while (i--) {
          dct[varAttribs[i].name] = varAttribs[i].value;
        }
        return dct;
      },

      // The mdString() function includes code adapted from https://gist.github.com/YouWoTMA/1762527
      mdString = function (oNode, strHost) {

        function nodeMD(oNode, strContext) {

          function mdEscaped(text) {
            return text ? text.replace(/\s+/g, " ").replace(
              /[\\\-*_>#]/g, "\\$&"
            ) : '';
          }

          function nreps(s, n) {
            var o = '';
            if (n < 1) return o;
            while (n > 1) {
              if (n & 1) o += s;
              n >>= 1;
              s += s;
            }
            return o + s;
          }

          function chilnMD(oNode, strContext) {
            return Array.prototype.slice.call(oNode.childNodes).reduce(
              function (strMD, n) {
                return strMD + nodeMD(n, strContext);
              }, ''
            );
          }

          var nl = "\n\n",
            strHref = '',
            rgxProtocol = /^(ht|f)tp(s?)\:\/\//,
            strTag = oNode.tagName,
            strTagName = strTag ? strTag.toLowerCase() : '',
            lngType = oNode.nodeType;

          if (lngType === Node.TEXT_NODE) {
            return mdEscaped(oNode.nodeValue)

          } else if (lngType === Node.ELEMENT_NODE) {

            if (strContext === "block") {
              switch (strTagName) {
              case "br":
                return nl;
              case "hr":
                return nl + "---" + nl;
                // Block container elements
              case "p":
              case "div":
              case "section":
              case "address":
              case "center":
                return nl + chilnMD(oNode, "block") + nl;
              case "ul":
                return nl + chilnMD(oNode, "u") + nl;
              case "ol":
                return nl + chilnMD(oNode, "o") + nl;
              case "pre":
                return nl + "    " + chilnMD(oNode, "inline") + nl;
              case "code":
                if (oNode.childNodes.length === 1) {
                  break; // use the inline format
                }
                return nl + "    " + chilnMD(oNode, "inline") + nl;
              case "h1":
              case "h2":
              case "h3":
              case "h4":
              case "h5":
              case "h6":
              case "h7":
                return nl + nreps("#", +strTagName[1]) + "  " + chilnMD(
                  oNode,
                  "inline") + nl;
              case "blockquote":
                return nl + "> " + chilnMD(oNode, "inline") + nl;
              }
            }

            // UL | OL
            if (/^[ou]+$/.test(strContext)) {
              if (strTagName === "li") {
                return "\n" + nreps("  ", strContext.length - 1) +
                  (strContext[strContext.length - 1] ===
                    "o" ? "1. " : "- ") + chilnMD(oNode, strContext + "l");
              } else {
                console.log("[toMarkdown] - invalid element at this point " +
                  strContext.tagName);
                return chilnMD(oNode, "inline")
              }
            } else if (/^[ou]+l$/.test(strContext)) {
              return chilnMD(
                oNode,
                strContext.substr(
                  0, strContext.length - 1
                ) + (strTagName === "ul" ? "u" : "o")
              );
            }

            // IN-LINE
            switch (strTagName) {
            case "strong":
            case "b":
              return "**" + chilnMD(oNode, "inline") + "**";
            case "em":
            case "i":
              return "_" + chilnMD(oNode, "inline") + "_";
            case "code": // Inline version of code
              return "`" + chilnMD(oNode, "inline") + "`";
            case "a":
              return "[" + chilnMD(oNode, "inline") + "](" +
                (
                  strHref = oNode.getAttribute("href") || '',
                  rgxProtocol.test(strHref) ? strHref : (
                    (strHref && (strHref.charAt(0) === '#')) ?
                    strPageURL + strHref :
                    strHost + strHref
                  )
                ) + ")";
            case "img":
              return nl + "[_Image_: " + mdEscaped(oNode.getAttribute("alt")) +
                "](" +
                oNode.getAttribute("src") + ")" + nl;
            case "script":
            case "style":
            case "meta":
              return "";
            default:
              console.log("[toMarkdown] - undefined element " + strTagName)
              return chilnMD(oNode, strContext);
            }
          }
        }

        // Translated to Markdown
        // and LF sequences normalised
        function toMarkdown(oNode) {
          var strMD = nodeMD(oNode, "block");
          return strMD ? strMD.replace(/[\n]{2,}/g, "\n\n").replace(
            /^[\n]+/, "").replace(/[\n]+$/, "") : '';
        }

        /*******************/
        return toMarkdown(oNode);
      },


      oAnchor = (strAnchor === 'document') ?
      document : (
        (strAnchor === 'selection') ?
        window.getSelection().anchorNode : null
      ),

      // OR PATH STARTS AT MOUSE ?
      nh = oAnchor ? null : document.querySelectorAll(':hover'),
      iLast = (nh ? nh.length : null),
      nodeHover = iLast ? nh[iLast - 1] : null,

      // IF WE HAVE A STARTING POINT,
      // DOES THE PATH YIELD MATCHES THERE ?
      oRoot = oAnchor ? oAnchor : nodeHover,
      xr = oRoot ? document.evaluate(
        strPath,
        oRoot,
        null,
        blnFirstOnly ?
        XPathResult.FIRST_ORDERED_NODE_TYPE :
        XPathResult.ORDERED_NODE_ITERATOR_TYPE,
        null
      ) : null,

      // XPATHRESULTS --> [match] (list of any matches)
      nodesToRead = xr ? (
        blnFirstOnly ? [xr.singleNodeValue] :
        (function () {
          var lst = [],
            oNode = xr.iterateNext();

          while (oNode) {
            lst.push(oNode);
            oNode = xr.iterateNext();
          }

          return lst;
        })()
      ) : [],
      oLocn = window.location,
      strHost = oLocn.protocol + "//" + oLocn.host,
      strPageURL = document.URL;

    // HARVEST IN JSON FORMAT
    return JSON.stringify({
      'doc': {
        'URL': strPageURL,
        'title': document.title
      },
      'anchor': oRoot ? nodeAttribs(oRoot) : null,
      'xpath': strPath,
      'firstOnly': blnFirstOnly,
      'matches': nodesToRead.map(nodeAttribs)
    }, null, 2);
  }

  // Evaluate code for a function application to a named browser (Chrome | Safari)
  // fn --> [arg] --> strBrowserName --> a
  function evalJSinBrowser(fnMain, lstArgs, strBrowser) {

    var strFrontApp = strBrowser.indexOf(' or ') !== -1 ?
      Application("System Events").applicationProcesses.where({
        frontmost: true
      })[0].name() : '',
      strTarget = (
        strFrontApp && (
          ['Safari', 'Google Chrome'].indexOf(strFrontApp) !== -1
        )
      ) ? strFrontApp : (strBrowser !== 'Safari' ? 'Google Chrome' : 'Safari'),
      blnSafari = (strTarget === 'Safari'),
      appBrowser = Application(strTarget),
      lstWins = appBrowser.windows(),
      lngWins = lstWins.length,

      // an open window (new if none exists)
      oWin = lngWins && lstWins[0].id() !== -1 ?
      lstWins[0] : blnSafari ?
      appBrowser.Document().make() && appBrowser.windows[0] :
      appBrowser.Window().make(),

      strJS = [
        '(', fnMain.toString(), ').apply(null, ',
        JSON.stringify(lstArgs), ');'
      ].join('');
    return (
      blnSafari ?
      appBrowser.doJavaScript(
        strJS, {
          "in": oWin.currentTab
        }) :
      oWin.activeTab.execute({
        "javascript": strJS
      })
    );
  }

  /***** MAIN ***/
  var a = Application.currentApplication(),
    sysAttr = (
      a.includeStandardAdditions = true, a
    ).systemAttribute;

  return evalJSinBrowser(
    fnAttributes, [
      sysAttr("KMPARAM_XPath"),
      sysAttr("KMPARAM_applied_to"),
      sysAttr("KMPARAM_text_as"),
      sysAttr("KMPARAM_read") === 'first match'
    ],
    sysAttr("KMPARAM_browser")
  );
})();

Copy paragraph from web with heading, link, and date-time
#3

Hi Rob

Which usescases do you have for this action?


#4

It’s something I do a lot – just posted one example under ‘macros’: copying a para from the web, with link to doc, preceding heading, and datestamp.

It’s a general scraping scalpel, for things like harvesting links, or watching changing data in a particular position on a particular site.


#5

Hi Rob

I have used this today.
The output is usable, but it seems to also through an error.

This is my macro:
http://cl.ly/343R0E1U0X2E/Read%20first%20Pakkeshop%20ID.kmmacros

And this the output:

2015-11-06 12:39:24.845 osascript[6948:46472] warning: failed to get scripting definition from /usr/bin/osascript; it may not be scriptable.
{
  "doc": {
    "URL": "https://customlocation.nokia.com/glsgroup/?lang=da_DK&country0=DK&xdm_e=https:2F%2Fgls-group.eu&xdm_c=default5939&xdm_p=1",
    "title": "GLS PakkeShop og depotsøgning"
  },
  "anchor": {
    "name": "#document",
    "text": ""
  },
  "xpath": "//*[@id=\"mainContent\"]/div/div[2]/div[1]/div[3]/div[2]/ul/li[1]/div[1]/div[1]",
  "firstOnly": true,
  "matches": [
    {
      "name": "DIV",
      "text": "PakkeShop-ID: 2080097171",
      "class": "psid"
    }
  ]
}

I expected to only get the PakkeShop-ID.

Now I can just search for it so that is fine.

But I wondered if this was an error on my system or something with the plugin.


#6

osacript is still a bit overhelpful with that warning …

The first thing I would try is to disable ‘include Errors’ on that action (click the control at top right)


#7

I have now used it again and removed the error. But I still get this long list of information:

{
  "doc": {
    "URL": "https://balleskolen.m.skoleintra.dk/parent/23157/David/contacts/students/cards",
    "title": "Elever - ForældreIntra"
  },
  "anchor": {
    "name": "#document",
    "text": ""
  },
  "xpath": "//*[@id=\"sk-contact-card-container\"]/div/div[2]/div[2]/div[1]/div[1]/span[2]",
  "firstOnly": true,
  "matches": [
    {
      "name": "SPAN",
      "text": "Merete Høiby Hartington",
      "class": "sk-labeledtext-value"
    }
  ]
}

In this case I would only like to have the text value. Is there an easy way to do this with your plugin?


#8

You’ve got one more stage to go :slight_smile:

The action returns its results (possibly including several matches) as a JSON string. See the synopsis on the readme sheet above, and the the end of the main function in the action:

// HARVEST IN JSON FORMAT
return JSON.stringify({
    'doc': {
        'URL': strPageURL,
        'title': document.title
    },
    'anchor': oRoot ? nodeAttribs(oRoot) : null,
    'xpath': strPath,
    'firstOnly': blnFirstOnly,
    'matches': nodesToRead.map(nodeAttribs)
}, null, 2);

and that JSON string is what you are capturing in your variable.

Imagine you are capturing that JSON string in a KM variable called xPathHarvest :

You can now add one or more simple Execute JavaScript for Automation actions which extract the part(s) you want.

For example:

The key line of this is JSON.parse(strJSON), which reads the JSON string into an active JS object, which you can extract things from or do further work on.

You can extract the .text value of the first match with something like:

(function (strVarName) {
    'use strict';

    var strJSON = Application("Keyboard Maestro Engine")
        .variables.byName(strVarName)
        .value(),

        dctResult = JSON.parse(strJSON),

        lstMatches = dctResult.matches;


    return lstMatches.length ? lstMatches[0].text : undefined;

})('xPathHarvest');

Saving the output of that JS for Automation action in another KM variable like PakkeShopID

So the final stages of your macro might start to look something like:


#9

Thanks. Now I understand the plugin.
And thanks for the script to extract the information.
It works now as expected.


#10

Hello @ComplexPoint!

The action works very well but i have some conditions when i have to use AppleScript instead of the action, don't you have an AppleScript to read XPath contents for Chrome and Safari?

Thanks in advance.


#11

Hey Kirill,

I thought we'd been over this before????

-Chris


Safari

----------------------------------------------------------------
# Auth: Christopher Stone & Rob Trew
# dCre: 2015/12/04 05:06
# dMod: 2018/12/28 03:14
# Appl: Safari
# Task: Get URLs from Google Search Results Page Using JavaScript & Xpath.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @Safari, @Google, @Links, @URLs, @JavaScript, @Xpath
----------------------------------------------------------------

set xpathStr to "//*[@class=\\'r\\']/a"

set jsCmdStr to "
var xpathResults = document.evaluate('" & xpathStr & "', document, null, 0, null),
  nodeList = [],
  oNode;

while (oNode = xpathResults.iterateNext()) {
  nodeList.push(oNode.href);
}

nodeList;
"

tell application "Safari"
   set linkList to (do JavaScript jsCmdStr in front document)
end tell

----------------------------------------------------------------

Google Chrome

----------------------------------------------------------------
# Auth: Christopher Stone & Rob Trew
# dCre: 2015/12/04 05:06
# dMod: 2018/12/28 03:17
# Appl: Google Chrome
# Task: Get URLs from Google Search Results Page Using JavaScript & Xpath.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @Safari, @Google, @Links, @URLs, @JavaScript, @Xpath
----------------------------------------------------------------

set xpathStr to "//*[@class=\\'r\\']/a"

set jsCmdStr to "
var xpathResults = document.evaluate('" & xpathStr & "', document, null, 0, null),
  nodeList = [],
  oNode;

while (oNode = xpathResults.iterateNext()) {
  nodeList.push(oNode.href);
}

nodeList;
"

tell application "Google Chrome"
   tell front window's active tab to set linkList to execute javascript jsCmdStr
end tell

----------------------------------------------------------------

#12

Hey Chris!

I am using the script to get URL from xpath, not to get some other attributes and text of web items.

I am willing to get the same result as i am getting using the custom action by @ComplexPoint but using AppleScript as i am going to check xpath for specific text in three different browsers simultaneously, it is impossible using the custom action.

The AS result should be like this: