Script to pull pub date and author(s) from URL in front window of Safari?

I’m a textbook writer.

When writing the footnote for a reference, I use KM to trigger applescripts that fetch the title and URL from the front window in Safari. This not only saves wear and tear on me, it eliminates typos and makes the references I use (from web sites like the Wall Street Journal, Bloomberg, New York Times, etc.) accurate. I trigger these KM macros within Pages or Adobe Acrobat DC.

Is there any way to use scripting to fetch the pub date or the author(s) from the front window in Safari?

There doesn’t seem to be a way to do this with applescript, which I’m capable of figuring out from others’ examples.

Over the last 10 evenings, I’ve tried to figure this out with JavaScript or Python, but I’m over my head and can’t figure out how to do this.

Is there an easy way to do this with JavaScript or Python?

Thanks! :slight_smile:

I have been wanting to get the same info for a long time. The main issue is that there does NOT appear to any standard for publishing/identifying/labeling Author and Publication Date.

If the web sites of interest to you do have a standard format or a standard HTML codes for these, let us know and maybe we can devise something.

It is actually pretty easy using RegEx to parse this data, as long as it is always in a standard format, like:

Author: John Smith
Date:  2017-04-15

From the Wall Street Journal:

By R.R. Reno
April 14, 2017 8:45 a.m. ET (I don’t need the timestamp, just the date).

Hopefully this is what you needed? There’s also datecreated?

Thank you!

“meta itemprop=“datePublished” content=“2017-04-14T12:45:00.000Z””

meta itemprop=“datePublished” content=“2017-04-14T12:45:00.000Z”

There’s also this JSON

script type=“application/ld+json”
{
"@context": “http://schema.org”,
"@type": “NewsArticle”,
“mainEntityOfPage”: {
"@type": “WebPage”,
"@id": “https://www.wsj.com/articles/the-profound-connection-between-easter-and-passover-1492173908
},
“headline”: “The Profound Connection Between Easter and Passover”,
“image”: {
"@type": “ImageObject”,
“url”: “https://si.wsj.net/public/resources/images/BN-SY639_EASTER_GR_20170413171020.jpg”,
“width”: 1242,
“height”: 810
},
“author”: {
"@type": “Person”,
“name”: “R.R. Reno”
},
“publisher”: {
"@type": “Organization”,
“name”: “Wall Street Journal”,
“logo”: {
"@type": “ImageObject”,
“url”: “https://s.wsj.net/media/wsj_amp_masthead_lg.png”,
“width”: 576,
“height”: 60
}
},
“datePublished”: “2017-04-14T12:45:00.000Z”,
“dateModified”: “2017-04-14T12:45:00.000Z”,
“url”: “https://www.wsj.com/articles/the-profound-connection-between-easter-and-passover-1492173908”,
“thumbnailUrl”: “https://si.wsj.net/public/resources/images/BN-SY639_EASTER_GR_20170413171020.jpg”,
“dateCreated”: “2017-04-14T12:45:00.000Z”,
“articleSection”: “Life”,
“creator”: [“R.R. Reno”],
“keywords”: [“christianity”,“easter”,“judaism”,“last supper”,“passover”,“passover seder”,“religion”,“political”,“general news”,“society”,“community”]
}

Is there a meta tag for author?
It is very easy to retrieve metadata.


script type="application/ld+json"
{

"author": {
"@type": "Person",
"name": "R.R. Reno"
},
"datePublished": "2017-04-14T12:45:00.000Z",
"dateModified": "2017-04-14T12:45:00.000Z",

Obviously both author and pub date are in the JSON.
I would expect there is a way to retrieve this using JavaScript, I just don't know how.

Maybe someone else who does will jump in here.

The architecture of a web page includes standardised and universal representations of title and URL, but there are no universal conventions which a script can use in a search for a publication date or authorship field.

You could certainly write a script to 'scrape' that particular WSJ page, and with luck your script might conceivably work with some useful proportion of other WSJ pages (if they are adopting consistent internal conventions) but it wouldn't work with pages on other websites.

Thanks for the reply. I use a number of regular sources for 95% of what I write, so I’ve already got partial customized macros with apple scripts for each of these. I get one to work and then duplicate and make adjustments for the others.

FYI, I’d figure out how to scrape the date first (and then authors), creating a separate macro for each. Then, I would trigger the date macro and the author macro as part of the customized multi-step macro I already use on a particular web site for, such as, extracting the data from Wall Street Journal articles.

Advice on the easiest way to scrape these data. Between python, javascript and other approaches, not sure where to begin. What would you recommend for someone who (probably like most) self teaches himself on this kind of stuff?

Thanks!

JavaScript is the only language which you can use to interrogate a browser about the active page.

It looks as if WSJ pages may tend to have <meta> tags in the <head> section which could contain some of the things you are looking for:

One avenue would be search for XPath threads here,

https://forum.keyboardmaestro.com/search?q=xpath

and see if you can figure out how to use Keyboard Maestro Execute JavaScript in Safari actions, with XPath expressions to find the content you want.

(Not something I can personally spend any time on at the moment, I'm afraid)

Do you have a WSJ URL I can use for testing?
It needs to work without a subscription to WSJ.

Thanks ComplexPoint and JMichaelTX!

This should work, How to Be the Best Deputy: When Second Best Is Best, https://www.wsj.com/articles/how-to-be-the-best-deputy-when-second-best-is-best-1492529374.

I logged out and that seemed to be the only page that was fully accessible. Was accessible in Safari, Chrome and Firefox without a sign in.

Yes. Tested using this URL:

Try this script in a Execute JavaScript in Safari Action.
Note that there are several HTML meta tags that report publication date. You can choose the one you prefer.

##javascript to Extract Author and Pub Date from WSJ



'use strict';
(function run() {      // this will auto-run when script is executed

var authorMeta = document.querySelector('meta[name="author"]')
var authorStr = authorMeta ? authorMeta.content : "UNKNOWN"

var datePubMeta = document.querySelector('meta[itemprop="datePublished"]')
var datePubStr = datePubMeta ? datePubMeta.content : "UNKNOWN"

//--- Example of a Bad Meta (not found) ---
var badMeta = document.querySelector('meta[itemprop="BadMeta"]')
var badStr = badMeta ? badMeta.content : "UNKNOWN"


return authorStr + "\n" + datePubStr;

}  // END of function run()
)();

##example Results

The querySelector approach is good, and although the META name= scheme for author and publication-date is probably rather WSJ-specific, its possible that you could widen the hits by loosening the query criteria a little.

Something like this, for example (ES6 version only here, so you will need an up to date Safari), uses the *= selector to pick up content from meta tags which have author or published anywhere in the name value. You could, could of course, cast nets more widely.


(() => {
    'use strict';

    // show :: a -> String
    const show = x => JSON.stringify(x, null, 2);

    return show(
        Array.from(document.querySelectorAll(
            'meta[name*="author"], meta[name*="published"]'
        ))
        .map(x => x.content)
    );
})();
1 Like

Thanks again JMichaelTX and ComplexPoint.

Dumb problem on my part. When I test the JavaScript in ScriptEditor I get the same error (on different lines) with the scripts you wrote: "Can't find variable: document"

I thought that "Selecting Safari Tab 1" as step 1 of the KM macro, followed by either of your scripts would "create" the "document" by pointing the scripts to the front/current tab in Safari.

Then, I tried "do JavaScript"......

tell application "Safari"
tell front window
tell current tab
do JavaScript "JMichaelTX_script here OR ComplexPoint_script here"
end tell
end tell
end tell

No luck. So, I'm obviously not feeding the front tab url to the script. Assume I'm making some simple right in front of my nose mistake.

Thanks again.

Script Editor JavaScript is not, alas, relevant to browser JavaScript - it doesn’t have any link to the browser’s DOM libraries.

The way to test these snippets is in a Keyboard Maestro Execute JavaScript in Safari action.

JavaScript is an embedded scripting language. The browser embedding is quite separate from and unlinked to the macOS system scripting embedding. Any link would actually be a more or less lethal security breach :slight_smile:

1 Like

@ComplexPoint nailed it. The script was not intended to work directly in Script Editor.

Try this macro:

##Macro Library   @Meta Extract Author and Pub Date from WSJ Meta @Web @HTML @Example


####DOWNLOAD:
<a class="attachment" href="/uploads/default/original/2X/4/41ff40a86d27c1adc2d0ca282afbd8307b91b680.kmmacros">@Meta Extract Author and Pub Date from WSJ Meta @Web @HTML @Example.kmmacros</a> (7.6 KB)

---

###ReleaseNotes

Author.@JMichaelTX

**PURPOSE:**

* **Extract the Author and Publication Date from Meta tags in the WSJ**

HOW TO USE:

1. Open WSJ page in either Safari or Chrome
2. Trigger this Macro

**MACRO SETUP**

* **Carefully review the Release Notes and the Macro Actions**
  * Make sure you understand what the Macro will do.  
  * You are responsible for running the Maco, not me.  😉
.
* Assign a Trigger to this maro.  I prefer TBD.
* Move this macro to a Macro Group that is only Active when you need this Macro.
* Enable this Macro (if needed).
.
* **REVIEW/CHANGE THE FOLLOWING MACRO ACTIONS:**
(all shown in the magneta color)
  *


TAGS:  

USER SETTINGS:

* Any Action in _magenta color_ is designed to be changed by end-user

ACTION COLOR CODES

* To facilitate the reading, customizing, and maintenance of this macro,
      key Actions are colored as follows:
* GREEN   -- Key Comments designed to highlight main sections of macro
* MAGENTA -- Actions designed to be customized by user
* YELLOW  -- Primary Actions (usually the main purpose of the macro)
* ORANGE  -- Actions that permanently destroy Varibles or Clipboards,
OR IF/THEN and PAUSE Actions

REQUIRES:

1.  Keyboard Maestro Ver 7.3+ (don't even ask me about KM 6 support).
2.  El Capitan 10.11.6+
  * It make work with Yosemite, but I make no guarantees.

**USE AT YOUR OWN RISK**

* While I have given this limited testing, and to the best of my knowledge will do no hard, I cannot guarantee it.
* If you have any doubts or questions:
  * **Ask first**
  * Turn on the KM Debugger from the KM Status Menu, and step through the macro, making sure you understand what it is doing with each Action.


---

<img src="/uploads/default/original/2X/a/a204d70d1d06f1043dae6696836ee81f0fcc38ae.png" width="496" height="566">

Hmmm, that's what I'm doing. I pasted each of your scripts into Keyboard Maestro Execute JavaScript in Safari action.

Nothing happens. JavaScript is enabled in Safari/Preferences/Security. Perhaps I'm overlooking something else that I need to "toggle" to make this work?

Thanks again for your help, both of you.

I agree. Thanks for sharing.

However, this code:

meta[name*="published"]

will actually return a different meta tag than my code:

var datePubMeta = document.querySelector('meta[itemprop="datePublished"]')
var datePubStr = datePubMeta ? datePubMeta.content : "UNKNOWN"

So, I suppose it depends on which meta tags are acceptable.

Also, my code sets the result to "UNKNOWN" if the meta tag is not found.
Whereas your code returns nothing if the tag is not found.

Having said that, using the "name" attribute will likely be successful in more web sites. But I'm just guessing here. :wink:

Checking the obvious: Do you have Safari opened to the WSJ page and is it frontmost?

If you want to run Javascript in Safari from the Script Editor, then you will need to use a script something like this.
Very Important: You must ESCAPE all quotes in the JavaScript using \"


set jsStr to "
'use strict';

(function run() {    // this will auto-run when script is executed

var authorMeta = document.querySelector('meta[name=\"author\"]')
var authorStr = authorMeta ? authorMeta.content : \"UNKNOWN\"

return authorStr;

}  // END of function run()
)();
"

set scriptResults to my doJSInSafari(jsStr)

return scriptResults

on doJSInSafari(javascriptStr)
  try
    tell application "Safari" to do JavaScript javascriptStr in front document
  on error e
    error "Error in handler doJSInSafari()" & return & return & e
  end try
end doJSInSafari

Bingo! That's awesome!!!

I can take the rest from here.

Thank you to you both. It's obvious what a tremendous resource you are to the KM discussion forum - and the others to which you contribute as well.

I add/replace a thousand + citations every time I update a textbook, which is an annual process. And since I have several books and am working on other writing projects as well, this has enormous utility to me in terms of accuracy and wear and tear on my hands, wrists, and forearms.

When I get the big "compiled" KM macro working, I'll post the whole thing, including the individual macros that "fold up" into the larger KM macro.

That will be my small contribution to others to repay the assistance that the two of you have kindly shown me.

Thanks!! Good on you both!