Can I get the HTML source of the current page in Safari into Keyboard Maestro?

tjluoma · January 31, 2020, 7:58pm

There are times when I would like to take the page that I am viewing in Safari and "do something" with its HTML source.

I don't want to stick the URL into curl in Terminal because that might not get the same thing that I'm looking at, so I really want the actual HTML from Safari.

It seems to be that there should be a way to use JavaScript or something to say "Hey, take all of the HTML of this entire page and send it to this script / macro / whatever"…

…but I have no idea how one might do that.

Anyone already solve this before I try?

JMichaelTX · February 1, 2020, 12:59am

This should get the job done:

tell application "Safari"
  
  set oTab to current tab of window 1
  set HTMLStr to source of oTab
  
end tell

This may be the same of a curl command, but JIC, you can also consider this KM Action:

Get a URL action.

tjluoma · February 1, 2020, 11:51am

Well… that was easy.™

In case anyone else comes across this and wonders, the AppleScript @JMichaelTX suggested does output the HTML source, which means that, in a shell script, I can capture the HTML source with this line:

SOURCE=$(osascript -e 'tell application "Safari"' -e 'set oTab to current tab of window 1' -e 'set HTMLStr to source of oTab' -e 'end tell')

Now the variable $SOURCE contains the HTML source of the current page, which can be utilized in ways that I am only now beginning to imagine.

Thanks, @JMichaelTX.

tjluoma · February 1, 2020, 12:45pm

So here's the first usage for this new knowledge that I thought I would share with the community on how to use this information.

How many times have you been on an Amazon.com page and wanted to share the name of the product and URL with someone?

For me, that happens a lot.

The first part is easy: the product name is basically the title of the web page (although, as you know, many of these titles are absurdly long and filled with keyword for SEO).

The second part is trickier. While you could just take the URL from your browser, most of the time it would be filled with a LOT of cruft that you neither want nor need. BUT! In the HTML source for every Amazon page is a line like this:

<link rel="canonical" href="https://www.amazon.com/MacBook-Release-Kuzy-Version-Display/dp/B07K8ZC6Y3" />

That <link rel="canonical" means that the URL that Amazon considers to be the “official” URL for this page is https://www.amazon.com/MacBook-Release-Kuzy-Version-Display/dp/B07K8ZC6Y3

Although, as many people know but many others do not, the part between amazon.com and /dp is merely descriptive. You can put anything in there, or nothing at all. For example, this URL will work:

https://www.amazon.com/i-like-bananas-and-puppies-and-sunsets-and-silly-examples/dp/B07K8ZC6Y3

as will this one:

https://www.amazon.com/dp/B07K8ZC6Y3

So, I wanted to get the canonical URL. But how?

Previously, I used a shell script, and what would happen was that I would press my Keyboard Maestro macro key and the shell script would take the URL and send it to curl which would try to fetch the same page that I was already looking at.

Obviously that's slow and inefficient. since I already have the information in Safari right now, but what was even worse is that (as you might expect) Amazon really, really, really does not want you doing any sort of automated “scraping” of their website (which is, in effect, what I was doing, although not for nefarious reasons). So my script would fail, often.

Now that I can use the HTML that is already in Safari, this is what I can do instead:

#!/usr/bin/env zsh -f

	# this just helps the shell find utilities
PATH="/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin"

	# this is AppleScript inside a shell script using `osascript`
	# each item inside "-e 'single quotes'" is the
	# same as if they were on separate lines in an
	# AppleScript script
SOURCE=$(osascript -e 'tell application "Safari"' \
		-e 'set oTab to current tab of window 1' \
		-e 'set HTMLStr to source of oTab' \
		-e 'end tell')

	# Now the variable '$SOURCE' has all of the HTML from Safari

	# This part says "use the variable '$SOURCE'"
	# narrow it down to just the line that matches '<link rel="canonical" href="'
	# then replace everything up to and including "https://www.amazon" with
	# "https://smile.amazon" (if you don't use "smile.amazon.com") you can
	# just replace the word "smile" with "www"
URL=$(echo "$SOURCE" \
	| fgrep '<link rel="canonical" href="' \
	| sed 's#.*https://www.amazon#https://smile.amazon#g ; s#" />##g')

	# This part says "use the variable '$SOURCE'"
	# narrow it down to just the line that matches '<title>'
	# then do the following:
	# remove the '<title>'
	# remove the '</title>'
	# remove "Amazon.com: "
	# replace any '&amp;' with '&'
TITLE=$(echo "$SOURCE" \
	| fgrep '<title>' \
	| sed 	-e 's#<title>##g' \
			-e 's#</title>##g' \
			-e 's#Amazon.com: ##g' \
			-e 's#&amp;#\&#g')

	# at this point, the variables we have are:

	# $SOURCE = the entire HTML of the page
	#  $TITLE = the full title of the page
	#    $URL = the official / canonical URL

	# Now, what do you want to do with those things?

	# for me, I want a Markdown link, which means that I want the
	# title in [brackets] and the URL in (parenthesis)
	# and then I want that copied to the clipboard / pasteboard, so
	# I would use this:

echo -n "[$TITLE]($URL)" | pbcopy

	# the '-n' after 'echo' just says 'do not add a "newline" at the end

	# if you wanted the script to output the $TITLE on one line
	# and the $URL on another, you could use this:
	# the '\n' says "add a line-break here"

# echo "${TITLE}\n${URL}"

	# technically the {brackets} are not required, but I find it makes
	# it easier to read the two variables separated by the '\n'
	# compared to this

# echo "$TITLE\n$URL"

	# but, functionally, they are the same

	# again, we don't technically need this to end the script
	# but I find it makes a nice marker for "this is the end"
exit 0

Anyway, I hope someone might find that useful, or at least interesting.

Shoshanna · February 1, 2020, 3:32pm

I'm not sure how I'd use this, but I'm sitting here trying to come up with reasons to, because it's just so cool. Thanks for posting such a well-commented script -- it's really helpful!

JMichaelTX · February 1, 2020, 9:56pm

I know you prefer to use shell scripts since you have a lot of knowledge and experience with them.

I have found for use cases like yours that a single JavaScript in Browser is the easiest and fastest way to get the info you want. In fact, I have a number of JavaScripts that I routinely use on the Amazon web site.

Also, you can use the Execute a JavaScript in Front Browser action so that it works with Safari, Google Chrome, and any of the Chome-based browsers, like Brave Browser.

tjluoma · February 2, 2020, 2:05am

Well, my actual macro is more involved than just the part I shared here. But it starts with 3 simple JavaScript calls to get

window.location.hostname;
document.title
document.URL

and then those three items are each saved to a variable, which I send to my script and can do different things for different sites. For example, now I can use the JavaScript for the HTML source, but I only need to do that for Amazon pages, since usually just getting the regular URL is fine.

But my script is also easily expandable so if I find myself wanting to make changes to how it works in another site, I can do that based on the website. For example, if I wanted to change DaringFireball URLs to append a ".text" to the end to get the Markdown version, I can do that too.

Because https://daringfireball.net/2020/01/the_ipad_awkwardly_turns_10.text is much easier to read than https://daringfireball.net/2020/01/the_ipad_awkwardly_turns_10 … not to mention that when I save web pages, I generally want to save them as Markdown to get rid of all the other crap that infects most websites these days.

JMichaelTX · February 2, 2020, 2:50am

These are available as KM tokens: FrontBrowser Title & URL tokens

It requires only one line to get the canonical link in JavaScript:

canonicalLink = document.querySelector('link[rel="canonical"]').href;
//-->"https://www.amazon.com/MacBook-Release-Kuzy-Version-Display/dp/B07K8ZC6Y3"

JavaScript also offers powerful Regex and array processing.

Of course, we all tend to use the tool we know best. I'm only providing the JavaScript solution for those that might prefer using JavaScript over shell scripts. One great feature of JavaScript is the ability to easily build (with autocomplete) scripts and test on the live web page using the Chrome or Safari JavaScript consoles.

And since, by some accounts, JavaScript is one of the world's most common languages, it is easy to find documentation, support, and examples on the Internet.

tjluoma · February 2, 2020, 2:44pm

Yeah, JavaScript is definitely the better tool here. And I’ve been meaning to learn it for years and years. It’s probably time to move that closer up the “to do” list.

tjluoma · February 3, 2020, 2:46am

Oh! I just remembered that I have free access to Lynda.com through the NY Public Library (something I learned thanks to someone on the MPU forum -- anyone who lives in NY can get access to to the NYPL and anyone with NYPL access can get access to Lynda).

So I've added a couple JavaScript intro classes to my playlist.

Can I get the HTML source of the current page in Safari into Keyboard Maestro?

Options