Search for a source code inside HTML Document

Quasar_Pulsar · June 14, 2019, 2:00am

I am trying to locate the HTML Document that contains "presentation".

Here is the page code link that I am trying to get the HTML document. The source code that I am looking for is inside the image.

Webpage Code

I made a short Version for the target page Source
Short Version

I already tried this AppleScript code:

tell application "Safari"

    set keyword to "presentation"

    set myWindow to current tab of first window
    activate

    set pageContent to do JavaScript ("window.document.documentElement.outerHTML") in myWindow

    if pageContent contains keyword then
        return "found it"
        exit repeat
    else
        return "not found"
    end if
end tell

But I can't find the codes that I want.

Here is the Code that I am looking for:

<img class="img-responsive img-rounded mx-auto d-block" id="imageZoomSource" src="/images/filestore/2/8/2/3/0_3b239de3b91c562/28230scr_9c23b47500062f0.jpg" alt="97-08 VN1500 CLSS GANGSTER RR FENDER" height="500">
<img role="presentation" alt="" src="https://dealers.partscanada.com/images/filestore/2/8/2/3/0_3b239de3b91c562/28230scr_9c23b47500062f0.jpg" class="zoomImg" style="position: absolute; top: -375.34375px; left: -275.53469397254304px; opacity: 0; width: 832px; height: 800px; border: none; max-width: none; max-height: none;">

I want to get the source code of the image or get the whole documents where the codes can be found and store it in a variable.

CJK · June 14, 2019, 3:18pm

This looks like you're wanting to retrieve the HTML source code for the whole page. In many cases, you can just use Safari's source property, which is a property of tab and document objects in its AppleScript dictionary.

For example:

tell application "Safari"
    get the source of the front document
end tell

which is largely equivalent to:

tell application "Safari"
    get the source of the current tab of the front window
end tell

If, for some reason, this version of the source code doesn't contain up-to-date source elements, say, from client-side DOM manipulation that negates page refreshing; or some other situations where it varies slightly or significantly to what you see in the developer console, then your JavaScript code is essential fine.

While it's very difficult to make good sense of the link to the HTML source you provided (it fails to render anything meaningful), what I do notice is that the img element you're directing us to appears to reside in a shadow DOM, indicated by the #document node. This is an encapsulated DOM hierarchy that is distinct from (although still attached to) the main DOM.

This would explain why your JavaScript call didn't find what you were looking for, as calls made to the main document object are specific to its DOM tree. To illustrate, try running this command, which one would typically expect would return the img element in question:

document.querySelector('#imageZoomSource');

My guess is that it won't, and you'll receive a null return value.

^{If it does return an element, my second guess that it will be a different element to the one you want, and that the main DOM happens to have an <img> element that shares that specific id attribute. id attributes are always unique within a DOM tree, but since a shadow DOM has a separate DOM encapsulation, it can use any id values that it itself doesn't already use, even if the main document object does.}

To access the elements inside a shadow DOM, you generally want to access its containing element inside the main DOM, then switch DOMs at that node. Here is the block of HTML that's relevant:

<div class="fancybox-content" style="width: 921.3333339691162px; max-width: 900px;
 max-height: 700px; padding: 1em; height: 126.13020896911621px;" 
 class="fancybox-iframe" frameborder="0" vspace="0" hspace="0" 
 webkitallowfullscreen mozallowfullscreen allowfullscreen allowtransparency="true"
 src="/common/inventory_item_images.php?sku=1402-0264" vspace="0" 
 webkitallowfullscreen>
		#document
			<!DOCTYPE html>
			<html lang="en" class="fontawesome-i2svg-active fontawesome-i2svg-complete gr__dealers_partscanada_com">
			<head>...</head>
			<body onload data-gr-c-s...
					.
					.
					.
				<img alt="97-08 VN1500 CLSS GANGSTER RR FENDER" class="img-responsive
				 img-rounded mx-auto d-block" height="500" id="imageZoomSource" 
				 src="/images/filestore/2/8/2/3/0_3b239de3b91c562/28230scr_9c23b47500062f0.jpg">
	
				<img role="presentation" alt src="https://dealers.partscanada.com/images/filestore/2/8/2/3/0_3b239de3b91c562/28230scr_9c23b47500062f0.jpg"
				 class="zoomImg" style="position: absolute; top: -35.34375px; left: -9.936655583424669px; opacity: 0; width: 832px; height: 800px; 
				 border: none; max-width: none; max-height: none;">
							.
							.
							.

I'm wondering if the first <div> tag was accidentally merged with a child <iframe> tag when you copied over some of the code, because the attribute list seems incongruous, and features attributes that are specific to <iframe> tags rather than <div>. I'm going to assume that this is the case, so I'll be referencing the non-apparent iframe element through its class:

document.querySelector('iframe.fancybox-iframe');

which will be the entry point into the shadow DOM. In the simple cases, you can access it through its contentDocument property:

document.querySelector('iframe.fancybox-iframe').contentDocument;

Then you can treat this object pretty much the same as you would the document object. For example, to retrieve the HTML source code for it:

const $document=document.querySelector('iframe.fancybox-iframe').contentDocument;
$document.documentElement.outerHTML

Or to retrieve the img element using the earlier technique applied to this shadow document object:

$document.querySelector('#imageZoomSource');

which is equivalent to:

$document.getElementById('imageZoomSource');

Obtaining its src attribute value, i.e. the URL to the image, is done like so:

$document.getElementById('imageZoomSource').src;

which appears to contain a relative URL, namely "/images/filestore/2/8/2/3/0_3b239de3b91c562/28230scr_9c23b47500062f0.jpg", so you can just remember to prepend the domain and scheme, which will be "https://dealers.partscanada.com".

The second img element contains exactly the same image retrieved from the same URL, but its src attribute value provides the full (absolute) URL that saves a job, I suppose. However, it doesn't have an id attribute, and both the class and role attributes sound very much non-specific enough that other elements might share these values, in which case the following snippet will return multiple elements and be of lesser value:

$document.querySelectorAll('img[role=presentation].zoomImg');

In more complicated situations, the shadow DOM can have be kept from exposing itself or parts of itself to the parent, making it trickier to access its tree through JavaScript. Also, if the sub-document is loaded from a different domain, then it will be subject to the cross-origin restrictions that further roadblocks your API calls.

In the middle between "simple" and "complicated", there's "slightly annoying" if the nested document object doesn't necessarily load at the same time as the main document object, meaning a JavaScript call acting upon the shadow document will initially fail (because the document object either doesn't exist or contains no content), in which case you need to attach an event listener to act on it as and when it does load. But we'll address that if it's needed.

Quasar_Pulsar · June 15, 2019, 4:23am

Hi,

Thank you so much for your reply.

I updated my post and added image and copied it exactly is shown on my screen.

And tried all code you recommend but still giving me a blank result.

here is the code I tried.

    tell application "Safari"
    	
    	
    	set myWindow to current tab of first window
    	activate
    	
    	set pageContent to do JavaScript "$document.getElementById('imageZoomSource').src;" in myWindow
    	
    	
    end tell

Please have a look again. thank you.

CJK · June 15, 2019, 7:06am

You haven’t defined the variable $document, so it’s not surprising you get no meaningful return value.

Quasar_Pulsar · June 15, 2019, 8:21am

Hi CJK, can you help me make the right code? I know little about JavaScript or AppleScript. I appreciate the time you give in helping to solve this problem I encountered.

Quasar_Pulsar · June 15, 2019, 12:36pm

Can you help me define the variable, please?

CJK · June 15, 2019, 9:35pm

Taking from my original reply:

CJK:

const $document=document.querySelector('iframe.fancybox-iframe').contentDocument;
$document.documentElement.outerHTML
Or to retrieve the img element using the earlier technique applied to this shadow document object:
$document.querySelector('#imageZoomSource');
which is equivalent to:
$document.getElementById('imageZoomSource');
Obtaining its src attribute value, i.e. the URL to the image, is done like so:
$document.getElementById('imageZoomSource').src;

JMichaelTX · June 16, 2019, 2:02am

Can you please post the actual URL of the target page -- NOT the w3schools try it page.

Unless the image tag is hidden/protected in some way it should be easy to find it.

Quasar_Pulsar · June 16, 2019, 12:20pm

Michael I will send to you the link and login information.,

CJK · June 17, 2019, 10:13am

Thank you for also sending the details to me.

The JavaScript code that gets the image source is this:

document.childNodes[1].querySelector('img.zoomImg').src;

Putting this into an AppleScript:

tell app "Safari"
    tell the front document to set imgsrc to do JavaScript ¬ 
    "document.childNodes[1].querySelector('img.zoomImg').src;"
end tell

This will store the URL to the image in the variable imgsrc.

Quasar_Pulsar · June 17, 2019, 10:30am

Still not working. No result found

    tell application "Safari"
	tell the front document
		set keyword to "presentation"
		
		set imgsrc to do JavaScript
		"document.childNodes[1].querySelector('img.zoomImg').src;"
		
		if imgsrc contains keyword then
			return "I found it"
		else
			return "not found"
		end if
	end tell
end tell

and

    tell application "Safari"
	set keyword to "presentation"
	tell the front document to set imgsrc to do JavaScript ¬
		"document.childNodes[1].querySelector('img.zoomImg').src;"
	
	if imgsrc contains keyword then
		return "I found it"
	else
		return "not found"
	end if
end tell

and

CJK · June 17, 2019, 10:32am

Can you please re-read what I wrote ?

Quasar_Pulsar · June 17, 2019, 10:47am

CJK I don't know much about coding. But I will try to understand what you wrote above.
Are your codes supposed to have a result like this?
because this is what I want to get, the source code.

<img role="presentation" alt="" src="https://dealers.partscanada.com/images/filestore/1/2/6/7/4_8e0d3b52d1feb61/12674scr_605f6856265df8f.jpg" class="zoomImg" style="position: absolute; top: -182.31937499999998px; left: -230.17261095537197px; opacity: 0; width: 1219px; height: 632px; border: none; max-width: none; max-height: none;">

Thanks.

CJK · June 17, 2019, 12:42pm

I can very much appreciate you are a novice when it comes to scripting, but the two times you didn't manage to get something working wasn't because your coding skills are lacking, it's because you didn't read what I wrote carefully when I said:

Then you did:

So while that's not something you can blame on anything inherently to do with scripting, I have to take responsibility and admit I've been equally as careless in a different way.

Because I've recently been experimenting with switching browsers, I did all my JavaScript inside Brave Browser, then just swapped out the name and appropriate AppleScript object names for Safari, thinking that—given the scripting and automation settings in both are set to equivalent settings—there wouldn't be any reason Safari wouldn't yield the same result as Brave.

Now I've physically gone in to run the JavaScript in Safari, I'm genuinely flummoxed that I can't access the shadow DOM at all, meaning I can't retrieve that image. And, right now, I haven't the foggiest idea why it works in one browser and not in the other, and it means I'm going to have to comb through every setting , likely via the defaults preferences .plist files, and discern what Brave is doing or isn't blocking that Safari isn't doing or is blocking.

My apologies for that, both for the careless aspect, but also for failing to get you what you want. As you've also sent the problem to JM to tackle, I wonder if he will have better luck. If so, let me know, and you can end up being the one helping me if you learn what settings (if any) were the culprit.

Search for a source code inside HTML Document

Options