Select text in HTML between <div> and </div>

paulsondervan · September 21, 2016, 8:42am

I have HTML-files that I have to clean from unwanted tags.
There are many special DIV tags (< div class=“xyz” >) in the HTML file.
How can I loop through the HTML file and select all the text between the < div class=“xyz” > and < /div > tags (tags included) and than remove the selected text?
Thank you in advance.

Av8tntek · September 21, 2016, 11:55am

Removed

DanThomas · September 21, 2016, 12:30pm

Well, there are HTML tools that work with the DOM that are better suited to this task. But here's something that might get you started.

Here's the regex string used in the second action:

(<div class=".*">)([^<]*)

Assumptions:

You didn't literally mean the class is "xyz". This example assumes any class name. However, if you really meant a literal class name, replace ".*" with what you really want.
No extra spaces in the div's, and they only have the class element.
I'm sure there's other assumptions I can't think of right now.

If this doesn't work for you as is, and experience tells me it won't because you probably didn't tell me everything , post more details.

paulsondervan · September 21, 2016, 12:58pm

Thank you Dan.
I’m using Coda as HTML editor.
The DIV tag does have spaces.
Here is an example.
(I added spaces in the tags to prevent them from being interpreted)
< div class=“xyz” >< a href=“http://example.com/img/img.jpg/” target=“blank” >< img border=“0” src="…/…/…/img/another_img.jpg"/ >< /a >< /div >
The text between the DIV’s vary.
(“blank” does have an underscore in front of it)

And yes, all the classes to be removed have the same xyz name.

DanThomas · September 21, 2016, 3:24pm

OK, try this:

The regex is:

(?m)(<div class="xyz">)(.*)(<\/div>)

By the way, you can post HTML if you place 3 back-tics on a line preceding and following the HTML. Back-tics are the backwards single quote. Here's a picture of how I did the above regex:

If it doesn't work, please post the HTML it doesn't work on. Thanks.

paulsondervan · September 22, 2016, 12:07pm

Thank you Dan.
The problem is that text between the div tags isn’t predictable and varies very much.
Am I right that I have to fill all the possible text has to be added to Set Variable ‘htmlString’ to Text Action?
It’s nearly impossible to add all the possible variations.
There will be hundreds of variations.

DanThomas · September 22, 2016, 1:39pm

In a word, no. What I provided doesn’t care what’s between the div tags. However, it does quit at the first closing div tag, so if the enclosed text contains div tags, then this won’t work.

JMichaelTX · September 22, 2016, 6:39pm

I think this might be very simple using JavaScript in Browser.
Can you please post a sample HTML file for testing? You probably need to zip it first.

JMichaelTX · September 22, 2016, 7:53pm

Based on limited testing, the below macro should work.

However, you must first remove all of the spaces from the pseudo divs, using something like TextWrangler.

Replace "< div" with "<div"
Replace "< /div>" with "</div>"

Then, open the revised document in Chrome, and run the below macro.

The macro outputs the revised HTML document to a KM Variable, and displays it in a Text Window. You can, of course, save the variable to a file, or copy the text in the window and paste into a new document.

You may want to change one line of the JavaScript:

divClassList[i].innerHTML = "[NOTHING]";

to

divClassList[i].innerHTML = "";

##Macro Library [WEB] Replace Content of HTML Elements [Example]


####DOWNLOAD:
<a class="attachment" href="/uploads/default/original/2X/2/21e9d5d4a66c9e94f00913fbe834770d51d31af1.kmmacros">[WEB] Replace Content of HTML Elements [Example].kmmacros</a> (2.8 KB)

---

###ReleaseNotes
You must first remove all of the spaces from the pseudo divs, using something like TextWrangler:

* Replace "`< div`" with "`<div`"
* Replace "`< /div>`" with "`</div>`"

Then, open the revised document in Chrome, and run the below macro.


---

<img src="/uploads/default/original/2X/b/b83ccbf2a0764f29d1f39d72518a4fe709c7b648.png" width="640" height="656">

###Script
```javascript

var divClassList = document.getElementsByClassName("xyz");

for (var i = 0; i < divClassList.length; i++) {
  if (divClassList[i].innerHTML) {
    divClassList[i].innerHTML = "[NOTHING]";
  }
}

var docHTML = document.documentElement.innerHTML;

docHTML;
```

paulsondervan · September 22, 2016, 8:16pm

FThank you JMichaelTX.
I have decided to use the editor Brackets.
This editor can edit files in a folder and I can use regex.
With the regex <div class="xyz">(.*?)</div> I’m able to replace the whole text and modify a lot of files in one sequence.

btw
The spaces in the div tags of my replies were placed to prevent interpretation of the tags.
The div tags in the files don’t have these spaces.

DanThomas · September 22, 2016, 8:57pm

Thanks for pointing this out this editor. While I love Atom, Brackets appears to have some nice features for front-end web development. I'm going to give it a shot.

JMichaelTX · September 22, 2016, 11:17pm

As Dan pointed out earlier, won't that fail if you have a div inside of a div?

Your regex fails for me with this type of HTML:

<div class="xyz"> do some stuff
	<div class="xyz">this is a sub-div
	</div>
</div>

My JavaScript avoids that issue.

paulsondervan · September 23, 2016, 5:51am

All the files are built the same and there are no div tags in a div tag.
Brackets works fine for me because I can handle many files in one sequence.

Joel_Rendall · September 30, 2017, 10:54am

Can this process be simplified using the new goodies added to 8.0? Trying to capture text from a web page from various parts of the DOM of a page. Thanks in advance!

JMichaelTX · September 30, 2017, 9:06pm

I have found using querySelector to be a very powerful, yet easy to use tool for extracting data from a web page.

If you do a forum search on "querySelector" you should find a number of examples. A google search of "JavaScript querySelector" will also provide lots of info.

If you have a specific question/task, please post in a new topic.

Select text in HTML between <div> and </div>

Options