Find word in html table and copy

Hi guys,

I need to search student profiles on an online database and see whether they're either 'Active in a program' or have 'Completed' it (see pic). If they have done either, I need to copy the description.

I would like to know how I could make KM find the word 'Active', copy the description on the left, if not, find the word 'Completed' and copy the left description.

Also it's a HTML table.

Thanks guys!

Hey Mel,

What browser are you using?

What does the text look like if you copy it and paste it into a plain-text editor like BBEdit.

This could be easy, or not so easy – but it’s impossible to say without being able to test with a real data sample.

-Chris

Hey Chris,

Thanks for your reply.

I’m using Chrome and when I copy and paste it to TextWrangler I get normal text… I unfortunately won’t be able to pop any real sample data here :frowning:

I’m guessing this involves finding the specific html tags?

Cheers
Mel

It is very hard to work with text without knowing the exact layout.

I suggest you post something here after suitably anonymising it. See:

for a discussion on how to anonymise data for this sort of purpose.

I believe when you select the rows of the table and copy them to the clipboard, you strip the HTML in Safari, Firefox and Chrome. So you end up with columns separated by tabs.

That's all you need.

Taking your example data, the attached macro will report which Descriptions are Active or Completed in a window whose text you can copy.

As a proof, I've entered the data in the second action, so just hit the Control-Option-Period key to see it work.

But to actually use it, you should disable the second action and enable the first one, which reads the clipboard into a variable rather than the hard coded text.

You will have to select the rows and Copy them but if this works for you, you can start the macro off with a Copy action to save a step.

Find in Table.kmmacros (3.5 KB)

Hi Mrpasini,

Thank you so much! It worked like a gem! Very simple indeed. And thanks for the other quick responses, really makes life a lot easier!!!

Hey Mel,

Good. Sometimes tabular text can get all distorted when copied from a browser.

See Peter's related post.

Not in this case.

In the worst case I'd try grabbing the source and parsing the html.

But the preference with this case if it works is to get the actual page text without any html codes and parse it.

With any luck this will allow you to open a page in Chrome, and hit your keyboard shortcut without having to select any portion of the page first.

Scrape Google Chrome Text with a Regular Expression.kmmacros (2.8 KB)

It may require some tweaking to give exactly the output you want, but let's see if this works first.

-Chris

Here's a tested production macro that reads the current system clipboard for the table data rather than the test data.

It includes the Copy command so just select the rows before running the macro on real data.

Tested in Safari, Firefox, Opera Neon -- and Chrome.

Find Active or Completed in Any Table.kmmacros (2.6 KB)

Intrigued, I installed Chrome to try this out. It returned the whole innerText of my test page which included a two-column table whose second column had the “Active,” etc. labels.

But you’re using a different regex than I was. I was capturing the text before a tab followed by either the Active or Completed label (case sensitive):

var matches = pInnerText.match(/(.+?)\t(Active|Completed)/g);

And the captured text would be returned as matches[1], I believe.

Hey Mike,

Not as such, no... It's not quite that easy to return captures with match.

Using this page as an example: The Boom Table

var pInnerText = document.body.parentNode.innerText;
var regEx = /^([^\t\n\r]+)\t([^\t\n\r]+)\t(.+(?:Thermopylae|Roman).+)/igm;
var matches = pInnerText.match(regEx).map(e => e.replace(regEx, '$1\t$2\t$3'));
matches.join('\n');

Returning all 3 captures with tabs between.

This code works with both Safari and Google Chrome.

-Chris

Chris, if you are talking about JavaScript here, it is quite easy to return capture groups with matches:

var sourceStr = `
ABCD-123abc
Some text ABCD-abc234 some other text
ABCD-76y65yj90
ABCD-76yABC90 some text ABCD-code6547AA
`
var regEx = /(ABCD\-[^\s]+)/g;
var captureList = [];
var matchResults;

while (matchResults = regEx.exec(sourceStr)) {
   captureList.push(matchResults[1]);
}
captureList

//--- RESULTS ---
/*
["ABCD-123abc", "ABCD-abc234", "ABCD-76y65yj90", "ABCD-76yABC90", "ABCD-code6547AA"]
*/

This one displays the results in an alert and copies them to the System clipboard for pasting.

This also refines the regexp so it works with tables with more than two-columns, although the OP showed only two.

Execute a JavaScript in Safari.kmactions (1.1 KB)

Hmm, except nothing is written to the System clipboard. But you can select the results from the alert, so all is not lost.

Hey JM,

I don't agree that that is quite easy.

A one-liner would be easy. That is convoluted – although not ridiculously hard to do or to understand.

In the original context I believe Mike thought that the one line would return capture 1, and as I said it's not quite that easy.

Mike – please correct me if I'm wrong.

This is more simply done in Perl:

#!/usr/bin/env perl -sw

my $string = "
ABCD-123abc
Some text ABCD-abc234 some other text
ABCD-76y65yj90
ABCD-76yABC90 some text ABCD-code6547AA
";
my @strings = $string =~ m/(ABCD\-[^\s]+)/gi;
$, = "\n";
print @strings;

It's easier yet using AppleScript and the Satimage.osax.

set theString to "
ABCD-123abc
Some text ABCD-abc234 some other text
ABCD-76y65yj90
ABCD-76yABC90 some text ABCD-code6547AA
"
set foundStrings to find text "(ABCD\\-[^\\s]+)" in theString using "$1" with regexp, all occurrences and string result

# CHANGE the dollar sign in “using "$1"” above to backslash backslash
# There is a bug in Discourse that prevents the correct syntax from displaying properly.

The downloadable AppleScript has the correct syntax:

An example of using the Satimage.osax to find text.scptd.zip (8.1 KB)

Using Safari instead of Chrome makes this a walk in the park when using the Satimage.osax.

Again using the Boom Table as source.

tell application "Safari" to set pageText to text of front document
set foundList to find text ".*life.*" in pageText using "\\0" with regexp, all occurrences and string result

-Chris

Chris,

I don't know what you mean by "convoluted".
It is standard, normal, JavaScript.

Your Perl and AppleScript (which really requires Satimage.osax) may seem simpler to you because you already understand them very well. You can't even do RegEx in normal AppleScript. You have to use either a scripting addition like Satimage.osax, of resort to the very complex ASObjC.

The choice of tool/language depends on both which does the best job, and which tools the user know how to use. So I don't put down any particular tool, especially if I don't understand it very well.

I do have to say that JavaScript is probably more widely used and known than any of the other languages (Perl, AppleScript) that are mentioned here.

If anyone wants to know how to do something in JavaScript, then a simple google with "JavaScript" and the keywords (like "regex", "capture", etc) will give you many hits/solutions.

You need to add a line at the end of your JavaScript to return the results to KM.
Probably you could just add:
result

as the last line.

I have NOT tested this.

Yep, that does the trick (who'd a thunk it). I do find it easier to do this in Perl, as Chris mentioned, but I liked his innerText approach enough to fiddler around with the JavaScript.

Scrape Google Chrome Table.kmactions (1.1 KB)