How do I process the RTF data in a paste buffer?

I'm extracting and processing information using Keyboard Maestro from: http://www.tsunami.gov

If the site is a Tsunami warning I email my iPhone with the warning. It's a fairly simple idea, but it's not quite as easy as it sounds.

I have two different approaches to this problem.

One is to use the KM Javascript action to fetch the text I need, which seems to work, but it's hard to be sure because I can't test it properly because I don't know what the page will look like when an actual warning appears on the page. Here's the important action that makes this approach seem to work: (it was tricky finding the element name that you see below)

The other approach is to use the macOS copy command to copy all the information from the page and then try to filter out the information I don't want (or filter in the information I do want.) This is difficult because (a) KM doesn't seem to have any actions which work on paste buffers that contain RTF data, and (b) the Shell in macOS doesn't have any filters that work on RTF data either.

Am I overlooking a feature of KM or perhaps a feature in the Shell that will help me filter out the data I need from the paste buffer? I often overlook features. (I did notice that the "Write to File" action allows me to save a paste buffer in its RTF format, but that doesn't help me solve the problem.)

If not, I would suggest a new action for KM that lets me do some basic filtering on RTF data. I would like to strip out (or strip in?) either tables, or pictures/objects, or text larger than (or smaller than) a certain size. Obviously there are many other things that an RTF filter could filter on but those are the things I would use.

P.S. I did find three websites that claim to email tsunami warnings for free, but I'm a programmer, so I want to do this myself.

Hi Sleepy,

If you copy something (to the system clipboard) these actions might prove useful:

  • search system clipboard using regular expression
  • Filter system Clipboard
  • Search and replace system clipboard

Cheers!

The “Filter” action doesn’t have any filters for filtering out graphics, or tables, or font sizes, or anything like that. Or were you suggesting that I save the clipboard as an RTF and then use the regular expression features of KM actions to read the raw RTF formatting commands?

No there doesn’t seem to be a way to filter out images and such. But you should be able to filter out what you want to find with regular expression.
Of course, you will need to know ‘what’ you want to find in there.
You don’t have to save as rtf. Simply copy and then search through clipboard with the first action above.
Correct me if I’m wrong, but this should be possible.

Yes, that would work if I could get a list of all possible words for that site. But I doubt that’s possible. Each disaster is probably explained with unique words. Eg, “Pacific Tsunami Warning”; or “Possible Atlantic Tsunami.” Unpredictable. That’s why I want to filter based on character size. The words that I want to extract are always the largest words on the page. I may be forced to save it as an RTF and parse all those ugly RTF formatting strings. I’ve solved tougher problems than this before.

Yes that would be difficult indeed. Maybe someone can chime in with more expertise. Maybe you could pick the HTML, and search for html tags for character size?

Preliminary conversion from RTF to TXT ?

The textutil shell command is the first option that comes to mind, but before that it may also be worth checking that there is definitely no plain text available on the clipboard:

1 Like

That’s a new thought. I was thinking RTF because the paste buffer contains RTF, but yes if I save it as HTML I can choose to parse HTML instead. At least HTML is a standard.

That’s a new lead, thanks. I know lots of UNIX commands but textutil is a new one to me. Looks interesting. This must be unique to macOS, as I think I would have noticed it on my work systems. It has lots of options some of which might actually help me here. Now I have homework to do.

2 Likes

Not sure which browser you are using, but I am generally finding that there is some ‘public.utf8-plain-text’ on the clipboard (in addition to the rtf and html) when I copy.

Might be worth trying this JavaScript for Automation fragment, just in case:

(() => {
    'use strict';

    ObjC.import('AppKit');

    return ObjC.unwrap(
        $.NSString.alloc.initWithDataEncoding(
            $.NSPasteboard.generalPasteboard
            .dataForType('public.utf8-plain-text'),
            $.NSUTF8StringEncoding
        )
    );
})();

1 Like

PS in an AS/JXA script, the theClipboard method of standard additions should, of course, also yield any textual version of the available pasteBoard contents:

Applescript

use scripting additions

the clipboard

JavaScript for Automation

(() => {
    // standardAdditions :: () -> Library Object
    const standardAdditions = () =>
        Object.assign(
            Application.currentApplication(), {
                includeStandardAdditions: true
            }
        );

    return standardAdditions().theClipboard();
})();

Bash command line

#!/bin/bash

osascript -e "the clipboard"

or

#!/bin/bash

echo $(pbpaste -Prefer txt)

You provided lots of ideas, which I will examine, and I appreciate. Simply converting from RTF to TXT was already understood and does not help me with my second idea of identifying the required text based on font size. I’m still looking for a way to filter text based on font information.

Yes, I realised that too late, after pasting the plain text thoughts.

Good luck with your html parsing !

Rather than build your own warning/alert system, have you looked at using the one already provided? See:
Message Subscriptions at tsunami.gov

In my original post when I said "I did find three websites that claim to email tsunami warnings for free, but I’m a programmer, so I want to do this myself" that was one of the three sites I had found. Some of the sites charge money for their subscriptions (apparently SMS isn't free). But the main reason I wanted to build my own solutions is that I'm a programmer and maybe I can do better than they can.

Hey @Sleepy,

I don’t think I’d use Safari for this sort of job.

The overhead is big. You have refresh issues, etc.

I’d start out by polling the page directly with curl.

curl -Ls --user-agent 'Opera/9.70 (Linux ppc64 ; U; en) Presto/2.2.1' "http://www.tsunami.gov" \
| egrep -i "statusHead"

If that didn’t work I’d think about using ASObjC to download and parse a webarchive of the page.

-Chris

Hmm, that seems to work quickly and accurately. Thanks for teaching me.

1 Like