Need Help With a RegEx to Cleanup PDF Annotations; and Learning RegEx

thank you

Well, we would need to see the concrete source of the data,

but if, for example, it turned out that:

  • The annotation fields are always separated by tab characters,
  • and that the part you need is always just the last of those fields

then vanilla ( regex-free ) KM actions might let you fairly quickly:

Recast each tab-delimited field as a separate line:

image

and then let a for each action wind through those lines,
just giving each of them, in turn, a variable name like part.

image

so that final value with the name part would be the last segment of the string:

image

1 Like

thanks a lot. I am trying it and will give you a follow-up

1 Like

Just to keep it simple (I can later create a version with variables) I highlighted whole the text → copy → macro action below → paste

all tabs are replaced with line feed

I think that the text I end up with is basically the equivalent of cliplines.

After that I don't follow. Sorry. thank you

Last tab-delimited segment of a string.kmmacros (20.4 KB)

If:

  • the data is consistently tab-delimited
  • and the part you want is always after the last tab,

then this would return what you want in the part variable:

Last tab-delimited segment of a string.kmmacros (20.4 KB)

but it may simply be that that isn't exactly the pattern of the source data.

What application do these annotations come from ?

First, I would like to encourage you to keep using and learning Regular Expressions (RegEx).
IMO, it is one of the most powerful and useful languages, and can be used just about everywhere.

Now, to your request.
Actually, you were pretty close with:
^(?\d+)\s(?Highlight)\s\s(?\d{4}-\d+-\d+),\s(?\d+:\d+:\d+)\s(?.+)$

My solution is:
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

For details, see regex101: build, test, and debug regex

The key to developing a good reliable RegEx is in recognizing the patterns in the data.
In this case, here is what I see:

  1. Line always starts with one or more digits
  2. Some amount of variable text
  3. ISO Date
  4. Comma, then some horizontal space (TABs and/or SPACEs)
  5. Time
  6. Some horizontal space
  7. Finally the text you want to keep

Note that I used the following RegEx patterns:

  • \h+ to match one or more TABs and/or SPACES
  • \d{1,2} to make a one or two digits

It might be tempting to include "Highlight" in the pattern, but that word might not apply to all annotations. IAC, it is not needed for a good match.

Solution

So, the KM solution is to use Search and Replace action:

Search FOR (using Regex)
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

Replace WITH
\1

Please feel free to ask any questions.

BTW, the RegEx101.com site is now back up.

2 Likes

You have a great solution. thanks VERY much. Works perfectly. Regex: I am trying but it's hard

thanks very much.

@JMichaelTX @ComplexPoint @tiffle
I don't know what Crystal Meth is like, but 2 days working on a regex must come close.

But hopefully, afterwards, you still remember what you’ve learned :wink:

Short regular expressions for personal use are still legal in most jurisdictions, but distribution of longer regexes really deserves some close attention from the regulatory authorities.

Personal desktop use for string mangling was never what Kleene's [Regular language - Wikipedia](Regular language - Wikipedia) formalism was intended for :slight_smile:


Sadly, whatever you wrote is seldom even legible the following morning, unless it was extremely short ...

Regular, but sub-economically time-consuming, and write-only.

Regular expressions are addictive, because they are:

  • interesting, and a source of pride if mastered,
  • the absolute peak of most casual scripter's technical aspirations,
  • fiddly, brittle and read-only enough require a lot of wax on wax off practice.

(Which creates an insatiable hunger for more practice material, and a temptation to encourage others to become aspirant users, seeking help, and supplying material)

Like many such substances, best used in tiny (very short) quantities, and only after other possibilities have been exhausted.

The Jamie Zawinksi point is not far from the mark:

[Jamie Zawinski - Wikiquote](Jamie Zawinski - Wikiquote)

Two days is no joke ...

1 Like

very funny !

Thanks to your email I will plunge into regex, starting with the basics and try to emulate your short and sweet approach.

1 Like

Great! I have no doubt that you can become productive using RegEx in short order.
Not only is RegEx101.com a great place to develop and test RegEx, it also provides detailed explanations of a RegEx pattern. For example:

See regex101: build, test, and debug regex

Benefits of Regular Expressions

In spite of @ComplexPoint's campaign to urge everyone to not use RegEx, using false claims, Regex remains a very powerful language. Like any language, it requires study and practice to master. But there are many simple RegEx patterns that can solve many common problems.

As you know, I also like, and often use, both JavaScript and AppleScript. While string manipulation can easily be done with the powerful bult-in functions of JavaScript, there are many string manipulations that can be easily done using one line of RegEx, but would require many, many lines of JavaScript.

Getting Started with Regular Expressions

Regular expressions (RegEx or RegExp) are extremely powerful, but have an initial steep learning curve that is often intimidating. But once you get over that initial hump, and you continue to write new RegExp, it will become much easier.

I do all of my RegEx development at this free website:

You may also find these sites helpful:

1 Like

thanks very much for the detailed info. I source of frustration is that fact that a regex may with on reg101, but not in BBEdit or KM search replace regex.

IME, I have found Regex101.com and KM to be very consistent.
However, it does require that you setup Regex101.com correctly, set the needed Regex options in the Regex pattern, and use the proper KM settings.

1. Optimum Regex101.com settings

  • Set Regex options to "gu"
  • You can use this URL to do this: regex101: build, test, and debug regex
  • Do NOT use any other settings like "m" or "i"
    • These should be set in the Regex pattern itself:
    • (?mi)
  • Use the "PCRE" Flavor, which is very close to what KM uses

KM Settings

  • Be sure to choose "Regular expressions" or "matches" in Actions and Conditions
    • image
  • RegEx Options: Note that I have used "(?mi)" in all of the above examples, but you should only use the options that you need:
    • "m" -- Multiline so that "^" and "$" match on each line
    • "i" -- Case insensitive

BBEdit

  • I mostly use the "Find" dialog, which should be very straight-forward.
  • I have not observed any differences with Regex101.com
  • If you have specific examples that work differently, please post.

Please let me know if you have other questions.

Strange I posted a detailed reply, and even corrected a typo, and it has disappeared.

Yes - I saw your reply and in between then and now (about 30 minutes) it has gone! (It had a lot of images...)

thank you for your comment

So I hear the reply disappeared. All is fine now. Thank you Chris who went through all the trouble of reviving it.

Thanks again very much. I am very sorry to take so much of your time.

I am going crazy simply because I get different results with BBEdit, KM and Regex101

How can I even start to lean in those conditions

That's what I did for 2 days, jumping from one to the other.

Please note that the BBEdit playground is useful because you can save the regex to a text factory.
The example below uses your excellent annotation clean regex.
Sorry for the many images

Objective
Convert
1 Highlight 2020-11-20, 20:20:30 Lorem ipsum...
testing regex
To
Lorem ipsum...
testing regex

Note that the testing regex line is there just so you don't think that I am simply deleting all characters before character no: 53

Regex used is your excellent one:
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

The regex101 link for the issue below is:

The KM macro works fine

BBEdit playground selects ALL the text including Lorem ipsum

Regex101 does nothing despite making sure I followed your settings guideline

Hey @ronald,

Well - a big part of the problem is that you don't know what you're doing.  :wink:

That makes the learning process more difficult. Believe me – I almost went bald when I first started trying to learn regex back about '96.

Of course it doesn't help that Keyboard Maestro and BBEdit and RegEx101 can use different syntaxes of regular expressions either.

However – the way you've got RegEx101 configured looks like it should work...

Except that you've got an extra linefeed in the regex pattern. (Good thing you posted a link to the actual saved page!)

Your test set with the extra linefeed:

Your test set with the extra linefeed removed:

My test set, before I realized you had one saved already:

BBEdit's playground is selecting the entire text, because that's what you told it to do.

Note how I've added a replacement pattern \1, and it shows what the replacement text will be.

If you click the Next button the next line will be highlighted in bright yellow and will show what will be replaced there.

With stuff like this you need to NOT knock yourself out. If you can't grok the problem/solution in 15-30-60 minutes then you need to reach out for some help.

I've spent hours banging my head against the wall, when what I really needed was a little advice from someone more advanced on the learning curve than me.

Over the years I've gotten pretty good at limiting the amount of blood spilled from all the banging.

-Chris

3 Likes

It's all crystal clear now, in terms of regex101, BBEdit and KM macro and what my mistakes were.

@JMichaelTX @ccstone

thank you both you’re your advice.

In fact I tried leaning regex a few times in the past, obviously unsuccessfully.

I think that a fundamental problem with regex is that all the online courses go through a dizzying number of expressions which I would forget as soon as I started the next chapter.

The reality is that the best time to learn regex is when I need it, ie need it to solve a specific problem. The range of regex expressions that I will end up using is very narrow, and much of what is taught in those courses I will never use.

That being said, I will try again.