Need Help With a RegEx to Cleanup PDF Annotations; and Learning RegEx

A problem with PDF annotations is that each annotation is proceeded by a prefix which makes for very tedious reading, so I want to just delete all prefixes in a RTF file text.

The following regex works:

^(?<PageNumber>\d+)\s(?<Highlight>Highlight)\s\s(?<Date>\d{4}-\d+-\d+),\s(?<Time>\d+:\d+:\d+)\s(?<Description>.+)$

With the following prefix:

10 Highlight	 2020-11-20, 21:48:47 Lorem ipsum...

But I would like it to work with 2 variants.
1- variant one:

90	Highlight		2020-11-20, 21:35:17 Lorem ipsum...

I think that the difference is that it works with a single space after the first number, and it should be a tab instead

2- variant two
Same as above but the name John Smith (ie author's name) is added followed by a tab:

2	Highlight	John Smith	2020-11-20, 17:19:26 Lorem ipsum...

Thank you very much.

Well, I’m not exactly sure what it is you’re trying to match with your complex regex but if it’s the stuff up to but not including “Lorem ipsum” this will do the job:

(\d.+?:\d\d )

So the match for the entire annotation would be:

^(\d.+?:\d\d )(.*)$

I think. (Can’t test it properly as regex101.com is down!) The second capture group is the contents of the annotation, the first is the prefix.

1 Like

(\d.+?:\d\d )

thank you very much.
I really like your clean and simple solution. Aesthetically beautiful.
regex101.com is down so I tested it with BBedit's regex playground

I identified 3 problems 2 of which I don't know how to solve,:
1- d )* → d)* - took out the space before ).
2- the regex does not highlight the seconds (:30 at the end)
3- there is a tab after the seconds which I also would like to highlight/delete

thanks again very much @tiffle. I have been working on this for 2 days !

With every additional minute, the likelihood increases (geometrically) that a non-Regex solution would have been much faster in human time ...

Usually better to just show us:

  1. Some representative (but anonymized) input samples
  2. The corresponding expected outputs.

(with a description the software contexts – i.e. which apps are being used, both with input and with output)

In the examples you originally provided you didn’t say there was a tab between the seconds and Lorem ipsum... so I assumed it was a space which is why it doesn’t work.

So, as it’s a tab you can use this:

^(\d.+?:\d\d\t)(.*)$

or

^(\d.+?:\d\d\s)(.*)$

The first matches a tab specifically while the second matches any white space character (space, tab, ...)

My tip for regex here is this: there’s no need to match every single bit of the text provided you can uniquely match the beginning pattern and the end pattern; everything in between can be “soaked” up by .+?

Hope that helps.

1 Like

Looks worth testing whether simply splitting on \t might not have solved the problem and saved two days ...

I'm sorry, I don't understand. thank you for your post.

One last question and I shall leave you in peace.
Just for my education and because I like your short and sweet (and smart) approach to regex, could one even create a regex which deletes all text before the 45th character on each line, a “line being defined” as a string of text ending with a line feed ?
thanks again very much

Sure - try this:

^(.{44})(.+?)$

Replace the first capture group with nothing; the second capture group contains the remainder of the line. Is that what you meant? BTW - the first capture group matches the first 44 characters of the line. If you wanted 45, the change the 44 to 45.

1 Like

For efficient use of human time, regex rarely score very well.

Before diving straight back into the regex whirlpool, it might be worth a quick look at alternatives like Keyboard Maestro Substring actions, which have a range option:

https://wiki.keyboardmaestro.com/action/Substring_of_Variable_or_Clipboard

thanks very much. Great idea !

1 Like

thank you

Well, we would need to see the concrete source of the data,

but if, for example, it turned out that:

  • The annotation fields are always separated by tab characters,
  • and that the part you need is always just the last of those fields

then vanilla ( regex-free ) KM actions might let you fairly quickly:

Recast each tab-delimited field as a separate line:

image

and then let a for each action wind through those lines,
just giving each of them, in turn, a variable name like part.

image

so that final value with the name part would be the last segment of the string:

image

1 Like

thanks a lot. I am trying it and will give you a follow-up

1 Like

Just to keep it simple (I can later create a version with variables) I highlighted whole the text → copy → macro action below → paste

all tabs are replaced with line feed

I think that the text I end up with is basically the equivalent of cliplines.

After that I don't follow. Sorry. thank you

Last tab-delimited segment of a string.kmmacros (20.4 KB)

If:

  • the data is consistently tab-delimited
  • and the part you want is always after the last tab,

then this would return what you want in the part variable:

Last tab-delimited segment of a string.kmmacros (20.4 KB)

but it may simply be that that isn't exactly the pattern of the source data.

What application do these annotations come from ?

First, I would like to encourage you to keep using and learning Regular Expressions (RegEx).
IMO, it is one of the most powerful and useful languages, and can be used just about everywhere.

Now, to your request.
Actually, you were pretty close with:
^(?\d+)\s(?Highlight)\s\s(?\d{4}-\d+-\d+),\s(?\d+:\d+:\d+)\s(?.+)$

My solution is:
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

For details, see https://regex101.com/r/fN3eMY/1/

The key to developing a good reliable RegEx is in recognizing the patterns in the data.
In this case, here is what I see:

  1. Line always starts with one or more digits
  2. Some amount of variable text
  3. ISO Date
  4. Comma, then some horizontal space (TABs and/or SPACEs)
  5. Time
  6. Some horizontal space
  7. Finally the text you want to keep

Note that I used the following RegEx patterns:

  • \h+ to match one or more TABs and/or SPACES
  • \d{1,2} to make a one or two digits

It might be tempting to include "Highlight" in the pattern, but that word might not apply to all annotations. IAC, it is not needed for a good match.

Solution

So, the KM solution is to use Search and Replace action:

Search FOR (using Regex)
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

Replace WITH
\1

Please feel free to ask any questions.

BTW, the RegEx101.com site is now back up.

1 Like

You have a great solution. thanks VERY much. Works perfectly. Regex: I am trying but it's hard

thanks very much.

@JMichaelTX @ComplexPoint @tiffle
I don't know what Crystal Meth is like, but 2 days working on a regex must come close.

But hopefully, afterwards, you still remember what you’ve learned :wink: