Need Help With a RegEx to Cleanup PDF Annotations; and Learning RegEx

ronald · November 21, 2020, 12:23pm

A problem with PDF annotations is that each annotation is proceeded by a prefix which makes for very tedious reading, so I want to just delete all prefixes in a RTF file text.

The following regex works:

^(?<PageNumber>\d+)\s(?<Highlight>Highlight)\s\s(?<Date>\d{4}-\d+-\d+),\s(?<Time>\d+:\d+:\d+)\s(?<Description>.+)$

With the following prefix:

10 Highlight	 2020-11-20, 21:48:47 Lorem ipsum...

But I would like it to work with 2 variants.
1- variant one:

90	Highlight		2020-11-20, 21:35:17 Lorem ipsum...

I think that the difference is that it works with a single space after the first number, and it should be a tab instead

2- variant two
Same as above but the name John Smith (ie author's name) is added followed by a tab:

2	Highlight	John Smith	2020-11-20, 17:19:26 Lorem ipsum...

Thank you very much.

tiffle · November 21, 2020, 12:44pm

Well, I’m not exactly sure what it is you’re trying to match with your complex regex but if it’s the stuff up to but not including “Lorem ipsum” this will do the job:

(\d.+?:\d\d )

So the match for the entire annotation would be:

^(\d.+?:\d\d )(.*)$

I think. (Can’t test it properly as regex101.com is down!) The second capture group is the contents of the annotation, the first is the prefix.

ronald · November 21, 2020, 2:40pm

(\d.+?:\d\d )

thank you very much.
I really like your clean and simple solution. Aesthetically beautiful.
regex101.com is down so I tested it with BBedit's regex playground

I identified 3 problems 2 of which I don't know how to solve,:
1- d )* → d)* - took out the space before ).
2- the regex does not highlight the seconds (:30 at the end)
3- there is a tab after the seconds which I also would like to highlight/delete

thanks again very much @tiffle. I have been working on this for 2 days !

ComplexPoint · November 21, 2020, 5:19pm

With every additional minute, the likelihood increases (geometrically) that a non-Regex solution would have been much faster in human time ...

Usually better to just show us:

Some representative (but anonymized) input samples
The corresponding expected outputs.

(with a description the software contexts – i.e. which apps are being used, both with input and with output)

tiffle · November 21, 2020, 5:25pm

In the examples you originally provided you didn’t say there was a tab between the seconds and Lorem ipsum... so I assumed it was a space which is why it doesn’t work.

So, as it’s a tab you can use this:

^(\d.+?:\d\d\t)(.*)$

or

^(\d.+?:\d\d\s)(.*)$

The first matches a tab specifically while the second matches any white space character (space, tab, ...)

My tip for regex here is this: there’s no need to match every single bit of the text provided you can uniquely match the beginning pattern and the end pattern; everything in between can be “soaked” up by .+?

Hope that helps.

ComplexPoint · November 21, 2020, 7:26pm

Looks worth testing whether simply splitting on \t might not have solved the problem and saved two days ...

ronald · November 21, 2020, 8:42pm

I'm sorry, I don't understand. thank you for your post.

ronald · November 21, 2020, 8:46pm

One last question and I shall leave you in peace.
Just for my education and because I like your short and sweet (and smart) approach to regex, could one even create a regex which deletes all text before the 45th character on each line, a “line being defined” as a string of text ending with a line feed ?
thanks again very much

tiffle · November 21, 2020, 9:35pm

Sure - try this:

^(.{44})(.+?)$

Replace the first capture group with nothing; the second capture group contains the remainder of the line. Is that what you meant? BTW - the first capture group matches the first 44 characters of the line. If you wanted 45, the change the 44 to 45.

ComplexPoint · November 21, 2020, 9:36pm

For efficient use of human time, regex rarely score very well.

Before diving straight back into the regex whirlpool, it might be worth a quick look at alternatives like Keyboard Maestro Substring actions, which have a range option:

https://wiki.keyboardmaestro.com/action/Substring_of_Variable_or_Clipboard

ronald · November 21, 2020, 9:40pm

thanks very much. Great idea !

ronald · November 21, 2020, 9:54pm

thank you

ComplexPoint · November 21, 2020, 10:16pm

Well, we would need to see the concrete source of the data,

but if, for example, it turned out that:

The annotation fields are always separated by tab characters,
and that the part you need is always just the last of those fields

then vanilla ( regex-free ) KM actions might let you fairly quickly:

Recast each tab-delimited field as a separate line:

and then let a for each action wind through those lines,
just giving each of them, in turn, a variable name like part.

so that final value with the name part would be the last segment of the string:

ronald · November 21, 2020, 10:20pm

thanks a lot. I am trying it and will give you a follow-up

ronald · November 21, 2020, 10:38pm

Just to keep it simple (I can later create a version with variables) I highlighted whole the text → copy → macro action below → paste

all tabs are replaced with line feed

I think that the text I end up with is basically the equivalent of cliplines.

After that I don't follow. Sorry. thank you

ComplexPoint · November 21, 2020, 10:58pm

Last tab-delimited segment of a string.kmmacros (20.4 KB)

If:

the data is consistently tab-delimited
and the part you want is always after the last tab,

then this would return what you want in the part variable:

Last tab-delimited segment of a string.kmmacros (20.4 KB)

but it may simply be that that isn't exactly the pattern of the source data.

What application do these annotations come from ?

JMichaelTX · November 21, 2020, 11:59pm

First, I would like to encourage you to keep using and learning Regular Expressions (RegEx).
IMO, it is one of the most powerful and useful languages, and can be used just about everywhere.

Now, to your request.
Actually, you were pretty close with:
^(?\d+)\s(?Highlight)\s\s(?\d{4}-\d+-\d+),\s(?\d+:\d+:\d+)\s(?.+)$

My solution is:
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

For details, see regex101: build, test, and debug regex

The key to developing a good reliable RegEx is in recognizing the patterns in the data.
In this case, here is what I see:

Line always starts with one or more digits
Some amount of variable text
ISO Date
Comma, then some horizontal space (TABs and/or SPACEs)
Time
Some horizontal space
Finally the text you want to keep

Note that I used the following RegEx patterns:

\h+ to match one or more TABs and/or SPACES
\d{1,2} to make a one or two digits

It might be tempting to include "Highlight" in the pattern, but that word might not apply to all annotations. IAC, it is not needed for a good match.

Solution

So, the KM solution is to use Search and Replace action:

Search FOR (using Regex)
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

Replace WITH
\1

Please feel free to ask any questions.

BTW, the RegEx101.com site is now back up.

ronald · November 22, 2020, 7:22am

You have a great solution. thanks VERY much. Works perfectly. Regex: I am trying but it's hard

ronald · November 22, 2020, 7:23am

thanks very much.

@JMichaelTX @ComplexPoint @tiffle
I don't know what Crystal Meth is like, but 2 days working on a regex must come close.

tiffle · November 22, 2020, 9:20am

But hopefully, afterwards, you still remember what you’ve learned

Need Help With a RegEx to Cleanup PDF Annotations; and Learning RegEx

Solution

Options