Need Help With a RegEx to Cleanup PDF Annotations; and Learning RegEx

Short regular expressions for personal use are still legal in most jurisdictions, but distribution of longer regexes really deserves some close attention from the regulatory authorities.

Personal desktop use for string mangling was never what Kleene's [Regular language - Wikipedia](Regular language - Wikipedia) formalism was intended for :slight_smile:


Sadly, whatever you wrote is seldom even legible the following morning, unless it was extremely short ...

Regular, but sub-economically time-consuming, and write-only.

Regular expressions are addictive, because they are:

  • interesting, and a source of pride if mastered,
  • the absolute peak of most casual scripter's technical aspirations,
  • fiddly, brittle and read-only enough require a lot of wax on wax off practice.

(Which creates an insatiable hunger for more practice material, and a temptation to encourage others to become aspirant users, seeking help, and supplying material)

Like many such substances, best used in tiny (very short) quantities, and only after other possibilities have been exhausted.

The Jamie Zawinksi point is not far from the mark:

[Jamie Zawinski - Wikiquote](Jamie Zawinski - Wikiquote)

Two days is no joke ...

1 Like

very funny !

Thanks to your email I will plunge into regex, starting with the basics and try to emulate your short and sweet approach.

1 Like

Great! I have no doubt that you can become productive using RegEx in short order.
Not only is RegEx101.com a great place to develop and test RegEx, it also provides detailed explanations of a RegEx pattern. For example:

See regex101: build, test, and debug regex

Benefits of Regular Expressions

In spite of @ComplexPoint's campaign to urge everyone to not use RegEx, using false claims, Regex remains a very powerful language. Like any language, it requires study and practice to master. But there are many simple RegEx patterns that can solve many common problems.

As you know, I also like, and often use, both JavaScript and AppleScript. While string manipulation can easily be done with the powerful bult-in functions of JavaScript, there are many string manipulations that can be easily done using one line of RegEx, but would require many, many lines of JavaScript.

Getting Started with Regular Expressions

Regular expressions (RegEx or RegExp) are extremely powerful, but have an initial steep learning curve that is often intimidating. But once you get over that initial hump, and you continue to write new RegExp, it will become much easier.

I do all of my RegEx development at this free website:

You may also find these sites helpful:

1 Like

thanks very much for the detailed info. I source of frustration is that fact that a regex may with on reg101, but not in BBEdit or KM search replace regex.

IME, I have found Regex101.com and KM to be very consistent.
However, it does require that you setup Regex101.com correctly, set the needed Regex options in the Regex pattern, and use the proper KM settings.

1. Optimum Regex101.com settings

  • Set Regex options to "gu"
  • You can use this URL to do this: regex101: build, test, and debug regex
  • Do NOT use any other settings like "m" or "i"
    β€’ These should be set in the Regex pattern itself:
    • (?mi)
  • Use the "PCRE" Flavor, which is very close to what KM uses

KM Settings

  • Be sure to choose "Regular expressions" or "matches" in Actions and Conditions
    • image
  • RegEx Options: Note that I have used "(?mi)" in all of the above examples, but you should only use the options that you need:
    • "m" -- Multiline so that "^" and "$" match on each line
    • "i" -- Case insensitive

BBEdit

  • I mostly use the "Find" dialog, which should be very straight-forward.
  • I have not observed any differences with Regex101.com
  • If you have specific examples that work differently, please post.

Please let me know if you have other questions.

Strange I posted a detailed reply, and even corrected a typo, and it has disappeared.

Yes - I saw your reply and in between then and now (about 30 minutes) it has gone! (It had a lot of images...)

thank you for your comment

So I hear the reply disappeared. All is fine now. Thank you Chris who went through all the trouble of reviving it.

Thanks again very much. I am very sorry to take so much of your time.

I am going crazy simply because I get different results with BBEdit, KM and Regex101

How can I even start to lean in those conditions

That's what I did for 2 days, jumping from one to the other.

Please note that the BBEdit playground is useful because you can save the regex to a text factory.
The example below uses your excellent annotation clean regex.
Sorry for the many images

Objective
Convert
1 Highlight 2020-11-20, 20:20:30 Lorem ipsum...
testing regex
To
Lorem ipsum...
testing regex

Note that the testing regex line is there just so you don't think that I am simply deleting all characters before character no: 53

Regex used is your excellent one:
(?mi)^\d+.+\d{4}-\d{2}-\d{2},\h+\d{1,2}:\d{1,2}:\d{1,2}\h+(.+)

The regex101 link for the issue below is:

The KM macro works fine

BBEdit playground selects ALL the text including Lorem ipsum

Regex101 does nothing despite making sure I followed your settings guideline

Hey @ronald,

Well - a big part of the problem is that you don't know what you're doing.  :wink:

That makes the learning process more difficult. Believe me – I almost went bald when I first started trying to learn regex back about '96.

Of course it doesn't help that Keyboard Maestro and BBEdit and RegEx101 can use different syntaxes of regular expressions either.

However – the way you've got RegEx101 configured looks like it should work...

Except that you've got an extra linefeed in the regex pattern. (Good thing you posted a link to the actual saved page!)

Your test set with the extra linefeed:

Your test set with the extra linefeed removed:

My test set, before I realized you had one saved already:

BBEdit's playground is selecting the entire text, because that's what you told it to do.

Note how I've added a replacement pattern \1, and it shows what the replacement text will be.

If you click the Next button the next line will be highlighted in bright yellow and will show what will be replaced there.

With stuff like this you need to NOT knock yourself out. If you can't grok the problem/solution in 15-30-60 minutes then you need to reach out for some help.

I've spent hours banging my head against the wall, when what I really needed was a little advice from someone more advanced on the learning curve than me.

Over the years I've gotten pretty good at limiting the amount of blood spilled from all the banging.

-Chris

3 Likes

It's all crystal clear now, in terms of regex101, BBEdit and KM macro and what my mistakes were.

@JMichaelTX @ccstone

thank you both you’re your advice.

In fact I tried leaning regex a few times in the past, obviously unsuccessfully.

I think that a fundamental problem with regex is that all the online courses go through a dizzying number of expressions which I would forget as soon as I started the next chapter.

The reality is that the best time to learn regex is when I need it, ie need it to solve a specific problem. The range of regex expressions that I will end up using is very narrow, and much of what is taught in those courses I will never use.

That being said, I will try again.

There's a lot to be said, I think, for keeping to one general language (JavaScript or AppleScript for example), and using some simple primitives like:

  • splitOn
  • startsWith or isPrefixOf
  • endsWidth or isSuffixOf

That's often enough for real problems, and makes much more productive use of human time : -)

JS has the advantage that if you every genuinely do need a visit to the medicine cabinet for a short regular expression, then JavaScript has a regular expression engine built in.

From AppleScript you need to juggle an unholy mixture of three different syntaxes:

  1. The ObjC foreign function interface syntax,
  2. AppleScript syntax,
  3. and regular expression syntax.
1 Like

Would you know of a library of Javascript regex ? thank you

I think this is true to a certain extent with learning all languages. The key is to take notes with links to the reference. In the note create your own example of text to process.

It is also useful to have a cheat sheet, just like we often do with apps that have a large number of keyboard shortcuts. Or, even like KM which has so many Actions very few could remember them all. So we have a KM Wiki.

In the case of RegEx, I have found this cheat sheet to be very useful:
Regex Accelerated Course and Cheat Sheet

Well, this is much like learning any language -- it is somewhat "chicken and egg" problem.
I do agree we tend to learn and retain knowledge the most when solving a real problem.
OTOH, you do need to become familiar with the tools in your tool box, to at least know of them even if you don't fully remember how to use them.

I have found when learning a new language, that it is very beneficial to reread the documentation several times as I learn more. After I have used the language for a while and developed a reasonable understanding of its terms, I learn a lot more when I read the manual a second/third time.

I have found that the more I use Regex, the more uses I find for it, and of course I am able to construct more complicated patterns. That is one reason that I like to help other users here in the KM forum that have problems dealing with text manipulation. It helps keep me sharp, and even expands my knowledge sometimes.

Finally I'll say this: Regex is a great example of the adage "use it or lose it".

1 Like

Tell me more ? ( Not quite sure that I have caught up with what a library of Regex might look like ).

A regex engine is part of the standard JS interpreter, and this is a good starting point for documentation:

[Regular expressions - JavaScript | MDN](Regular expressions - JavaScript | MDN)

JS strings also have some built-in methods like:

  • haystack.includes(needle) -> true | false
  • haystack.startsWith(needle) -> true | false
  • haystack.endsWith(needle) -> true | false

as well as:

  • haystack.match(regex) -> String
  • haystack.matchAll(regex) -> [String]

and

  • haystack.findIndex(test function) -> Zero-based index or -1
1 Like

The Regex engine used by JavaScript is significantly different from PCRE and that used by KM.
And, the JavaScript syntax to use Regex is quite different.
I don't see any advantages to use JavaScript just to execute a Regex search or replace function, when you have both of these easy to use as KM Actions.
Going down this path will just make Regex more confusing for you at this stage.

2 Likes

The point of JS is to avoid Regex (like the plague), (or like an addictive and time-wasting substance) most of the time :slight_smile:

Two days is no joke ...

1 Like

thank you

In the case of RegEx, I have found this cheat sheet to be very useful:
Regex Accelerated Course and Cheat Sheet

EXCELLENT site !! thank you !

Hey @ronald,

Yes, but...

:sunglasses:

Then you'll bang your head against the wall instead of getting your problem solved -- unless you really decide you're going to learn something from the exercise.

I like the cheat-sheet @JMichaelTX mentions, but I also like this one because of the filter.

https://www.debuggex.com/cheatsheet/regex/pcre

However my most used reference is my own:

I have a local copy of this bound to an AppleScript with a keyboard shortcut in BBEdit's Script-Menu.

--------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2008/01/05 05:34
# dMod: 2018/05/30 19:09
# Appl: TextWrangler
# Task: Open BBEdit/TextWrangler RegEx Cheat Sheet
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @System_Events, @TextWrangler, @RegEx, @Cheat_Sheet
--------------------------------------------------------

set preferredWindowBounds to {351, 45, 1393, 1196}
set bbeditCheatSheetPath to "~/Documents/BBEdit Documents/Documentation/RegEx Cheat Sheet.txt"

# Expand the $HOME-based (tilde) path above.
tell application "System Events" to Β¬
   set bbeditCheatSheetPath to POSIX path of disk item bbeditCheatSheetPath

tell application "BBEdit"
   set bbApp to a reference to it
   
   tell document "RegEx Cheat Sheet.txt"
      
      if it exists then
         if index of its window β‰  1 then
            set index of its window to 1
         end if
      else
         tell bbApp to open bbeditCheatSheetPath opening in new_window
      end if
      
      if bounds of its window β‰  preferredWindowBounds then
         set bounds of its window to preferredWindowBounds
      end if
      
   end tell
end tell

--------------------------------------------------------

Since I always compose regular expressions in BBEdit, my reference is only a keystroke away. Of course nothing is stopping your from making a macro to open one (or more) of the other references in your web browser.

I've also used this site quite extensively over the years:

And don't forget our own wiki page on regular expressions.

Learning things haphazardly on the Internet has its place, but for serious study you need a reference book or two (or ten).

I started my regex odyssey on the Internet back when there wasn't much content, and it was hard to find – and as I stated earlier I pulled my hair out a lot, and my wall got pretty bloody...

When finally I got serious I bought some books.

I have all of these plus several tomes on Perl:

I'm Interested in these two, but I haven't had my hands on them yet.

Note -- β€œMastering Regular Expressions” by Jeffry Friedl has its merits but is very technical and not really for beginners.

Reading about regular expressions helps develop vocabulary, but it is only through repeatedly working with them that one develops proficiency.

Learning regular expressions is difficult for most people, but the rewards last a lifetime.

I use them every day either directly or in macros I've built that employ them.

-Chris

1 Like