RegEx question: how to search for white space repetitions (1 tab and 1 new line)

cdthomer · September 8, 2021, 1:34pm

Howdy folks, I've been trying to improve on some of my RegEx and while I have been able to simplify a lot of them, I have come across one particular case that has me stumped. I tried searching for a RegEx forum but didn't come up with anything (perhaps somebody here knows of a RegEx specific forum they could direct me to as well).

Below is a line from which I am trying to extract the words "ERO Headquarters AOR Caller" as well as the RegEx I use to get it. Obviously, this comes between the words "Status" and "CustomerConnected".

Status	
ERO Headquarters AOR Caller	CustomerConnected

(?<=Status\s\s).*(?=\sCustomerConnected)

I had to put in \s twice because there is both a tab and new line after the word Status. I tried several combinations of \s* and the like to have it search for multiple cases of white space, but they never worked. What would be the proper way to write this so it looks for more than one instance of white space after the word "Status" so as to include both the tab and the new line?

EDIT: I should state that the RegEx I use works just fine...I just want to know how to simplify \s\s portion of it.

ComplexPoint · September 8, 2021, 1:47pm

Have you tried \s+ in the sense of at least one white space char ?

cdthomer · September 8, 2021, 1:53pm

Just tried it on regex101.com and it gives me the following error.

ComplexPoint · September 8, 2021, 3:48pm

Can you work with something like this, which captures spaces too ?

(?<=Status)\s+.*(?=\sCustomerConnected)

perhaps, for example by adding a capturing group which excludes the leading space ?

(?<=Status)\s+(.*)(?=\sCustomerConnected)

cdthomer · September 8, 2021, 4:27pm

Those work...but I'm trying to avoid capturing the spaces preceding and following the string because then I would have to filter out those spaces in another action or else it would lead to extra lines in the end result. The way I was doing it before was by including the the words "Status" and "CustomerConnected" and then filtering them out in a subsequent action. I just want to simplify things and have one action to get the exact string I'm looking for.

martin · September 8, 2021, 4:34pm

You can just do

Status\s+(.*)\sCustomerConnected

and save the capture group into a variable. Therefore, no need for an extra action.

cdthomer · September 8, 2021, 7:19pm

Thanks! I should look more into the capture groups, because this is actually just one of four pieces of information I need to extract from a page. Right now I have them setup the following way.

Ideally I would set it up to extract all that info in one shot, with a single action. But at least for now I have greatly streamlined it, going from literally 20 actions to only 4. So I'm making progress. I was just trying to understand why I was having trouble writing in a RegEx for multiple white spaces in the participant search. Thanks again for your help!

ccstone · September 8, 2021, 8:01pm

Hey Chris,

Yes!

Stay away from lookaheads and lookbehinds, unless you really need them. They add complexity and confusion – especially for neophyte users.

As you've discovered lookbehind assertions must be of fixed length and cannot contain quantifiers.

Don't get overly fixated on creating the perfect RegEx – use two, three, or more if required for simplicity and readability.

Learn – improve your proficiency – but don't waste too much time.

You'll understand this better after you spend 10 hours working on the “perfect” regular expression only to discover it breaks too easily.

Then you spend 10 more hours trying to fix every possible breakage.

Then you realize you wasted your time, because a single regex can't do everything you need.

After all that you find that 4 regular expressions were all that were required, and the total dev time needed for them was 20 minutes...

After you do that a time or two (or three) you'll pay a little more attention to the Keep it Simple rule.

Have fun though! For me regular expressions are fun puzzles that make my work easier – as long as I don't get too precious about them.

-Chris

cdthomer · September 8, 2021, 8:17pm

You just described what my wife has often accused me of...always wanting to refine something that already works

But you’re both very right...sometimes it's best to leave something alone once it works well enough. Thanks again for your help with AppleScript and RegEx!

martin · September 8, 2021, 8:39pm

It's possible to do multiple capture groups in one action, provided the text follows the same format. You will need to provide the relevant sample text block for test.

ccstone · September 8, 2021, 8:54pm

The former NASA engineer @JMichaelTX used to say “Better is the enemy of good enough...”

Since nothing is perfect one can always spend time in pursuit of better.

Since I'm a perfectionist by trade I have to make sure I don't go excessively far down that particular rabbit hole.

-Chris

cdthomer · September 8, 2021, 9:06pm

I think I'll follow Chris' (@ccstone) advice and leave the macro itself alone since it works quite well. BUT, I also want to learn some more so I've included the sample text block for you to take a look at.

 Your Conference Details
Customer Name:	DHS U.S. Citizenship and Immigration Services
Site Name:	Houston (ZHN)
Language:	Spanish
Reference ID:	11111111
Call Duration:	00:00:01
 Participants
Name	Status	
Name changed for security	CustomerConnected

I changed the reference ID and customer's name for security reasons, but they are still in the same format as what the actual text would read.

I need to extract the following info:

DHS U.S. Citizenship and Immigration Services
This name can be just one word or several as you can see.

Houston (ZHN)
This name can also be just one word or several.

11111111
Obviously that's not the real number, but it is always an 8-digit string.

Name changed for security
This is usually two or three words depending on if it's the actual persons name, or the name of their office.

So just for curiosity sake at this point (because why screw up a perfectly good macro I managed to reduce from 20 actions to 4? ) would there be a way to parse this information with a single action, and set each set of data to a separate variable?

The page's format is the same every single time...well, 98% of the time. For the other 2% I have 4 separate macros that can extract just one single item in case one of those items is missing (sometimes the participant's name is missing for example).

Thanks!

martin · September 8, 2021, 9:41pm

This is the action based on your sample text:
Search pattern:

Name:\s(.*?)\sSite Name:\s(.*?)\sLanguage(?:.|\s)+ID:\s(\d{8})\sCall(?:.|\s)+Status\s+(.*)\sCustomerConnected

dashard · September 8, 2021, 9:47pm

aka spending 10 hours to save 10 minutes.
Guilty!

ccstone · September 10, 2021, 1:22am

IF

That 10 minutes is repeatedly saved over time.
Keeps me from pulling my hair out.
Improves the likelihood that I'll get work done on time.
Makes my work more accurate.

OR

Helps me to learn things that will facilitate the above.
Is a FUN pastime that takes the place of other leisure activities, and let's me learn something at the same time.

I'll gladly spend the 10 hours.

OTHERWISE

If I'm not getting paid for the work.
- I better take a good long look at why I'm doing what I'm doing.

-Chris

cdthomer · September 10, 2021, 2:06am

I must have messed up the formatting when I posted it, because using the action you built doesn’t return me any results. I’ll take a closer look at it tomorrow though and see what I can figure out.

cdthomer · September 10, 2021, 2:06am

A good portion of my time spent building and tweaking macros is for these very reasons, either to learn and/or to just have fun haha.

ccstone · September 10, 2021, 2:08am

Hey Chris,

How are you acquiring the data in the first place?

-ccs

cdthomer · September 10, 2021, 2:13am

Copy contents of page to system clipboard and then filtering to remove styles. But I have several copies saved in text files on my computer to test things with and I used one of those in a previous post. It must have gotten corrupted somehow; likely I deleted some white space character or the like.

ccstone · September 10, 2021, 2:15am

How?

In what browser?

-ccs

RegEx question: how to search for white space repetitions (1 tab and 1 new line)

Options