Regex -- Harvest Variables From Multi Line Text

troy · December 31, 2018, 4:21pm

I realize more than a little help is needed to accomplish my goal.
Appreciate any input -
I'd like to 'pull' these variables from the text.
Name_First
Name_Last
Phone_Daytime
Phone_Cell
Phone_Ext_Daytime
Phone_Ext_Cell
Client_Email
Address_Street (the full line after the text 'Print this lead')
Address_City
Address_State
Address_Zip
Detail_Consumer_Comments
Detail_Location
Detail_Cleaning_Type
Detail_Stage
Detail_Completion_Date

The example text will always be in the middle of a long string of text but will always be preceded by:
Expand
Set Status
(I also have attached the complete block of 'master' text if that is of help)
TestDataFor KM.rtf.zip (2.3 KB)

Here is the example text:

Expand
Set Status
Avertta, Tammy

Daytime
(929) 295-4774 ext. 1047

Cell
(929) 295-4774 ext. 1048

abcfany.testla1234@gmail.com

Messages

Rate this lead
Print this lead
213 Maple Ave
New Hampton, NY 10958

Map

DetailsEmailNotes & RemindersHistory
Clean House Interior (Maid Service)

Job #: 131584913
Additional HomeAdvisor Pros Matched: 2
Service Description

Consumer Comments:
In need of kitchen, bathrooms, living room and dining rooms cleaned. Just the upstairs which include 3 bedrooms, 2 baths, living room, dining room, kitchen, steps and landing

What kind of location is this?:
Home/Residence

Cleaning Type Needed:
Recurring Service

Request Stage:
Planning & Budgeting

Desired Completion Date:
1 - 2 weeks

Rather · January 2, 2019, 9:12am

I usually find the best way to approach this kind of task is to avoid the temptation of having one super-RegEx that will extract everything and simply do it several bites.

So:
Names:
Set Status\n(\w*), (\w*)

Phones:
Daytime\n(.)\n\nCell\n(.)

Email:
\b(?<![%+_$.-])a-z0-9@a-z0-9.[a-z]{2,6}\b
(Note: This is not mine, but from the people at RegExRx)

...and so on.

It may not be super-elegant, but it looks to me as if your example text would allow you to extract all the fields you need in this way.

troy · January 2, 2019, 8:28pm

thank you @Rather - I did a bit of looking around but could not find an 'easy' answer - go figure.
How to 'match' the complete next line of text ie. I'd like to get the line after 'Print this lead' as that text is always the 'header' to the street address.
The following regex works if there is no space, but the space is there
printthislead[\r\n]+([^\r\n]+)

ie.
Print this lead
110 waters edge

the goal would be to end up with 110 Waters Edge as the match.
Thank you again.

Rather · January 3, 2019, 8:08am

The space can be put in quite easily with \s.

Your expression would then be:
print\sthis\slead[\r\n]+([^\r\n]+)

I don't know much about US address conventions or postcodes, but perhaps something like this would give you more detail:

nt\sthis\slead[\r\n]+([^\r\n]+)\n(.[\w\s-]*),\s(\w{2})\s(\d{4,6})\n

\d is a digit
{4,6} means you want between 4 and 6 of those digits

Just a starter, and you'll soon be doing much more sophisticated expressions, I'm sure!

Good luck.

troy · January 5, 2019, 11:02pm

I'd like to get the base 10 Digit number in one variable and the extension in another.
To start with:
Daytime\n(.)
Does not extract the initial 10 digit number on

Daytime
(929) 295-4774 ext. 1069

I can't run your suggested
Daytime\n(.)\n\nCell\n(.)
Because sometime the client does not enter a Cell and other times may not enter a daytime number. So I'll have to run them in two different searches.

Nor does the regex
\b(?<![%+$.-])a-z0-9 @a-z0-9 .[a-z]{2,6}\b
seem to find the email address -
I did find this works:
\b[A-Z0-9.%+-]+@[A-Z0-9.-]+.[A-Z]{2,}\b courtesy of:
https://www.regular-expressions.info/email.html

Rather · January 6, 2019, 2:10pm

I'd like to get the base 10 Digit number in one variable and the extension in another.

OK. I don't know if these are being entered in a freeform field. Assuming so, then we will have to deal with:

the length and format of the number; and
the way the person abbreviates extension

It seems the telephone number is made up of digits; hyphens; spaces and brackets.

This would catch a number with 10-14 of those characters:
([\d\s-()]{10,14})
(I'll let you decide how many characters you need.)

For the next part, I have seen people use:

ext 1069
x1069
extension: 1069

and so you would want to catch all of those.

This expression:
[extnsio]{1,9}[.\t\s:]{1,3}(\d{3,5})

should isolate:

any 'word' of between 1 and 9 characters from "extension"
a spacer made up of a tab, a space, a full-stop, a colon or any combination of up to 3 of those things
a sequence of between 3 and 5 digits, which is the only part actually captured.

Nor does the regex
\b(?<![%+ $.-])a-z0-9 @a-z0-9 .[a-z]{2,6}\b
seem to find the email address

Yep, my bad. I let some spaces slip in and it should be:
\b(?<![%+$.-])a-z0-9@a-z0-9.[a-z]{2,6}\b

However, if you've found something that works, all good!

ccstone · January 7, 2019, 12:09am

Hey @troy,

Here's an example that works with the sample data you posted.

-Chris

Regex – Harvest Variables From Multi Line Text v1.00.kmmacros (16 KB)

troy · January 7, 2019, 3:38pm

@ccstone this macro is awesome! I really appreciate your time.
The one thing that seems to not be 'clean' is the First Name, it returns more text that just the first name.
I've taken a bit of time to look at the RegEx and googled a bit and have learned a lot by looking at your example.
Cheers.

ccstone · January 7, 2019, 4:05pm

Hey @troy,

Give me a good example, and I'll fix it.

-Chris

troy · January 7, 2019, 5:13pm

For the first name,
Using the macro you posted and the 'embedded' data in the initial action. (initialData)
It returns the first name with a number of additional lines of data.
Just need the first name.
Cheers

ccstone · January 7, 2019, 6:09pm

Hmm – I thought I'd tested that...

Oh, I relied on the view in the editor instead of looking at the variable in the prefs. Oops...

This should fix it.

Search using Regular Expression.kmactions (686 B)

-Chris

Regex -- Harvest Variables From Multi Line Text

Options