Regular Expression Copy the first 3 words

Bill_Mabey · May 25, 2017, 5:15pm

I have a Variable named %BaseCamp% “Beauceville 173 Road between the intersection of the road from the golf course and the municipal boundary of Notre-Dame-des-Pins”

I would like to copy the first 3 or 4 words. I have tried so hard for 2 hours I give up.

JMichaelTX · May 25, 2017, 6:30pm

Here's the RegEx for 3 words:
^((?:\S+\s+){2}\S+)

Just change the {2} to {n-1} words that you want to capture.

For details, see regex101: build, test, and debug regex


EDIT:  See below post:

[quote="JMichaelTX, post:8, topic:7109"]
@Bill_Mabey, here's my new improved RegEx:
`^((?:\S+\h+){2}\S+?)(?=[\h.,;!?:]|$)`
[/quote]

Tom · May 25, 2017, 9:39pm

If you already tried for 2 hours why don’t share your efforts here with us? This would make things easier, since your description is not overly clear (“3 or 4 words”; what is a word for you?; 3 or 4?)

Bill_Mabey · May 26, 2017, 12:47am

Exactly what I was looking for Thank you very much, much appreciated.

Bill_Mabey · May 26, 2017, 12:58am

I think I was clear enough. One reason I don’t like posting here that I’m not a genius when it comes to writing expressions and l like to keep the post simple. While I’m committing to the this post here I have trouble reading posts because too much info at one time and the original post has been drawn into a story. Would be much better to keep posts simple…

Cheers…

ccstone · May 26, 2017, 1:45am

Hey Bill,

Don't bang your head against the wall (without making progress) for more than 30 to 60 minutes at a time. We don't need no dain brammage!

There are many ways to complete this particular task.

RegEx ⇢ Search Variable for First 3 Words.kmmacros (2.9 KB)

I've used a little more complex regular expression than JM's to limit the found-string to word-characters and hyphens.

(This method can be problematic if you have punctuation in words.)

I've also used the horizontal whitespace token and a Positive-Lookahead-Assertion to stop before the last appropriate space (or end-of-line).

Once you wrap your head around AppleScript's text item delimiters this kind of job is easily done with AppleScript.

set theStr to "Beauceville 173 Road between the intersection of the road from the golf course and the municipal boundary of Notre-Dame-des-Pins"
set AppleScript's text item delimiters to space
set breakStrIntoList to text items of theStr
set firstThreeWords to (items 1 thru 3 of breakStrIntoList) as text

By doing it this way I don't have to be concerned about punctuation or other strange characters in words.

I can do the same thing in awk quite easily, since it looks at lines as records and words as fields with horizontal-whitespace the default field separator.

A couple of ways to print the first three fields of a string with awk:

strVar='Beauceville 173 Road between the intersection of the road from the golf course and the municipal boundary of Notre-Dame-des-Pins'

# Super-simple awk — print fields 1, 2, & 3 with a space in between:
echo "$strVar" | awk '{ print $1" "$2" "$3}' 

# Slightly more complex awk — set the output-field-separater to space — print with comma-notation to use the OFS.
echo "$strVar" | awk 'BEGIN {OFS=" "} {print $1,$2,$3}'

I'll do three more with AppleScript and the Satimage.osax:

This script like the TIDs method splits the string on whitespace and then rejoins the first three words.

set theStr to "Beauceville 173 Road between the intersection of the road from the golf course and the municipal boundary of Notre-Dame-des-Pins"
set firstThreeWords to splittext theStr using "[[:blank:]]+" with regexp

if length of firstThreeWords ≥ 3 then
   set firstThreeWords to items 1 thru 3 of firstThreeWords
   set firstThreeWords to join firstThreeWords using space
end if

This method uses the Satimage.osax's regular expression support with an even more simple patttern:

set theStr to "Beauceville 173 Road between the intersection of the road from the golf course and the municipal boundary of Notre-Dame-des-Pins"

try
   set theStr to find text "^\\S+ \\S+ \\S+" in theStr with regexp and string result
on error
   set theStr to false
end try

Since the Satimage.osax's regex flavor is a bit different than Keyboard Maestro's I have to write my original pattern differently to work with it.

set theStr to "Beauceville 173 Road between the intersection of the road from the golf course and the municipal boundary of Notre-Dame-des-Pins"

try
   set theStr to find text "^(\\S+[[:blank:]]?){3}(?=[[:blank:]]|$)" in theStr with regexp and string result
on error
   set theStr to false
end try

-Chris

Bill_Mabey · May 26, 2017, 10:40pm

Wow Thank you @ccstone these scripts will most definitely come into use.
These scripts explains a lot…

Again Thank you @ccstone and @JMichaelTX

JMichaelTX · May 29, 2017, 2:24am

@Bill_Mabey, here's my new improved RegEx:
^((?:\S+\h+){2}\S+?)(?=[\h.,;!?:]|$)

In this case you want to use Capture Group #1 as the results.

This version:

Like the previous version, allows any non-whitespace character (like -_(){}[].,) as part of each word.
Allows any of the following characters at the end of the string, but does NOT require them, and does NOT capture them:
\h.,;!?: where \h is a SPACE or TAB char.

Examples (green highlight is what is captured):

If you want to allow ONLY the RegEx standard "word" characters, you can replace the \S with a \w:
^((?:\w+\h+){2}\w+?)(?=[\h.,;!?:]|$)

For details, see regex101: build, test, and debug regex

My thanks to Chris (@ccstone) for showing me how to make the RegEx Lookahead work.

Please let me know if you have any questions.

Kirby_Krieger · June 1, 2017, 1:37am

That site is brilliant, afaict. What a fantastic learning tool. Thanks for the link.

JMichaelTX · June 1, 2017, 2:08am

Yep. And thanks to recent improvements, it is now easy to create and maintain your own set of RegEx cases, including Title and Tags:

I highly recommend it.

I've look in earnest several times for a RegEx desktop app, and could not find any I liked nearly as well as RegEx101.com. Of course, it does require a good Internet connection, but I have that with Comcast Cable (200Mbps).

Regular Expression Copy the first 3 words

Options