RegEx: How to Split String into Words

JMichaelTX · December 13, 2015, 1:13am

####Actually, it finds 4 groups, or perhaps more accurately, it finds the same group 4 times.

If you notice the panel on the right, under "Match Information", it says "No match groups" when I do not use the parenthesis (as I did above).

Here is an example with NO groups, using the expression:
/"[^"]*"|[^:]+/g

####When I use parenthesis, to identify group(s), then the tools shows the groups that were found:
/("[^"]*"|[^:]+)/g

peternlewis · December 14, 2015, 3:49am

Done.

Also I've added split and list as matches for the For Each action in Add Action by Name and the action category search.

ComplexPoint · December 14, 2015, 1:01pm

on the right, under "Match Information", it says "No match groups"

Absolutely – illustrates why that edit is a good idea. Their conflation of matches with groups has demonstrably been a source of confusion : - )

JMichaelTX · December 15, 2015, 12:54am

@ComplexPoint, so are we now in agreement that the RegEx /("[^"]*"|[^:]+)/g is returning “groups”, as shown by the online tool in my above post?

peternlewis · December 15, 2015, 4:21am

No. RegEx /("[^"]*"|[^:]+)/g is returning multiple matches, each of which includes one capture group (technically two, since there is an implicit “0th” “All” capture group which is the entire match, which in this case matches the one explicit group).

JMichaelTX · December 15, 2015, 4:38am

@peternlewis, OK, I was using the term "groups" in the generic sense, meaning one or more groups.

The main point is that use of parenthesis in the RegEx returns a match group for each set of parenthesis. The online tools perhaps goes further by showing all matches in the source string that fit the pattern of the group.

So, as I said above:

Is this an improper use of the RegEx term "group"?

peternlewis · December 15, 2015, 5:51am

It’s border line. I would say it finds four matches, each with one capture group.

As a general rule though, you can only use one or the other - if you are using something that finds all the matches, it will return all the matches and ignore capture groups. Otherwise, if you are using something that find the first match (or finds a complete) match, then it may have the facility to return capture groups. It’s rare (except with the programming level API) to be able to find multiple matches and the capture groups in each.

ComplexPoint · December 15, 2015, 8:45am

I was using the term "groups" in the generic sense

Of course, that's quite understandable, and that's what the author of that web-tool was doing too.

Probably worth noticing though, that their usage slightly confused your expectation of the regex behaviour, and is generating this thread.

Words get their meanings from particular fields/discourses/activities rather than from dictionaries. One simply offers more clarity to others, and probably to oneself as well, by trying to avoid unclear boundaries, in a technical field, between generic usage and the particular usage of that field.

In terms of regexes, a group is specifically a single string separately captured by enclosing a pattern in parentheses. Each pair of parentheses captures just one group, and captures it only once.

Each pair of parentheses gets a single and unique index, and assigns it to a single and unique matching string.

To multiply such captures, you have to have multiple pairs of parentheses. You can't get one pair of parentheses to return multiple matches.

ianthekirkland · December 15, 2015, 3:12pm

I could be wrong, but from your screenshots it appears that your confusion might be stemming from regex101’s user interface. I can see how one might be led to believe they are looking at ‘groups’ when they are really just being presented with matches. You may want to try one of these online tools:

Of course, I still use (and love) RegEx101!

In the end, I’d recommend simply putting parentheses around certain parts of your expression and noting the differences in the displayed results. At the end of the day, RegEx groups are similar to Keyboard Maestro’s ‘Group’ action: They’re basically arbitrary containers that wrap several items into a single unit so you can refer to stuff by unit name rather than the names of the individual contents.

ccstone · December 15, 2015, 11:12pm

Hey JM,

Good communication always demands some level of precision, and since regular expressions are very technical the necessary level of precision goes up when describing them.

A match.
A group.
A capture-group.

Are all considered to be separate entities.

(the\w+)   # A capture-group.
(?:the\w+) # A non-capturing group.

So while we might talk about grouping part of the pattern, we should always distinguish that from a capture-group.

Match is entire match – and as you've seen the 'g' switch may produce multiple matches.

You can easily get tripped up in one of these online analyzers that emulate programming language level /search/replace/ code.

Without understanding how all the separators and switches work it's easy to get confused.

Many people mistakenly believe the forward slashes are part of the pattern, so I don't advise their use except as part of real code snippets.

$myStrVar =~ s/\w/•/g # Canonical Perl
$myStrVar =~ s!\w!•!g # My preference for pattern separation characters ('!').

Unless I'm talking to someone I know is versed in the syntax I'm using I find it more effective to explicitly separate search (or find) pattern, replace pattern, and switches.

Search Pattern:

<my search pattern>

Replace Pattern:

<my replace pattern>

Switches:

msixg

And when you're talking about a GUI regex environment it's usually more effective to present switches in-line in the pattern:

(?imx)<more pattern>           # ON
(?-imx)<more pattern>          # OFF
(?s-i)[a-z[:punct:][:blank:]]+ # MIXED

Another pitfall of regular expression analyzers is that you can get something working without understanding why it works. (Often true of Life in general of course.)

Using good vocabulary even when talking to oneself about RegEx helps us to understand them better. After 20 some-odd years of use and study I continue to learn new stuff, and I've found the need to tighten up my own vocabulary.

Hopefully this is more informative than pedantic .

-Chris

JMichaelTX · December 16, 2015, 12:13am

Excellent write-up Chis, and very helpful, to a lot of folks I suspect.

I think this should be added to the KM wiki, as a separate page, but linked to the main RegEx page. What do you think?

If you agree, I'll be glad to do the yeoman's work of putting it into the wiki, since you did the hard stuff.

ianthekirkland · December 16, 2015, 1:47am

It’s not (directly) regex-driven, but you might be able to get some use out of this plugin I threw together:

https://forum.keyboardmaestro.com/t/split-text-plugin-action/2595

JMichaelTX · December 16, 2015, 1:55am

Sounds good, Ian. Thanks for sharing.

I think this could be very useful. I've overloaded at the moment, but as soon as I get a chance I'll take it for a test drive.

RegEx: How to Split String into Words

Options