RegEx: How to Split String into Words

This is in no way, shape, or form intuitive for newbies. You have to know about the action, and you have to know a bit of RegEx.

But. Once you've used it a time or two it becomes a useful tool in the toolbox.

Open the Keyboard Maestro Engine Log in the Console.app to see the macro work.

~/Library/Logs/Keyboard Maestro/Engine.log

In the RegEx I'm using 2-digits as the substgring, and the Positive Lookahead Assertion of colon (:slight_smile: OR end-of-line as the substring-separator.

-Chris


For Each Substring Match.kmmacros (2.6 KB)

I didn't say it was intuitive.

But I will say when using Keyboard Maestro, whenever you hit a problem that deals with a list of things, the For Each action is the pace to look. Keyboard Maestro has no other places where it iterates over a collection of items.

As I say, it is not intuitive - what solution to this sort of problem would be intuitive? I can't think of any.

Instead, once learnt, it is knowledge that applies across a range of problems within Keyboard Maestro - this is the best I know how to do…

The For each > Substrings option is excellent.

Perhaps, if users are getting to it via a notion of splitting, it might be useful to make for each > substrings appear as a hit for searches on split by inserting the word split in a relevant wiki paragraph somewhere ?

OK, let's figure out how to add this to the wiki so that users who don't know KM or programming will know how to proceed. After all, the wiki is for those who don't know KM fully, right?

I guess what I'm saying is that, in the wiki, we need to connect RegEx with For Each, and provide some examples.

####Also, I want to make it easy for KM users to go from creating/testing a RegEx expression in a tool like RegEx101.com to KM.

So I just got this suggestion for creating multiple groups:

/("[^"]*"|[^:]+)/g

How to we instruct the KM user to implement this in KM?
Is the answer anytime a RegEx tool uses the global flag /g that the KM user needs to use the For Each Action?

See https://regex101.com/r/jN0uT2/1

1 Like

Chris, I'm probably doing something wrong, but this is not working for me. It only shows a notification for "12"

2/3 comments:

  1. by \d(2) I would guess that you meant \d{2} or \d\d

    ( \d(2) looks for a digit followed by the number two, which it places in a distinct capture group.
    The latter two look for two digits. \d\d may be quicker to parse and execute,
    as well as quicker to write, and perhaps a bit less error-prone.)

  2. As you found, quite easy for confusion about the difference between regex groups and regex matches to arise. If you do want to write learning materials, it would probably be good to make a quick edit to ensure that you are not using the terms interchangeably. (See yellow bubble above - the tool doesn't auto-find 4 groups, it finds four matches).

  3. If simplicity is the goal, and splitting is the first metaphor that comes to mind, perhaps you can try a search and replace with : --> \n, and then loop through the resulting lines

####Actually, it finds 4 groups, or perhaps more accurately, it finds the same group 4 times.

If you notice the panel on the right, under "Match Information", it says "No match groups" when I do not use the parenthesis (as I did above).

Here is an example with NO groups, using the expression:
/"[^"]*"|[^:]+/g

####When I use parenthesis, to identify group(s), then the tools shows the groups that were found:
/("[^"]*"|[^:]+)/g

Done.

Also I've added split and list as matches for the For Each action in Add Action by Name and the action category search.

1 Like

on the right, under "Match Information", it says "No match groups"

Absolutely – illustrates why that edit is a good idea. Their conflation of matches with groups has demonstrably been a source of confusion : - )

@ComplexPoint, so are we now in agreement that the RegEx /("[^"]*"|[^:]+)/g is returning “groups”, as shown by the online tool in my above post?

No. RegEx /("[^"]*"|[^:]+)/g is returning multiple matches, each of which includes one capture group (technically two, since there is an implicit “0th” “All” capture group which is the entire match, which in this case matches the one explicit group).

@peternlewis, OK, I was using the term "groups" in the generic sense, meaning one or more groups.

The main point is that use of parenthesis in the RegEx returns a match group for each set of parenthesis. The online tools perhaps goes further by showing all matches in the source string that fit the pattern of the group.

So, as I said above:

Is this an improper use of the RegEx term "group"?

It’s border line. I would say it finds four matches, each with one capture group.

As a general rule though, you can only use one or the other - if you are using something that finds all the matches, it will return all the matches and ignore capture groups. Otherwise, if you are using something that find the first match (or finds a complete) match, then it may have the facility to return capture groups. It’s rare (except with the programming level API) to be able to find multiple matches and the capture groups in each.

I was using the term "groups" in the generic sense

Of course, that's quite understandable, and that's what the author of that web-tool was doing too.

Probably worth noticing though, that their usage slightly confused your expectation of the regex behaviour, and is generating this thread.

Words get their meanings from particular fields/discourses/activities rather than from dictionaries. One simply offers more clarity to others, and probably to oneself as well, by trying to avoid unclear boundaries, in a technical field, between generic usage and the particular usage of that field.

In terms of regexes, a group is specifically a single string separately captured by enclosing a pattern in parentheses. Each pair of parentheses captures just one group, and captures it only once.

Each pair of parentheses gets a single and unique index, and assigns it to a single and unique matching string.

To multiply such captures, you have to have multiple pairs of parentheses. You can't get one pair of parentheses to return multiple matches.

I could be wrong, but from your screenshots it appears that your confusion might be stemming from regex101’s user interface. I can see how one might be led to believe they are looking at ‘groups’ when they are really just being presented with matches. You may want to try one of these online tools:

Of course, I still use (and love) RegEx101!

In the end, I’d recommend simply putting parentheses around certain parts of your expression and noting the differences in the displayed results. At the end of the day, RegEx groups are similar to Keyboard Maestro’s ‘Group’ action: They’re basically arbitrary containers that wrap several items into a single unit so you can refer to stuff by unit name rather than the names of the individual contents.

Hey JM,

Good communication always demands some level of precision, and since regular expressions are very technical the necessary level of precision goes up when describing them.

A match.
A group.
A capture-group.

Are all considered to be separate entities.

(the\w+)   # A capture-group.
(?:the\w+) # A non-capturing group.

So while we might talk about grouping part of the pattern, we should always distinguish that from a capture-group.

Match is entire match – and as you've seen the 'g' switch may produce multiple matches.

You can easily get tripped up in one of these online analyzers that emulate programming language level /search/replace/ code.

Without understanding how all the separators and switches work it's easy to get confused.

Many people mistakenly believe the forward slashes are part of the pattern, so I don't advise their use except as part of real code snippets.

$myStrVar =~ s/\w/•/g # Canonical Perl
$myStrVar =~ s!\w!•!g # My preference for pattern separation characters ('!'). 

Unless I'm talking to someone I know is versed in the syntax I'm using I find it more effective to explicitly separate search (or find) pattern, replace pattern, and switches.

Search Pattern:

<my search pattern>

Replace Pattern:

<my replace pattern>

Switches:

msixg

And when you're talking about a GUI regex environment it's usually more effective to present switches in-line in the pattern:

(?imx)<more pattern>           # ON
(?-imx)<more pattern>          # OFF
(?s-i)[a-z[:punct:][:blank:]]+ # MIXED

Another pitfall of regular expression analyzers is that you can get something working without understanding why it works. (Often true of Life in general of course.)

Using good vocabulary even when talking to oneself about RegEx helps us to understand them better. After 20 some-odd years of use and study I continue to learn new stuff, and I've found the need to tighten up my own vocabulary.

Hopefully this is more informative than pedantic .

-Chris

5 Likes

Excellent write-up Chis, and very helpful, to a lot of folks I suspect.

I think this should be added to the KM wiki, as a separate page, but linked to the main RegEx page. What do you think?

If you agree, I'll be glad to do the yeoman's work of putting it into the wiki, since you did the hard stuff. :wink:

1 Like

It’s not (directly) regex-driven, but you might be able to get some use out of this plugin I threw together:

https://forum.keyboardmaestro.com/t/split-text-plugin-action/2595

Sounds good, Ian. Thanks for sharing. :thumbsup:

I think this could be very useful. I've overloaded at the moment, but as soon as I get a chance I'll take it for a test drive.

1 Like