RegEx for Horizontal Whitespace (\s \h \t blank etc)

JMichaelTX · September 26, 2017, 11:30pm

This is probably the most often recommended, but it is too aggressive for my tastes:

\s
Most engines: "whitespace character": space, tab, newline, carriage return, vertical tab

So I have gotten into the habit of just replacing TABs and SPACEs:
[ \t]+

It just depends on your needs, and the source text.

Tom · September 27, 2017, 1:01am

You can also use \h which stands for 'horizontal whitespace', that is, it doesn't match newlines:

Match a Horizontal White Space character. They are characters with Unicode General Category of Space_Separator plus the ASCII tab (\u0009). [Since ICU 55]

from ICU Regex

JMichaelTX · September 27, 2017, 1:04am

Yep, that will work fine for KM.
IIRC, that is not supported by the JavaScript RegEx engine (yet). So that's why I use the [ \t] everywhere. It's easier (for me) to just remember one thing than to remember exceptions.

Tom · September 27, 2017, 1:06am

Also works in Perl, BBEdit (PCRE), …

JMichaelTX · September 27, 2017, 1:22am

Well, as long as we are providing a more complete reference, here's:
Shorthand Character Classes at regular-expressions.info

Perl 5.10 introduced \h and \v. \h matches horizontal whitespace, which includes the tab and all characters in the "space separator" Unicode category. It is the same as [\t\p{Zs}]. \v matches "vertical whitespace", which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}].

PCRE also supports \h and \v starting with version 7.2. PHP does as of version 5.2.2, Java as of version 8, and the JGsoft engine as of version 2. Boost supports \h starting with version 1.42. No version of Boost supports \v as a shorthand.

In many other regex flavors, \v matches only the vertical tab character. Perl, PCRE, and PHP never supported this, so they were free to give \v a different meaning. Java 4 to 7 and JGsoft V1 did use \v to match only the vertical tab. Java 8 and JGsoft V2 changed the meaning of this token anyway. The vertical tab is also a vertical whitespace character. To avoid confusion, the above paragraph uses \cK to represent the vertical tab.

Ruby 1.9 and later have their own version of \h. It matches a single hexadecimal digit just like [0-9a-fA-F]. \v is a vertical tab in Ruby.

That's one reason I always test my RegEx patterns at RegEx101.com. They provide a language selector:

Although this says "pcre (php)" I believe pcre stands for Perl Compatible Regular Expressions. The ICU RegEx used by KM is based on perl RegEx, so I use the RegEx101.com "pcre" for KM testing.

Finally here's wikipedia's take:
Comparison of regular expression engines

Tom · September 27, 2017, 1:47am

That's an important point. Many people are still thinking in ASCII, where things like Three-Per-Em-Space or Hair Space don't exist.

Yes, I also think this is the best choice to approximate the ICU regex.

peternlewis · September 27, 2017, 3:42am

Note that \h came in in ICU 55, the same time \R came in, which, as we all know works in 10.11 and 10.12, but not 10.10.

Tom · September 28, 2017, 12:53pm

The problem is that [ \t] is not an equivalent for \h :

[ \t] covers exactly two whitespace characters (U+0020 and U+0009), where \h covers two dozens of them or more.

For a nice and readable list of whitespace characters see the Wikipedia article.

Of course, most of them are rarely used, but the following characters you will find in texts:

U+00A0 no-break space

and to lesser degree:

U+2009 thin space
U+202F narrow no-break space
U+200A hair space

Those characters are directly accessible in common programs like Adobe Indesign:

…and of course also in TeX-based typesetting systems (ConTeXt, LaTeX) and other. And they are used (not only by me ;)).

So, to cover also those characters, there are (at least) two good replacements for \h that should work also with programs/languages that are not (yet) supporting \h :

[[:blank:]]

That is the POSIX expression for \h. (As you know, I don't have the faintest clue of JavaScript, so you have to test it yourself if it works with JS.)

If the POSIX expression doesn't work, then this should work:

[^\S\r\n]

Both are not 100% equivalent to \h but they should do the trick with most texts. In any case they are more comprehensive than [ \t].

‌
This StackOverflow post is worth reading:

gglick · September 28, 2017, 1:10pm

Well damn, I know what I’m using in my regexes from now on.

peternlewis · September 28, 2017, 1:20pm

If you open the Help ➤ ICU Regular Expression Reference entry in Keyboard Maestro, you can see that \h is Space_Separator plus the ASCII tab (\u0009).

If you look up Help ➤ Regular Expression Unicode Properties and search for Space_Separator, you can find that

Space_Separator is the defined as `\p{Space_Separator} aka \p{Zs} aka \p{General_Category=Space_Separator} aka \p{gc=Zs} and contains these 17 characters.

[32 SPACE][160 NO-BREAK SPACE][5760 OGHAM SPACE MARK][8192 EN QUAD][8193 EM QUAD][8194 EN SPACE][8195 EM SPACE][8196 THREE-PER-EM SPACE][8197 FOUR-PER-EM SPACE][8198 SIX-PER-EM SPACE][8199 FIGURE SPACE][8200 PUNCTUATION SPACE][8201 THIN SPACE][8202 HAIR SPACE][8239 NARROW NO-BREAK SPACE][8287 MEDIUM MATHEMATICAL SPACE][12288 IDEOGRAPHIC SPACE]

I haven’t updated the page in a while, so it is possible the category may have been extended somewhat.

The same page will find the value of blank which is exactly the above set plus Tab, which should be exactly what \h uses.

So [[:blank:]] or \p{blank} should be perfect, or [\p{Zs}\t]. All of which work on 10.10 (I checked ;- ).

\s matches [\t\n\f\r\p{Z}].

So [^\S\r\n] should match [\t\f\p{Z}]

But \f is form feed which should not be included, and \p{Z} is a different set.

[32 SPACE][160 NO-BREAK SPACE][5760 OGHAM SPACE MARK][8192 EN QUAD][8193 EM QUAD][8194 EN SPACE][8195 EM SPACE][8196 THREE-PER-EM SPACE][8197 FOUR-PER-EM SPACE][8198 SIX-PER-EM SPACE][8199 FIGURE SPACE][8200 PUNCTUATION SPACE][8201 THIN SPACE][8202 HAIR SPACE][8232 LINE SEPARATOR][8233 PARAGRAPH SEPARATOR][8239 NARROW NO-BREAK SPACE][8287 MEDIUM MATHEMATICAL SPACE][12288 IDEOGRAPHIC SPACE]

Which includes [8232 LINE SEPARATOR][8233 PARAGRAPH SEPARATOR] which should not be included.

So, all that, use [[:blank:]] or \p{blank} or [\p{Zs}\t] or take your changes with \h or a less accurate set.

Tom · September 28, 2017, 1:26pm

Thanks for confirming.

Does it also work in JavaScript (which was @JMichaelTX’s concern)?

JMichaelTX · September 28, 2017, 9:19pm

According to RegEx101.com, the answer is no.

My sincere thanks to @Tom and @peternlewis for pursuing this subject so thoroughly.

Based on their posts, my research, and testing at RegEx101.com, it seems to me this is the best horizontal whitespace pattern that will work in all languages, including JavaScript:
[^\S\r\n\f]

2017-10-30 18:17 CT
If you are just working with the KM RegEx Actions, which all use the ICU RegEx engine, and your macOS is 10.11+, then clearly the best choice is:
\h (requires macOS 10.11+)

You can view my test case at: regex101: build, test, and debug regex

The see how the pattern works with different languages, click on the language in the sidebar.

If anyone sees an issue using [^\S\r\n\f] for matching horizontal whitespace, please advise.

JMichaelTX · October 13, 2017, 8:29pm

@Tom, the saga continues.

I have 3+ sources that all state that [[:blank:]] is the same as [ \t]:

2017-10-13 16:33 CT

So, unfortunately, it is NOT a replacement for \h.

See:

[[:blank:]] is NOT mentioned in Regular Expressions - ICU User Guide

<img src="/uploads/default/original/2X/9/9fa034801953dfe7d1745afcb7087c7530a06e23.gif" width="70" height="17"  alt="updated" title="updated"> 2017-10-13 16:47 CT

OK, this is about as authoritative as I can find:

[perlrecharclass - perldoc.perl.org](http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes)

<img src="/uploads/default/original/2X/3/332244b2b9baad79ac8710d25139d0241ec45114.png" width="690" height="111">

JMichaelTX · October 13, 2017, 9:30pm

@Tom, after re-reading what I posted, I saw the \h in the image of the [[:blank:]] explanation, so it confused me.

Here is a test of [[:blank:]]

It shows that it does NOT match OPTION-SPACE, which \h does match.

All of the references I found simply state that [[:blank:]] matches "space and tab".
I have taken "space" to mean the standard ASCII space character.

If you or @peternlewis can add any clarity to this, please do so.

ccstone · October 14, 2017, 7:27am

Hey Jim,

Actually non-breaking spaces match just fine in BBEdit and Keyboard Maestro.

RegEx ⇢ String ⇢ Test KM Find & Replace RegEx.kmmacros (2.9 KB)

•••••   Space
•••••   NBS (via Opt-Space)
•••••   U+00A0 non-breaking-space (via Unicode Entry)
•••••   U+2009 thin space
•••••   U+202F narrow no-break space
•••••   U+200A hair space
•••••   Tab

BBEdit uses PCRE, so I'm not sure why regex101 is failing to make the match...

-Chris

Tom · October 14, 2017, 10:48am

This is referring to ASCII (!)

Here a test for KM:

[test] Whitespace.kmmacros (5.5 KB)

Set your expression in the green action, for example:

[[:blank:]]
[ \t]
\h
\p{Zs}
[\p{Zs}\t]
\s
[^\S]
[^\S\n\r]

On my machine

[[:blank:]]
\h
[\p{Zs}\t]
[^\S\n\r]

are delivering the same result: They match all horizontal whitespaces except the no-width spaces and the joiners.

Also

\s
[^\S]

are matching the same whitespaces as the above expressions, plus, of course, the vertical spaces / newlines (which in my test are excluded from being matched).

Note:

The used list of spaces is not comprehensive. (Only the more common ones.)

gglick · October 14, 2017, 11:04am

I don't have much to contribute here, especially after @Tom's epic test case, but for what it's worth I can confirm that [[:blank:]] in BBEdit matches all the characters in @ccstone's example:

(on a side note, I love that BBEdit added regex capabilities to the live search feature; I've wanted this kind of capability in it for a long time)

I can also confirm that whatever weirdness is happening with regex101's refusal to have [[:blank:]] match the same characters is also happening with Patterns, a regex101-like native Mac app, so perhaps there's some common factor at work here:

Tom · October 14, 2017, 11:22am

On regex101 you have to set the Unicode flag (u):

Without the Unicode flag you will only get ASCII characters (hence the same as [ \t])

Tom · October 14, 2017, 12:05pm

See if the app has a setting to enable Unicode (like with regex101.com).

Otherwise Oyster (macOS) seems to work correctly. Also Mark Alldritt's RegEx Knife (iOS).

gglick · October 14, 2017, 12:21pm

It does not (in fact, it doesn't have any settings at all other than the regex flavor) but I did find out that it too explicitly says [[:blank:]] only matches spaces and tabs:

Strangely enough, it doesn't even list \h in its reference, even though it lists \v

RegEx for Horizontal Whitespace (\s \h \t blank etc)

Options