RegEx for Horizontal Whitespace (\s \h \t blank etc)

Note that \h came in in ICU 55, the same time \R came in, which, as we all know works in 10.11 and 10.12, but not 10.10.

The problem is that [ \t] is not an equivalent for \h :

[ \t] covers exactly two whitespace characters (U+0020 and U+0009), where \h covers two dozens of them or more.

For a nice and readable list of whitespace characters see the Wikipedia article.

Of course, most of them are rarely used, but the following characters you will find in texts:

U+00A0 no-break space

and to lesser degree:

U+2009 thin space
U+202F narrow no-break space
U+200A hair space

Those characters are directly accessible in common programs like Adobe Indesign:

…and of course also in TeX-based typesetting systems (ConTeXt, LaTeX) and other. And they are used (not only by me ;)).


So, to cover also those characters, there are (at least) two good replacements for \h that should work also with programs/languages that are not (yet) supporting \h :

[[:blank:]]

That is the POSIX expression for \h. (As you know, I don't have the faintest clue of JavaScript, so you have to test it yourself if it works with JS.)

If the POSIX expression doesn't work, then this should work:

[^\S\r\n]

Both are not 100% equivalent to \h but they should do the trick with most texts. In any case they are more comprehensive than [ \t].



This StackOverflow post is worth reading:

3 Likes

Well damn, I know what I’m using in my regexes from now on.

1 Like

If you open the Help ➤ ICU Regular Expression Reference entry in Keyboard Maestro, you can see that \h is Space_Separator plus the ASCII tab (\u0009).

If you look up Help ➤ Regular Expression Unicode Properties and search for Space_Separator, you can find that

Space_Separator is the defined as `\p{Space_Separator} aka \p{Zs} aka \p{General_Category=Space_Separator} aka \p{gc=Zs} and contains these 17 characters.

[32 SPACE][160 NO-BREAK SPACE][5760 OGHAM SPACE MARK][8192 EN QUAD][8193 EM QUAD][8194 EN SPACE][8195 EM SPACE][8196 THREE-PER-EM SPACE][8197 FOUR-PER-EM SPACE][8198 SIX-PER-EM SPACE][8199 FIGURE SPACE][8200 PUNCTUATION SPACE][8201 THIN SPACE][8202 HAIR SPACE][8239 NARROW NO-BREAK SPACE][8287 MEDIUM MATHEMATICAL SPACE][12288 IDEOGRAPHIC SPACE]

I haven’t updated the page in a while, so it is possible the category may have been extended somewhat.

The same page will find the value of blank which is exactly the above set plus Tab, which should be exactly what \h uses.

So [[:blank:]] or \p{blank} should be perfect, or [\p{Zs}\t]. All of which work on 10.10 (I checked ;- ).

\s matches [\t\n\f\r\p{Z}].

So [^\S\r\n] should match [\t\f\p{Z}]

But \f is form feed which should not be included, and \p{Z} is a different set.

[32 SPACE][160 NO-BREAK SPACE][5760 OGHAM SPACE MARK][8192 EN QUAD][8193 EM QUAD][8194 EN SPACE][8195 EM SPACE][8196 THREE-PER-EM SPACE][8197 FOUR-PER-EM SPACE][8198 SIX-PER-EM SPACE][8199 FIGURE SPACE][8200 PUNCTUATION SPACE][8201 THIN SPACE][8202 HAIR SPACE][8232 LINE SEPARATOR][8233 PARAGRAPH SEPARATOR][8239 NARROW NO-BREAK SPACE][8287 MEDIUM MATHEMATICAL SPACE][12288 IDEOGRAPHIC SPACE]

Which includes [8232 LINE SEPARATOR][8233 PARAGRAPH SEPARATOR] which should not be included.

So, all that, use [[:blank:]] or \p{blank} or [\p{Zs}\t] or take your changes with \h or a less accurate set.

2 Likes

Thanks for confirming.

Does it also work in JavaScript (which was @JMichaelTX’s concern)?

According to RegEx101.com, the answer is no.

My sincere thanks to @Tom and @peternlewis for pursuing this subject so thoroughly.

Based on their posts, my research, and testing at RegEx101.com, it seems to me this is the best horizontal whitespace pattern that will work in all languages, including JavaScript:
[^\S\r\n\f]

updated 2017-10-30 18:17 CT
If you are just working with the KM RegEx Actions, which all use the ICU RegEx engine, and your macOS is 10.11+, then clearly the best choice is:
\h (requires macOS 10.11+)

You can view my test case at: regex101: build, test, and debug regex

  • The see how the pattern works with different languages, click on the language in the sidebar.

If anyone sees an issue using [^\S\r\n\f] for matching horizontal whitespace, please advise.

1 Like

@Tom, the saga continues. :wink:

I have 3+ sources that all state that [[:blank:]] is the same as [ \t]:

updated 2017-10-13 16:33 CT

So, unfortunately, it is NOT a replacement for \h.

See:

[[:blank:]] is NOT mentioned in Regular Expressions - ICU User Guide

<img src="/uploads/default/original/2X/9/9fa034801953dfe7d1745afcb7087c7530a06e23.gif" width="70" height="17"  alt="updated" title="updated"> 2017-10-13 16:47 CT

OK, this is about as authoritative as I can find:

[perlrecharclass - perldoc.perl.org](http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes)

<img src="/uploads/default/original/2X/3/332244b2b9baad79ac8710d25139d0241ec45114.png" width="690" height="111">

@Tom, after re-reading what I posted, I saw the \h in the image of the [[:blank:]] explanation, so it confused me.

Here is a test of [[:blank:]]

It shows that it does NOT match OPTION-SPACE, which \h does match.

All of the references I found simply state that [[:blank:]] matches "space and tab".
I have taken "space" to mean the standard ASCII space character.

If you or @peternlewis can add any clarity to this, please do so.

Hey Jim,

Actually non-breaking spaces match just fine in BBEdit and Keyboard Maestro.

RegEx ⇢ String ⇢ Test KM Find & Replace RegEx.kmmacros (2.9 KB)


•••••   Space
•••••   NBS (via Opt-Space)
•••••   U+00A0 non-breaking-space (via Unicode Entry)
•••••   U+2009 thin space
•••••   U+202F narrow no-break space
•••••   U+200A hair space
•••••   Tab

BBEdit uses PCRE, so I'm not sure why regex101 is failing to make the match...

-Chris

1 Like

This is referring to ASCII (!)

Here a test for KM:

[test] Whitespace.kmmacros (5.5 KB)

Set your expression in the green action, for example:

[[:blank:]]
[ \t]
\h
\p{Zs}
[\p{Zs}\t]
\s
[^\S]
[^\S\n\r]

On my machine

[[:blank:]]
\h
[\p{Zs}\t]
[^\S\n\r]

are delivering the same result: They match all horizontal whitespaces except the no-width spaces and the joiners.

Also

\s
[^\S]

are matching the same whitespaces as the above expressions, plus, of course, the vertical spaces / newlines (which in my test are excluded from being matched).

Note:

The used list of spaces is not comprehensive. (Only the more common ones.)

3 Likes

I don't have much to contribute here, especially after @Tom's epic test case, but for what it's worth I can confirm that [[:blank:]] in BBEdit matches all the characters in @ccstone's example:

(on a side note, I love that BBEdit added regex capabilities to the live search feature; I've wanted this kind of capability in it for a long time)

I can also confirm that whatever weirdness is happening with regex101's refusal to have [[:blank:]] match the same characters is also happening with Patterns, a regex101-like native Mac app, so perhaps there's some common factor at work here:

1 Like

On regex101 you have to set the Unicode flag (u):

Without the Unicode flag you will only get ASCII characters (hence the same as [ \t])

2 Likes

See if the app has a setting to enable Unicode (like with regex101.com).

Otherwise Oyster (macOS) seems to work correctly. Also Mark Alldritt's RegEx Knife (iOS).

It does not (in fact, it doesn't have any settings at all other than the regex flavor) but I did find out that it too explicitly says [[:blank:]] only matches spaces and tabs:

Strangely enough, it doesn't even list \h in its reference, even though it lists \v :person_shrugging:

Maybe it isn't Unicode-capable. Or he has copied the reference list from the ASCII column on regular-expression.info or so.

What does happen when you use \h ?

Oh it supports \h just fine:

It just doesn't list it in the app's reference sheet.

1 Like

While we're on this subject, here's another macOS app, Expressions (I remembered this one from my Setapp trial) that both handles [[:blank:]] appropriately and correctly explains what `\h' does:

Yes, this one looks nice. But — unless I'm missing something — it is impossible to set the flags per expression. (Only globally in the prefs.) Not good.

Hmm. Unless I'm missing something (very possible, since my regex expertise and experience is still quite low) it looks like you can set per-expression flags with the standard `(?[FLAG]) and (?-[END FLAG]) syntax:

(both of these tests were done with case sensitivity turned ON in the global preferences)

Ah, ok, great. (I only tried the /pattern/flag syntax.)

Thanks

1 Like