How Do I Use GREP RegEx to Remove Newline Characters (like CR, LF)?

project_guru · August 6, 2018, 6:07pm

For the life of me I can't figure out in KM how to use GREP in the Search and Replace action to remove carriage/line returns. Please see example text below where there's a line return in between each line of text:

–––––––––––––––––––––––––––––––––––
This is text line 1

This is text line 2

This is text line 3
–––––––––––––––––––––––––––––––––––

I found another thread where someone advised to use (?m)^ in the Search field to find the beginning of each line. So far so good. HOWEVER, if I leave the Replace field blank, it spits out exactly the same text. I have confirmed this because I am using the Display System Clipboard action to see the output.

What am I doing wrong?

ccstone · August 6, 2018, 6:11pm

Hey @project_guru,

Try this.

-Chris

RegEx ⇢ String ⇢ Test KM Find & Replace RegEx.kmmacros (2.8 KB)

project_guru · August 6, 2018, 6:22pm

Thanks Chris,

I tried modifying my Macro to reflect what you are doing but I also added ^ to it because with out it, it removes ALL line returns. However, it's still spitting out exactly the same thing.

2018-08-06_13-19-32

ccstone · August 6, 2018, 6:25pm

I see I misread your initial post slightly...

Try this pattern:

(?ms)^\n

-Chris

project_guru · August 6, 2018, 6:33pm

That worked.

For those who encounter the same problem as I did, (?ms) simply tells the system to process multiple lines. Otherwise it seems it will only process the very first line. Correct me if I'm wrong.

Thanks so much!

JMichaelTX · August 6, 2018, 6:33pm

I would use this to match any newline character (including CRLF), not just linefeed:

SEARCH FOR:
\R+

REPLACE WITH:
\n

It will remove all blank lines.

project_guru · August 6, 2018, 6:35pm

Thanks JMichaelTX. Yours works too but not if there are more than 1 consecutive line returns.

JMichaelTX · August 6, 2018, 6:39pm

It removes ALL blank lines in my testing.
See https://regex101.com/r/kOtsRs/1/

But you would need to be running macOS 10.11+

ccstone · August 6, 2018, 7:59pm

Hey Guys,

Let's make that just a trifle more robust.

Sometimes horizontal whitespace can sneak into what you think are empty blank lines, and you may also want to prevent the last line from having a linefeed.

JM's pattern has the advantage of simplicity and of taking any line endings and ensuring they end up being linefeeds, but you would need a second pass to remove the last linefeed (if desired).

The regular expression below REQUIRES macOS 10.11 or later.

RegEx ⇢ String ⇢ Test KM Find & Replace RegEx.kmmacros (2.8 KB)

-Chris

peternlewis · August 7, 2018, 1:54am

A couple points that may help your understanding:

The Flag Options are described in the reg ex help (Help ➤ ICU Regular Expression Reference), and the two in question are:

(?m) — Control the behavior of "^" and "$" in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, "^" and "$" will also match at the start and end of each line within the input text.
(?s) — set, a "." in a pattern will match a line terminator in the input text. By default, it will not. Note that a carriage-return / line-feed pair in text behave as a single line terminator, and will match a single "." in a RE pattern. Line terminators are \u000a, \u000b, \u000c, \u000d, \u0085, \u2028, \u2029 and the sequence \u000d \u000a.

So the “s” in “(?ms)^\n” is redundant, since you are not using “.” in the pattern. Just “(?m)^\n” will be fine.

Next, ^ and $ are zero width matches. They match at the start or end of the text, or with (?m) at the start or end fo the line, but they do not match any characters per se. Thus replacing them with an empty string accomplices nothing because they are already an empty string.

The best option for removing blank lines is:

Replace “(?m)^\R” with “” (10.11+)
Replace “(?m)^\n” with “” (10.10)

@JMichaelTX’s solution is cleaner, but fails to remove blank lines at the start of the variable:

Replace “\R+” with “\n” (10.11+)
Replace “\n+” with “\n” (10.10)

The 10.10 solutions do not handle alternative line endings (\r or \r\n), but that is probably not an issue.

JMichaelTX · August 7, 2018, 2:33am

Well, unless I fouled something up, It looks like to me that all of our solutions leave a blank line on the bottom if there were multiple blanks lines at the bottom to start with:

Source data for all tests



–––––––––––––––––––––––––––––––––––
This is text line 1


	
  

This is text line 2
a normal line

This is text line 3
–––––––––––––––––––––––––––––––––––
last line

Chris' Solution

Peter's Solution

JMichaelTX's solution

Here's my test macro:

MACRO: TEST Regex Replace Syntax


#### DOWNLOAD:
<a class="attachment" href="/uploads/default/original/3X/0/6/067f3d3230730ea05ffc38775aa21679aaef6e32.kmmacros">TEST Regex Replace Syntax.kmmacros</a> (3.1 KB)
**Note: This Macro was uploaded in a DISABLED state. You must enable before it can be triggered.**

---



![image|458x966](upload://qsexXI9B1MGAlJ2PbWIE7miwWJx.jpg)

---

### KM to the rescue!
Get rid of all lines at top and bottom of text!  👍
[Filter action -- Trim Whitespace](https://wiki.keyboardmaestro.com/action/Filter)

peternlewis · August 7, 2018, 2:39am

There is generally meant to be a line ending character at the end of a list of lines, assuming the lines are text. This may or may not be displayed as a blank line at the end of the text.

So a piece of text with three lines a, b, c would be “a\nb\nc\n”.

As you note, if that is not what you want, then there are further tools to solve the problem.

JMichaelTX · August 7, 2018, 2:41am

Easy fix:
Chris:
(?m)(^\h*\R)|(\R+\Z)

JMichaelTX · August 7, 2018, 2:51am

Interesting point/discussion. For a while I called this stuff "end-of-line" characters. But I think that is wrong, as most references call them "newline" or "line break" characters.
That would imply that the last line would not have one.

I guess the point is you never know for sure what you will get. If you don't want a blank line on the bottom, then you probably need to take action to remove it, if it is there.

ccstone · August 7, 2018, 7:14am

Right.

When you assume, you'll get bitten more often.

I nearly always remove vertical head and tail whitespace when I'm massaging text.

\A\s+|\s+\Z

Here's a tweak to my pattern in post #9 above that will remove the EOL character from the last line if it exists.

(?m)(^\h*\R)|(\R+\Z)

-Chris

project_guru · August 7, 2018, 4:37pm

My brain is developing an aneurysm trying to understand this ICU GREP stuff. For years I've only used the type of GREP that TextWrangler uses (don't know exactly which one it uses).

For example: In TW /r will find a line return but in KM it must be /R.

It seems that I now have to start learning the type that KM uses.

Two questions:

Might anyone know of a way to change TextWrangler's GREP system to match KM's so I don't go crazy with their differences? If not, can KM's be changed to Match TextWrangler's. If not, Is there another text editor that has support for ICU GREP?
There's a lot that I don't understand on the ICU User Guide. Just one example would be:

{n,m} Match between n and m times. Match as many times as possible, but not more than m.

I don't understand exactly what they mean by "Match between n and m times". Before I can understand this, I'd first need to know what "n" and "m" exactly mean. I did a Google search but found nothing.

Can someone recommend a book/resource for beginners that I can study?

Again, thanks to all for your help. This is an amazing forum!

JMichaelTX · August 7, 2018, 7:37pm

Looks familiar.

JMichaelTX · August 7, 2018, 8:55pm

Both TW (now BBEdit Lite) and KM use RegEx based on PCRE.
So, they are very similar, but with some differences. KM uses the RegEx engine provided by the macOS, which is ICU Regular Expressions.

TW has been deprecated, replaced with BBEdit Lite. From the BBEdit User's Manual, Chapter 8, p168:

BBEdit’s grep engine is based on the PCRE library package, which is open source software, written by Philip Hazel, and copyright 1997-2004 by the University of Cambridge, England. For details, see: http://www.pcre.org/

KM uses the ICU Regular Expressions, which Unicode Technical Standard #18. This is, in essense, PCRE, "The regular expression patterns and behavior are based on Perl's regular expressions"

However, as best as I can tell (@peternlewis and @ccstone might know more about this) , BBEdit does do some things differently:

All text opened/pasted in BBEdit is converted to linefeed as the newline character, if that is set as your default "Line break" in Preferences.
It appears that they have changed the use of \r \n and \R to all simply match "'hard' line break".

[The above items are under review 2018-08-10.]

From the BBEdit User's Manual, Chapter 8, p173:

Personally, while I use BBEdit RegEx a lot to manipulate text, I rarely use it to develop RegEx patterns. For that, I mostly use Regex101.com

Sorry, but that is incorrect.
First of all, you need to use the backslash rather than the forwardslash.

Both BBEdit and KM can use both \r and \R, but they have different meanings, at least in ICU compliant apps like macOS and KM:

image805×77 13.8 KB

I'll try to post some RegEx references later.

peternlewis · August 8, 2018, 3:28am

Basically, no. BBEdit and TextWrangler (which is defunct now anyway, use BBEdit Lite) use PCRE, where as Keyboard Maestro uses the system icucore ICU regular expressions.

They are similar in most ways, but they do have their differences - not that I could find a good description of what the differences are.

n and m are numbers you provide. So a{3,5} will match aaaaa or aaaa or aaa (in that order).

Keyboard Maestro uses the system ICU Core Regular Expression engine, which is different to PCRE (Perl Compatible Regular Expressions), although it is very similar. And as you note, BBEdit seems to have adjusted PCRE as well.

It would be nice to have a definitive list of what is different between the two, but I can't seem to find even a set of differences between PCRE and ICU, and then BBEdit is its own world again.

JMichaelTX · August 8, 2018, 3:41am

I've always loved those type of statements.

According to ICU Regular Expressions:

"The regular expression patterns and behavior are based on Perl's regular expressions"

To me, it boils down to this:

ICU RegEx is closer to PCRE than BBEdit RegEx
For anything that is critical, always test in the app where the RegEx will be used.
I have found Regex101.com, which provides a PCRE engine, to provide results very very close to that which is provided by KM.
(but be sure to set the "unicode" flag in Regex101.com)
I repeat: For anything that is critical, always test in the app where the RegEx will be used.