How to remove lines between pairs of text markers?

riwoong · June 11, 2024, 4:47am

I have a very long movie script text file that includes annotations for each scene. The annotations are multi-line and are enclosed by ==== at the beginning and ++++ at the end.

I would appreciate it if you could tell me how to delete all the lines between ==== and ++++.

For example, it looks like this:

A certain place
Content of the scene.
====
Annotation 1
Annotation 2
++++
Another place
Another content of the scene.
====
Annotation 3
Annotation 4
++++

I would like these to be changed as follows:

A certain place
Content of the scene.
Another place
Another content of the scene.

Thank you.

Airy · June 11, 2024, 5:42am

I think this will work. Just put your text into the specified variable, below.

I must give you one small word of warning. If your final annotation also happens to be the last line of the file, without a newline after it, then this macro could fail to remove the last annotation. But I doubt that any normal script would have no newline after the last annotation. So I don't think you will ever see that happen.

ALYB · June 11, 2024, 5:48am

What does (?s) mean?

Airy · June 11, 2024, 5:49am

It means that a dot will match a newline. For details:

peternlewis · June 11, 2024, 8:04am

But you are not using . anywhere in the regex.

Note that that regex will fail if + is used anywhere in an annotation.

What you want is probably:

====(?s:.*?)\+\+\+\+\R?

That will match:

Four =
?s: means . matches any character
*? means match any number of the previous item, but the minimum it can match
Then four +
Then an optional line ending

If ==== is allowed in the text anywhere not at the start of the line, then you'd need to add restrictions for that as well.

Airy · June 11, 2024, 8:50am

But [^x] means "any character but x" and that's functionally equivalent to a dot with one exception. So I assumed that the (?s) would be required.

ComplexPoint · June 11, 2024, 8:57am

An alternative instrument is:

a Keyboard Maestro For Each action applied to each line, one by one, with
the value of a local_InFence variable moving 0 ⇄ 1, that is: false ⇄ true

When (and only when) local_InFence is false, we append a line to the accumulating output.

When we see ====, local_InFence becomes true, and
when we see ++++, local_InFence becomes false

For example:

Lines between fences filtered out.kmmacros (8.3 KB)

Or as a single script action:

Lines between fences filtered out (by JS .reduce).kmmacros (4,0 Ko)

Expand disclosure triangle to view JS source

return kmvar.local_Source.split("\n")
.reduce(
    // Updated accumulator
    ([inFence, outputLines], lineText) => {
        const
            [fenceClosing, fenceOpening] = [
                "====", "++++"
            ]
            .map(x => lineText.includes(x)),
            
            dropped = (inFence || fenceClosing);

        return [
            dropped && !fenceOpening,
            dropped
                ? outputLines
                : outputLines.concat(lineText)
        ];
    },

    // Initial state of accumulator
    [false, []]
)[1]
.flat()
.join("\n");

riwoong · June 11, 2024, 4:18pm

Thank you very much! It really helped me.

riwoong · June 11, 2024, 4:19pm

Thank you! At this moment, Airy's soulution works, if there's any things go wrong, I'll try this.
Thanks again!

Airy · June 11, 2024, 6:40pm

You are welcome. Peter made a decent point about a minor error in my method, but considering that my method is one short action, it's pretty easy to understand.

Some days I prefer solutions that are all-KM actions only, but other days I'm favourable to solving problems using Execute Shell Script if the solution is quite simple there.

peternlewis · June 12, 2024, 3:46am

Nope. [^x] means every character except x. The s flag makes no difference.

Without the flag, . means any single character except any line terminating characters (\u000a, \u000b, \u000c, \u000d, \u0085, \u2028, \u2029).

With the flag, . means any single character, including the line terminating characters (\u000a, \u000b, \u000c, \u000d, \u0085, \u2028, \u2029) and also including the pair \u000d \u000a (so it could actually match two characters, which is not something I knew!).

This means you can frequently avoid the (?s) flag by using [^x] in place of . where you know x will never be present.

Airy · June 12, 2024, 4:46am

To be perfectly honest, I think I originally started my macro using a regex that contained a dot, and the truth is I just didn't bother to remove it, after switching to [^x], and wasn't sure if it was necessary to remove (?s). So I left it in. I should have conducted the tests that you did, but I was lazy.

ComplexPoint · June 12, 2024, 8:19am

Only a much simpler regular expression:

[+=]{4}

is needed in the natural habitat of regular expressions (splits rather than the slightly dysfunctional search and replace association to which they were yoked by grep in the 60s)

I don't think that Keyboard Maestro provides a very native (non-script) route to splitting on multi-character strings or regexes (perhaps a For Each collection could be defined in those terms ?), but reaching, for the moment, for a script action, we can:

Split on [+=]{4}, and
take the even-indexed fruits.

return kmvar.local_Source

.split(/[+=]{4}/u)

.filter((_, i) => i % 2 === 0)

.join("")

Lines between fences filtered by Splits.kmmacros (2.3 KB)

Alexander · June 12, 2024, 4:39pm

Here's my take on the task at hand:

Text without annotations and fences.kmmacros (18 KB)

How to remove lines between pairs of text markers?

Options