How to add a negative lookahead to my regular expression?

ALYB · July 8, 2024, 6:15am

Yesterday I've wasted some hours in trying to add a negative lookahead to my regular expression:

\\b(a|all|an|other|many|different|several|some|the|this|these)(\s[a-z]+\\b){1,3}

When I apply this expression to these lines:

the product manual in
a machine manual for
these product samples of
this material with

I would like to keep these filtered lines:

the product manual
a machine manual
these product samples
this material

How can I achieve this? Is it possible at all, using a regular expression? (The text to clean has about 1,000 lines.)

Airy · July 8, 2024, 7:39am

I'm not good with negative lookahead. What's wrong with this solution? I solved it myself, then I asked ChatGPT to solve it, and it came up with the same solution.

ALYB · July 8, 2024, 8:10am

Thank you very much!

Your suggestion works fine. (I just had to add a space before every noise word, to set the word boundary.)

Do you happen to know how long this command line can be? If it is limited, it would be better if it would get its input from a file with noise words :).

Happy with your solution!!!

ComplexPoint · July 8, 2024, 8:27am

A variant approach (possibly more readily scalable):

EOL Noise Words Dropped.kmmacros (3.9 KB)

Expand disclosure triangle to view JS source

return (() => {
    "use strict";

    const main = () => {
        const noise = lines(kmvar.local_Noise);

        return lines(kmvar.local_Source)
        .flatMap(s => {
            const ws = words(s);

            return unwords(
                noise.includes(last(ws))
                    ? init(ws)
                    : ws
            )
        })
        .join("\n");
    };

    // --------------------- GENERIC ---------------------

    // init :: [a] -> [a]
    const init = xs =>
        // All elements of a list except the last.
        0 < xs.length
            ? xs.slice(0, -1)
            : null;

    // last :: [a] -> a
    const last = xs => {
        // The last item of a list.
        const n = xs.length;

        return 0 < n
            ? xs[n - 1]
            : null;
    };

    // lines :: String -> [String]
    const lines = s =>
        // A list of strings derived from a single string
        // which is delimited by \n or by \r\n or \r.
        0 < s.length
            ? s.split(/\r\n|\n|\r/u)
            : [];

    // unwords :: [String] -> String
    const unwords = xs =>
        // A space-separated string derived
        // from a list of words.
        xs.join(" ");

    // words :: String -> [String]
    const words = s =>
        // List of space-delimited sub-strings.
        // Leading and trailling space ignored.
        s.split(/\s+/u).filter(Boolean);


    // MAIN ---
    return main();
})();

Airy · July 8, 2024, 9:25am

Good point. I made a mistake. You fixed it.

I think the limit is very large, (200k?) so you don't need to worry about it.

Even if you put it in a file, the lines of the file would be subject to the same limitation, so I don't see what you would be gaining. You can't break up the command I gave you into separate commands. Well, you could put them in a loop, but that would be MUCH slower, and I don't think you are anywhere near the 200k command line limit.

How to add a negative lookahead to my regular expression?

Options