Help with regex (or: find unique value pairs for two named CSV columns)

maweirkbm · June 8, 2022, 7:58pm

I am attempting to get one instance of each of the following pieces of information...
.?4201[A-Z]+.\d+.\d+.(\d+).
.?4531[A-Z]+.\d+.\d+.(\d+).
.?4701[A-Z]+.\d+.\d+.(\d+).
.?4702[A-Z]+.\d+.\d+.(\d+).
.?4703[A-Z]+.\d+.\d+.(\d+).

...from this text;
textbox5,Practice,textbox13,textbox14,textbox15,textbox16,textbox37,Svc_Date,textbox19,textbox20,textbox21,textbox22,textbox38,textbox27,textbox42,textbox43,textbox44,textbox45,textbox46,textbox26,EPS_Patients,WC_Patients,Private_Patients,Misc_Patients,textbox39,textbox32,textbox33,textbox34,textbox35,textbox40
Between Service Dates of 6/7/2022 and 6/7/2022,ZCO01,0,0,8,0,8,2022 06/07 TUE,0,0,8,0,8,4531GNDJUN,0,0,8,0,8,COVID TEST,0,0,2,0,2,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZCO01,0,0,8,0,8,2022 06/07 TUE,0,0,8,0,8,4531GNDJUN,0,0,8,0,8,FOLLOW-UP,0,0,2,0,2,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZCO01,0,0,8,0,8,2022 06/07 TUE,0,0,8,0,8,4531GNDJUN,0,0,8,0,8,ILLNESS,0,0,4,0,4,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4701GDNCTY,0,0,8,0,8,COVID TEST,0,0,1,0,1,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4701GDNCTY,0,0,8,0,8,ILLNESS,0,0,4,0,4,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4701GDNCTY,0,0,8,0,8,INJURY,0,0,1,0,1,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4701GDNCTY,0,0,8,0,8,VACC - COVID,0,0,1,0,1,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4701GDNCTY,0,0,8,0,8,VACC - OTHER,0,0,1,0,1,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4702NAMPA,0,0,10,0,10,COVID TEST,0,0,3,0,3,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4702NAMPA,0,0,10,0,10,FOLLOW-UP,0,0,2,0,2,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4702NAMPA,0,0,10,0,10,ILLNESS,0,0,5,0,5,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4703MRDIN,0,0,11,0,11,COVID TEST,0,0,4,0,4,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4703MRDIN,0,0,11,0,11,ILLNESS,0,0,5,0,5,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4703MRDIN,0,0,11,0,11,OTHER,0,0,1,0,1,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZID01,0,0,29,0,29,2022 06/07 TUE,0,0,29,0,29,4703MRDIN,0,0,11,0,11,VACC - COVID,0,0,1,0,1,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZNM01,0,0,32,0,32,2022 06/07 TUE,0,0,32,0,32,4201FARM,0,0,32,0,32,(No Category),0,0,0,0,0,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZNM01,0,0,32,0,32,2022 06/07 TUE,0,0,32,0,32,4201FARM,0,0,32,0,32,COVID TEST,0,0,14,0,14,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZNM01,0,0,32,0,32,2022 06/07 TUE,0,0,32,0,32,4201FARM,0,0,32,0,32,ILLNESS,0,0,16,0,16,0,0,69,0,69
Between Service Dates of 6/7/2022 and 6/7/2022,ZNM01,0,0,32,0,32,2022 06/07 TUE,0,0,32,0,32,4201FARM,0,0,32,0,32,INJURY,0,0,2,0,2,0,0,69,0,69

I am doing it with 5 different actions like this...

Can it be combined into one?

ComplexPoint · June 8, 2022, 9:18pm

Not sure that this is a task to which regular expressions are well suited.

To get to a solution a bit faster, and make it more solid, I think you probably need:

a scripting language (JavaScript, Python ...), and
a description of the problem itself (rather than a description of where regular expressions become hard to manage)
a sample of the kind of output that you need.

XY problem - Wikipedia

(A thread title summarising the problem itself would also help others looking for similar things later on)

If, for example, you want:

the records grouped by the value of (zero-based) column 13,
and values of columns 13 to 17 extracted for the first sample of each group,

then you could:

use the groupby function in the standard itertools module of Python, or
use a definition of groupBy in some other language. JavaScript is available built-in (see the KM Execute a JavaScript for Automation action)

Range of column values for one sample in each group.kmmacros (6.6 KB)

Expand disclosure triangle to view JS Source

(() => {
    "use strict";

    // main :: IO ()
    const main = () =>
        groupBy(
            on(eq)(x => x[13])
        )(
            lines(
                Application("Keyboard Maestro Engine")
                .getvariable("csvSample")
            )
            // Header line ignored.
            .slice(1)
            .map(x => x.split(","))
        )
        .map(
            gp => gp[0]
            .slice(13, 17)
            .join(" ")
        )
        .join("\n");


    // --------------------- GENERIC ---------------------

    // eq :: a -> a -> Bool
    const eq = a =>
        b => a === b;

    // groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
    const groupBy = eqOp =>
        // A list of lists, each containing only elements
        // equal under the given equality operator,
        // such that the concatenation of these lists is xs.
        xs => Boolean(xs.length) ? (() => {
            const [h, ...t] = xs;
            const [groups, g] = t.reduce(
                ([gs, a], x) => eqOp(x)(a[0]) ? (
                    [gs, [...a, x]]
                ) : [
                    [...gs, a],
                    [x]
                ],
                [
                    [],
                    [h]
                ]
            );

            return [...groups, g];
        })() : [];


    // lines :: String -> [String]
    const lines = s =>
        // A list of strings derived from a single string
        // which is delimited by \n or by \r\n or \r.
        Boolean(s.length) ? (
            s.split(/\r\n|\n|\r/u)
        ) : [];


    // on :: (b -> b -> c) -> (a -> b) -> a -> a -> c
    const on = f =>
        // e.g. groupBy(on(eq)(length))
        g => a => b => f(g(a))(g(b));

    // MAIN ---
    return main();
})();

Nige_S · June 8, 2022, 9:34pm

According to the manual a single regex will only return one match -- to get more you'll have to do a "For Each" or similar (here and an example workaround).

But I think we can use this behaviour in your case, because you only want the first match for each of the "codes". So if you "For Each" the codes rather than the lines of text you could do what you want:

drdrang · June 9, 2022, 12:16am

Overall, I'm in agreement with @ComplexPoint. When you have well-structured data, like a CSV file, it's generally easier to take advantage of that structure than to get into regular expressions. Opening the example data you provided in a spreadsheet,

it appears that you want to match on the leading digits in Column "textbox27" and extract the number in Column "textbox44". In the data you gave us, every row with a given value in "textbox27" has the same value in "textbox44". Will that always be the case? Is that why you're asking for just the first row that matches?

In any event, there are lots of ways to pull information out of a CSV file. You could just use the LOOKUP function in Numbers. Or, if you install the csvkit collection of command-line tools (which you can do through Homebrew), you can pull out the number for each location using something like

csvgrep -c textbox27 -m 4201 maweirkbm.csv | csvcut -c textbox44 | sed -n 2p

where maweirkbm.csv is the name of the CSV file. You could use a loop to do this for each of the numeric codes and store the results in variables named after the locations.

The best way to handle this depends very much on what you want to do with the information, which only you know.

ComplexPoint · June 9, 2022, 7:32am

And if the unstated problem is actually to:

find unique value pairs for two named CSV columns

(a useful thread title ?)

then from that CSV snippet you could obtain:

by writing something analogous to:

Unique value pairs for two CSV columns.kmmacros (6.5 KB)

Expand disclosure triangle to view JS Source

(() => {
    "use strict";

    // main :: IO ()
    const main = () => {
        const
            csvRows = lines(
                Application("Keyboard Maestro Engine")
                .getvariable("csvSample")
            )
            .map(x => x.split(",")),
            [header, rows] = [
                csvRows[0],
                csvRows.slice(1)
            ],
            indices = header.reduce(
                (a, k, i) => Object.assign(a, {
                    [k]: i
                }), {}
            ),
            textbox27 = indices.textbox27,
            textbox44 = indices.textbox44;

        return nubBy(
                ([a, b]) => ([x, y]) => a === x && b === y
            )(
                rows.map(
                    row => [row[textbox27], row[textbox44]]
                )
            )
            .map(ab => `${ab[0]}\t${ab[1]}`)
            .join("\n");
    };

    // --------------------- GENERIC ---------------------

    // lines :: String -> [String]
    const lines = s =>
        // A list of strings derived from a single string
        // which is delimited by \n or by \r\n or \r.
        Boolean(s.length) ? (
            s.split(/\r\n|\n|\r/u)
        ) : [];


    // nubBy :: (a -> a -> Bool) -> [a] -> [a]
    const nubBy = p =>
        // A sublist of xs from which all duplicates,
        //  (as defined by the equality predicate p)
        //   are excluded.
        xs => xs.reduce(
            (acc, x) => acc.some(p(x)) ? (
                acc
            ) : [x].concat(acc),
            []
        );

    // MAIN ---
    return main();
})();

Nige_S · June 9, 2022, 10:37am

That's a lovely generalised example.

But if OP's data is representative then, on each run, each "code" has the same value of interest -- the pairing is just repeated over multiple lines. In this particular case, would regexing over the collection be quicker, short-circuiting on first match, rather than processing every line in the array?

Genuinely interested -- I don't know how regex works "under the hood" (I assume finding a match stops further processing, but you know what they say about assuming!), relative speeds of KM-native processing vs shelling out, etc, etc.

ComplexPoint · June 9, 2022, 10:48am

I think all of these approaches inevitably do that:

If we group the data by column textbox27, taking only the first row in each group, then the grouping entails processing each line
If we assume less and look for unique pairs of values, we have to read each line
if we repeatedly apply different custom regexes, we have to do extra work (with something less limited than regular expressions) to avoid an exponential component, in which the earlier lines are repeatedly re-read.

Regular expressions were never conceived for the solution of problems of this generality. They are underpowered, and inevitably entail the construction of Rube Goldberg, or Heath Robinson, or simply Frankenstein assemblies of incoherent purpose-built parts (Regex + something else) if anything more ambitious is attempted.

Both labour-intensive and fragile – not an obviously appealing combination.

Nige_S · June 9, 2022, 11:04am

Ah -- so "process each line just once looking for all terms" vs "process from the start until a match every time for each term"? Gotcha!

Having never been a regular regexer yet often seeing them touted on the interwebz as a text-processing cure-all, I'm slowly getting to grips with when their use is -- and isn't -- appropriate. This has been really useful -- thanks a lot!

And apologies to OP for the hijack...

maweirkbm · June 10, 2022, 6:52pm

Thank you to each of you, very helpful!

Help with regex (or: find unique value pairs for two named CSV columns)

Options