Split Tab Delimited Term Pairs

ALYB · December 5, 2021, 6:19pm

I'm looking for a way to split tab-delimited term pairs where the source terms (at the left-hand side of the tab) are separated by a semicolon, in such a way that every source term gets the full bunch of corresponding target terms (from the right-hand side of the tab).

Example:

bauliche;baulicher;baulichen;baulichem\tstructurele
Bauteiloberflächen;Bauteil-Oberflächen\tcomponentoppervlakken
Befestigungsbolzen;Befestigungs-Bolzen\tbevestigingsbout;bevestigingspen;bevestigingsbouten;bevestigingspennen
Beispielberechnung;Beispiel-Berechnung\tvoorbeeldberekening
Beispielstückliste;Beispiel-Stückliste\tvoorbeeld-stuklijst
benannte;benannter;benannten;benanntem\taangewezen;aangestelde;benoemde

(Where \t represents a tab stop.)

Should become:

bauliche\tstructurele
baulicher\tstructurele
baulichen\tstructurele
baulichem\tstructurele
Bauteiloberflächen\tcomponentoppervlakken
Bauteil-Oberflächen\tcomponentoppervlakken
Befestigungsbolzen\tbevestigingsbout;bevestigingspen;bevestigingsbouten;bevestigingspennen
Befestigungs-Bolzen\tbevestigingsbout;bevestigingspen;bevestigingsbouten;bevestigingspennen
Beispielberechnung\tvoorbeeldberekening
Beispiel-Berechnung\tvoorbeeldberekening
Beispielstückliste\tvoorbeeld-stuklijst
Beispiel-Stückliste\tvoorbeeld-stuklijst
benannte\taangewezen;aangestelde;benoemde
benannter\taangewezen;aangestelde;benoemde
benannten\taangewezen;aangestelde;benoemde
benanntem\taangewezen;aangestelde;benoemde

Sleepy · December 5, 2021, 6:58pm

That seems to be a clear example. How much data are you dealing with? If it's a large quantity of data, you should be using a macOS utility like awk.

drdrang · December 5, 2021, 7:21pm

As @Sleepy says, awk would certainly work. Because I'm more familiar with Perl syntax, this is how I'd do it:

#!/usr/bin/perl

while (<>){
	chomp;
	($lhs, $rhs) = split /\t/;
	foreach $s (split /;/, $lhs){
		print "$s\t$rhs\n";
	}
}

How you send the input to this script and what you do with its output is up to you.

gglick · December 5, 2021, 7:24pm

For good measure, an example with KM native actions:

Example Macro.kmmacros (5.0 KB)

Results

baulicher	structurele
baulichen	structurele
baulichem	structurele
Bauteiloberflächen	componentoppervlakken
Bauteil-Oberflächen	componentoppervlakken
Befestigungsbolzen	bevestigingsbout;bevestigingspen;bevestigingsbouten;bevestigingspennen
Befestigungs-Bolzen	bevestigingsbout;bevestigingspen;bevestigingsbouten;bevestigingspennen
Beispielberechnung	voorbeeldberekening
Beispiel-Berechnung	voorbeeldberekening
Beispielstückliste	voorbeeld-stuklijst
Beispiel-Stückliste	voorbeeld-stuklijst
benannte	aangewezen;aangestelde;benoemde
benannter	aangewezen;aangestelde;benoemde
benannten	aangewezen;aangestelde;benoemde
benanntem	aangewezen;aangestelde;benoemde

Sleepy · December 5, 2021, 7:25pm

Amazing work (both of you guys). I did ask the user how much data he had, and he didn't answer yet. Since KM works at 1.002 KHz, that may make Perl or awk better for large data amounts.

ALYB · December 6, 2021, 4:54am

I have to split about 9000 lines to clean up my glossary. This is a one-time operation.

In the future, I’ll be dealing with smaller files of about 100 lines.

Sleepy · December 6, 2021, 4:56am

That seems like a small quantity. At least for the ongoing basis.

ALYB · December 6, 2021, 4:56am

Thank you. Could you please show me how I can send the content of glossary.txt (UTF-8, Unix LF) to the script in the Terminal?

ALYB · December 6, 2021, 4:58am

Thank you. I’ll add this KM solution to my cleaning macro for cleaning up on a regular basis. Much appreciated!

ccstone · December 6, 2021, 5:51am

Hey Hans,

You don't need to use the Terminal.app; you can use an Execute a Shell Script action.

-Chris

Reformat Delimited Text v1.00.kmmacros (2.8 KB)

drdrang · December 6, 2021, 12:40pm

Put a copy of the glossary.txt file on your Desktop.
Use a plain text editor to create a file called splitterms.pl. Copy the Perl script above into it and save it to your Desktop.

Open Terminal and execute the following two lines:

cd ~/Desktop
perl splitterms.pl glossary.txt > newglossary.txt

The file newglossary.txt will appear on your Desktop in the format you want.

In the future, when you don't have so many lines to split, it might be easier to copy your glossary lines to the Clipboard and run this macro, which will do the conversion and put the results back on the Clipboard.

Split Term Pairs.kmmacros (1.7 KB)

ComplexPoint · December 6, 2021, 1:56pm

and in JavaScript for Automation:

// termPairs :: String -> String
const termPairs = s =>
    lines(s).flatMap(x => {
        const [l, r] = x.split("\t");

        return l.split(";").map(
            k => `${k}\t${r}`
        );
    })
    .join("\n");

So, for example:

Tab-Delimited Term Pair Splits.kmmacros (3.1 KB)

Expand disclosure triangle to view JS Source

(() => {
    "use strict";

    // termPairs :: String -> String
    const termPairs = s =>
        lines(s).flatMap(x => {
            const [l, r] = x.split("\t");

            return l.split(";").map(
                k => `${k}\t${r}`
            );
        })
        .join("\n");


    // ---------------------- TEST -----------------------
    const main = () =>
        termPairs(
            Application("Keyboard Maestro Engine")
            .getvariable("termSample")
        );

    // --------------------- GENERIC ---------------------

    // lines :: String -> [String]
    const lines = s =>
        // A list of strings derived from a single
        // string delimited by newline and or CR.
        0 < s.length ? (
            s.split(/[\r\n]+/u)
        ) : [];


    return main();
})();

ALYB · December 28, 2021, 12:49pm

Wow, the Terminal solution is very fast.

Also thanks for the Clipboard variant!

ALYB · December 30, 2021, 11:58am

To split target-side variants of term pairs, I've amended your script like this:

#!/usr/bin/perl

while (<>){
	chomp;
	($lhs, $rhs) = split /\t/;
	foreach $s (split /;/, $rhs){
		print "$lhs\t$s\n";
	}
}

It works fine.

However, my CAT tool requires the order of the lines with extracted target terms reversed.

Test file:

f\tf;v
F\tF;V
h\th;u;hour;uur
o\to;of;z;zonder

Current order of the target terms:

f\tf
f\tv
F\tF
F\tV
h\th
h\tu
h\thour
h\tuur
o\to
o\tof
o\tz
o\tzonder

Required order of the extracted target terms (= reversed):

f\tv
f\tf
F\tV
F\tF
h\tuur
h\thour
h\tu
h\th
o\tzonder
o\tz
o\tof
o\to

Could you please help me once again with a modification of the script to get the order of the target terms reversed?
Thank you in advance!
test file.txt.zip (586 Bytes)

ComplexPoint · December 30, 2021, 12:38pm

the target terms reversed

All of these languages have a list reverse function / operator, which you can apply to the output of the semicolon splitting.

@drdrang will show you where to apply the Perl reverse function

One approach to the JS equivalent would look like this:

return lines(source).flatMap(x => {
    const [l, r] = x.split(/\t/u);

    return r.split(/;/u)
        .reverse()
        .map(
            k => `${l}\t${k}`
        );
})
.join("\n");

Term splits in reversed order.kmmacros (2.7 KB)

Expand disclosure triangle to view a possible Perl variant

#!/usr/bin/perl

while (<>){
	chomp;
	($lhs, $rhs) = split /\t/;
	foreach $s (reverse split /;/, $rhs){
		print "$lhs\t$s\n";
	}
}

drdrang · December 30, 2021, 2:48pm

We could slip a call to reverse into the for line, but I think things are getting complicated enough to be more explicit. Try this:

#!/usr/bin/perl

while (<>){
	chomp;
	($lhs, $rhs) = split /\t/;
	@targets = split /;/, $rhs;
	@targets = reverse @targets;
	foreach $s (@targets){
		print "$lhs\t$s\n";
	}
}

It does the split and reversal of the right-hand side in two steps. If you ever decide you don't want the reversal, you can comment out that line.

Split Tab Delimited Term Pairs

Options