Compare two lists

ALYB · May 17, 2024, 12:38pm

I'm looking for a way to compare two lists. The first list contains 850 German verbs:

schalten
laden
füllen
führen

The second list contains 65K of German words ending on '-n':

abachsen
abaendern
abarbeiten
abarbeitungsprozeduren
abbaubaren
abbauen
abbauposition
abbaustollen
abbauunternehmen
abbeizen
abbestellen
abbewegen
abbewegungen
abbiegen
abbiegeunfällen
abbiegungen
abbilddateien
abbilden
abbildungen
abbildungsnummern
abbindebahn
abbinden
abbindestation
abbindestationen
abbindestern
abbindesternen
abbindestrecken
abbindeverhalten
abblasbohrungen
abblasdüsen
abblaseleitungen
abblasen
abblasestation
abblasfunktion
abblasklappen
abblasleitungen
abblasposition
abblaspositionen
abblasungen
abblasventilen
abblocken
abblockkondensatoren
abblättern
abbohren
abbrandgefahren
abbrechen
abbremsen
abbremsungen
abbrennen
abbrennstumpfschweissungen
abbrennstumpfschweißungen
abbrucharbeiten
abbruchbedingungen
abbruchbeginn
abbruchflaschen
abbruchfunktion
ausführen
ausfüllen
ausladen
ausschalten
einführen
einfüllen
einladen
einschalten

The words that should be extracted from this second list (either to a file or just to the clipboard):

einschalten
einladen
einfüllen
einführen
ausschalten
ausladen
ausfüllen
ausführen

If it's any easier to implement: it would be okay if the non-matching words from the long list would be removed.

Thank you in advance!

griffman · May 17, 2024, 1:40pm

This is one of those spots where KM's ability to use the shell makes this rather simple—I only know this because I had this exact same issue years ago, with an early version of my Quick Web Search macro. I fought regex and KM's Search command before someone (pre ChatGPT days, so not sure who :)) pointed out grep as a solution.

Assuming your 850,000 words are in a file named The Words.txt in the /tmp folder, this should work as a shell script action, saving results to a variable:

grep -E '(schalten|laden|füllen|führen)$' "/tmp/The Words.txt"

At least it does here in testing:

-rob.

ComplexPoint · May 17, 2024, 2:20pm

Should the items in the short list appear only as affixes (suffixes, possibly prefixes) ?

Or for laden for example, would you also want to match infix sequences like:

Ausladend
Geladenen

?

ALYB · May 17, 2024, 2:40pm

Good question. Only with prefixes. I assume that there are no suffixes for these verbs.

ComplexPoint · May 17, 2024, 3:04pm

And both lists come to you as files,
or might the shorter one start its life as a clipboard or Keyboard Maestro variable ?

Is it an option to share the lists with us (perhaps zipped) for testing ?

ALYB · May 17, 2024, 3:06pm

Both lists can either be clipboard variables or files.

ALYB · May 17, 2024, 3:07pm

To answer your second question: I’ll have a look at the content of the second list.

ComplexPoint · May 17, 2024, 3:58pm

Putting aside the issue of files, and assuming, for the moment, the use of
a regular expression (not the only option, at that scale),
I think you will probably need to automate the creation of any regular expression. (850 alternatives is a lot to assemble by hand, with a bit too much room for accident on the way).

Automating the assembly of a regular expression is easier with a scripting language like Python or JavaScript.

Here's one JS approach, (and there may also be an argument for an approach which puts aside regular expressions altogether):

Entries in a list of longer words which end with items in a list of verbs.kmmacros (3.5 KB)

Expand disclosure triangle to view JS source

return [
    ...kmvar.local_PrefixedForms.matchAll(
        new RegExp(
            `.*(${kmvar.local_ShortForms.split("\n").join("|")})`,
            "ugm"
        )
    )
]
.map(pair => pair[0])
.join("\n");

Nige_S · May 17, 2024, 9:08pm

Here's a bash version, in a KM wrapper so you can easily choose verb, list, and output files:

List from lists.kmmacros (3.9 KB)

Image

Simple enough in that it just works through the "verb" list line by line, grepping each against the "big" list and writing matches to file.

I've no idea how performance will be for 850/65k line files -- if you do try it, let me know!

ALYB · May 18, 2024, 5:49am

It only takes about 20 seconds.

I cannot find any further info about Standardize Path. Does this always mean the Desktop?

Also thanks to you for your help!

ComplexPoint · May 18, 2024, 7:45am

Expands any leading ~ to the user path. (Applies to any macOS path – not just to the Desktop)
Tidies up a few things

See:

Compare two lists

Options