Tagging Terms Via a Glossary

ALYB · March 2, 2021, 8:57pm

I'm trying to create a macro that reads a variable SourceSegment, extracts all text between tags of the type <xn/> ... <xn+1/> to variables SourceTerm(y), then reads a tab-delimited glossary that links SourceTerm(y) to TargetTerm(y) and inserts the tags around the corresponding target terms in the variable TargetSegment.

In the example, the source segment contains three source terms, but the macro should cover more terms (say, 10).

In the example, the glossary contains 4 term pairs (lines), but the macro should handle more term pairs (say, 100). The content of the glossary is read from an UTF-8 text file.

How to process multiple source terms?

Possible approach: After the first source term has been identified, either remove the surrounding tags or the part from the segment start up to the first closing tag, and repeat the regular expression search. Repeat this process until there is no more tag in the source segment.

How to match them to the corresponding target terms from the glossary?

Possible approach: Use the first source term to search all lines of the glossary. If the complete source term (can be a multi-word term) is found, assign the part from the tab character to the end of the line to the corresponding target term. Repeat this for all identified source terms in the source segment.

All help is greatly appreciated.

Untitled

Tagging glossary.kmmacros (4.8 KB)

Glossary.txt.zip (596 Bytes)

ComplexPoint · March 3, 2021, 3:33pm

Do you need to avoid nested tagging in the output ?

for example, if the glossary contains both

one -> ene

and also

this one word -> dit ene woord

Do we need to protect agains generating outputs like:

<x7/>dit <x3/>ene<x4/> woord<x8/>

Or would that not matter too much ?

ALYB · March 3, 2021, 7:06pm

Tagging should match the source segment and always use the longest match/full match of the source term in the glossary. I think this implies that nesting doesn’t and shouldn’t occur.

ComplexPoint · March 3, 2021, 9:36pm

A rough sketch

(assumes that the glossary is a tab-delimited text file at a given path)

Target terms inherit tags.kmmacros (33.8 KB)

JS Source

(() => {
    "use strict";

    // Wrapping translation target terms in
    // matching (numbered) tags.

    // Rob Trew @2021
    // Ver 0.01

    // main :: IO ()
    const main = () => {
        const
            kme = Application("Keyboard Maestro Engine"),
            kmValue = k => kme.getvariable(k),
            fpGlossary = kmValue("glossaryPath");

        return either(
            msg => alert("Tagging target terms")(msg)
        )(
            x => x
        )(
            bindLR(
                readFileLR(fpGlossary)
            )(fileText => {
                const
                    glossary = glossaryFromTabDelimited(
                        fileText
                    ),
                    termList = Object.keys(glossary),
                    [sourceText, targetText] = [
                        "SourceSegment",
                        "TargetSegment"
                    ].map(kmValue);

                return 0 < termList.length ? (() => {
                    const
                        parseList = parse(
                            glossaryTerms()
                        )(sourceText);

                    return 0 < parseList.length ? (
                        Right(
                            targetTaggedFromGlossary(
                                glossary
                            )(
                                parseList
                            )(
                                targetText
                            )
                        )
                    ) : Left(
                        "No tagged terms seen in source text."
                    );
                })() : Left(
                    `No terms found in ${fpGlossary}`
                );
            })
        );
    };


    // targetTaggedFromGlossary :: Dict ->
    // [((String, Int, Int), String)] ->
    // String -> String
    const targetTaggedFromGlossary = glossary =>
        parseList => targetText => {
            const
                termTags = parseList[0][0],
                translationPairs = sortBy(
                    flip(comparing(
                        tpl => tpl[1].length
                    ))
                )(
                    termTags.map(
                        tagContent => [
                            tagContent,
                            glossary[
                                fst(tagContent)
                            ]
                        ]
                    )
                );

            return taggedTarget(translationPairs)(
                targetText
            );
        };


    // taggedTarget :: () ->
    // ((String, Int, Int), String) ->
    // String -> String
    const taggedTarget = translationPairs =>
        targetText => translationPairs.reduce(
            (a, tpl) => {
                const
                    gloss = snd(tpl),
                    tags = fst(tpl),
                    n1 = tags[1],
                    n2 = tags[2];

                return a.replace(
                    gloss,
                    `<x${n1}/>${gloss}<x${n2}/>`
                );
            },
            targetText
        );


    // glossaryFromTabDelimited :: String ->
    // { Source::String, Target::String }
    const glossaryFromTabDelimited = text =>
        lines(text).reduce((a, s) => {
            const parts = s.split("\t");

            return 1 < parts.length ? (
                Object.assign(a, {
                    [parts[0]]: parts[1]
                })
            ) : a;
        }, {});

    // --------------- TAGGED LINE PARSER ----------------

    // glossaryTerms = () ->
    // Parser [(String, String, String, Int, Int)]
    const glossaryTerms = () =>
        many(
            bindP(
                tagLess()
            )(
                taggedTerm
            )
        );


    // taggedTerm = () ->
    // Parser (String, String, String, Int, Int)
    const taggedTerm = () => {
        const tag = numberedEndTag();

        return bindP(
            tag
        )(
            ([k, n]) => bindP(
                tagLess()
            )(
                term => bindP(
                    tag
                )(
                    ([k2, n2]) => pureP([
                        term, n, n2, k, k2
                    ])
                )
            )
        );
    };

    // numberedEndTag :: () -> Parser (String, Int)
    const numberedEndTag = () =>
        bindP(
            char("<")
        )(() => bindP(
            fmapP(concat)(
                some(
                    satisfy(c => !isDigit(c))
                )
            )
        )(k => bindP(
            fmapP(
                ds => parseInt(
                    ds.join(""), 10
                )
            )(
                some(satisfy(isDigit))
            )
        )(n => bindP(
            string("/>")
        )(
            () => pureP([k, n])
        ))));


    // tagLess :: () -> Parser String
    const tagLess = () =>
        fmapP(concat)(
            some(satisfy(c => "<" !== c))
        );

    // --------- GENERIC PARSERS AND COMBINATORS ---------

    // Parser :: String -> [(a, String)] -> Parser a
    const Parser = f =>
        // A function lifted into a Parser object.
        ({
            type: "Parser",
            parser: f
        });


    // altP (<|>) :: Parser a -> Parser a -> Parser a
    const altP = p =>
        // p, or q if p doesn't match.
        q => Parser(s => {
            const xs = parse(p)(s);

            return 0 < xs.length ? (
                xs
            ) : parse(q)(s);
        });


    // apP <*> :: Parser (a -> b) -> Parser a -> Parser b
    const apP = pf =>
        // A new parser obtained by the application
        // of a Parser-wrapped function,
        // to a Parser-wrapped value.
        p => Parser(
            s => parse(pf)(s).flatMap(
                vr => parse(
                    fmapP(vr[0])(p)
                )(vr[1])
            )
        );


    // bindP (>>=) :: Parser a ->
    // (a -> Parser b) -> Parser b
    const bindP = p =>
        // A new parser obtained by the application
        // of a function to a Parser-wrapped value.
        // The function must enrich its output,
        // lifting it into a new Parser.
        // Allows for the nesting of parsers.
        f => Parser(
            s => parse(p)(s).flatMap(
                tpl => parse(f(tpl[0]))(tpl[1])
            )
        );

    // char :: Char -> Parser Char
    const char = x =>
        // A particular single character.
        satisfy(c => x === c);


    // fmapP :: (a -> b) -> Parser a -> Parser b
    const fmapP = f =>
        // A new parser derived by the structure-preserving
        // application of f to the value in p.
        p => Parser(
            s => parse(p)(s).flatMap(
                first(f)
            )
        );


    // isDigit :: Char -> Bool
    const isDigit = c => {
        const n = c.codePointAt(0);

        return 48 <= n && 57 >= n;
    };


    // liftA2P :: (a -> b -> c) ->
    // Parser a -> Parser b -> Parser c
    const liftA2P = op =>
        // The binary function op, lifted
        // to a function over two parsers.
        p => apP(fmapP(op)(p));


    // many :: Parser a -> Parser [a]
    const many = p => {
        // Zero or more instances of p.
        // Lifts a parser for a simple type of value
        // to a parser for a list of such values.
        const someP = q =>
            liftA2P(
                x => xs => [x].concat(xs)
            )(q)(many(q));

        return Parser(
            s => parse(
                0 < s.length ? (
                    altP(someP(p))(pureP([]))
                ) : pureP([])
            )(s)
        );
    };

    // parse :: Parser a -> String -> [(a, String)]
    const parse = p =>
        // The result of parsing a string with p.
        p.parser;


    // pureP :: a -> Parser a
    const pureP = x =>
        // The value x lifted, unchanged,
        // into the Parser monad.
        Parser(s => [Tuple(x)(s)]);


    // satisfy :: (Char -> Bool) -> Parser Char
    const satisfy = test =>
        // Any character for which the
        // given predicate returns true.
        Parser(
            s => 0 < s.length ? (
                test(s[0]) ? [
                    Tuple(s[0])(s.slice(1))
                ] : []
            ) : []
        );


    // sequenceP :: [Parser a] -> Parser [a]
    const sequenceP = ps =>
        // A single parser for a list of values, derived
        // from a list of parsers for single values.
        Parser(
            s => ps.reduce(
                (a, q) => a.flatMap(
                    vr => parse(q)(vr[1]).flatMap(
                        first(xs => vr[0].concat(xs))
                    )
                ),
                [Tuple([])(s)]
            )
        );


    // some :: Parser a -> Parser [a]
    const some = p => {
        // One or more instances of p.
        // Lifts a parser for a simple type of value
        // to a parser for a list of such values.
        const manyP = q =>
            altP(some(q))(pureP([]));

        return Parser(
            s => parse(
                liftA2P(
                    x => xs => [x].concat(xs)
                )(p)(manyP(p))
            )(s)
        );
    };


    // string :: String -> Parser String
    const string = s =>
        // A particular string.
        fmapP(cs => cs.join(""))(
            sequenceP([...s].map(char))
        );

    // ----------------------- JXA -----------------------

    // alert :: String => String -> IO String
    const alert = title =>
        s => {
            const sa = Object.assign(
                Application("System Events"), {
                    includeStandardAdditions: true
                });

            return (
                sa.activate(),
                sa.displayDialog(s, {
                    withTitle: title,
                    buttons: ["OK"],
                    defaultButton: "OK"
                }),
                s
            );
        };


    // readFileLR :: FilePath -> Either String IO String
    const readFileLR = fp => {
        // Either a message or the contents of any
        // text file at the given filepath.
        const
            e = $(),
            ns = $.NSString
            .stringWithContentsOfFileEncodingError(
                $(fp).stringByStandardizingPath,
                $.NSUTF8StringEncoding,
                e
            );

        return ns.isNil() ? (
            Left(ObjC.unwrap(e.localizedDescription))
        ) : Right(ObjC.unwrap(ns));
    };

    // --------------------- GENERIC ---------------------

    // Left :: a -> Either a b
    const Left = x => ({
        type: "Either",
        Left: x
    });


    // Right :: b -> Either a b
    const Right = x => ({
        type: "Either",
        Right: x
    });


    // Tuple (,) :: a -> b -> (a, b)
    const Tuple = a =>
        b => ({
            type: "Tuple",
            "0": a,
            "1": b,
            length: 2
        });


    // bindLR (>>=) :: Either a ->
    // (a -> Either b) -> Either b
    const bindLR = m =>
        mf => m.Left ? (
            m
        ) : mf(m.Right);


    // comparing :: (a -> b) -> (a -> a -> Ordering)
    const comparing = f =>
        x => y => {
            const
                a = f(x),
                b = f(y);

            return a < b ? -1 : (a > b ? 1 : 0);
        };


    // concat :: [[a]] -> [a]
    // concat :: [String] -> String
    const concat = xs =>
        0 < xs.length ? (
            (
                xs.every(x => "string" === typeof x) ? (
                    ""
                ) : []
            ).concat(...xs)
        ) : xs;


    // either :: (a -> c) -> (b -> c) -> Either a b -> c
    const either = fl =>
        // Application of the function fl to the
        // contents of any Left value in e, or
        // the application of fr to its Right value.
        fr => e => e.Left ? (
            fl(e.Left)
        ) : fr(e.Right);


    // flip :: (a -> b -> c) -> b -> a -> c
    const flip = op =>
        // The binary function op with
        // its arguments reversed.
        1 < op.length ? (
            (a, b) => op(b, a)
        ) : (x => y => op(y)(x));


    // fst :: (a, b) -> a
    const fst = tpl =>
        // First member of a pair.
        tpl[0];


    // first :: (a -> b) -> ((a, c) -> (b, c))
    const first = f =>
        // A simple function lifted to one which applies
        // to a tuple, transforming only its first item.
        xy => {
            const tpl = Tuple(f(xy[0]))(xy[1]);

            return Array.isArray(xy) ? (
                Array.from(tpl)
            ) : tpl;
        };


    // lines :: String -> [String]
    const lines = s =>
        // A list of strings derived from a single
        // string delimited by newline and or CR.
        0 < s.length ? (
            s.split(/[\r\n]+/u)
        ) : [];


    // list :: StringOrArrayLike b => b -> [a]
    const list = xs =>
        // xs itself, if it is an Array,
        // or an Array derived from xs.
        Array.isArray(xs) ? (
            xs
        ) : Array.from(xs || []);


    // snd :: (a, b) -> b
    const snd = tpl =>
        // Second member of a pair.
        tpl[1];


    // sortBy :: (a -> a -> Ordering) -> [a] -> [a]
    const sortBy = f =>
        xs => list(xs).slice()
        .sort((a, b) => f(a)(b));

    // MAIN --
    return main();
})();

ALYB · March 4, 2021, 6:25am

Great work. Looks very promising!

One glitch (or better said: flaw in the task description): when a source word is repeated in the segment, the first occurrence of this word is tagged multiple times, while the other ones aren't tagged:

New example segments:

This <x1/>one<x2/> word is bold, these <x3/>two words<x4/> are italics and <x5/>these three words<x6/> are underlined<x7/>®<x8/> and in <x9/>two words<x10/>: another <x11/>one<x12/> bites the <x13/>dust<x14/>.

Dit ene woord is vet, deze deze drie woorden zijn cursief en twee woorden zijn onderstreept® en in twee woorden: nog ene valt in het stof.

ComplexPoint · March 4, 2021, 7:32am

Yes – this was the nested tagging problem that I was wondering about : -)

I'll think about whether that can be disentangled – at the moment the process is just a kind of search and replace, which we could make global, to reach the repeated occurrences.

Avoiding multiple (nested) tagging however, may need quite a different approach, I'll try another iteration this evening.

(only matching where a potential target string doesn't have a tag to its left, perhaps ? Let me know if any clear and simple rules do come to mind : -)

ALYB · March 4, 2021, 7:38am

Ah, how foreseeing of yours :).

I'm not a programmer (far from that), but how about temporarily adding a unique sign to or into a source term once it has been identified, in order to enable the finding of other instances? In the last step the macro should remove this unique sign:

This <x1/>o§ne<x2/> word is bold, these <x3/>two words<x4/> are italics and <x5/>these three words<x6/> are underlined<x7/>®<x8/> and in <x9/>two words<x10/>: another <x11/>one<x12/> bites the <x13/>dust<x14/>.

ComplexPoint · March 4, 2021, 8:26am

Yes, something like that, I think. It may be that the tags themselves are enough.

I'll experiment later on.

ALYB · March 4, 2021, 8:37am

I just realised that adding a unique character to a one-word term, like the trademark symbol (that has to be set in superscript), will be difficult. I think that this symbol will be treated as one word in the search. Not sure.

ComplexPoint · March 4, 2021, 8:56am

Another scheme I might test this evening – two separate search and replace passes:

stage one, the longest glossary entry match is replaced by content-free (numbered) tag pair
stage two, the numbered tag pairs are expanded to fill their content.

i.e.

ene -> <x1/><x2/>
(starting again after all the other matches have been processed) <x1/><x2/> -> <x1/>ene<x2/>

ComplexPoint · March 4, 2021, 9:20am

In fact that switch is quick, so let's see where it gets us:

Target terms inherit tags (iteration 2).kmmacros (34.8 KB)

(assuming the addition of stof to the tab-delimited ~/Desktop/glossary.txt)

®	®
dust	stof
one	ene
two words	twee woorden
these three words	deze drie woorden

ALYB · March 4, 2021, 1:44pm

Looks good. Thank you very much!

I'll start testing the macro in my daily work.

I've made a demo here: https://youtu.be/mmCelWfmC-Q

Untitled

ALYB · March 4, 2021, 5:24pm

@ComplexPoint I'm a little embarrassed, but I had forgotten about a 'special case', where an opening tag is at the start of the segment and thus is suppressed by my editor. Same goes for a closing tag a the end of the segment.

So this is covered by your JS:

SourceSegment:
Das ist ein <x1/>Anfangsbuchstabe<x2/> und ein <x3/>Kleinbuchstabe<x4/> und ein <x5/>Großbuchstabe<x6/> und ein <x7/>Endbuchstabe<x8/>.

TargetSegment:
Das ist ein <x1/>beginletter<x2/> und ein <x3/>kleine letter<x4/> und ein <x5/>hoofdletter<x6/> und ein <x7/>eindletter<x8/>.

Would it be possible to cover this special case too?

SourceSegment:
Anfangsbuchstabe<x1/> und ein <x2/>Kleinbuchstabe<x3/> und ein <x4/>Großbuchstabe<x5/> und ein <x6/>Endbuchstabe

TargetSegment:
Beginletter<x1/> und ein <x2/>kleine letter<x3/> und ein <x4/>hoofdletter<x5/> und ein <x6/>eindletter

Glossary.txt.zip (637 Bytes)

ComplexPoint · March 4, 2021, 5:37pm

Tell me more about the number codes ?

In previous examples:

the opening tag always had an odd number, and
the closing tag always had an even number.

Is that a reliable pattern ?

It seems to break in your later examples ?

Should Anfangsbuchstabe and Beginletter there be ending with <x2/> ?

or is the fact that they end with 1, and the <x1/> is followed by space (rather than printing characters) a clue ?

ALYB · March 4, 2021, 6:03pm

There is a problem with giving constructed examples: though they are simple, important information can get lost.

I've given examples where source terms are neatly wrapped with tags. In these cases, opening tag always had an odd number, and the closing tag always had an even number.

The numbering of the tags start at the beginning of the segment and ends at the segment end.

Additional (unpaired) tags can occur in the segment, causing the numbering to shift:

Source segment:
Bild <x1/> zeigt einen <x2/>Anfangsbuchstaben<x3/> und einen <x4/>Kleinbuchstaben<x5/> und einen <x6/>Großbuchstaben<x7/> und einen <x8/>Endbuchstaben<x9/>.

Target segment:
Figuur <x1/> zeigt einen <x2/>beginletter<x3/> und einen <x4/>kleine letter<x5/> und einen <x6/>hoofdletter<x7/> und einen <x8/>eindletter<x9/>.

SourceSegment:
Anfangsbuchstabe<x1/> und ein <x2/>Kleinbuchstabe<x3/> und ein <x4/>Großbuchstabe<x5/> und ein <x6/>Endbuchstabe

TargetSegment:
Beginletter<x1/> und ein <x2/>kleine letter<x3/> und ein <x4/>hoofdletter<x5/> und ein <x6/>eindletter

Modified glossary:

®	®
©	©
™	™
Anfangsbuchstabe	beginletter
Kleinbuchstabe	kleine letter
Großbuchstabe	hoofdletter
Endbuchstabe	eindletter
Anfangsbuchstaben	beginletter
Kleinbuchstaben	kleine letter
Großbuchstaben	hoofdletter
Endbuchstaben	eindletter

ComplexPoint · March 4, 2021, 6:08pm

And is the lack of white space between tags and phrase a reliable indicator of a term ?

Something like:

<x2/>Anfangsbuchstaben<x3/> (this is definitely a tagged term)
<x5/> und ein <x6/> (this is definitely a gap between two tagged terms)

?

ALYB · March 4, 2021, 8:16pm

Since writers are sloppy formatters, there are occasions where a trailing (never a leading) space follows a term before the closing tag.

(One of the causes is how MS Word selects two or more words: including the trailing space. For underlined, Word applies a correction so that the trailing space isn’t underlined. For bold and italics this is not the case.)

But you have to go for the majority of the cases, I guess. You cannot cover all.

How about this algorithm:

If a tag directly precedes a word (term) the next one will be the closing tag. Regardless whether this closing tag is separated by a leading space.

I’m not sure about this.

An example of a complex case:

This is a <x1/>brand name<x2/>TM<x3/><x4/>.

Here 1 an 4 indicate bold, and 2 and 3 superscript.

But these are exceptions that don’t need to be covered.

ComplexPoint · March 4, 2021, 9:58pm

OK, some of this looks feasible.

The part that looks challenging, is
Bild <x1/> zeigt einen

→

Figuur <x1/> zeigt einen

We don't seem to have an anchor there for placing the <x1/> tag.

(Nothing in the glossary that would obviously help)

Can we drop the unpaired stragglers ?

For example producing just:

Figuur zeigt einen <x2/>beginletter<x3/> und einen <x4/>kleine letter<x5/> und einen <x6/>hoofdletter<x7/> und einen <x8/>eindletter<x9/>.

ALYB · March 4, 2021, 10:05pm

Yes, sure you can drop them.

However, I've just browsed through some example XLIFF files and I've found that there is also this tagging pattern:

Source segment:
Bild 123 zeigt einen <x1/>Anfangsbuchstaben<x1/> und einen <x2/>Kleinbuchstaben<x2/> und einen <x3/>Großbuchstaben<x3/> und einen <x4/>Endbuchstaben<x4/>.

Target segment:
Figuur 123 zeigt einen <x1/>beginletter<x1/> und einen <x2/>kleine letter<x2/> und einen <x3/>hoofdletter<x3/> und einen <x4/>eindletter<x4/>.

ComplexPoint · March 4, 2021, 10:09pm

Ah, so the tag numbers are sometimes the same on both sides ...

Is that the key divergence here ?

I think that should be survivable : -)

Tagging Terms Via a Glossary

Options