Extracting strings from multiple lines in XML

ALYB · March 23, 2025, 9:22am

I'm having a hard time to find a regex for a macro to delete this part from a TMX (XML) file:

<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="CafeTran" creationtoolversion="2.0" datatype="plaintext" segtype="sentence" o-tmf="CafeTran TMX" adminlang="EN-US" srclang="de-DE" creationdate="20250323T101322Z" creationid="" changedate="20250323T101323Z" changeid=""><prop type="x-segments">true</prop><prop type="x-terms">true</prop><prop type="x-processing_tags">false</prop><prop type="x-read_only">false</prop><prop type="x-stop_autoassembling">false</prop><prop type="x-pretranslate_only">false</prop><prop type="x-terms_consistency_check">false</prop><prop type="x-priority">1</prop><prop type="x-integration">0</prop><prop type="x-prefix_matching">false</prop><prop type="x-case_match">true</prop><prop type="x-duplicates">-1</prop><prop type="x-greedy_exact_match">true</prop><note>size=1691</note>
</header>
<body>

This expression doesn't work:

Link to expression and test text: regex101: build, test, and debug regex

Airy · March 23, 2025, 9:43am

I believe that's because you omitted the "s" option, which allows wildcards to match newline chatracters. Adding the s option is done differently in KM and in this site's regex. Which location do you really want to get working, and I can explain the solution.

ComplexPoint · March 23, 2025, 10:23am

Presumably you also want to delete the trailing:

</body>
</tmx>

tags ?

Regular expressions are not formally capable of encoding recursive data patterns (here - nested XML tags), and it may prove easier to use an XML parser.

There is, for example, an XQuery parser built into macOS, which would return the full sequence of <tu> ... </tu> translation unit tags between the <body> </body> pair in exchange for a query as simple as:

//body/node()

What you do with the list of parsed <tu> nodes depends on what your next stage is,
but you can extract strings from them with built-in properties like .description, .xmlString, .stringValue etc,

You could also slightly elaborate your XQuery expression to separate out any <tuv> translation variant tags.

Broadly, a first rough sketch (I've pencilled in empty <tu> tags, but try with your own data) might start with something like:

CafeTran XML contents.kmmacros (6.2 KB)

Expand disclosure triangle to view JS source

const main = () =>
    either(
        alert("Translation units extracted from body")
    )(
        xs => xs.map(
            x => ObjC.unwrap(
                x.XMLStringWithOptions(
                    $.NSXMLNodePrettyPrint
                )
                // x.description
                // x.stringValue
            )
        )
            .join("\n")
    )(
        valuesFromXQuery(kmvar.local_XQuery)(
            kmvar.local_CafeTranXML
        )
    );


// ----------------------- JXA -----------------------

// alert :: String => String -> IO String
const alert = title =>
    s => {
        const sa = Object.assign(
            Application("System Events"), {
            includeStandardAdditions: true
        });

        return (
            sa.activate(),
            sa.displayDialog(s, {
                withTitle: title,
                buttons: ["OK"],
                defaultButton: "OK"
            }),
            s
        );
    };


// --------------------- XQUERY ---------------------

// valuesFromXQuery :: XQuery String -> XML String -> Either String [a]
const valuesFromXQuery = xq =>
    xml => {
        const
            uw = ObjC.unwrap,
            eXML = $(),
            docXML = $.NSXMLDocument.alloc
                .initWithXMLStringOptionsError(
                    xml, 0, eXML
                );

        return bindLR(
            docXML.isNil()
                ? Left(uw(eXML.localizedDescription))
                : Right(docXML)
        )(
            doc => {
                const
                    eXQ = $(),
                    xs = doc.objectsForXQueryError(xq, eXQ);

                return xs.isNil()
                    ? Left(uw(eXQ.localizedDescription))
                    : Right(ObjC.unwrap(xs));
            }
        );
    };

// --------------------- GENERIC ---------------------

// Left :: a -> Either a b
const Left = x => ({
    type: "Either",
    Left: x
});


// Right :: b -> Either a b
const Right = x => ({
    type: "Either",
    Right: x
});


// bindLR (>>=) :: Either a ->
// (a -> Either b) -> Either b
const bindLR = lr =>
    // Bind operator for the Either option type.
    // If lr has a Left value then lr unchanged,
    // otherwise the function mf applied to the
    // Right value in lr.
    mf => "Left" in lr
        ? lr
        : mf(lr.Right);


// either :: (a -> c) -> (b -> c) -> Either a b -> c
const either = fl =>
    // Application of the function fl to the
    // contents of any Left value in e, or
    // the application of fr to its Right value.
    fr => e => "Left" in e
        ? fl(e.Left)
        : fr(e.Right);


// MAIN ()
return main();

kevinb · March 23, 2025, 2:18pm

Yes, as @Airy says, you need to take into account that the text is not all on one line. The construct [\s\S]* can help us out here. It matches zero or more non-space or space characters, including newlines. This seems to fit the bill:

<\?xml[\s\S]*\<body>

ALYB · March 24, 2025, 8:16am

CafeTran Espresso doesn't add tags inside the segments, it stores tag positions instead:

<tu tuid="12" creationdate="20250323T101322Z" creationid="HL"><prop type="target_tags">0,9</prop><tuv xml:lang="de-DE"><seg>Taster </seg></tuv><tuv xml:lang="nl-NL"><seg>Drukknop </seg></tuv></tu>

This has the advantage that the TMX stays clean and minimal, that you can do S&R in it without being hindered by tags. It has the disadvantage that in fuzzy matches of these segments, the stored tag location often cannot be used.

Output of your script:

<tu tuid="1" creationdate="20250323T101322Z" creationid="">
            <tuv xml:lang="de-DE">
                <seg>Administrator</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>Beheerder</seg>
            </tuv>
        </tu>
<tu tuid="2" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>3</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>3</seg>
            </tuv>
        </tu>
<tu tuid="3" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>2</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>2</seg>
            </tuv>
        </tu>
<tu tuid="4" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
        </tu>
<tu tuid="5" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
        </tu>
<tu tuid="6" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>Typenschild</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>Typeplaatje</seg>
            </tuv>
        </tu>

At the end, all I need is a text file that only contains the text of every German segment, one line per Segment:

Administrator
3
2
1
1
Typenschild

Of course I can do this via S&R in the clipboard, but I have a hunch that you know a way to instruct the XQuery parser to do this :). BTW: A function that removes all duplicates (keeping only the first instance) of the extracted German segments would also be greatly appreciated.

ALYB · March 24, 2025, 8:23am

Thank you! That's a very useful regular expression, that I surely will use often.

I see that the header is replaced with a new line. Is there a way to avoid that (just remove the header)?

ComplexPoint · March 24, 2025, 9:18am

So, for example, we can narrow down from:

//body/node()

to the <tuv> segments contained in those nodes

//body/node()/tuv

and filtering down, with a condition, to only those <tuv> elements in which the value of the xml:lang attribute is 'de-DE'

//body/node()/tuv[@xml:lang='de-DE']

and more specifically to the <seg> elements within de-DE <tuv> elements:

//body/node()/tuv[@xml:lang='de-DE']/seg

XPath expressions can typically be written in more than one way.
With your data here, we could also write:

//body/tu/tuv[@xml:lang='de-DE']/seg

or for those de-DE <seg> elements, just:

//tuv[@xml:lang='de-DE']/seg

And in the JavaScript return expression, we specify that what interests us is not pretty-printed XML, but just the text content of the filtered elements:

// x => uw(x.XMLStringWithOptions($.NSXMLNodePrettyPrint))

x => uw(x.stringValue)

e.g.

CafeTran XML Strings for particular language.kmmacros (7.2 KB)

Expand disclosure triangle to view JS source

const main = () => {
    const uw = ObjC.unwrap;

    return either(
        alert("Translation units extracted from body")
    )(
        xs => xs.map(
            x => uw(x.stringValue)
        )
            .join("\n")
    )(
        valuesFromXQuery(
            kmvar.local_XQuery
        )(
            kmvar.local_CafeTranXML
        )
    );
};


// ----------------------- JXA -----------------------

// alert :: String => String -> IO String
const alert = title =>
    s => {
        const sa = Object.assign(
            Application("System Events"), {
            includeStandardAdditions: true
        });

        return (
            sa.activate(),
            sa.displayDialog(s, {
                withTitle: title,
                buttons: ["OK"],
                defaultButton: "OK"
            }),
            s
        );
    };


// --------------------- XQUERY ---------------------

// valuesFromXQuery :: XQuery String -> XML String -> Either String [a]
const valuesFromXQuery = xq =>
    xml => {
        const
            uw = ObjC.unwrap,
            eXML = $(),
            docXML = $.NSXMLDocument.alloc
                .initWithXMLStringOptionsError(
                    xml, 0, eXML
                );

        return bindLR(
            docXML.isNil()
                ? Left(uw(eXML.localizedDescription))
                : Right(docXML)
        )(
            doc => {
                const
                    eXQ = $(),
                    xs = doc.objectsForXQueryError(xq, eXQ);

                return xs.isNil()
                    ? Left(uw(eXQ.localizedDescription))
                    : Right(uw(xs));
            }
        );
    };

// --------------------- GENERIC ---------------------

// Left :: a -> Either a b
const Left = x => ({
    type: "Either",
    Left: x
});


// Right :: b -> Either a b
const Right = x => ({
    type: "Either",
    Right: x
});


// bindLR (>>=) :: Either a ->
// (a -> Either b) -> Either b
const bindLR = lr =>
    // Bind operator for the Either option type.
    // If lr has a Left value then lr unchanged,
    // otherwise the function mf applied to the
    // Right value in lr.
    mf => "Left" in lr
        ? lr
        : mf(lr.Right);


// either :: (a -> c) -> (b -> c) -> Either a b -> c
const either = fl =>
    // Application of the function fl to the
    // contents of any Left value in e, or
    // the application of fr to its Right value.
    fr => e => "Left" in e
        ? fl(e.Left)
        : fr(e.Right);


// MAIN ()
return main();

ComplexPoint · March 24, 2025, 9:36am

To return a list of unique values, without duplicates, we can wrap the output list in a nub function, before joining the lines with "\n" characters:

// nub :: Eq a => [a] -> [a]
const nub = xs =>
    [...new Set(xs)];

So, updating the example:

CafeTran XML Strings for particular language.kmmacros (7.3 KB)

Expand disclosure triangle to view JS source

const main = () => {
    const uw = ObjC.unwrap;

    return either(
        alert("Translation units extracted from body")
    )(
        xs => nub(
            xs.map(
                x => uw(x.stringValue)
            )
        )
            .join("\n")
    )(
        valuesFromXQuery(
            kmvar.local_XQuery
        )(
            kmvar.local_CafeTranXML
        )
    );
};


// ----------------------- JXA -----------------------

// alert :: String => String -> IO String
const alert = title =>
    s => {
        const sa = Object.assign(
            Application("System Events"), {
            includeStandardAdditions: true
        });

        return (
            sa.activate(),
            sa.displayDialog(s, {
                withTitle: title,
                buttons: ["OK"],
                defaultButton: "OK"
            }),
            s
        );
    };


// --------------------- XQUERY ---------------------

// valuesFromXQuery :: XQuery String -> XML String -> Either String [a]
const valuesFromXQuery = xq =>
    xml => {
        const
            uw = ObjC.unwrap,
            eXML = $(),
            docXML = $.NSXMLDocument.alloc
                .initWithXMLStringOptionsError(
                    xml, 0, eXML
                );

        return bindLR(
            docXML.isNil()
                ? Left(uw(eXML.localizedDescription))
                : Right(docXML)
        )(
            doc => {
                const
                    eXQ = $(),
                    xs = doc.objectsForXQueryError(xq, eXQ);

                return xs.isNil()
                    ? Left(uw(eXQ.localizedDescription))
                    : Right(uw(xs));
            }
        );
    };

// --------------------- GENERIC ---------------------

// Left :: a -> Either a b
const Left = x => ({
    type: "Either",
    Left: x
});


// Right :: b -> Either a b
const Right = x => ({
    type: "Either",
    Right: x
});


// bindLR (>>=) :: Either a ->
// (a -> Either b) -> Either b
const bindLR = lr =>
    // Bind operator for the Either option type.
    // If lr has a Left value then lr unchanged,
    // otherwise the function mf applied to the
    // Right value in lr.
    mf => "Left" in lr
        ? lr
        : mf(lr.Right);


// either :: (a -> c) -> (b -> c) -> Either a b -> c
const either = fl =>
    // Application of the function fl to the
    // contents of any Left value in e, or
    // the application of fr to its Right value.
    fr => e => "Left" in e
        ? fl(e.Left)
        : fr(e.Right);


// nub :: Eq a => [a] -> [a]
const nub = xs =>
    [...new Set(xs)];


// MAIN ()
return main();

kevinb · March 24, 2025, 4:30pm

The regex works in KM as expected. Test again with some text either side of your original text input, e.g.

greeting.kmmacros (3.3 KB)

ALYB · March 25, 2025, 8:30am

Yes, Kevin, I didn't mean to say that the regex doesn't work. I was wondering why when no replacement string is defined in the action, a new line is added to the output.

ALYB · March 25, 2025, 8:30am

Many thanks, Rob. This is very useful.

kevinb · March 25, 2025, 12:27pm

Did you try the macro? It proves that that is not the case.