SUBROUTINE :: Filtering XML with XQuery

ComplexPoint · March 25, 2025, 9:37pm

Keyboard Maestro users wanting to extract parts and strings from XML may struggle to find general Regular Expression solutions. XML consists of nested tags, and regular expressions can't model recursive patterns.

XQuery (a W3C query language built into macOS) can make things quite a lot easier.

Here is a subroutine (with an example macro) which applies any XQuery 1.0 expression to a Keyboard Maestro variable containing an XML document, and returns the results.

The XML sample is kindly provided by @ALYB.
The version of XQuery built into macOS is 1.0
To run the sample test macro, you also need to have the subroutine somewhere in an activated group
Some basic examples of XQuery expressions are explained below

UPDATED to version 0.5 (enabling display of date values, and allowing for missing xml or xquery, and allowing processing of XInclude composite XML documents)

XQuery over XML SUBROUTINE.kmmacros.zip (2,6 Ko)

XQuery subroutine TEST.kmmacros (7.9 KB)

EXAMPLES

Given the CafeTran XML example provided by @ALYB:

Expand disclosure triangle to view XML source

<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
    <header
        creationtool="CafeTran Espresso"
        creationtoolversion="10.8"
        segtype="sentence"
        o-tmf="tmx"
        adminlang="en-US"
        srclang="en-US"
        datatype="plaintext"
    />
    <body>
        <tu tuid="1" creationdate="20250323T101322Z" creationid="">
            <tuv xml:lang="de-DE">
                <seg>Administrator</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>Beheerder</seg>
            </tuv>
        </tu>
        <tu tuid="2" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>3</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>3</seg>
            </tuv>
        </tu>
        <tu tuid="3" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>2</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>2</seg>
            </tuv>
        </tu>
        <tu tuid="4" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
        </tu>
        <tu tuid="5" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
        </tu>
        <tu tuid="6" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>Typenschild</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>Typeplaatje</seg>
            </tuv>
        </tu>
    </body>
</tmx>

Plain XPaths

the first thing to be aware of is that the simplest XQuery searches are just plain XPath expressions:

using / (as in file paths) to separate levels, or
// to mean 'at any level'

The XQuery:

/*/*/*/*

will find all XML tags, with any name, that are nested 4 levels deep.

From our example, with:

we get all the <tuv> elements in the document:

Expand disclosure triangle to view XQuery result for level 4

<tuv xml:lang="de-DE">
                <seg>Administrator</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>Beheerder</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>3</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>3</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>2</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>2</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>Typenschild</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>Typeplaatje</seg>
            </tuv>

Searching by tag name, rather than level, we can get all the <seg> elements, at any level, with the XQuery:

//seg

obtaining:

<seg>Administrator</seg>
<seg>Beheerder</seg>
<seg>3</seg>
<seg>3</seg>
<seg>2</seg>
<seg>2</seg>
<seg>1</seg>
<seg>1</seg>
<seg>1</seg>
<seg>1</seg>
<seg>Typenschild</seg>
<seg>Typeplaatje</seg>

and if we only want the text they contain (without the tags) we can write:

//seg/text()

getting just:

Administrator
Beheerder
3
3
2
2
1
1
1
1
Typenschild
Typeplaatje

To exclude the lines that start with digits, we can add a regular expression condition between square brackets:

//seg/text()[ not( matches(., '^\d') ) ]

(where dot, as in file paths, refers to the current level)

Now we just get:

Administrator
Beheerder
Typenschild
Typeplaatje

XPaths inside "FLWOR" expressions

XQuery FLWORs are sequences of the pattern

FOR (XPath)
LET (optionally attaching a name to one or more values)
WHERE (optionally filtering down by specifiying a condition)
ORDER BY (optionally sorting)
RETURN (some value)

With our sample file, and prefacing any XML attribute names inside tags with @, we can write something like:

for $tuv in //tuv
let $lang:= $tuv/@xml:lang
let $text := $tuv/seg/text()

return string-join( ($lang, $text), "\t")

to get:

de-DE	Administrator
nl-NL	Beheerder
de-DE	3
nl-NL	3
de-DE	2
nl-NL	2
de-DE	1
nl-NL	1
de-DE	1
nl-NL	1
de-DE	Typenschild
nl-NL	Typeplaatje

and we can add a WHERE clause to filter it down a bit, excluding the digit lines:

for $tuv in //tuv
let $lang:= $tuv/@xml:lang
let $text := $tuv/seg/text()
where not (matches($text, '^\d'))
return string-join( ($lang, $text), "\t")

getting:

de-DE	Administrator
nl-NL	Beheerder
de-DE	Typenschild
nl-NL	Typeplaatje

or perhaps adding an ORDER BY clause to separate out the languages:

for $tuv in //tuv
let $lang:= $tuv/@xml:lang
let $text := $tuv/seg/text()
where not (matches($text, '^\d'))
order by $lang
return string-join( ($lang, $text), "\t")

resulting in:

de-DE	Administrator
de-DE	Typenschild
nl-NL	Beheerder
nl-NL	Typeplaatje

Or if we prefer, we can put corresponding NL and DE terms next to each other, line by line:

for $tu in //tu
let $de := $tu/tuv[@xml:lang="de-DE"]/seg/text()
where not (matches($de, '^\d'))
return string-join(($tu/tuv[@xml:lang="nl-NL"]/seg/text(), $de), '\t')

→


Beheerder	Administrator
Typeplaatje	Typenschild

ComplexPoint · March 25, 2025, 9:55pm

More examples – XML Attributes

Lets look at the header element of our sample file:

//header

→

<header creationtool="CafeTran Espresso" creationtoolversion="10.8" segtype="sentence" o-tmf="tmx" adminlang="en-US" srclang="en-US" datatype="plaintext"></header>

It contains a number of key="value" XML attributes.

Let's list all the attributes in the header:

//header/@*

→

datatype="plaintext"
srclang="en-US"
adminlang="en-US"
o-tmf="tmx"
segtype="sentence"
creationtoolversion="10.8"
creationtool="CafeTran Espresso"

but perhaps we only want to see their value strings:

for $attribute in //header/@*
return string($attribute)

→

plaintext
en-US
en-US
tmx
sentence
10.8
CafeTran Espresso

or, for that matter, perhaps we only need to see their names:

for $attribute in //header/@*
return name($attribute)

→

datatype
srclang
adminlang
o-tmf
segtype
creationtoolversion
creationtool

and when we are interested in just one particular attribute, specifying its name:

//header/@creationtool

→

creationtool="CafeTran Espresso"

or

//header/string(@creationtool)

→

CafeTran Espresso

ComplexPoint · March 26, 2025, 10:08am

XQuery Function library

XQuery 1.0 defines a number of useful functions over:

Strings
Numbers
Dates
Boolean true/false values
Sequences of any of the above

The XQuery 1.0 implementation built into macOS provides most but not all of the functions listed at:

XQuery 1.0 and XPath 2.0 Functions and Operators (Second Edition)

The main gaps that I have noticed are:

Date functions are implemented, but not Duration functions
Some useful Sequence functions are provided (esp. distinct-values() which prunes out duplicates) but reverse() has not been implemented. (FLOWR Order By clauses can reverse, however)

for $x in (1,2,3,4,5)
order by $x descending
return $x

→

XQuery Types

A typical XQuery returns a Sequence, which this Keyboard Maestro subroutine renders as a series of lines.

distinct-values(//seg/text())

→

Administrator
Beheerder
3
2
1
Typenschild
Typeplaatje

You can construct your own sequence directly, between brackets, delimited by commas:

(1, 2, 3)

or with an equivalent x to y expression:

1 to 3

→

1
2
3

and sequences can contain any mix of atomic types:

distinct-values( ("Alpha", "Beta", "Alpha", 2, 7, 1, 8, 2, 8) )

→

Alpha
Beta
2
7
1
8

The atomic types include boolean (true or false) values:

Numeric 0 and 1 evaluate to false and true respectively, and there are a pair of functions for constructing boolean values: false(), and true()

( boolean(0), boolean(1), false(), true() )

→

false
true
false
true

Empty strings evaluate, as booleans, to false,
and any non-empty string (even "0" or "false") evaluates to true

( boolean(""), boolean("0"), boolean("false") )

→

false
true
true

Dates, Times and DateTimes can be constructed from ISO8601 strings,
and are returned as ISO8601 strings too.

xs:date("2025-03-26")

→

2025-03-26T00:00:00.000Z

Numeric and arithmetic expressions are evaluated as you might expect:

2 + 2

→

ComplexPoint · March 26, 2025, 8:07pm

And a fuller example, showing:

(: comments :)
User-defined functions to avoid reusing long patterns
Use of some standard built-in functions
- string-length()
- string()
- exists()
- concat()

and applying an XQuery over the XML source of Keyboard Maestro Macros.plist,
to list all Keyboard Maestro macros by descending size:

XQuery Listing of KM Macros by Decreasing Size.kmmacros (8.6 KB)

declare function local:next-dict($context as node()) as node()* {
    $context/following-sibling::array/dict
};

declare function local:name-value($context as node()) as node()* {
    $context/key['Name'=string()]/following-sibling::string[1]/string()
};

for $g in local:next-dict(/plist/dict/key['MacroGroups'=string()])
	for $m in local:next-dict($g//key['Macros'=string()])

		let $size := string-length(string($m))
       	let $macroname := local:name-value($m)

		(: Skipping untitled macros :)
		where fn:exists($macroname)
		order by $size descending
		return concat(
            $size, ' ',
			local:name-value($g), ' :: ', $macroname
		)