SUBROUTINE :: Filtering XML with XQuery

Keyboard Maestro users wanting to extract parts and strings from XML may struggle to find general Regular Expression solutions. XML consists of nested tags, and regular expressions can't model recursive patterns.

XQuery (a W3C query language built into macOS) can make things quite a lot easier.

XQuery Book 1st Edition (XQuery 1.0)

Apple's introduction: Querying an XML Document

Here is a subroutine (with an example macro) which applies any XQuery 1.0 expression to a Keyboard Maestro variable containing an XML document, and returns the results.

  • The XML sample is kindly provided by @ALYB.
  • The version of XQuery built into macOS is 1.0
  • To run the sample test macro, you also need to have the subroutine somewhere in an activated group
  • Some basic examples of XQuery expressions are explained below

UPDATED to version 0.7

  • Adding a field for an optional JSON dictionary, defining names and values of constants (String, Number, Boolean, or – possibly nested – Arrays of these) which can be referenced (prefixing the key name with $) in the XQuery.
  • Enabling display of date values,
  • allowing for missing xml or xquery,
  • allowing processing of XInclude composite XML documents,
  • and allowing interpretation of non-canonical (e.g. MS Word) HTML as XML

XQuery over XML SUBROUTINE.kmmacros.zip (3,2 Ko)

XQuery subroutine TEST.kmmacros (7.9 KB)


EXAMPLES

Given the CafeTran XML example provided by @ALYB:

Expand disclosure triangle to view XML source
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
    <header
        creationtool="CafeTran Espresso"
        creationtoolversion="10.8"
        segtype="sentence"
        o-tmf="tmx"
        adminlang="en-US"
        srclang="en-US"
        datatype="plaintext"
    />
    <body>
        <tu tuid="1" creationdate="20250323T101322Z" creationid="">
            <tuv xml:lang="de-DE">
                <seg>Administrator</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>Beheerder</seg>
            </tuv>
        </tu>
        <tu tuid="2" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>3</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>3</seg>
            </tuv>
        </tu>
        <tu tuid="3" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>2</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>2</seg>
            </tuv>
        </tu>
        <tu tuid="4" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
        </tu>
        <tu tuid="5" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
        </tu>
        <tu tuid="6" creationdate="20250323T101322Z" creationid="HL">
            <tuv xml:lang="de-DE">
                <seg>Typenschild</seg>
            </tuv>
            <tuv xml:lang="nl-NL">
                <seg>Typeplaatje</seg>
            </tuv>
        </tu>
    </body>
</tmx>

Plain XPaths

the first thing to be aware of is that the simplest XQuery searches are just plain XPath expressions:

  • using / (as in file paths) to separate levels, or
  • // to mean 'at any level'

The XQuery:

/*/*/*/*

will find all XML tags, with any name, that are nested 4 levels deep.

From our example, with:

we get all the <tuv> elements in the document:

Expand disclosure triangle to view XQuery result for level 4
<tuv xml:lang="de-DE">
                <seg>Administrator</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>Beheerder</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>3</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>3</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>2</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>2</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>1</seg>
            </tuv>
<tuv xml:lang="de-DE">
                <seg>Typenschild</seg>
            </tuv>
<tuv xml:lang="nl-NL">
                <seg>Typeplaatje</seg>
            </tuv>

Searching by tag name, rather than level, we can get all the <seg> elements, at any level, with the XQuery:

//seg

obtaining:

<seg>Administrator</seg>
<seg>Beheerder</seg>
<seg>3</seg>
<seg>3</seg>
<seg>2</seg>
<seg>2</seg>
<seg>1</seg>
<seg>1</seg>
<seg>1</seg>
<seg>1</seg>
<seg>Typenschild</seg>
<seg>Typeplaatje</seg>

and if we only want the text they contain (without the tags) we can write:

//seg/text()

getting just:

Administrator
Beheerder
3
3
2
2
1
1
1
1
Typenschild
Typeplaatje

To exclude the lines that start with digits, we can add a regular expression condition between square brackets:

//seg/text()[ not( matches(., '^\d') ) ]

(where dot, as in file paths, refers to the current level)

Now we just get:

Administrator
Beheerder
Typenschild
Typeplaatje

XPaths inside "FLWOR" expressions

XQuery FLWORs are sequences of the pattern

  • FOR (XPath)
  • LET (optionally attaching a name to one or more values)
  • WHERE (optionally filtering down by specifiying a condition)
  • ORDER BY (optionally sorting)
  • RETURN (some value)

With our sample file, and prefacing any XML attribute names inside tags with @, we can write something like:

for $tuv in //tuv
let $lang:= $tuv/@xml:lang
let $text := $tuv/seg/text()

return string-join( ($lang, $text), "\t")

to get:

de-DE	Administrator
nl-NL	Beheerder
de-DE	3
nl-NL	3
de-DE	2
nl-NL	2
de-DE	1
nl-NL	1
de-DE	1
nl-NL	1
de-DE	Typenschild
nl-NL	Typeplaatje

and we can add a WHERE clause to filter it down a bit, excluding the digit lines:

for $tuv in //tuv
let $lang:= $tuv/@xml:lang
let $text := $tuv/seg/text()
where not (matches($text, '^\d'))
return string-join( ($lang, $text), "\t")

getting:

de-DE	Administrator
nl-NL	Beheerder
de-DE	Typenschild
nl-NL	Typeplaatje

or perhaps adding an ORDER BY clause to separate out the languages:

for $tuv in //tuv
let $lang:= $tuv/@xml:lang
let $text := $tuv/seg/text()
where not (matches($text, '^\d'))
order by $lang
return string-join( ($lang, $text), "\t")

resulting in:

de-DE	Administrator
de-DE	Typenschild
nl-NL	Beheerder
nl-NL	Typeplaatje

Or if we prefer, we can put corresponding NL and DE terms next to each other, line by line:

for $tu in //tu
let $de := $tu/tuv[@xml:lang="de-DE"]/seg/text()
where not (matches($de, '^\d'))
return string-join(($tu/tuv[@xml:lang="nl-NL"]/seg/text(), $de), '\t')

β†’

Beheerder Administrator
Typeplaatje Typenschild

7 Likes

More examples – XML Attributes

Lets look at the header element of our sample file:

//header

β†’

<header creationtool="CafeTran Espresso" creationtoolversion="10.8" segtype="sentence" o-tmf="tmx" adminlang="en-US" srclang="en-US" datatype="plaintext"></header>

It contains a number of key="value" XML attributes.

Let's list all the attributes in the header:

//header/@*

β†’

datatype="plaintext"
srclang="en-US"
adminlang="en-US"
o-tmf="tmx"
segtype="sentence"
creationtoolversion="10.8"
creationtool="CafeTran Espresso"

but perhaps we only want to see their value strings:

for $attribute in //header/@*
return string($attribute)

β†’

plaintext
en-US
en-US
tmx
sentence
10.8
CafeTran Espresso

or, for that matter, perhaps we only need to see their names:

for $attribute in //header/@*
return name($attribute)

β†’

datatype
srclang
adminlang
o-tmf
segtype
creationtoolversion
creationtool

and when we are interested in just one particular attribute, specifying its name:

//header/@creationtool

β†’

creationtool="CafeTran Espresso"

or

//header/string(@creationtool)

β†’

CafeTran Espresso
4 Likes

XQuery Function library

XQuery 1.0 defines a number of useful functions over:

  • Strings
  • Numbers
  • Dates
  • Boolean true/false values
  • Sequences of any of the above

The XQuery 1.0 implementation built into macOS provides most but not all of the functions listed at:

XQuery 1.0 and XPath 2.0 Functions and Operators (Second Edition)

The main gaps that I have noticed are:

  • Date functions are implemented, but not Duration functions
  • Some useful Sequence functions are provided (esp. distinct-values() which prunes out duplicates) but reverse() has not been implemented. (FLOWR Order By clauses can reverse, however)
for $x in (1,2,3,4,5)
order by $x descending
return $x

β†’

5
4
3
2
1

XQuery Types

A typical XQuery returns a Sequence, which this Keyboard Maestro subroutine renders as a series of lines.

distinct-values(//seg/text())

β†’

Administrator
Beheerder
3
2
1
Typenschild
Typeplaatje

You can construct your own sequence directly, between brackets, delimited by commas:

(1, 2, 3)

or with an equivalent x to y expression:

1 to 3

β†’

1
2
3

and sequences can contain any mix of atomic types:

distinct-values( ("Alpha", "Beta", "Alpha", 2, 7, 1, 8, 2, 8) )

β†’

Alpha
Beta
2
7
1
8

The atomic types include boolean (true or false) values:

Numeric 0 and 1 evaluate to false and true respectively, and there are a pair of functions for constructing boolean values: false(), and true()

( boolean(0), boolean(1), false(), true() )

β†’

false
true
false
true

Empty strings evaluate, as booleans, to false,
and any non-empty string (even "0" or "false") evaluates to true

( boolean(""), boolean("0"), boolean("false") )

β†’

false
true
true

Dates, Times and DateTimes can be constructed from ISO8601 strings,
and are returned as ISO8601 strings too.

xs:date("2025-03-26")

β†’

2025-03-26T00:00:00.000Z

Numeric and arithmetic expressions are evaluated as you might expect:

2 + 2

β†’

4
3 Likes

And a fuller example, showing:

  • (: comments :)
  • User-defined functions to avoid reusing long patterns
  • Use of some standard built-in functions
    • string-length()
    • string()
    • exists()
    • concat()

and applying an XQuery over the XML source of Keyboard Maestro Macros.plist,
to list all Keyboard Maestro macros by descending size:

XQuery Listing of KM Macros by Decreasing Size.kmmacros (8.6 KB)


declare function local:next-dict($context as node()) as node()* {
    $context/following-sibling::array/dict
};

declare function local:name-value($context as node()) as node()* {
    $context/key['Name'=string()]/following-sibling::string[1]/string()
};

for $g in local:next-dict(/plist/dict/key['MacroGroups'=string()])
	for $m in local:next-dict($g//key['Macros'=string()])

		let $size := string-length(string($m))
       	let $macroname := local:name-value($m)

		(: Skipping untitled macros :)
		where fn:exists($macroname)
		order by $size descending
		return concat(
            $size, ' ',
			local:name-value($g), ' :: ', $macroname
		)
2 Likes

Thanks for sharing this subroutine, @ComplexPoint. I've already used it to successfully parse some Keyboard Maestro action XML. This is certainly far easier and robust than regex.

With that said, I've run into an issue and it appears that Keyboard Maestro tokens are the source of the issue.

For this local_XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<array>
	<dict>
		<key>ActionUID</key>
		<integer>16883798</integer>
		<key>MacroActionType</key>
		<string>SetVariableToText</string>
		<key>Text</key>
		<string>This is %Variable%local_EmbeddedText1% and %Variable%local_EmbeddedText2% in the variable value.</string>
		<key>Variable</key>
		<string>local_SomeTextVariable</string>
	</dict>
</array>
</plist>

My goal is to extract all Keyboard Maestro variables. In the above, it's easy to get local_SomeTextVariable. I used this local_XQuery:

for $var in //key[. = "Variable" or . = "SourceVariable"]/following-sibling::string[1]/text()
return $var

When I attempted to return local_EmbeddedText1 and local_EmbeddedText2 I ran into an interesting issue. With this simple local_XQuery:

for $string in //string
return $string

The following is returned:

<string>SetVariableToText</string>
<string>This is  and  in the variable value.</string>
<string>local_SomeTextVariable</string>

It certainly appears that the tokens are being evaluated before local_XML is processed.

How can the raw text be processed?

Ideally, I'd have a single query that would return:

local_EmbeddedText1
local_EmbeddedText2
local_SomeTextVariable

TIA!


UPDATE: Just after posting this question, I had a palm-slap to the forehead moment. I'll leave the above for the benefit of others that might trip up on this.

Of course, all I needed to do was change the token processing on the local_XML:

3 Likes

Good point – if your XML (or XQuery) is coming in through a Text field, then you have the option of processing tokens (pre-xquery) or leaving them intact.

(Not an issue when you bring your XML in through a Read File action)


To extract KM variable names from the Text and Variable keys of KM action XML, perhaps you could experiment with a nested FLWOR along the lines of:

for $k in //key
let $v := $k/following-sibling::string[1]/text()
return if ($k = 'Variable') then
	$v
else if ($k = 'Text') then
	let $tokens := tokenize($v, '%')

  	for $i in (1 to count($tokens) - 1)
  	where $tokens[$i] = 'Variable'
  	return $tokens[$i + 1]
else
	()

or perhaps, if you want to experiment with user-defined functions:

declare function local:kmvarNamesInText($text as xs:string) as xs:string* {
	let $tokens := tokenize($text, '%')
	for $token at $i in $tokens
	where $token = 'Variable'
	return $tokens[$i + 1]
};

for $k in //key
let $v := $k/following-sibling::string[1]/text()
return if ($k = 'Variable') then
	$v
else if ($k = 'Text') then
	local:kmvarNamesInText($v)
else
	()

e.g.

Variable names in action plist XML.kmmacros (4.3 KB)

2 Likes

Thanks, @ComplexPoint.

I have something very similar, but covers a few more keys since some actions don’t’ use a key of Variable and my, previously unstated, goal is to extract variables from all actions,

I’m away from my computer until tomorrow, but I’ll share my XQuery when I return.

This is so much better than a brute force RegEx parse. Thanks again.

1 Like

I can imagine – this seems to find c. 22 distinct keys with string values including %Variable% in my Keyboard Maestro Macros.plist:

distinct-values( //string[contains(., "%Variable%")]/preceding-sibling::key[1] )

KM Plist Keys containing %Variable% tokens ?.kmmacros (8.1 KB)

1 Like

Version 7 of the Subroutine (updated in first post, above), now adds an extra field, in which a dictionary of constants can be supplied as a JSON string.

In your XQuery, you can reference the values of any of these constants by preceding the JSON key name with a $ character.

The named constants can be of type, String, Number, Boolean, Date, or an Array of these.

Arrays can be nested, and are accessible in the query as XQuery series.


The example below queries only the JSON data, and references no XML document (all Keyboard Maestro variables listed by descending size).

KM Variables by Descending Size ( XQuery version ).kmmacros (4.1 KB)

1 Like

And, of course, you could also make direct ($-prefixed) references to Keyboard Maestro variable names, within an XQuery.

Direct reference to KM Variables in XQuery.kmmacros (4.7 KB)

2 Likes