Strange Error Whilst Searching with RegEx

Vinho · August 25, 2019, 3:23pm

Visualisation with Graphviz3.kmmacros (14.5 KB) Hi!

I'm using a Software similar to nvALT for note-taking and have been busy building a KM macro to visualise links between my notes with Graphviz. The title of one of my notes (=basename of its file) always consists of a unique ID (12 digits) and a name after that, e.g. "201008192344 On Breadbaking". When I link to a note from within another note, the link always looks like this: "[[123456789012]]" – the number in brackets is the unique ID of the note I want to link to.

The macro I wrote is fed with a list of note-titles I select in my note-taking software and then uses these to build a .dot-file that can then be made into a graph showing individual note-titles as nodes with arrows that point to all the notes they link to. This is an example of a possible outcome:

The KM macro roughly does the following if I select a number of notes in the note list of my software and trigger it:

It copies the note titles selected (each note title in a new line)
It uses the list of selected notes, their unique IDs (UIDs) and their verbal descriptions to make a list of all the nodes and a list of all the relations between them (links) for the .dot-file
It creates the .dot-file and makes a .svg-file out of it.
It opens the .svg-file in the browser.

Step 2 is the most complicated step. The macro will go through each of the selected note titles and amongst other things search the associated file for links to other notes – it searches for the RegEx (?<=\[\[)[0-9]{12}(?=\]\]). For each of the search results, it then determines whether the found UID belongs to
a) the note it was found in (then it is ignored),
b) the list of selected notes (then the associated note will be displayed with its verbal description in the final graph)
c) or other non-selected notes (the user can choose whether to show these in the final graph at all and if they are shown their nodes will only show their UID).

In the case of b), the macro performs the RegEx search (?<=^%linkID%\ ).+ in the list of selected note titles to associate each note with its verbal description rather than its UID. This step causes a problem I can't explain at all – the macro will often be cancelled and show the error "Search Regular Expression failed to match (?<=^%linkID%\ ).+". Strangely, this error isn't tied to particular notes (sometimes they work, sometimes they don't), but it seems to depend on the note list I select. Does anyone have an idea of what's going wrong here?

To provide more information: This is a list of notes that works:

201903201647 SHYoung [HWT] Improving Critical Thinking
201903201645 [HWT] Improving Critical Thinking
201903090914 [DEF] Life
201903020926 [EXPL] Climate Change
201902231419 [DEF] Spirituality
201901102147 [DEF] Sustainability
201901071530 Basic Income Discussion
201811251200 [DEF] Wisdom

This list of notes (just one more at the beginning) doesn't work. The macro runs through the first two notes just fine and is cancelled when doing the RegEx search (?<=^%linkID%\ ).+ for the UID 201903201647 found in the third note – something that worked without problems in the first list above. The new (first) note in this list doesn't contain any links to other notes:

201904101954 [DEF] Essential nutrients for an organism
201903201647 SHYoung [HWT] Improving Critical Thinking
201903201645 [HWT] Improving Critical Thinking
201903090914 [DEF] Life
201903020926 [EXPL] Climate Change
201902231419 [DEF] Spirituality
201901102147 [DEF] Sustainability
201901071530 Basic Income Discussion
201811251200 [DEF] Wisdom

I've also attached an image of the macro (the problematic step is red).

Tom · August 25, 2019, 8:37pm

Thanks for your detailed macro and context explanation. It is really helpful to understand how the macro works and what it is supposed to do.

Unfortunately I was not able to duplicate the issue. Probably this is because I couldn’t use the same data input. (I used a minimalized version of the macro with the actions I considered core actions on a sort of simulated input built from the data in your description, via a variable.)

So — unless somebody more experienced spots a glaring issue — I think it would be helpful if you provided the following:

The original input files where you are experiencing the issue.
- Put the necessary files into a Zip archive and upload the zip the same way as the macro file.
- If the files or file names contain sensitive information, change it. Just make sure the data still triggers the issue.
Which KM version are you using?

Vinho · August 25, 2019, 9:36pm

Hi Tom! Thanks for trying to help me with this issue! I put four other notes in a Zip archive. If I trigger the macro with the copied list of the following notes, it works:

201905051640 [DEF] Humus
201906200712 [FACT] How humus is formed
201905092213 [DEF] Mineralisation

If I add the fourth to it, it produces the error described:

201905051640 [DEF] Humus
201906200712 [FACT] How humus is formed
201905092213 [DEF] Mineralisation
201905051629 [DEF] Humification

The macro is cancelled whilst dealing with the first of the notes, doing the RegEx search for the linkID 201905051629 of the added fourth note.

I made another macro that produces a log-file for most of the steps – that's how I can tell where exactly the problem occurs. I attached that as well.

Visualisation with Graphviz2.kmmacros (25.7 KB)
Archive.zip (3.8 KB)

Tom · August 25, 2019, 9:49pm

OK, I can reproduce it now. I'll get back to you.

Tom · August 25, 2019, 10:37pm

Is this the correct result?:

Vinho · August 25, 2019, 10:44pm

Hi Tom!

The graph you have posted is not correct. The correct result would show the following arrows:

201905051640 -> 201905051629 (fourth note in the list)
201906200712 -> { 201906200735 201906200720 } (both not in the list of selected notes)
201905092213 -> { }
201905051629 -> { 201905051640 (first note in the list) 201906200712 (second note in the list) }

The two notes that 201906200712 points to should show up as additional nodes in the graph.

Tom · August 25, 2019, 11:26pm

Better?:

Vinho · August 25, 2019, 11:41pm

Definitely better, but still not how it should be The two arrows from the note “[DEF] Humification” are missing...

Tom · August 25, 2019, 11:59pm

OK, not sure if the missing arrows are a different issue or not.

In any case, the regex (?<=^%linkID%\ ).+ cannot match 201905051629 if it is not at the beginning of the string (as it is the case with the list above). And 201905051629 is the only one being processed through the regex, no? (Since it’s not part of the files sample.)

If I’m interpreting the logic of the macro correctly, I have the impression the ^ is rather meant to match the beginning of a line (not of the string). (But maybe I’m missing something, it’s already late at night…)

To make it match the beginning of the line you have to set the Multiline flag:

(?m)(?<=^%linkID%\ ).+

or just not use the ^:

(?<=%linkID%\ ).+

But this does probably not explain the two missing arrows…

Vinho · August 26, 2019, 8:20am

Hi Tom!

Thanks a lot for your advice, it's working now. I discovered that the missing arrows were due to the UID of "[DEF] Humification" being different in the list of notes and in the filename, so the macro couldn't find a file to search. I don't understand how that happened, since my note-taking app should make sure they are always identical. It was 201905051628 instead of 201905051629 which corresponds to one minute difference, because the UIDs are time-based (YYYYMMDDHHMM)...

Regarding the RegEx itself: It works with both of your suggestions, but I can't say I understand why. If I enter my RegEx (?<=^%linkID%\ ).+ (with the actual UID instead of the KM-variable) and the above lists on https://regex101.com, it works perfectly.
Aren't the note lists I copied considered as separate strings, one for each line? Otherwise the .+ at the end of the RegEx would also cause the lines underneath my desired line to be matched, wouldn't it? Many online resources also call ^ a "line anchor" because (with most regex engines) it matches after each line break...

Thanks anyway, I would never have gotten it to work without your help!

Tom · August 26, 2019, 10:15am

OK, so this was indeed a separate issue.

Because you probably have set the m flag (Multiline) on regex101:

41-pty-fs8

Click here to open an example of your regex on regex101,

now remove the m flag from the flag popup at right-hand side
- No more match
now try any of my variants
- Match again

The (?m) is just another notation for setting the m flag. But — unless there are also lines with more than one of those 12-digit numbers — you shouldn’t need the ^ anchor at all, and then it will match with or without the m flag.

From the ICU Regex guide:

m
UREGEX_MULTILINE
Control the behavior of "^" and "$" in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, "^" and "$" will also match at the start and end of each line within the input text.

(ICU regex is the flavor used in KM.)

I think, usually it is called “start of string anchor”. If a program sets the m flag by default, it will probably call it a “line anchor”.

Here a good explanation on regular-expressions.info

Vinho · August 26, 2019, 11:49am

Thanks a lot for these explanations, that makes sense now. I indeed had the m flag set on regex101.

So the operator ^ treats an input with multiple lines as one string (unless the m flag is set), but the operator + ends at the end of a line, even if the input has more lines?

Tom · August 26, 2019, 12:10pm

The + means ‘one or more matches of the preceding character class or metacharacter’, in your case the dot (.).

A dot means ‘any character’. In order to make the “any” include line terminators, you can set the s (Single Line) flag:

13-pty-fs8

So, the m flag affects ^ and $, whereas the s flag affects . (the dot).

Tom · August 26, 2019, 12:20pm

IMO these are the most essential regex help sources:

Regular-Expressions.info for general reference and for in-depth understanding
ICU regex guide as reference specifically for KM regexes (also linked in the KM Help menu)
regex101 as testing and debugging playground

Vinho · August 26, 2019, 1:18pm

Great, Tom! I'm incredibly grateful for your help, thanks a lot!

Tom · August 26, 2019, 1:49pm

Glad to see that it works now!

If one of my posts did solve your issue (or was a major help), feel free to set the “Solved” check box at the bottom of that post.

BTW, Welcome to the forum

Vinho · August 26, 2019, 6:27pm

Thanks And thanks again for your help!

In case anyone is interested in the finished macro and how exactly I use it, I posted it here.

Strange Error Whilst Searching with RegEx

Options