Auditing Wiki for Missing Titles

A while back, in another thread, I made this offer:

I've done a comprehensive but somewhat simplistic audit. Here's what I found. I need some help in moving forward from here.

The numbers

Out of 788 Wiki pages audited:

  • 204 pages (26%) have titles that match the standard format
  • 223 pages (28%) are missing any title at all
  • 361 pages (46%) have titles that don't match the standard format and will need manual review

The standard I was working from is that a file like .../action/Comment should have a title like "Comment Action" β€” derived from the filename and its namespace.

It turns out that the largest group of pages (46%) have titles that exist but don't follow that standard in some way or another. Those will each need a human decision at some level (perhaps in bulk) of: whether they should conform to the standard, whether the existing title is OK as is, or whether the difference between the current title and the "standard" exposes that the current title needs a different tweak, other than simply conforming to the standard. For now, there is something there, so they are a lower priority.

As for the missing titles, I haven't fixed anything yet. I expected maybe 20 or 30 missing titles and thought I'd just work through them. 223 is a different proposition, and 361 needing review is a project that requires input from others, not a simple afternoon's excercise.

How I did the audit

The audit runs in two scripts. The first does a Wiki search for all pages in a given namespace using curl and from the resulting page, generates a list of URLs. The second script loads each URL from a given list, one at a time via curl, and greps for the HTML title tag, <H1>. From the filename it generates the standard title and compares that to what was found as the H1 text.

The output is a Markdown table β€” one row per URL. The first column is a clickable link to the page. The second column shows the status: OK, MISSING, or ERR (with the actual, found title shown for ERR entries). The third column, for MISSING and ERR rows, contains a correctly formatted standard title, ready to paste directly into the Wiki editor.

I'll be sharing both scripts in this thread in later posts. They're well commented and would be a reasonable starting point for anyone doing similar audits here or elsewhere.

What I'm asking for

For the MISSING pages, the workflow is straightforward enough that it might be worth building a KM macro to handle the repetitive steps: copy the formatted new title, open the link to the Wiki page, open the WIki editor, paste in the title, save it. I haven't built that yet, If someone could volunteer to build the macro and someone (else?) could volunteer to use it to fill in even part of the 223 missing titles, that would help complete this first pass pretty quickly.

The ERR pages are a different matter. Those need someone to look at each one and decide whether the existing title is fine, needs a tweak, or needs to be replaced with the standard format. That's judgment work and I'd welcome thoughts on how to organize that. I expect that there will be some large chunks, possibly whole namespaces, that can be identified as "just fine as is".

I will share the full tables here so that you can see what the ERR pages are, in the context of other page titles, but I'd rather share them somewhere that they could be and now that, thanks @peternlewis, has helped me make Wiki posts here, they will be jointly editable. That way the status can be updated as the MISSING entries are fixed and as the ERR entries are resolved.

Does the Wiki support hosted files that editors can update? Otherwise I'm thinking a shared Google Sheet with edit access to anyone with the link. Suggestions are welcome.

[Update:] As discussed in the next few posts, the Forum provides a feature of converting a post into a Wiki that anyone can edit. This should make it easy to keep track of what's been edited and what is still to do.

1 Like

Here is the first script, get_urls.sh, listed below (Fig.1) along with sample output (Fig.3). It is basically a five-piece pipeline:

curl | grep | sort | sed | cat

The curl command uses the script's command-line argument as the name of the group/namespace to search for. It prefixes the name with "@" and runs the search on the Wiki home page.

The output of the curl search has all the resulting filenames formatting in a single line, so "normal" grep is only marginally useful. Usually grep returns whole lines from the input that each contain the search pattern. Whle working on this script, I learned about the the magic option, -o. Using -o returns only the matching text, no matter how many times it appears in a line. By constructing a search pattern that specifies the group name plus a fileame, I can use grep -o to extract only those substrings from the curl output.

That list of filenames goes through sort for de-duplicating and then through sed to paste the BASE_URL onto the front of each (because the curl search produced releative pathnames). Then cat sticks the result into a file that includes the group name in its filename.

Fig.1 shows the fully annotated get_urls.sh script. (It uses my own "graphic" annotation style for the complex commands called "cxc" for "Character by Character". I hope it's helpful.)

Fig.1. The `get_urls.sh` script ( expand / collapse )
#!/usr/bin/env bash
#
# USAGE: ./get_urls.sh GROUP_NAME
#
# Example:
#   ./get_urls.sh token
#   ./get_urls.sh trigger

# Turn on Bash safety features:
# -e β€” exit on error
# -u β€” treat unset variables as errors
# -o pipefail β€” detect failures inside pipelines
set -euo pipefail

########################################################################
# Configuration

# Set GROUP to $1; if it’s missing or empty, abort with this error message.
GROUP="${1:?Usage: $0 group_name}"

# BASE_URL is used both in SEARCH_URL and in reporting the found URLs.
BASE_URL="https://wiki.keyboardmaestro.com"

# SEARCH_URL is used in the curl command to generate the search page.
SEARCH_URL="$BASE_URL/Home_Page?do=search&q=%40${GROUP}"


########################################################################
# Extraction pipeline

#    -s,   :--silent,  Silent or quiet mode.
#    β”‚  "$SEARCH_URL"    :The BASE_URL plus the Search Page details
#    β”‚  β”‚             ∣    :Pipe the output of curl to...
#    β”‚  β”‚             β”‚ \    : ...the next command (grep)
#    β”‚  β”‚             β”‚ β”‚
curl -s "$SEARCH_URL" | \

  # Extract matching URLs 
  #   (-o is required because the HTML search result is one long line)
  #
  #    -o    :Print only matching content, don't output the entire line
  #    |         Matches multiple times in a single line
  #    β”‚ -E    :-E Use Extended-regex
  #    β”‚ β”‚ "/    :Begin the search string with a literal / character 
  #    β”‚ β”‚ β”‚ ${GROUP}    :the text in the $GROUP variable
  #    β”‚ β”‚ β”‚ β”‚       /    :Another slash
  #    β”‚ β”‚ β”‚ β”‚       β”‚[A-Za-z0-9_]    :And the filename using these characters
  #    β”‚ β”‚ β”‚ β”‚       β”‚β”‚           +    :one or more times.
  #    β”‚ β”‚ β”‚ β”‚       β”‚β”‚           |"    :End of search string
  #    β”‚ β”‚ β”‚ β”‚       β”‚β”‚           β”‚| ∣    :Pipe the output of grep to...
  #    β”‚ β”‚ β”‚ β”‚       β”‚β”‚           β”‚β”‚ β”‚ \    : ...the next command (sort)
  #    β”‚ β”‚ β”‚ β”‚       β”‚β”‚           β”‚β”‚ β”‚ 
  grep -oE "/${GROUP}/[A-Za-z0-9_]+" | \

  # Sort and de-duplicate the resulting list
  sort -u | \

  # Convert to absolute URLs instead of relative
  #
  #    s    :substitute
  #    β”‚βˆ£    :using | as the deliniter instead of the usual "/".
  #    β”‚β”‚^    :Find the beginning of the line and...
  #    β”‚β”‚β”‚βˆ£    :replace that with...
  #    β”‚β”‚β”‚β”‚'    :(step outside the '-delimited string to
  #    β”‚β”‚β”‚β”‚        use the local variable)
  #    β”‚β”‚β”‚β”‚β”‚$BASE_URL    :the Base URL variable.
  #    β”‚β”‚β”‚β”‚β”‚β”‚        '    :Return to the single-quoted context
  #    β”‚β”‚β”‚β”‚β”‚β”‚        β”‚      so the "|" is a character to sed.
  #    β”‚β”‚β”‚β”‚β”‚β”‚        β”‚βˆ£    :End the sed substitution,
  #    β”‚β”‚β”‚β”‚β”‚β”‚        β”‚β”‚'    :End the quoted string
  #    β”‚β”‚β”‚β”‚β”‚β”‚        β”‚β”‚β”‚ ∣    :Pipe the output of sed to...
  #    β”‚β”‚β”‚β”‚β”‚β”‚        β”‚β”‚β”‚ β”‚ \    : ...the next command (cat)
  #    β”‚β”‚β”‚β”‚β”‚β”‚        β”‚β”‚β”‚ β”‚ β”‚
  sed 's|^|'$BASE_URL'|' | \

# Write final list to disk
cat > "urls_${GROUP}.txt"


Fig.2 shows a list of the lists of resulting filenames, with counts of how many lines/names are in each list.

Fig.2. List of Lists of Wiki URLs ( expand / collapse )
wc -l urls_*
     349 urls_action.txt
       9 urls_actions.txt
       5 urls_application.txt
      17 urls_assistance.txt
      13 urls_collection.txt
      26 urls_condition.txt
     106 urls_function.txt
       3 urls_include.txt
      52 urls_manual.txt
       1 urls_playground.txt
     164 urls_token.txt
      39 urls_trigger.txt
       4 urls_wiki.txt
     788 total

As an example of what the above files contain, Fig.3 shows shows the contents of urls_collection.txt, which is the output of the script when given the name of the collection group/namespace using this terminal command:

./get_urls.sh collection

The above command needs to be run in the directory where the script lives (and with the script file made executable). It generates the file urls_collection.txt which contains a list of all of the Wiki pages in the collection namespace.

Fig.3. Contents of `urls_collection.txt` ( expand / collapse )
https://wiki.keyboardmaestro.com/collection/Clipboard_History
https://wiki.keyboardmaestro.com/collection/Dictionaries
https://wiki.keyboardmaestro.com/collection/Dictionary_Keys
https://wiki.keyboardmaestro.com/collection/Finders_Selection
https://wiki.keyboardmaestro.com/collection/Folder_Contents
https://wiki.keyboardmaestro.com/collection/Found_Images
https://wiki.keyboardmaestro.com/collection/JSON_Keys
https://wiki.keyboardmaestro.com/collection/Lines_In
https://wiki.keyboardmaestro.com/collection/Mounted_Volumes
https://wiki.keyboardmaestro.com/collection/Number_Range
https://wiki.keyboardmaestro.com/collection/Running_Applications
https://wiki.keyboardmaestro.com/collection/Substrings_In
https://wiki.keyboardmaestro.com/collection/Variables

The various urls_*.txt lists for each of the namespaces become input for the next script, audit.sh, that does the auditing. I'll describe that in another post.

Testing a Forum Wiki Post

Peter made this post into a Forum Wiki post, which allows other people to edit it too.

I (August) had originally intended to post the output tables of audit.sh for each namespace/group. However, it turns out that the Forum software won't allow a single post of more than 32,000 characters, which is about half of the table of Action page titile audits. Even just the list of Action page URLs is over 21,000 characters itself. Peter and I are working on an alternative.

In theory, a Wiki Post like this would allow anyone else with editing permission in the main KBM Wiki to find pages that need updating in the list here, make an update, and change the list here to show that it's been fixed. But organizing that in a way that is easy to use and meets the forum post limitation is currently a challenge.

I think, in theory, the previous post I just made is now a wiki, and anyone can edit it, so August, you could try editing and putting in your document and see if that is workable.

I am happy to do the file translation macro, I have a lit of experience doing that, and it's possible I could do something server-side (although dokuwiki does not seem to have any ability to allow you to edit the raw file and then have it update the cache, so probably it has to be done the hard way, but that is something I can deal with).

The Forum editor says a post is limited to 32000 characters and when I try to past the the whole thing into the Wiki Post above, it's over 115000 characters. It originally seemed to accept some of it anyway, but the next morning, that's gone.

I like the idea of Wiki pages right here in this thread. Much cleaner than a linked Google Docs page, but that 32000 character limit is a pain. The audit_action.md file with the table for that namespace is 55806 characters, so it will take two Wiki Post entries all by itself.

@peternlewis, can you tell me how to create those pages? While the action namespace will take two, I'm sure I can six or more of the smaller namespaces into one Wiki Post

For the macro, I had in mind just automating opening the WIki Editor on each page and then pasting by hand, but if you have something better, let me know.

Just post a comment, then click the … at the bottom, click the wrench, and make it a wiki.

Feel free to experiment here or make a new topic for it if you like. The wiki above with all the disclosure triangles would not be very user friendly, so it may require some fidling around with format to get it useful.

Looking for the wrench...

image

No wrench. Probably a permissions thing. I'll try in a different section, Outback Lounge maybe...

No wrench. Googling "Discourse how to create a wiki post", I find, "the wrench icon (admin actions)"

  • Permissions: Generally, you need to be a Trust Level 3 (Regular) user or higher to convert your own posts into a wiki. Staff members (moderators/admins) can convert any post.

I bumped you to level 3.

1 Like

That seems to do it. Now I just need to work out how to organizing it in a way that is easy to use and meets the 32,000 char forum post limitation

1 Like

Sounds good. I'm leaving it to you and ignoring the random notifications, so let me know when its time to take the next steps.

I'm thinking of splitting it into three separate posts:

  • Methodology and URLs, this thread
  • The Audit Process, the audit.sh script and an overview of how the output tables are to be used
  • Updating the Wiki, several posts breaking up the page-by-page tables to keep track of updating
1 Like