This post shows how to calculate the frequencies of words in the English language. Obviously, not many people care about this task. However the techniques used could be helpful for people to learn from, so I'm posting it here.
There's a popular (and simple) new game called "Wordle" that many people are playing (it's so popular that the New York Times just bought it for $1 million and eventually they are going to put it behind their paywall.) You can find the official website quite easily, but I hear there are some copycat websites spreading out there.
This is the kind of puzzle that I adore because I'm good with KM and I'm able to use KM to assist in the puzzle. Wordle is a game where you guess a word, and the web server tells you if you are right, or at least how close you are to being right. If you want to write a KM program to emulate Wordle, you have to decide if you are going to program the human's side or the computer's side, or both. The human side is the trickier side because it requires a certain amount of logic and data processing. But the computer side isn't that difficult to program because it mostly involves "marking" a word that the human guesses. I've programmed both sides in KM, although my code is a little sloppy right now. I could probably upload the code, but today I want to upload only a sub-task.
The sub-task is simple. It's obtaining a frequency list of the most common 5-letter words in English, with frequency counts. There are websites that claim to provide this data, but some of the websites actually charge money, and the ones that don't charge money have their data in a format that's difficult to read into KM. So I decided to get this data on my own using KM. Why not use KM, since I love it so much?
My idea was as follows: there's a website called Wikipedia that has a link to a random page on its site. The web link is:
If you put that link into a browser, and wait for the result, you get a random article from the site. The next step is to get all the words from that page. The Safari browser has a fabulous feature called "Reader" which removes all the junk from the web page and tries to return only the article's main text. Reader works accurately about 99% of the time. (In addition, it's a pretty simple/free ad remover. So I love it.) The reason it helps us here is that it removes most of the unwanted "frames/menus" that appear within the body of any web page. This way we get only the words that matter. We will copy the words from each page into the System Clipboard, then use that text as our sample random words.
Once we have the random sample page of random words, we're going to extract every word from the page, and and keep count of the frequency of every word. We will do this by using a KM Dictionary (which is kind of ironic since we are using a KM "dictionary" to build a frequency "dictionary".) We will call the dictionary "Words" and each word will contain a frequency count, like this: (in this pseudocode example, the word is "about" and the frequency is "101".)
Since the game Wordle is currently limited to five letter words only, we will use a shell command to strip out any words that aren't five letters. We will also strip out any capitalized words using a shell command (this will skew the frequency results a little, but for these purposes it won't hurt much.)
Most of the magic in the macro will be performed by the following statement:
grep -o -E "\w+" | egrep "^\w\w\w\w\w$" | egrep -v "[[:digit:]]" | egrep -v "[A-Z]" | sort | uniq -c
Each shell statement in the above shell command does the following:
- Extract all words into a separate line.
- Extract only the five letter words.
- Remove any words that contain a digit.
- Remove any words that contain a capital letter.
- Sort the words alphabetically.
- Count how many times each word appears in this text.
I've been running this macro for a half hour, and here's the first part of the tally:
You may notice that some odd words are high on this list, like "links", which is not popular in normal English, but it's a very common word on any wikipedia page. So this list is not perfectly accurate, but it's adequate for my purposes. Another minor problem is that my code doesn't remove words with accented characters (or strange fonts), because I'm not sure how to fix that, so there are some words in the list that shouldn't be there because they aren't English words.
I used a KM subroutine to assist with my code. It helped reduce the amount of code.
Wordle1 Macros (v10.0.2)
Wordle1 Macros.kmmacros (28 KB)