Getting Word Count Frequencies Using KM, Dictionaries, and Random Web Pages

This post shows how to calculate the frequencies of words in the English language. Obviously, not many people care about this task. However the techniques used could be helpful for people to learn from, so I'm posting it here.

There's a popular (and simple) new game called "Wordle" that many people are playing (it's so popular that the New York Times just bought it for $1 million and eventually they are going to put it behind their paywall.) You can find the official website quite easily, but I hear there are some copycat websites spreading out there.

This is the kind of puzzle that I adore because I'm good with KM and I'm able to use KM to assist in the puzzle. Wordle is a game where you guess a word, and the web server tells you if you are right, or at least how close you are to being right. If you want to write a KM program to emulate Wordle, you have to decide if you are going to program the human's side or the computer's side, or both. The human side is the trickier side because it requires a certain amount of logic and data processing. But the computer side isn't that difficult to program because it mostly involves "marking" a word that the human guesses. I've programmed both sides in KM, although my code is a little sloppy right now. I could probably upload the code, but today I want to upload only a sub-task.

The sub-task is simple. It's obtaining a frequency list of the most common 5-letter words in English, with frequency counts. There are websites that claim to provide this data, but some of the websites actually charge money, and the ones that don't charge money have their data in a format that's difficult to read into KM. So I decided to get this data on my own using KM. Why not use KM, since I love it so much?

My idea was as follows: there's a website called Wikipedia that has a link to a random page on its site. The web link is:

https://en.wikipedia.org/wiki/Special:Random

If you put that link into a browser, and wait for the result, you get a random article from the site. The next step is to get all the words from that page. The Safari browser has a fabulous feature called "Reader" which removes all the junk from the web page and tries to return only the article's main text. Reader works accurately about 99% of the time. (In addition, it's a pretty simple/free ad remover. So I love it.) The reason it helps us here is that it removes most of the unwanted "frames/menus" that appear within the body of any web page. This way we get only the words that matter. We will copy the words from each page into the System Clipboard, then use that text as our sample random words.

Once we have the random sample page of random words, we're going to extract every word from the page, and and keep count of the frequency of every word. We will do this by using a KM Dictionary (which is kind of ironic since we are using a KM "dictionary" to build a frequency "dictionary".) We will call the dictionary "Words" and each word will contain a frequency count, like this: (in this pseudocode example, the word is "about" and the frequency is "101".)

Dictionary[Words,about]=101

Since the game Wordle is currently limited to five letter words only, we will use a shell command to strip out any words that aren't five letters. We will also strip out any capitalized words using a shell command (this will skew the frequency results a little, but for these purposes it won't hurt much.)

Most of the magic in the macro will be performed by the following statement:

grep -o -E "\w+" | egrep "^\w\w\w\w\w$" | egrep -v "[[:digit:]]" | egrep -v "[A-Z]" | sort | uniq -c

Each shell statement in the above shell command does the following:

  1. Extract all words into a separate line.
  2. Extract only the five letter words.
  3. Remove any words that contain a digit.
  4. Remove any words that contain a capital letter.
  5. Sort the words alphabetically.
  6. Count how many times each word appears in this text.

I've been running this macro for a half hour, and here's the first part of the tally:

which 463
first 357
links 317
their 281
other 215
years 177
about 173
after 170
would 168
known 159
where 148
album 138
under 136
three 128
later 116
there 109
being 107
based 105
these 103
while 92
since 89

You may notice that some odd words are high on this list, like "links", which is not popular in normal English, but it's a very common word on any wikipedia page. So this list is not perfectly accurate, but it's adequate for my purposes. Another minor problem is that my code doesn't remove words with accented characters (or strange fonts), because I'm not sure how to fix that, so there are some words in the list that shouldn't be there because they aren't English words.

I used a KM subroutine to assist with my code. It helped reduce the amount of code.

Wordle1 Macros (v10.0.2)

Wordle1 Macros.kmmacros (28 KB)

You may be interested in this post. I downloaded lists of five-letter words from Scrabble sites to get letter frequencies. These words are generally legal to play in Wordle, but most of them will not be answers. As described in this New York Times article, the list of possible Wordle answers is only about 2300 words long. There's a very interesting take on using information theory to devise a Wordle guessing program in this video from 3 Blue 1 Brown. Beware that the video shows parts of Wordle's answer list—you may need to shut your eyes during those scenes.

For what it's worth, in the month-plus that I've been playing, I've never seen a plural noun or a third-person singular verb, so words that end in "s," which are very common in text, are not likely to be a Wordle answer. They are perfectly legal guesses, though.

I sometimes use grep after playing Wordle to see other words that I could have used at various points in the game. Here's a particular example.

Yes, letter frequency does affect one's strategy. But the reason I'm measuring word frequency is simply to help determine which word to pick when you don't know the dictionary of words that could be chosen. If two words are equally good as the next guess, I would like to choose the one that's more frequent. (Yes, I know the dictionary is visible in the Javascript, but I'm not here to cheat, I'm here to solve problems.)

The 3B1B video is interesting, but it's not the only way to write code to solve this game. In fact, they don't even clarify in that video which is the following is "better": an algorithm with a lower number of average required guesses or an algorithm with a lower number of maximum required guesses. So clearly there are different solutions when there is more than one possible goal.

3 Blue 1 Brown just issued a new video with a couple of corrections to the video you cited. And interestingly, the first correction that they addressed was a problem that I raised in the comment section of their video 8 days ago. But he didn't give me credit for that correction. So I'm slightly sad, but I'm also happy that I was able to spot an error in that respectable channel, inform them of the error in the Comments section, and have them issue a correction a week later.

I also implied 8 days ago in my comment that there was a bigger problem with his video, and it's amazing that he spent most of this new video trying to correct that problem.

So clearly I'm not a complete idiot if I can correct the respectable channel of 3 Blue 1 Brown.

I was glad to see that Grant Sanderson also acknowledged in the followup video that strategies for a computer don’t necessarily work well for people.