Science's Less Accurate Grandmother: Did the Wordlebot Make a Mistake? (Wordle No. 1214, 15 Oct. 2024)

18 October 2024

Did the Wordlebot Make a Mistake? (Wordle No. 1214, 15 Oct. 2024)

No one needs me to explain what the Wordle is; I don't remember exactly when I picked it up, but I've been doing it since around the time it took off. (Unfortunately, I lost my stats in summer 2022 when my phone stopped working and I didn't have a New York Times account yet, so I only have a record of my last 833 games.)

And like many Wordle aficionados, I'm sure, I like to check out my scores against the Wordlebot. The Wordlebot, of course, plays with a 99 skill score, meaning it's in the 99th percentile for quality of guess. But when I went through the Wordlebot analysis of Wordle no. 1214, from Tuesday, October 15, 2024, I decided that its analysis of my guesses was wrong.

Here's my claim. Spoilers, of course, for a past Wordle.

So my initial guess was what has largely been my initial guess,* IRATE:

As you can see, the Wordlebot gives this a 94 skill score, though your opening word choice doesn't factor into your overall skill score.

Unless I get a whole lot of information from my first guess, I usually use LOCUS as my second:

It gives me three key consonants (L, C, and S) and the remaining two vowels. It's not always a great guess, but in this case it was; I got a 96 skill score off it and narrowed things down to just five words.

With this information, my third guess was CODER:

Wordlebot gave me a 90 for this; I'm not very sure why COVER is considered the superior pick.

Anyway, it was clear to me that there were three options left, just as you can see in the above screenshot: CORER, COVER, and COWER. Probably, anyway. Sometimes I systematically go through all the possible letters but still miss an option. And I had three guesses left. I could of course just make each guess in turn, but 1) that meant a decent chance of scoring 6 guesses overall, and 2) what if I had indeed missed something and the answer was some other word I hadn't noticed?

So I decided to be a bit strategic and guess VOWED. This would eliminate COVER and COWER if they were the answer, leaving me free to guess CORER on my fifth guess, and thus giving me the space to guess one more word if it somehow wasn't the right answer.

As you can see, Wordlebot says, "Nice," but actually gives it a 51 skill score. It would have picked COVER. But had it picked COVER, it would have then picked, presumably, COWER, and then CORER, meaning with the information I had, it would have have taken the Wordlebot 6 guesses, whereas it took me 5.

Why is this? While the Wordlebot said CORER, COVER, and COWER were all possible, it actually didn't think CORER very probable. See this list:

You'll see it gives COVER a 99 skill score, COWER a 96 skill score... and CORER a 6 skill score! Ouch. And if you scroll down to the "Comparing our guesses" frame, you can see it assigns a very low probability score to CORER:

COVER is 51%, COWER is 48%... and CORER is 1%!

Why is this? The Wordlebot's understanding of what makes for a probable word is derived from the NYT's own corpus. As per their explanation, "The bot's probabilities are estimated based on how common a word is, using the frequency of appearances in The Times since 2000 as a rough and imperfect proxy. This means the bot, just like humans, has to guess whether borderline words will wind up on the new (hidden) Wordle solution list."

I do have to admit that CORER seemed somewhat less likely to COVER or COWER to me, but that COVER was fifty times as likely is not a ratio I would have come up with. But the NYT is a newspaper, and I guess newspapers probably talk about people covering and cowering a lot more than they talk about people using apple corers. But in day-to-day existence, while "corer" probably isn't used as much as the other two words, I don't think I would assign such a lopsided ratio.†

So while I can see why Wordlebot went with COVER, as it basically had a 50% chance of getting the solution in 4 according to its own metrics, it would have actually got it in 6, whereas I guaranteed getting it in 5 with my strategy. So I think I deserve a better skill score than 51 on VOWED, and a better skill score than 84 on the puzzle overall, neener neener.

* Once IRATE was the actual answer, I switched to a strategy I saw on the Wordle subreddit for switching things up, which was to use the previous day's answer as the next day's starting word. This was fun if tricky, and taught me to be a bit more strategic about my guesses. But as I neared exceeding a particularly long streak, I switched back to using IRATE in order to make things easier on myself.

† Interestingly, in the Google Books Ngram corpus for 2022, "cover" appears 150 times as often as "cower," and "cower" appears 10 times as often as "corer."

1 comment:

Unknown17 April, 2026 04:44
Yes, this does happen to be a “feature “of the bot. It’s a really gray area as far as what words might be solutions, and what might not be. For example, one of the previous solutions was LORIS. I had no idea what it meant, and honestly, even after having looked it up, I already forget what it means. So should every word that’s at least as common as LORIS be treated with equal weight as all the very common words? I honestly don’t know the answer to that question.

The only way to determine this is to come up with a list of five letter words, and then sort them all by “commonness” by some measure, Whether it be the New York Times bot, or Google Ngrams, or something else. Then perhaps pick 100 words that are ranked around 2000th on the list. That would make up The bottom 5% of the top 2000 words. Then look to see how many of those words have been solutions so far, and see if they make up somewhere close to 5% of them. Depending on how close they are to being 5% of all the previous solutions would determine how much they should be discounted.

Of course, there are flaws with this also. For example, perhaps if less common words have been under represented so far, they could possibly be OVER represented going forward as they use up more of the common words, and don’t start reusing them all that often.

The idea of discounting less common words is understandable on some level, but I’m not sure it’s implemented properly. In any event, if you just look at the bot’s scoring as for instructional use only, and understand it’s limitations, you can take occasional cases like this one with a grain of salt.
ReplyDelete
Replies

Add comment