SamSuka
3blue1brown
3blue1brown

patreon


Some Wordle fun (rough cut)

Hey everyone,

I decided to do a digression from the main project I was working on most of January to do a "quick" little video using Wordle as an excuse to teach some information theory.  Here's a preview of the rough cut, which I'll aim to finalize today and publish tomorrow. If you catch any errors (which are inevitably in there, I have yet to do a proper final pass), do let me know.

(Edit: And now the final video is up, thanks for all the helpful catches!)

-Grant

Some Wordle fun (rough cut)

Comments

This is interesting. What about approaching this through a Neural Network, where grey returns 0 for that letter in all places, amber returns a value of 0 for that letter in that space but .25 for that letter in each of the remaining spaces, and green returns 1 for that letter in that space?

Very cool. My question: if you were going to memorize two words to always be your opener, no matter what the result of the first try is, what would they be? Like your other nails suggestion.

Leo Barlach

Grant @ 7:13 - "Here's a good puzzle for you. What are the three words in the English language that start with a W, end with a Y, and have an R somewhere in the middle?" Me (who hasn't seen the game before, I'm in the control group) - "Hmm, let's see. There's 'worry', and..." Grant - proceeds to not include that. :>

Jesse Thompson

Fantastic! A question: * 10:00: Why does the entropy add here? As far as I can see, these two options (first letter s & has a t) are not independent, so they shouldn't add exactly, but there should be a term expressing the mutual information as well, right? * 18:50 Awesome puns :D

Rion Boom Crabhands Keon

Hi, I'm a bit late, but I wanted to say the video is fantastic! However at (25:50) you use (1-0.58) * f(1.44 - 1.27), that is the current entropy and the _expected_ entropy after testing the 'words'. I might be wrong, but 1.27 is the expected entropy of 'words', but the (1-0.58) coefficient adds the assumption that 'words' is not a match... Would not this change the expected remaining entropy?

Actually, it looks like people already computed optimal strategies for Wordle: http://sonorouschocolate.com/notes/index.php?title=The_best_strategies_for_Wordle

Hi Grant. Could you please share the code or just tell me the statistics for other versions like Nerdle and Mathle? Thank you! Also for Wordle, 1) What do we know about the worst-case strategies (can you guarantee the win in say 5 or 6 steps?) 2) What are the class of strategies you have tried? Did you look at depth-first backtracking algorithm and genetic algorithms for the Mastermind game?

Hi there. Long time listener, first time commenter. I made my own "Wordle", here: https://madc0w.github.io/wordle/ The dictionary is hand-curated, and can be found here: https://madc0w.github.io/wordle/dict.js

Yes, they'd have to be independent, or you'd have to be using conditional probabilities... like, if "it starts with an s" has 2 bits, and "given it starts with an s, it has a t" has 3 bits. Which matches up with the corresponding probability calculations... P(A and B) = P(A)P(B|A), and if A and B are independent then P(A and B) = P(A)P(B).

Phillip Bradbury

Hi Grant, interesting video, enjoyed the content. It would be cool if in the final you could link to your code on github (or other code sharing platform of choice) so people could experiment.

James Matheson

This is awesome. Thank you. I should think about my own game

28:42 you've got "Trickes" above the rightmost panel... Otherwise great video as always, i've just recently gotten interested in information theory and this video coming out has been a fun coincidence!

I really think the 3^5 should be explained further (and for example you cannot have one yellow and four green). Additionally the elimination of wrong guesses as well as the placement of the yellow and green highlights should be addressed and incorporated in the algorithm.

Gregor Shapiro

I assumed the game would have unlikely words as the answers. Being a word game you wouldn't expect the answers to be the most likely ones. I feel like "more unlikely yet more probable" changes the calculations a bit.

"These are the most common five letter words in the english language. Or rather these is the eighth most common. First is which after which there's their and there. First itself is not first but ninth and it makes sense that these other words could come about more often where those after first are after where and those being just a little bit less common."

This was great! Explaining (stat mech) thermodynamic entropy could also be a really good future video. WRT "tares," it's actually a not-uncommon word, but not the noun you looked up: tare weight is the empty weight of a container, and to tare a scale is to put the empty container on it and adjust its zero point to match, so that when you fill it the scale will read the net (gross minus tare) weight which is what you care about. Most kitchen and food scales, for example, have "tare" buttons on them for just that.

Yonatan Zunger

7:16 "venture even farther to the left": maybe you meant "to the right"?

Kevin Iga

"AEIOU" guess - good point!

Edith Dubiner

it does not have an 'r' or a 'd', which the correct word should have.

Edith Dubiner

that's why the probability shown is 0.00 :-)

Edith Dubiner

Minor copyedit note: you have "useing" at 4:31 (s/be "using"). My personal thoughts on a Wordle bot would be that it'd be a great way to showcase the importance of choosing the right training set for your AI. How good would the results be if you completely ignore the Wordle word list and *just* use common English words? In AI training, there's a Scylla/Charybdis of either creating results that aren't useful because they're too general, or results that are useless because they're hyper-focused on the training set (case in point: try training your AI on tomorrow's word and nothing else - it'll be useless!). 18:51 Love that summary. Point of curiosity: Would the simulation be different if intermediate guesses weren't required to be words? What would you learn from an opening guess of, say, "AEIOU"? (I'm assuming that that doesn't count as a word, of course. If it is, it would mean "the Habsburgs will rule the world".)

Rosuav

Thanks Grant! I found the stair-step shape of the probability graphs intriguing. At first glance, I would have expected it to be more uniform and 1/x-ish. I'm thinking that this is due to the non-uniform distribution of English words. You might start with a completely uniform sample space like picking 5 balls from a box of 26 different balls and show how this distribution is shaped. It might make some of the ideas more accessible to the less informed in your audience. The video is nicely long and very comprehensive. It's gratifying that your bucking the YouTube direction of short Tic-Toc-ish videos for something that truly imparts information (pun intended).

Maybe I'm just not quite getting it, but wouldn't the 2 bits for 'has t' and 3 bits for 'starts with an s' at 10:23 in principle have to be uncorrelated to add to 5 bits?

Martin Manscher

Love this!!! Is there an information theory book anyone would recommend? This vid makes me want to dive deeper

At 25:55 I was confused why the word "wombs" was assigned a zero probability of being the right word. Seems like a perfectly good word to me.

The same spelling error recurs as 28:40.

Loving the impromptu "Who's on first" style routine with the word frequency data.

Arne Tobias Malkenes Ødegaard

"That's what I'm askin ya .. WHICH of these words is the most common?" lol

Shawn Van Ness

It was also super interesting to see the risk/reward tradeoff, where going for low-score brings some risk of failure (taking >6 guesses). Maybe worth spending a minute to analyze the expected distribution of Wordl's "hard mode", where each guess must incorporate the green and yellow letters.

Shawn Van Ness

I've often started with "stare" (anagram of "tares") so that makes me feel slightly smart.. lol. But it was interesting to see that "stare" isn't actually one of the top starting words. Presumably because the positional value of having the 's' on the end is so great. Maybe worth mentioning something about that.

Shawn Van Ness

A partial answer after some thought: In RL terminology, you're estimating a value function via empirical performance, with entropy mapping the state space down to the reals? So adding value iteration could help, but the entropy approximation remains.

I think you're right; that's a complication. I got thrashed by the word "freer" the other day, so I feel the pain.

Don Sanderson

Yes! I agree entirely!

Neal McBurnett

During the animation at 11:40, some patterns have 4 green squares and 1 yellow square. But that doesn't make sense - there's no other place that yellow letter could be at.

Eugene Pakhomov

Suggest using wordle in colorblind mode. Colors of exactly-right-letter and right-letter-wrong-position look the same to colorblind folks.

Steve Muench

Nice! Couple qs: - Can you pinpoint what's "suboptimal" or heuristic in your algorithm, and/or draw a comparison to the infeasible solving of the enormous finite MDP (e.g., RL with reward -1 and discount 1, or dynamic programming)? - I think it's worth emphasizing more that the use of sigmoid is an arbitrary way to inject a prior on answers. One could infer that it's more principled. - Is there an easy relation to Huffman encodings? I can't quite place it.

'Tis the season of Wordle studies! Check out what I published yesterday, focused on what you can actually learn from the what other people share. That includes just the colors of their guess results, but it turns out that that sometimes gives meaningful hints, and can even identify the exact solution. I'll update it to point at your video. On Medium: https://medium.com/@nealmcb/fun-word-patterns-exhibited-by-wordle-grid-shares-51cdaaab53f9 Python notebook: https://github.com/nealmcb/wordle_spoilers

Neal McBurnett

Really lovely video Grant, and wonderful example of how to play with math in a real-world problem! It's nice that you've quantified how little difference it makes if you know the set of words they're drawing from. Wordle can go on forever, using the same set of common words.

Norman Margolus

awesome. 4:29 "useing" => "using"

I LOVE your run-thru of the most common 5-letter words at 18:50. Sort of like "Who's on first?"

Neal McBurnett

I think that he meant that after you make your initial guess, there are 243 possible responses from the game.

MBP

At 7:27 you ask what 3 words start with W, end with Y, and have an R, but there's a fourth not mentioned - "worry." I think you don't include it because it has two Rs, but my understanding of wordle is it doesn't tell you how many times a letter occurs, so worry is a potential guess there. Does that affect the algorithm at all?

genericbandname

I'm commenting before seeing the whole thing, at 13:15 where you say something that I find very puzzling. You say, "In our case where there are 3^5 total patterns...". Where does this come from? If the 12,972 five letter words were equally probable, the information when you see one would be 13.7 bits, not 7.92. Did you previously have an example with 243 possibilities that you edited out? Edit: Okay, after watching the rest it's clear you meant 243 to just be a supposition, but using 3^5 is very specific and makes this obscure.

Norman Margolus

Thanks for the comments! I'll add a note about the unit conventions, which I'll admit I was not aware of. I always learned it with the word "bit", e.g. as I look through MacKay now he seems to regularly use "bit" when talking about the units of Shannon information.

3blue1brown

Thanks! Fixed.

3blue1brown

Thanks! Fixed.

3blue1brown

Maybe a segue into dynamic programming?

Imre Polik

I'm not sure I see what you're referring to, are you sure the timestamp is 15:04?

3blue1brown

Good point, I'll look into simplifying that.

3blue1brown

Amazing, as always. It's very cool to see bits of your coding workflow come through in this video. In particular, starting some .py file with a bunch of code to solve an interesting problem, but also pretty seamlessly integration animations into it which do a fabulous job of expositing your code and what it's doing in an extremely visual way. It's also great that this wordle.py file is up on github for us to peek at and try to unpack exactly what you're doing in the video. I'm especially curious by how much you've been showing what look to be screencaptures in your last few videos. It really gives the impression that you can run bits of wordle.py and create this animations which you are interacting with in real time. Is that actually what's happening here? As somebody who has tried with medium success to use ManimCE, but has not figured out a satisfying workflow to keep doing that, I would be really curious to pick your brain on what your workflow actually look like. Thanks as always for keeping the bar so high on what quality content can look like.

Eric Severson

In any case, a great video! But I will continue using my own initial words (which I won´t tell anybody!)

Daniel Armesto

The strategy stays the same, you just limit the word set it's allowed to search through. I just tried it, and it looks like it averages around 3.56

3blue1brown

Good point, log bit, I was going to explain the p=0 case originally but guess I never quite got to that in the recording. I'll add an on-screen note. Good to know on the 99.9% bits, I'll see what I can do to add clarity.

3blue1brown

Agreed! This was probably the funniest part of any 3b1b video I've ever seen.

Eric Severson

Also, It would be fun to illustrate how much information green, oranges and greys contribute.

Daniel Armesto

Thanks! Just fixed it.

3blue1brown

There seems to be a small typo that appears twice: " try not usEing this".

Daniel Armesto

7:00, emphasize that this chart is specific for WEARY. Also, I thought John Tukey came up with “bit”.

Max Goldstein

Someone in the background is yawning loudly after you say "apropos" at 25:04

The wordplay around 19:00 is hilarious.

Imre Polik

(Obsolete by the end of the video) Another way to improve the guesses is to only guess words that are still possible. In the abbas/abyss example the bot would have guessed libri, because that would have given the same 1 bit of information needed to decide which word is the right one, but it didn't have a chance of hitting the right word.

Imre Polik

15:04, you show a word that is already excluded by the prior information. The first word ends with R, and the information you get is that there is an R, but in another position, so words that end in R are not possible guesses (by contradiction, if the word ended in R, the last cell would have been green).

Andrea Bedin

Small suggestion: at 12:55 there is happening a lot. 4 numbers are updating, the letters are changing colors. This is a bit too fast to look at, so maybe you could e.g. only show 3 frames : low entropy, medium entropy and high entropy. Another idea: Keeping the frame rate, but reducing the opacity of everything except the entropy term and the plot.

It is actually additive, but it's true that it should be precised in which sense. If you sum up the *conditional* entropies of all guesses (from the first to the final solution), each conditional on the preceding ones, you always obtain the same value, which is the entropy of the probability distribution for the solutions. This point is nicely explained with fun examples in MacKay's book.

Hello. Great video. My suggestion is that at the point where you say one advantage of measuring entropy in bits is that you can add them together (and using an example). That assumes that the distribution is the same both before and after you get some information -- essentially that the two probability distributions are independent, which is not necessarily true. For instance, if the information was that the word contains a "Q", then a second piece of information that it contains a "U" would add little or no new information, even though knowing the words contains a "U" would add information as your first datum.

MBP

Would the "hard mode" setting, where you are forced to use the information you're given by each guess (e.g. R is yellow, your next word must contain an R), affect how you would approach the problem? Also, at 28:50 it says "trickes", which I think should be "tricks"?

Olav Valle

13:05 nitpick, but the pattern animation ends on a p=0 pattern so it shows log_2(1/p) = 0. That's fine for doing calculations, but I think a careful viewer seeing entropy for the first time would think you broke math. A tiny sidenote could fix that, or just avoid showing it at all. 19:30, the two 99.9%s confused me at first. They make more sense after watching the next section, but maybe there's something to say about how the sigmoid gets you to a relative number of "counts" for each word. Very fun watch! It was especially nice seeing the two ways of using entropy. Definitely gonna immediately pass it to my math grad student Wordle geek friends.

Nicolas Bolle

Thanks for sharing such an interesting and topical video. It's been a long time since I touched on information theory in undergrad, so I appreciated the refresher. One comment: at 17:40, I believe the actual implementation of Wordle would have colored the "BA" in "ABBAS" gray. Each of these letters has a multiplicity of one in the target word "ABYSS," and they are already correctly located as the "AB" in "ABBAS."

Also, a comment only vaguely related to the video. When you have time check out R.T. Cox's work on entropy, I think you'll love it. In short, Cox defines the notion of "question" as the collection of its possible answer statements. Dually one can also define a "statement" as the collection of questions of which it is a possible answer. It turns out that it's possible to develop a calculus for the "bearing" of a question that's dual to the calculus for the probability of a statement. And the dual calculus has entropy at its core. The reference is "Of inference and inquiry, an essay in inductive logic", pp. 119–167 of Levine & al (eds): "The Maximum Entropy Formalism", MIT Press 1979.

Hi Grant, great new video! Just wanted to point out that the unit of information in base-2 logarithm, according to the International Organization for Standardization (ISO), is the "Shannon", not the "bit". See for example https://en.wikipedia.org/wiki/Shannon_(unit) (I can share the standard if you like). The bit is a unit of storage capacity. There are some good reasons to keep them separate. "Bit" is still widely used among mathematicians, but maybe it's worth pointing out the standard unit too? I liked very much your final reasoning about which candidate word to choose, which is basically a decision-theoretic problem, and indeed you solve it using Decision Theory (maximization of expected utility). Maybe it's useful to point this out too? Finally, there's an interesting property about the sequence of guesses (from the first guess to the final solution) that's always neat to point out: the sum of their entropies (each conditional on the previous guesses) is always constant and equal to the entropy of the probability distribution for the solutions. Sorry for not explaining this well: check out MacKay's book (https://www.inference.org.uk/itila/book.html) around page 71, "The game of submarine" :)

Really great video (although I'm in the wordle control group (see xkcd)) 28:40 useing again 28:43 trickes - tricks

At 17:40. format: [green], (yellow), grey [A] [B] (B) (A) [S] should be [A] [B] B A [S] [A] [B] (B) (A) [S] implies the final answer must be ABABS, not ABYSS Maybe you're checking for the letter's inclusion in the whole word and not only in the remaining unsolved positions?

At 7:15 you say "father to the left" but move father to the right

Nicolas Bolle

At 4:35, I think it should be 'using' instead of 'useing' (which incidentally makes it a 5 letter word as well). Great watch so far, never heard of this game so super interested.

sploopidoop


More Creators