Late acceptance hill-climbing, a snake with a camera
It’s a Saturday morning, and suddenly I wonder: what’s the most compressible list of eight distinct 8-letter English words? Let’s write some Python code.
import random, re, zlib
words = open("/usr/share/dict/words").read().split()
words = [w for w in words if re.match("^[a-z]{8}$", w)]
N = 8
def compress(w):
return zlib.compress(" ".join(w).encode("utf-8"), level=9)
w = random.sample(words, N)
print(len(compress(w))) # How small can this get?
Minimizing the “cost function” len(compress(w)) is an “optimization problem” with a large search space: there are 12,529 candidate words, which means over 6 × 1032 lists-of-eight-words to consider.
We’re going to compare some increasingly clever approaches for exploring this search space:
- Trying random stuff
- Undirected local search
- Hill climbing
- Simulated annealing
- Late-acceptance hill climbing
- Conclusion
Each code blog in this post is a standalone Python script, so you can follow along at home easily. In fact, I encourage you to type out the code examples by hand: the ideas in the post will stick with you better than if you were to simply read along, or clicked “play” on a notebook.
On Arch, I had to yay -S words, but other Linuxes (and Macs, I think) come with /usr/share/dict/words
Trying random stuff
We might try random sets of words and see how well they compress. Whenever we beat our record, we’ll print the result.
import random, re, zlib
words = open("/usr/share/dict/words").read().split()
words = [w for w in words if re.match("^[a-z]{8}$", w)]
N = 8
def compress(w):
return zlib.compress(" ".join(w).encode("utf-8"), level=9)
best = random.sample(words, 8)
best_cost = len(compress(best))
print(f"({best_cost}):", *best)
while True:
attempt = random.sample(words, 8)
cost = len(compress(attempt))
if cost < best_cost:
best = attempt
best_cost = cost
print(f"({best_cost}):", *best)On my computer, this prints something like…
(68): puniness reverted scavenge critical haymaker cowlicks hankered relaunch
(67): stranger platters betiding skincare advanced resuming bourbons seizures
(66): crisping sleekest lovelorn stuffily pirating estrogen gargoyle vagrants
(65): shebangs tarragon doorpost coleslaw prequels japanned dispirit maternal
(64): adducing splodges motorman weddings enticing collages darlings pervades
(61): poisoned suspense tousling elisions discuses sheepish ogresses snippier
(60): blocking trucking booklets trekking billings quipping preciser bombings
(59): caroling scooting spooning surefire clouting flouting cackling wonkiest
and then stalls. I guess the many -ing words are working well for compressibility, but it’s hard to arrive at such a set by pure chance.
Undirected local search
Instead of creating new attempts from scratch every time — picture throwing darts into the search space at random — we can “walk around” the search space, constantly moving from the current state to a “nearby” state in some sense. This idea is called local search.
What might that mean for us? If we have a candidate list of compressible words w, how do we tweak it to find another candidate? Let’s try replacing one word at a time, swapping words between w and the rest of the dictionary. To make this convenient, we can permute the big words list itself, and say that words[:N] is current list-of-eight under consideration, while words[N:] contains all other words. Then a “step” involves simply swapping elements between these two sides of the array:
i = random.randrange(0, N)
j = random.randrange(N, len(words))
words[i], words[j] = words[j], words[i]
Here’s our full code again. Try running it yourself.
import random, re, zlib
words = open("/usr/share/dict/words").read().split()
words = [w for w in words if re.match("^[a-z]{8}$", w)]
N = 8
def compress(w):
return zlib.compress(" ".join(w).encode("utf-8"), level=9)
random.shuffle(words)
best = words[:N]
best_cost = len(compress(best))
print(f"({best_cost}):", *best)
while True:
# Randomly swap out a word
i = random.randrange(0, N)
j = random.randrange(N, len(words))
words[i], words[j] = words[j], words[i]
# Did that improve things?
cost = len(compress(words[:N]))
if cost < best_cost:
best = words[:N]
best_cost = cost
print(f"({best_cost}):", *best)With this change, instead of “throwing darts” at search space, we are stumbling through it randomly. We can see words sticking around between improvements now, especially at the beginning.
(68): tattooed smirches revivify hoarding squarest cowlings unfasten locators
(67): tattooed smirches revivify hoarding squarest cowlings cabochon locators
(66): tattooed mediates trenches hoarding squarest cowlings misusing locators
(65): overhung mediates consoles hoarding probates retained misusing centrism
(63): derriere acrobats marketer balsamic slighted rotation crappers threaten
(62): milepost scantest snugging paranoid parasols plotting spoiling unburden
(61): pointing rambling bleeping eruption sureness nosiness prairies mistreat
(60): singular pureeing choicest squarest grasping chroming testiest mechanic
(59): pleasing grapples gritters relaters toasting resoling paneling shoehorn
(58): defiling tricking swagging heisting sledding haunting mounting dentists
(56): releases tracking shacking stapling snacking stemming demising primness
But can we really expect this approach to do much better than the “fully random” approach? We are printing results that are better than anything we’ve seen, but we’re not letting that comparison guide our search at all yet. We may stumble back out of a promising region of search space just as easily as we stumbled into it. Let’s find some way to improve on this.
Hill climbing
A simple way to guide a local search is called hill climbing: only commit to those random steps that are direct improvements. If we imagine the problem space as a landscape where better solutions (lower cost) are “higher up”, we only take steps uphill.
To implement it, we only need to append this little bit to our code…
else: # If this swap *isn't* an improvement...
words[i], words[j] = words[j], words[i] # undo it
Now the behavior of our program is very different! Here are three different runs:
(66): bendable beginner cesspool northern selloffs showoffs massacre druggies
(65): bendable beginner asphodel northern selloffs showoffs massacre druggies
(64): bendable beginner asphodel northern selloffs fusspots massacre druggies
(63): bendable beginner asphodel northern selloffs fusspots massacre embroils
(62): bendable occupies asphodel northern selloffs fusspots massacre embroils
(66): eremites sciences nuisance spawning polecats lacrosse globulin chancels
(65): eremites sciences nuisance spawning polecats lacrosse trapdoor chancels
(64): eremites sciences nuisance spawning polecats lacrosse trapdoor mealiest
(62): eremites sciences nuisance screened polecats lacrosse trapdoor mealiest
(61): eremites sciences nuisance screened neurosis lacrosse trapdoor mealiest
(60): eremites sciences nuisance screened neurosis lacrosse landmine mealiest
(59): eremites sciences nuisance screened neurosis lacrosse landmine addendum
(58): eremites sciences nuisance screened neurosis lacrosse landmine lecterns
(57): eremites sciences nuisance screened schussed lacrosse landmine lecterns
(67): emerging insignia bronzing diazepam berating discoing shakeups killjoys
(65): emerging diuretic bronzing diazepam berating discoing shakeups killjoys
(63): emerging diuretic bronzing diazepam berating discoing shakeups iambuses
(62): emerging diuretic bronzing diazepam berating discoing gestated iambuses
(61): emerging diuretic bronzing diabetes berating discoing gestated iambuses
(60): emerging diuretic bronzing diabetes berating discoing disorder iambuses
(59): emerging diuretic bronzing diabetes berating discoing disorder bartered
(58): emerging diuretic denuding diabetes berating discoing disorder bartered
(57): emerging guarding denuding diabetes berating discoing disorder bartered
(56): deriding guarding denuding diabetes berating discoing disorder bartered
(55): deriding guarding denuding recorder berating discoing disorder bartered
(54): deriding acceding denuding recorder berating discoing disorder bartered
(53): receding acceding denuding recorder berating discoing disorder bartered
(52): receding acceding denuding recorder berating discoing disorder beriberi
(49): receding acceding aerating recorder berating discoing disorder beriberi
(48): receding acceding aerating recorder berating seceding disorder beriberi
(47): receding acceding aerating suborder berating seceding disorder beriberi
The third time, we beat our fully-random attempts by a wide margin. But the first two times, we were unlucky, and we got stuck at a local optimum. The hill we randomly decided to climb was not the highest hill in the landscape.
Even in the third run, we have the “odd pair” of -order words in the result we land on, breaking the -erating and -ceding pattern. Improving further on the whole set means first making the backwards step of getting rid of one of them. This is the big problem with “naive” hill climbing.
So, we’d like a less headstrong local-search algorithm, that will take steps downhill sometimes: often enough to help it discover higher hills in the landscape, but not so often that it fails to be sufficiently attentive or attracted to them.
Simulated annealing
I really like this animated GIF from the Wikipedia article on simulated annealing:

It shows a one-dimensional “hilly landscape” with plenty of local optima to get stuck on. What is this algorithm doing to find the highest hill? From looking at the GIF, we can already try to form an intuition. A temperature value is ticking down steadily. When it’s high, the red line moves erratically, accepting almost any step towards a new state. As the temperature goes down, the red line chills out, and becomes less likely to jump away.
Indeed, simulated annealing boils down to this:
- Set some initial temperature
T, and slowly tick it down to 0 as you search. - Always accept “uphill” steps with
cost < best_cost. - Accept a step “downhill”, i.e. with
cost > best_cost, randomly with probability .exp(-(cost - best_cost) / T)
Why this function? The graph of f(x) = exp(-x / T) slopes down from 1 at x = 0 to 0 as x → ∞, with T controlling the steepness of the slope. (Play around with it in Desmos!) This gives us what we want: an infinitesimally small step down x is acceptable at any temperature, but as T goes up, we become quicker and quicker to reject a step down by any larger x.
The name annealing comes from metallurgy: it refers to treating a material by heating it up a bunch and then slowly cooling it down at just the right pace.
This box is blue, which is the color of apologies. I apologize for mixing metaphors! People write about optimization problems in two directions: maximization and minimization. There’s the “hill climbing” metaphor, like in the GIF, which makes the most intuitive sense when you’re maximizing a score: “higher” means “better”. But simulated annealing is often described in terms of minimizing a cost, as we have it in our word-compressing problem: going “uphill” in the language of hill-climbing means going down in cost.
Here’s simulated annealing in a Python script:
from math import exp
import random, re, zlib
words = open("/usr/share/dict/words").read().split()
words = [w for w in words if re.match("^[a-z]{8}$", w)]
N = 8
def compress(w):
return zlib.compress(" ".join(w).encode("utf-8"), level=9)
random.shuffle(words)
current_cost = len(compress(words[:N]))
best, best_cost = words[:N], current_cost
T = 2
while T > 0:
# Randomly swap out a word
i = random.randrange(0, N)
j = random.randrange(N, len(words))
words[i], words[j] = words[j], words[i]
# Accept the swap if it's uphill, or the temperature randomly allows it.
new_cost = len(compress(words[:N]))
p = 1 if new_cost < current_cost else exp(-(new_cost - current_cost) / T)
if random.random() < p:
current_cost = new_cost
if current_cost < best_cost:
best, best_cost = words[:N], current_cost
print(f"({best_cost}):", *best)
else:
# Otherwise, undo the swap
words[i], words[j] = words[j], words[i]
T -= 0.000001It’s immensely successful, reliably finding solutions about as good as this one:
...
(37): winching pinching cinching parching munching patching punching bunching
(36): hunching munching mulching matching hatching patching punching mulcting
(35): latching patching bitching watching pitching witching hatching ditching
or this one:
...
(37): clinging flinging slinging bringing fringing tingling bridling ringings
(36): slinging clinging swagging clinking clogging swinging slogging slinking
(34): slinging clinging cringing clinking stinking stinging clonking slinking
The sad news is that to get the program to this point, I had to tweak the initial temperature (2) and step size (0.000001) until the results were any good. This is called hyperparameter optimization and nobody likes it, especially when there are many parameters to finetune before you get any promising results — which is sadly the case for simulated annealing.
Late-acceptance hill climbing
There is a brilliant alternative with a very short Wikipedia article called late-acceptance hill climbing. It’s so simple: we keep track of the last L costs, and accept a step if it’s at least as good as the current state or as what we had L steps ago. We can use a circular buffer (let’s call it history) to implement this:
import random, re, zlib
words = open("/usr/share/dict/words").read().split()
words = [w for w in words if re.match("^[a-z]{8}$", w)]
N = 8
def compress(w):
return zlib.compress(" ".join(w).encode("utf-8"), level=9)
random.shuffle(words)
current_cost = len(compress(words[:N]))
best, best_cost = words[:N], current_cost
L = 40
history = [current_cost] * L # the last L costs; starts all-equal
step = 0
while True:
# Randomly swap out a word
i = random.randrange(0, N)
j = random.randrange(N, len(words))
words[i], words[j] = words[j], words[i]
# Accept if it beats where we are now, or where we were L steps ago.
new_cost = len(compress(words[:N]))
v = step % L
if new_cost <= current_cost or new_cost <= history[v]:
current_cost = new_cost
if current_cost < best_cost:
best, best_cost = words[:N], current_cost
print(f"({best_cost}):", *best)
else:
words[i], words[j] = words[j], words[i] # undo
history[v] = current_cost # remember where we are now
step += 1There is only one hyperparameter to tweak, namely the size of history L. This choice of 40 gives me excellent results:
...
(35): wracking slacking stinging sticking stinking snacking stacking slinging
(34): blacking slacking stinging sticking stinking swinging stacking slinging
(33): blacking slacking stinging sticking slicking slinging stacking stinking
Try running it yourself and tweaking the parameter yourself.
I don’t have a great intuition for why this works well, but I did think of a nice visualization. In the video below, the state space is a blobby image, and a “snake” is wriggling around, looking for the brightest pixel. The yellowest blob contains the global optimum. The white circle is the snake’s head, the white square is a “camera” attached to the tip of its tail.
The snake’s head will move to a new nearby spot if it’s at least as bright as what either its head or the tail camera sees; otherwise it stays in place. (The little red lines are where it’s deciding not to move.) The parameter L is the length of the snake.
Conclusion
I just really like late-acceptance hill climbing. If you have a search problem with a big state space, give it a try!
Oh, and the most compressible word list is probably something like blacking slacking stinging sticking slicking slinging stacking stinking, which compresses to 33 bytes. But who knows? An important takeaway is that none of these approaches promise to find the perfectly optimal result. Often, an approximate optimum is good enough.
Can you find a better list of eight words? Let me know on Bluesky.