What do the new book by Andy Weir and debugging of legacy apps have in common?
If we ignore the fact that Andy is a former software engineer who successfully became a successful writer (somebody’s wet dream came true), and hence my admiration, there’s something else in play here.
To be clear, I absolutely recommend reading the book itself, regardless of this article. Or better yet, look for the audio version as people claim it’s much much better than the written one.
Now, what is the story about at all?
Believe it or not, it’s all about DEBUGGING stuff. Seriously. Debugging stuff that you’ve never ever had to deal with before! And if you ever had the pleasure of debugging somebody else’s undocumented legacy code, you’re surely aware of the thrill.
The best part, at least for me, was the inspiration and enthusiasm about debugging. Like, having a guy sixteen light years away from Earth, trying to debug why all the lights on his ship went off, without having anyone else to consult is, as close as it gets to how I feel when debugging somebody else’s “piece of art” 🙂
Let’s dive deeper into it.
Picture this – a guy wakes up from coma, having absolutely no recollection of where he is or how he got there. All he sees is a robot arm that seems to have been taking care of him. And a white room.
Completely scared, lost and confused, wonders what the heck happened and where the hell is he? Since he has hard time getting up, he figures that he’d probably have been lying there for a while. Bummer!
He starts exploring his surroundings and eventually figures out it seems to be some kind of a ship. Further exploration leads to a conclusion that he seems to be on a … spaceship! And, let me tell you right there – waking up from coma, seeing a robot arm and nobody else around you, while simultaneously learning that you seem to be on a space ship that travels through space is … challenging, to say at least!
Through further trial and error he figures out that he seems to be far away from earth and, eventually, learns that he was sent on a mission. A mission to save the earth! And he’s all alone. And he has no idea what he is saving the earth from in the first place! Talk about the horror and stress of the situation …
So, a completely lost guy, on a space ship, no recollection of why or how he got there, but he has to save the earth; most likely as soon as possible.
If not – you’ve surely never had to debug an emergency incident that renders the production system unusable; on Sunday evening; without ANY clue what’s the problem at all.
Napkin math to the rescue!
Napkin Math and Scientific Method
Napkin Math is, roughly speaking, a method for making rough assumptions in order to predict expected outcomes. Think of it as a quick method of validating the idea.
I’ll give you a very simple example. Let’s assume you want to download a file that is 1GB in size and that your (advertised) Internet Connection speed is 100 Mbps, which is roughly 12.5 MB per second. Back of the napkin math calculation would be that downloading this file takes 1000 MB (1GB) divided by 12.5 MB per second, or – around 80 seconds.
Next, you go and test this hypothesis by actually downloading this file and measuring how long it takes to do so.
If you end up within ± 10% range, your hypothesis is likely to be correct.
However, if you end up with a result that is way over the estimate (e.g. let’s say that it takes 300 seconds), then we are safe to assume that something is wrong in our calculation. Next we go and check our contract, and if it says 100 Mbps download, then we look for the next offender in line and we make another hypothesis – we assume that our internet connection is OK but that the origin server can’t support that speed. We test this by trying to download another file from completely different source. Rinse & repeat until we figure out the likely origin of the problem.
Reason why it’s called napkin math is the fact that we are making very rough assumptions and estimates, primarily because we don’t really care about specifics. It’s more about making hypothesis and doing quick validation on them until we get to the root of the issue.
In case you’re interested, there’s an incredibly fun blog by Simon Eskildsen called “Napkin Math”. You can find tons of interesting calculations there and some incredibly useful outcomes (e.g. by using the Napkin Math he figured out that if your web page size is <= 12kb, you can probably fit it into an initial TCP packet, almost doubling the loading speed of your page!).
On the other hand, scientific method is, as the name suggest – a napkin math method with very rigid rules (yes, that’s MY interpretation of it and you won’t find this officially anywhere). Specifically, the rules say that you have to make EXACT calculations, your findings have to be repeated MULTIPLE times and before being accepted, they have to be confirmed by multiple DIFFERENT SOURCES (e.g. other scientists).
Let’s do an example. I’ll assume that earth is flat. Next, my assumption is that, given a long and flat surface, I should see my friend running away from me for … well, for as long as he could run, but let’s say 42km (earth curvature is ~ 6 km / ~ 3.5 mi). Hence, given that I could find a flat enough surface and a friend who wants to run in a straight line for 42kms, my napkin math says that, after he reaches 42km, I should most likely see him in full size (minus the distance part – obviously he’ll appear “smaller”).
Luckily, I don’t need a friend who will do so as it’s pretty obvious that after half the distance he’ll most likely completely disappear, but feel free to test it.
In contrast, scientific method would require that I setup exact hypothesis (e.g. after 42km he should appear of size X) and then my hypothesis (assuming that it’s true, which, luckily – is not) has to be tested & confirmed by other scientists as well.
Two really cool things about napkin math and scientific method are something that proves incredibly useful for debugging legacy apps as well:
- They both revolve around a really simple framework:
- Make an observation (e.g. production system is kaputt)
- Make a hypothesis (e.g. HTTP servers are overloaded)
- Test the idea (e.g. check the load on your Load Balancer)
- Analyze findings
- If root cause not discovered, proceed from step #1 with new insights
- Regardless of whether your hypothesis was right or wrong, you ALWAYS end up with a new insight (e.g. it was NOT due to HTTP servers being overloaded, hence it has to be something else). This is just beautiful because you always keep learning until you figure out the cause (i.e. there is no wasted effort)
Does it really work in practice?
Well, it worked for Dr. Grace in Project Hail Mary, so I see no reason why it wouldn’t work for you?
Joke aside – yes. It works. But it’s missing the element of frustration and pressure, which makes it a bit more complex than “just do steps 1 to 5 and repeat until results are achieved”. Take it from a guy who can’t make a proper pancake no matter how detailed the steps in the recipe are.
Sometimes you just need experience. And patience.
You know what’s the best way to get those? Practice! Ideally – on a daily basis!
I’m not saying that you should aim for having your production system down on daily basis; no. But you can always practice on medium and high prio incidents. Make it a challenge to solve them in <= 2 hours by following a simple approach of Make a hypothesis, Validate it, Repeat from there.
The trick is in small wins (i.e. Assume, Test, Validate, Learn, Repeat) and constant practice. There’s nothing else to it, really.
- Project Hail Mary by Andy Weir — obviously, the best useful resource is the origin of the article itself – the book. Here’s my review of it.
- Napkin Math Blog by Simon Hørup Eskildsen — if you ever need inspiration for debugging, just skim over any article on this blog. Seriously.
- Scientific Method on Wikipedia — page dedicated to “Scientific Method”
You might also like:
- Turning Chaos into a Line
- Why do we suck at estimating stuff?
- How I Learned To Read (and read 30 books in a year)
You may also subscribe to my mailing list: