We had a funny situation. Our Sphinx Search Engine crashed the other day and, once restarted, it reported that it’s doing crash-recovery of Real-Time (RT) Index using a 76GB binary log. And you don’t need to be Computer Science major to understand that recovering anything from 76 gigs would take … a while.
Now, here’s what puzzled me. We are (currently, at least) using RT-Indexes for keeping daily diffs only, and we are rebuilding the disk-based indexes during the night. That means that, no matter how much diffs we generated in a day, it simply can’t be 76 gigs. Like, trust me, there’s NOT that much data.
On the other hand, Sphinx was obviously trying to recover itself using a massive 76 gigs log file. And you know what? That HUGE log file means only one thing – it is AS OLD AS HELL! But rebuilding it using data from days or even weeks ago MAKES NO SENSE. We don’t care how it looked weeks ago. We only care about last 24 hours or so!
We further learned that Sphinx seems to erase binary logs ONLY after a clean shutdown, which is something you should have in mind, if you are using Sphinx Search engine.
End-result of this was that I got inspired to write a bit about crash, recovery and what role do binary logs play there.
What is a ‘crash’?
I figured that if you want to talk about RECOVERY, you need to understand what precedes it. And what precedes it is a crash.
Crashing means that your app was violently stopped in the middle of whatever it was doing.
One common example would be pressing that reset button on your PC, which would force it to reboot and would cause your OS to start the recovery process.
Problem with crashing is that it could happen that your app was in the middle of DOING SOMETHING, and being stopped violently means that it won’t know where it left off. It will be in a “limbo”.
Let’s assume that the app that we are referring to is a database. Let’s further assume that you asked your DB to insert 1000 rows into table X. Well, assuming that your table X has some indexes and whatnot, your DB would be doing a couple of operations in parallel – writing data to disk, rebuilding indexes, adding something to cache, … I don’t know exactly what happens, but you could assume there are couple of things going on in parallel.
For the sake of this example, let’s assume that it takes 1ms per row to be inserted. This means that it’d take 1s to insert 1000 rows. Well, what would happen if at 400ms mark your DB server lost it’s power? Woop and it’s gone.
Once it’s back online, how would it know if it’s in right state or not? Because, it simply never managed to process 1000 rows, you’d be left in a limbo state with half-inserted data, but you could never be sure where exactly it left off. It could have been in the middle of writing 450th row to the disk, but without knowledge on whether it was written or not.
The process of resurrection and returning yourself back to proper state is called, you guessed it – recovery.
Recovery and Binary Logs
Picture this – you are a web designer and you have client sitting next to you, describing what they want.
You could be building the website along with what they’re saying, sure, but one thing is for certain – they’re talking way faster then you can convert that to HTML code. You could try remembering everything (i.e. keep it in your “RAM”), but what if something urgent pops up and you forget what they said? It’d be lame to call them and ask to repeat what they wanted, no? Clients want to be sure that what they said will be executed.
So, what do you do? Well, you can do what’s easiest – record their words (e.g. using your phone) and re-listen if and when needed. Easy peasy. You could still keep everything in your memory, but if shit hits the fan and your memory crashes – you could easily RECOVER. The recorded audio is NOT the end product that they need, but it provides enough information so that you can REBUILD what they were saying initially.
This is exactly how crash-recovery processes work. Before doing any work at all, you ensure that all incoming data (e.g. INSERTs, UPDATEs and DELETES) is stored, in its raw form, to some durable device.
Once it’s written, then and only then do you tell the client that their WRITE operation was successful.
This file is referred to as a binary log. Why binary log? Well, for one, because it’s a binary file, and two – because it’s a log file. Append-only log-file (i.e. you can only append, but you can’t change anything), to be more precise. It contains raw, unstructured data, which, in case of crash, you could literally REPLAY and get back to the state where you were before the crash happened.
And that’s really all there is to it. That’s what those binary logs are for. They contain the raw commands that, if needed, can be replayed from beginning in order to derive the end-state of your DB. And yes, they also happen to be used for replication purposes because, guess what – you just send those WRITE commands to another server, have that server replay them and, voila – you end up in the same state as your primary.
Frankly speaking, that’s all there is to it. And I think that’s actually amazing.
In case you’re still wondering why I figured that 76 gigs of Sphinx recovery file don’t make any sense — well, since we are keeping data relevant to a single day only, there was no possibility that we had 76 gigs of data written in a day. As simple as that 🙂
- Designing data-intensive applications book by Martin Kleppmann — I keep referring to this book in pretty much every DB & data-related article that I write. And that’s for a reason – because it’s just a must-read for anyone dealing with data.
- MySQL Binary Log Overview — official documentation from MySQL, describing how Binary Logging works
- Sphinx RT-Index Binary Logging — official documentation from Sphinx, describing how Binary Logging works