Disk Failure: An Extended Analysis

Disk Failure: An Extended Analysis

Today, I want to talk about my worst nightmare: data loss.

As a server owner, there is nothing more important than data. Collectively, we’ve spent thousands of hours developing and playing iPwnAge, and all of this work is represented as data. And so, naturally, we make sure it’s safe: data loss is creative loss.

We make backups as often as possible, and we even backup the backups across various physical locations and storage mediums. There are daily checks of hardware health, and always-on log analysis for instantaneous notifications of server errors. Everything runs off batteries to ensure that power loss doesn’t mean data loss. We hear stories of other’s catastrophic loss, and so we learn a lesson and add another safety net. And another. Until we’re so certain data loss can’t happen that we stop worrying. And when you stop worrying, shit happens.

 

Shit happened.

 

For the past two weeks, unbeknownst to me, the server’s main solid-state disk had been quietly dying. This disk stored  Main, Survival, the server’s databases, and the main operating system. It was a 128GB Samsung 850 Pro. I personally own 11 of them–I just counted. They are solid drives that perform well under lots of small read/writes, which is what a Minecraft server does a lot of. But nothing is perfect, and neither was this specific drive.

“But, aha”, I said. “That’s impossible. I’ve got monitoring tools on the drives every day to check disk health. If the drive was really dying, these tools would’ve reported it.” I knew S.M.A.R.T tests aren’t a good method of predicting drive failure, but if a failure was actively occurring, it would surely know, I thought. Lmao @ myself. Turns out, the tools wouldn’t know disk failure if it was staring right at its face. I know that because a disk failure was staring right at its face and it said “no error”.

That “no error” message was critical. In the logical overview of the backup system I’ve created, the results of the disk health test determine the whole flow. Before anything is copied, the system first checks the disk. If the tools say the disk is OK, it deletes the oldest backup and then makes a new one. If the tools say the disk is bad, it stops everything, notifies me, and freezes everything. Do you see my error? I do. Hindsight is 20-20.

Even at this moment, smartctl returns no error when the 2 minute test is run.

The major flaw in the backup system was that I assumed too much. I placed too much emphasis on the presumptive accuracy of the health tool, smartctl. I thought that if the disk was “healthy”, then it was safe to delete the oldest backup. This was necessary, as a single day of backups was a whopping 2TB of space. We were capable of storing 4 days of backups, which was fine for the past 7 years. The only reason we needed historical backups were to fix grief. If hardware failed, no backups were deleted and no backups were made. We would just roll back to the most recent snapshot (which was at most ~50 minutes old). What we never anticipated was a scenario where the disk reported “healthy” but returned garbage data. It’s something I never imagined. Disks have a ton of error detecting algorithms that let the operating system know if SHTF. Only, my drive never mentioned anything. Not until last week when things got REALLY bad and spitting out “uncorrectable errors”. But by then, it was too late. All the good backups had already been deleted. It’s just backups of garbage data. Did you know that we changed our server name from iPwnAge to �@�@. @DX`S`z@@�`������� @� `���`�(`�`��@``@,A?

 

So, what’s the good and bad news? Good news first, FTB and MMC are completely unharmed. They’re stored on a disk separate from the rest because y’all insist on generating 70GB maps. Bad news? Main, Survival, and some important server information is lost (most notably, the economy and previous bank levels). Also, the server will be offline for a bit longer while I rewrite the backup logic to prevent this from happening again. Also also, I’m going to be renting a new server specifically for permanent storage of weekly backups (that’s 110TB a year, so if you know a place that offers storage for cheap, lmk).

I’ll be bringing the disk to a professional data recovery service, so hopefully they can salvage some data. But I don’t have high hopes right now. I’ll keep everyone updated in Discord.

 

Side notes:

  • I’m not exaggerating, 24 compressed backups of the server total to 1.97TB of disk usage
  • The hour-long smartctl extended test does show disk error. But my backup system used the 2 minute test because in the interest of overhead reduction (hourly backups taking the whole hour to complete?)
  • I’ve been writing Aegis2, the new backup system, for awhile now. It includes localized snapshots that only backup changes made within the past hour. The goal is to reduce the total amount of storage necessary which lets us keep more history. But, it would’ve included the same logic as Aegis1, so this failure would’ve still occurred
  • When the disk went from quietly screwing things up to blatantly killing everything, MySQL resource usage skyrocketed since the DB corrupted and it every time it went to fix it, it hit a disk read error, crashed, and restarted.

4 thoughts on “Disk Failure: An Extended Analysis

    1. Right now, all servers are down since the disk was also the OS disk. It’s a waiting game, Samsung is taking their sweet time getting a new disk to me. I can’t give any estimates ATM.

Leave a Reply

Your email address will not be published. Required fields are marked *