Data Disaster & Recovery: Oversight

An hour and a half of baited breath and the kind of sweating you’d only expect from Kevin James peeling an orange was the result of a critical error in our live system today.

We don’t exactly have big data, but we still like the ease of use mongodb brings to the table. Mongo also brings with it a somewhat unfamiliar JavaScript syntax for those most accustomed to the world of SQL. If you had to, say, adjust all your items so tax between 0-100 instead of 0-1, it would look something like this:

db.myItems.find().forEach(function (item) {
  db.myItems.update({ '_id': item._id }, { $set: { 'tax': (item.tax*100) } });
});

This would loop through each of the items in your database and set the tax field to 100 times its current value. No problem, and it’s in fact exactly what we want.

Knowing Enough to be Dangerous

Unfamiliar as it may be at first, write a few of these migrations and it starts to become second nature. One may even assume that the docs don’t have to be looked at. That’s how you end up with something subtly different:

db.myItems.find().forEach(function (item) {
  db.myItems.update({ '_id': item._id }, { $set: { 'tax': (item.tax*100) } });
});

What does that missing $set amount to? It unfortunately amounts to the entire contents of your item being replaced by just one field: tax in this case.

When this happens on a live system, it’s time for production support to break out the well-oiled disaster recovery plan…or create one in a hurry.

Mis-Testing

How can one miss something so obvious? It starts with some oversight that snowballs out of control all the way to production.

Our item entity is quite large. Too large to print to the Mongo console and make sense of. For this reason we tested our migration locally by viewing the distinct values of our pre-migration tax fields like so:

db.myItems.distinct('tax');

This yields a collection of values like ["0.05", "0.07", "0.09"].

Just what we expected — now we’ll run the script and do another distinct select.

It yields ["5", "7", "9"]. Perfect, this migration seems to be done. In similar fashion, it slipped through the QA cracks as well with this verification, and into the production queue it went.

Note the two painfully obvious and embarrassing oversights on our part:

  1. We didn’t get an entire item after the migration
  2. We didn’t load up the app

Either of these two would have clearly shown we weren’t done.

It’s Gone

The experience with our four other Mongo migrations has been so smooth to this point that running one on a live system was being performed by prod support without much concern, until a shocked member of the team walked over.

You better look at this. The production data is gone.

I had never been here before. My stomach dropped. I started sweating and couldn’t seem to stop. I walked over and looked at the gui revealing that, yes, all we had left was the tax field. An odd calm overtook the team.

Race Against the Clock

There wasn’t a fresh backup/dump taken before the migration, but we did have a replicated database on an hour delay. We could get all the data back from there and at most lose one hour of work. It seemed like a good option and it was. What would have made it better is if we knew the credentials right away. I’ll spare the searching details: with about 25 minutes left before our data was wiped with the newly tax-only documents, we got a dump of the database.

The tools that come with Mongo make it fairly straight forward to take backups and load from those backups. After getting our dump, the restore went smoothly and the adjusted migration was applied seamlessly.

Never Again

We identified a number of points that all seem quite obvious in retrospect. The perfect storm can happen, though, and we’re committed to never letting this happen to us again.

Our new process:

  1. Take a backup before applying migrations.
  2. Don’t run the migrations on a live system…especially when waiting until 5:00 has no business impact.
  3. Deploy every single migration to the QA environment. Even if it is the most urgent bug fix in the world.

It’d be great to automate the whole process. We hope to get there soon. Until then, these are some of our quick takeaways…any one of which would have made this situation significantly less stressful or prevented it entirely.

Consider these takeaways if only because one time you read about that guy who got burned by the perfect storm.

Team

Perhaps the greatest takeaway one can have from any critical mistake is the quality of team. I’m proud to be a part of team that didn’t take the opportunity to blame anyone, but instead rallied around solving the problem. In many cases, developers with very little Mongo experience stepped up to search for and provide solutions. Our production support team is calm under pressure and deserves the utmost confidence.


Published Mar 15, 2012