Skedsheet Blog

Where we talk about the product, calendars, organization, and business

Recovering from a disaster

with one comment

Blue Screen - from http://www.flickr.com/photos/stublag/248808645/I have been shocked a few times when I see how some of our customers take care of their own data, when they install JobTracker on their own servers. They usually think they’re fine, until their server crashes after running 24×7 for 4 years, and they suddenly realize they don’t know if they have any way to recover.

I think I have the opposite problem. No matter how much effort we put into designing and implementing our backup process, I’m always worried that it’s not enough. I’m just a glass half-empty kind of guy.

When designing a backup process, the only thing that matters is how you’re going to recover data. So you have to think backwards and design a recovery process first, then make sure you have a backup process to support it.

We just bought a new set of servers and much of the decision for what to buy was based on our desired recovery process. Our previous server setup left us with a few scenarios where it would just take too long to recover from a disaster, because we had to move too many GB of data across the network. The new setup spreads the data out more, so there is more redundancy and recovering from any single failure will be much quicker.

The two primary measures of the recovery process are how long will the system be down, and how much data will be lost. Ideally both these numbers would be zero, but you can always dream up a scenario that causes downtime or data loss, even if it’s extremely rare or violent. So the best you can do is make these numbers small, but the smaller you want them, the more it costs, and the more it can hurt performance.

Once you accept that there’s the possibility of downtime and data loss, then you can map out various scenarios, and design a process to get the right $ cost vs. performance vs. risk trade-off.

We have done just that, and below is a summary of how much downtime and data loss we expect from various disasters, under our current server configuration.

Disaster Recovery Action Downtime Max Data Loss
Human error, destroying data Restore from backup None Depends on how long it takes to discover problem. Lose data entered after the backup you’re restoring was taken.
Any server hardware failure, leaving drives OK Swap hardware 30 min None
Server software problem Move data to another server, rebuild server later 30-60 min None
Single drive failure Swap drive, reboot server, may have slow performance while RAID rebuilds 10-30 min None
Multiple drives fail on app server Restore from backup to another server 30-60 min 1 hour (only for databases on failed server)
All production drives fail on all servers Rebuild everything from local backup 3 days 1 day
Nuclear bomb at primary data center Rebuild everything from remote backup 3 days for initial recovery, 1-2 weeks to get all historical data transferred 1 day
2 nuclear bombs at primary and backup data centers Enlist in the Army Forever Everything

Some of the reasons we can recover quickly are:

  • Our servers are in a data center that’s staffed with experts 24x7x365 with tons of spare parts sitting around to fix any hardware problem within 30 minutes.
  • We have a bunch of identically configured servers. If there’s a problem with one, we can easily move the data to another server that’s ready to go.
  • We use a combination of database full & log backups, file replication, on-site backups, and off-site backups.
  • Every piece of data gets stored on at least 5 servers, across 10 drives. Keeping 30 days of full backups means once it’s a month old there are over 90 copies of that data! (Luckily our new servers have 18TB of disk space to handle this.)

We’re always improving our process, so I expect we’ll get these numbers down over time. But it also depends on how much money we’re willing to spend, which is based on our revenue going forward.

Advertisements

Written by Ted Pitts

July 2, 2009 at 1:55 pm

Posted in Uncategorized

One Response

Subscribe to comments with RSS.

  1. […] hurt application performance as much. That worked OK, but we still had other problems around disaster recovery. If a server failed and took all its data with it, we’d have to recover from backup, which […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: