Foursquare checks out for 11 hours

Foursquare experienced two outages yesterday, the longest lasting about 11 hours, thanks to an unexpected database problem.

Traffic to the site has increased sharply in recent months, and it now has over three million users, putting increased pressure on the company databases.

The company stores data from users’ check-in histories in segments called ‘shards’, with data spread evenly across multiple shards.

“Starting around 11:00am EST yesterday, we noticed that one of these shards was performing poorly because a disproportionate share of check-ins were being written to it,” says an engineer on the company blog.

“For the next hour and a half, until about 12:30pm, we tried various measures to ensure a proper load balance. None of these things worked.”

In the end, the company decided to introduce a new shard and move some of the data across to spread the load. Unfortunately, this didn’t go as planned. First, adding the shard caused the entire site to go down.

In addition, moving the data across didn’t free up as much space as was anticipated; and five hours’ worth of trying different ways to migrate the data and restart the site resulted in a crash every time.

“At 6:30pm EST, we determined the most effective course of action was to re-index the shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours,” says Foursquare.

“At 11:30, the site was brought back up. Because of our safeguards and extensive backups, no data was lost.”

Foursquare says it’s working on ways of making sure the problem doesn’t happen again – including having a word with MongoDB, which powers the database, to try and improve stability. It’s also changing operational procedures and considering using artful degradation to try and make sure that if problems do arise the whole system won’t go down.

And there’s a new status blog, here.

Later yesterday, the company experienced another outage, lasting over  six hours, but finally got the site up and running by about 12.30 this morning.