Facebook’s revealed more details of last night’s 2.5-hour outage, calling it the worst it’s ever had in over four years.
The site was unavailable for most users round the world between 11.30am and 3.00pm EST.
Software engineering director Robert Johnson said the problem was caused by the mishandling of an error condition. “An automated system for verifying configuration values ended up causing much more damage than it fixed,” he says.
A change was made to the persistent copy of a configuration value that was interpreted as invalid. The unfortunatel result was that every single client saw the invalid value and attempted to fix it. The fix involves making a query to a cluster of databases, which was, unsurprisingly, quickly overwhelmed by hundreds of thousands of queries a second.
“To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key,” says Johnson.
This meant that even after the original problem had been fixed, the stream of queries continued. And with the databases failing to service some of the requests, even more requests were generated, creating a feedback loop that didn’t allow the databases to recover.
“The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site,” says Johnson. “Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”
Facebook has now turned off the configuration verification system and is looking to redesign it so that it deals ‘more gracefully’ with feedback loops and traffic spikes.