Yesterday the Facebook website went offline without warning for around 2.5 hours. Facebook has now released a detailed explanation of the reason why.
Facebook explains that the major cause of the outage was due to ‘an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.’
Then to make matters even worse every time a Facebook user got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key.
This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. Causing the Facebook website to enter a feedback loop that didn’t allow the databases to recover.
The stop the loop Facebook needed to cut all traffic to the site and database cluster allowing the database to return to stability.
Facebook are now developing new systems that deal more gracefully with feedback loops and transient spikes.