Pressure on Google servers caused the 100 minute outage of Gmail yesterday and the company has gone out of its way to Google Grovel about the problem.
Ben Treynor, who is Google’s “Site Reliability Czar” said he wanted to apologize to all Gmail users in a posting on the Google blog last night. “Today’s outage was a Big Deal, and we’re treating it as such,” he said.
Google took what he described as a small fraction of Gmail servers down to perform upgrades but underestimated the load that some changes placed on the request routers.
“At about 12:30PM Pacific, a few of the request routers became overloaded and in effect told the rest of the system ‘stop sending us traffic, we’re too slow!'”
The load was transferred to the remaining request routers and they became overloaded one by one.
The Czar claimed that the engineering team found about the problem within seconds and shoved many more additional request routers online.
He said: “Gmail remains more than 99.9% available to all users, and we’re committed to keeping events like today’s notable for their rarity.”
Systems do break, and when they’re down 100 percent, the fact that only represents .1 percent of failutes, that’s still 100 percent down for 100 minutes.