Google Gmail 100 minute outage is a big deal

From the ‘Everyone Makes Mistakes‘ files:

Google Gmail users were hit with a 100 minutes outage yesterday due to an upgrade issue.

Ben Treynor, VP Engineering for Google Gmail blogged that,Google took some of the Gmail servers offline on Tuesday AM for routine upgrades. It was those upgrades that led to the service disruption.

That’s right, due to miscalculation on Google’s part, an action (the
upgrade) which should have provided better service, resulted in no
service for tens of millions of Gmail users around the world.

“We had slightly underestimated the load which some recent changes
(ironically, some designed to improve service availability) placed on
the request routers — servers which direct web queries to the
appropriate Gmail server for response,” Treynor blogged.

In my opinion, this is a classic load balancing newbie error. Problem is Google isn’t a newbie.

How could they not know the load on their servers? More importantly how come they don’t have some kind of virtual (or physical) pool of burst bandwidth on demand capability to deal with issues?

