The blackout of Microsoft’s Windows Azure servers last weekend was due to a glitch in an operating system update, the company said this week.
Microsoft’s disclosure came in a post Wednesday on the Windows Azure blog.
The roughly 22-hour service outage began last Friday evening at around 10:30 p.m. Pacific and ran into early Saturday evening.
“During a routine operating system upgrade on Friday (March 13th), the deployment service within Windows Azure began to slow down due to networking issues. This caused a large number of servers to time out and fail,” the Windows Azure team said in their post.
The servers were back up and functioning normally by Saturday evening, later blog posts said. However, finding out what caused the outage so it doesn’t happen again took a while longer.
What Microsoft’s engineers found was that, as application servers failed, they began notifying a server called the Fabric Controller. Part of the controller’s job is to recover crashed applications by moving them to other servers, but as more and more servers failed, the cascade backed up the Fabric Controller as well.
“We are addressing the network issues and we will be refining and tuning our recovery algorithm to ensure that it can handle malfunctions quickly and gracefully,” Wednesday’s post continued.
Microsoft also suggested that developers run more than one instance of their applications because those with more than one instance of their applications were less likely to fail.
Windows Azure, often shortened to simply Azure, is Microsoft’s (NASDAQ: MSFT) cloud computing environment. Azure has been available as a Community Technology Preview since it was introduced at Microsoft’s Professional Developers Conference last October. Microsoft is in the process of building datacenters worldwide to support Azure when it is released.
During last weekend’s blackout, Developers who were testing their code on Azure services received error messages informing them that their applications were “unreachable or in ‘stopped’ or ‘initializing’ states for long periods of time,” according to a statement posted during the outage.