At 7:33 PM Pacific Time (PT) on Wednesday, June 10, parts of Amazon’s Elastic Compute Cloud (EC2) lost power after a lightning strike. Technicians were at the data center within the hour and replaced the damaged power units gradually, finishing work at 1:20 AM PT this morning.
“Last night’s issue was limited to one US EC2 Availability Zone when a small percentage of instances in that zone lost power due to a lightning storm. This was not a generalized issue and there was no impact to other AWS services,” said an Amazon representative in an e-mail to InternetNews.com.
The company posts the current condition as well as a historical record of outages on its AWS Service Health Dashboard. The dashboard shows a piece of the Mechanical Turk service also experienced some problems yesterday.
Past issues recorded in the dashboard generally have to do with the very rapid increase in network traffic caused by the cloud. For example, on May 28, Amazon’s European cloud exceeded router connection limits as it accessed Amazon’s Simple Storage Service (S3) in the U.S.
Amazon solves these problems quickly. On May 28, the problem was solved in 20 minutes, the site says.
Yesterday’s issue took longer to solve, but the company is working to ensure that such problems won’t affect users when they happen again.
“We are continually striving to prevent such unforeseen events like lightning strikes from being even a blip on our customers radars. This type of incident is exactly why EC2 offers Multiple Availability Zones to our developers,” said Amazon’s representative.
“When a developer chooses to run an application in Multiple Availability Zones, the application will be fault tolerant against exactly this type of event and will keep performing. During the issue last night, any user running in just one zone could also chose to re-launch their degraded instances in another zone,” she added.
She noted that Amazon’s new cloud control services can help. “The new monitoring, auto scaling and load balancing features give users options to set up automatic controls to automatically launch new instances if an instance goes down, and/or to reroute traffic to additional Availability Zones when instances they are running in only one Availability Zone become unhealthy,” she said.