RightScale Blog

Cloud Management Blog
RightScale 2014 State of the Cloud Report
Cloud Management Blog

Amazon EC2 Outage: Summary and Lessons Learned

Last Thursday's Amazon EC2 outage was the worst in cloud computing's history. It made the front page of many news pages, including the New York Times, probably because many people were shocked by how many web sites and services rely on EC2. Seeing so much affected was a very graphical illustration of how pervasive cloud computing has become.

I will try to summarize what happened, what worked and didn't work, and what to learn from it. I'll do my best to add signal to all the noise out there, in that respect I liked a tweet by Beaker (Christofer Hoff): "Happy with my decision NOT to have written a blog about the misfortune of AWS, stating nothing but the obvious & sounding like a muppet."

Executive summary

  • The Amazon cloud proved itself in that sufficient resources were available world-wide such that many well-prepared users could continue operating with relatively little downtime. But because Amazon's reliability has been incredible, many users were not well-prepared leading to widespread outages. Additionally, some users got caught by unforeseen failure modes rendering their failure plans ineffective.
  • Some ripple effects within EC2 and in particular EBS caused by the initial failure should not have happened. There's important work Amazon needs to do to prevent such occurrences.
  • Amazon's communication, while better than during previous outages still earns an F. This is probably the #1 threat to AWS's business.
  • The cloud architecture provides ample opportunities to design systems to withstand failures. The material cost of such designs is a fraction of what comparative measures would cost using traditional hosting means. However, designing, building, and testing everything is not cheap. Many of our customers who used our best practices fared well (I'm not claiming we're perfect or that everything is automatic!) and we got numerous calls from other companies that were wholly unprepared.
  • Overall this is just one of many bumps in the cloud computing road. It reminds us that this is still "day one" of the cloud and that we all have much to learn about building and operating robust systems on a large scale. We are receiving a stream of calls from EC2 users that realize they need help in setting up a more robust architecture for their systems.

Outage analysis

At the time of writing Amazon has not yet posted a root cause analysis. I will update this section when they do. Until then, I have to make some educated guesses.

We got the first alerts at 1:01am on Thursday, the proverbial Christmas lights lit up indicating I/O issues on a large number of our servers. We started failing servers over and opened a ticket with Amazon. They finally posted a status message at 1:41am containing no useful details, sadly this is a typical sequence of events.

It appears that a major network failure was the initial cause of problems but that the real damage happened when EBS (Elastic Block Store) volume replication was disrupted. We did some extrapolations and concluded that there must have been on the order of 500k EBS storage volumes in the affected availability zone. It appears that a significant fraction of the volumes concluded that the replication mirroring was out-of-sync and started re-replicating causing further havoc, including an overload of the EBS control plane. It is also possible that the EBS replication problem was the root cause and that the network issues were a consequence, hopefully Amazon's root cause analysis will shed light on this.

The biggest problem, from my point of view, was that more than one availability zone was affected. We didn't see servers or volumes fail in other zones but we were unable to create fresh volumes elsewhere, which of course makes it difficult to move services. This is "not supposed to happen" and is an indication that the EBS control plane has dependencies across zones. Amazon did manage to contain the problem to one zone approx 3 hours after the onset.

After Amazon managed to contain the problems to one zone, it took a very long time to get the EBS machinery under control and to recover all the volumes. Given the extrapolated number of volumes it would not be surprising that an event of this scale exceeded the design parameters and was never tested (or able to be tested). I'm not sure there is any system of comparable scale in operation anywhere.

I do want to state that while "something large" clearly failed, namely the EBS system as a whole, the real big failure is that multiple availability zones were affected for ~3 hours. I also want to mention two important things that didn't fail: we didn't see capacity constraints in relaunching servers in other zones after the initial cross-zone issues and we didn't see other regions affected at all. This is clearly good news!

Amazon communication failure

In my opinion the biggest failure in this event was Amazon's communication, or rather lack thereof. The status updates were far too vague to be of much use and there was no background information whatsoever. Neither the official AWS blog nor Werner Vogels' blog had any post whatsoever 4 days after the outage! Here is a list of improvements for Amazon:

  • Do not wait 40 minutes to post the first status message!
  • Do not talk about "a small percentage of instances/volumes/...", give actual percentages! Those of us with many servers/volumes care whether it's 1% or 25%, we will take different actions.
  • Do not talk about "the impacted availability zone" or "multiple availability zones", give each zone a name and refer to them by name (I know that zone 1a in each account refers to a different physical zone, so give each zone a second name so I can look it up).
  • Provide individualized status information: use email (or other means) to tell us what the status of our instances and volumes is. I don't mean things I can get myself like cpu load or such, but information like "the following volumes (by id) are currently recovering and should be available within the next hour, the following volumes will require manual intervention at a later time, ...". That allows users to plan and choose where to put their efforts.
  • Make predictions! We saw volumes in the "impacted availability zone" getting taken out many hours after the initial event. I'm sure you knew that the problem was still spreading and could have warned everyone. Something like: "we recommend you move all servers and volumes that are still operating in the impacted availability zone [sic] to a different zone or region as the problem is still spreading."
  • Provide an overview! Each status update should list which functions are still affected and which have been repaired, don't make everyone scan back through the messages and try to infer what the status of each function is.
  • Is it so hard to write a blog post with an apology and some background information, even if it's preliminary? AWS tweeters that usually send multiple tweets per day remained silent. I'm sure there's something to talk about 24 hours after the event! Don't you want to tell everyone what they should be thinking instead of having them make it up???

Coverage from around the web

Since Amazon did not communicate much of substance beyond the rather sparse and obscure status updates everyone else was left to speculate. Most of the blog posts or news articles contained little information. Here's a list of blog posts that I found interesting:

Lessons learned

Our services team handled 4x the incident volume last Thursday compared to a normal Thursday. A large number of callers needed help in assessing the situation or in bringing their servers back up. A typical request was: "It looks like my db server is down due to the outage, can you help confirm and assist with a migration?" Unfortunately we also heard from a good number of users who were using a single availability zone or didn't set up redundancy properly. Hindsight is always 20-20.

A clear lesson for everyone is obviously that backup and replication have to be taken seriously (duh). In EC2 this means live replication across multiple availability zones and backups to S3 (and ideally elsewhere also). It has also become clear that a minimum of replicas must be running and a certain degree of over-provisioning is necessary to handle the load spike after a massive failure. Adrian Cockroft from Netflix summarized their strategy in a tweet a while ago: "Deploy in three AZ with no extra instances - target autoscale 30-60% util. You have 50% headroom for load spikes. Lose an AZ -> 90% util." (Also see the discussion around the tweet.) Users that relied on launching fresh servers or on creating fresh volumes from snapshots were not able to do so for several hours. The only previous event that I remember where multiple availability zones were affected was the July 20th 2008 S3 outage that took down S3 in the US and EU (multiple regions!).

A number of blogs mention NoSQL databases as a solution to the replication and failure difficulties with traditional relational databases. While we've started to use Cassandra ourselves it has become pretty clear to me that this is not a silver bullet by a long shot. When a single node fails the built-in replication and recovery functions well, although the extra load on remaining nodes is high when the failing node is repaired and resynchronizes. But when large numbers of nodes in the cluster lock-up one-by-one over the course of an hour, I'd be hesitant to make a prediction about the outcome both in terms of the cluster's availability and its consistency. We have two applications that make very different use of Cassandra and the behavior of the database is very different in both cases. My conclusion from what I have observed thus far is that clusters of replicated eventually-consistent NoSQL stores have pretty complex dynamics that can easily lead to unpleasant surprises. Sometimes it's nice to have a comparatively simple MySQL master-slave set-up that experiences some downtime during the fail-over but acts very predictably.

I can't help but feel uncomfortable about the performance of Amazon's RDS "database-as-a-service" in that some databases that were replicated across multiple availability zones did not fail-over properly. It evidently took more than 12 hours to recover a number of the multi-az databases. The obvious failure here is compounded by the fact that Amazon has made it difficult for users to backup their databases outside of RDS, leaving them no choice but to wait for someone at Amazon to work on their database. This lock-in is one reason many of our customers prefer to use our MySQL master-slave setup or to architect their own.

The biggest lesson we learned abut operating RightScale itself is that we have to continue pushing hard on reducing the load on our central MySQL database and distributing our service. The database has grown too big and failover consequently takes too long because it takes forever to load the working set (over 30GB) into memory. We have some short-term measures we will be implementing to reduce the failover time, but more is needed. We also need to provide our users a choice of RightScale systems located in different regions and clouds: users operating primarily out of one region need to be able to use RightScale in an independent region or cloud. Ironically the first thing every public cloud operator and every company with a private cloud asks us is whether we can run RightScale inside their cloud: that seems pretty misguided to me!

We also were confused by Amazon's status messages. In hindsight we should have intentionally failed-over our master database which was operating in the "impacted availability zone" early on at a time where we could minimize downtime. We were lucky that it didn't get affected until about 12 hours after the start of the outage but we didn't connect one and one. A clear message from Amazon that more and more volumes were continuing to fail in the zone would have been really helpful.

What's next?

With Amazon's overall stellar operating reliability it is easy to become complacent. This outage was a wake-up call for many of us. What remains to be seen is whether Amazon decides to take a lead and provide more granular descriptions of failure modes and recommended actions or whether they will leave it to everyone else to guess and figure it out. I see this as being one of the main long-term problems of cloud computing, namely that it is extremely difficult for users to list the possible failure modes and even more difficult to actually test any of them.

In the big picture I find Lew Moorman's analogy in the NYT article very appropriate: "The Amazon interruption was the computing equivalent of an airplane crash. It is a major episode with widespread damage. But airline travel is still safer than traveling in a car — analogous to cloud computing being safer than data centers run by individual companies. Every day, inside companies all over the world, there are technology outages, each episode is smaller, but they add up to far more lost time, money and business.” Most of the articles that predict a run away from cloud computing fail to explain where to run to. Unless you can hire superman to run your private datacenters my experience tells me that you'll be worse off.

Comments

[...] Amazon EC2 outage: summary and lessons learned « RightScale Blog (tags: article backup amazon cloud aws) [...]
[...] rare occasions multiple zones in a region can experience outages due to system-wide issues — the April 2011 Amazon Web Services (AWS) outage is a notable example. Each region is an independent system of resources with its own API endpoint [...]
[...] worth a read as it’s the best explanation I have read to date and there’s also a lessons learnt article here as well. For a more technical explanation check out Amazon AWS message as well as article by Cloud [...]
[...] Amazon EC2 Outage – Lessons Learned: A good article putting the recent EC2 outage into perspective. [...]
[...] their traditional disaster recovery strategies. Several weeks later, the EBS system in one of Amazon’s EC2 data centers in the Eastern U.S. failed due to a faulty router upgrade and a cascade of resulting events, sent hundreds of [...]
[...] rare occasions multiple zones in a region can experience outages due to system-wide issues — the April 2011 Amazon Web Services (AWS) outage is a notable example. Each region is an independent system of resources with its own API endpoint [...]
[...] RightScale suffered from the outage, but, as CTO Thorsten von Eicken explained in a post this morning, it did take some time for the cloud-management service to determine what was going on and then act [...]
We're on EC2 on the West Coast so weren't affected. We use an automated script to take a snapshot every 4 hours and dispose of old snapshots. Would we have been able in the above outage to copy the snapshot over to a different data center and boot with from them? I guess the second question is how fast is the link between the two coasts. It might be too slow to copy things over.
Dror, I'm afraid that snapshots cannot be used outside of a region, so you can't create a volume in us-east from a snapshot in us-west. The strategy we've used is to run a database slave in a different region and use db replication. The remote instance can be smaller as long as it can keep up with replication. Of course if you have different software your solution will vary. Another option is to use LVM so you can do local snapshots and then rsync the data to a remote region or upload it to S3. The data bandwidth is virtually infinite but of course the latency isn't and the depth of your purse may be limited as well. Most likely the tools you use to copy the data will be the bottleneck, plus the impact on the application that needs to continue running.
Thorsten, I understand that the snapshots can't be used across regions. Let's imagine that I was a customer that was affected by the outage. How practical would it have been to copy my snapshot from the us-east to us-west and boot my image in us-west. My images are 8 Gig so if the bandwidth is good it should take less than an hour. I then just boot the new images, change the IPs/DNS, and I'm in business.
The problem here is that you can't access a snapshot other than by creating a volume from it. So, during this particular outage, since the EBS API was unavailable it is very likely that you wouldn't have been able to create a volume in us-east to serve as the source for your copy. I think that you really have to look at the cost of a few hours of downtime and consider running a--possibly small--server on in the other region and regularly rsync the data set over. Since it's incremental the bandwidth cost may be acceptable and a small instance or even a tiny one may be sufficient. If a few hours of downtime are "cheaper" then sitting tight is the best thing to do. My 2 cents, which matches the Netflix strategy, is that all fail-over systems fail. Unless it's active-active it's bound to fail.
"The problem here is that you can’t access a snapshot other than by creating a volume from it" That's the part I was missing. For some reason I had the impression that the snapshot simply reside on S3, and I should be able to copy them over. On second thought, I'm clear why that doesn't make sense. So now I'm on board with the idea of doing an application level rsync. At this point in time, it's probably an overkill for our needs, our current strategy is "good enough." When hit the million user mark we'll revisit :-).
[...] their traditional disaster recovery strategies. Several weeks later, the EBS system in one of Amazon’s EC2 data centers in the Eastern U.S. failed due to a faulty router upgrade and a cascade of resulting events, sent hundreds of [...]
[...] previous outages, still earns an F. This is probably the #1 threat to AWS’s business,” von Eicken said in a blog post on [...]
[...] by Eric Kidd explains the series of events, and the various options different customers had. RightScale provides another good summary. What can we [...]
[...] Amazon EC2 outage: summary and lessons learned Last Thursday’s Amazon EC2 outage was the worst in cloud computing’s history. It made the front page of many news [...] [...]
[...] by Eric Kidd explains the series of events, and the various options different customers had. RightScale provides another good summary. What can we [...]
[...] Rightscale criticised Amazon for failing to properly communicate about the outages, having taken 40 minutes to post its initial status update, and having not provided enough information for companies using its services to take action. [...]
[...] the recent AWS outage has generated such widespread attention, with a plethora of blog posts from customers to industry experts taking up pixel space across all corners of the [...]
[...] For a summary of the events and their impact see this blog entry of RightScale (I am not sure if it was written by Thorsten [...]
[...] the recent AWS outage has generated such widespread attention, with a plethora of blog posts from customers to industry experts taking up pixel space across all corners of the [...]
[...] according to Joyent’s Jason Hoffman, was pretty much destined for a failure of this type. According to RightScale’s Thorsten von Eicken, the EBS failure appeared to span Availability Zones, meaning that a problem in one zone could [...]
I have experienced the outage on the EC2 was a bummer. I hope they can get this under warps. <a href="http://www.galtlinedesign.com" rel="nofollow">Real Estate IDX</a>
[...] services provider RightScale calls Amazon&#8217;s lack of communications through the process &#8220;the biggest failure in this event,&#8221; and offers a list of improvements that would make future outages more [...]
[...] status of the service. Cloud management provider RightScale summed up Amazon&#039;s failings in its own blog post, and there were a lot -- starting with no blog updates from Amazon Web Services for four days. What [...]
[...] http://blog.rightscale.com/2011/04/25/amazon-ec2-outage-summary-and-lessons-learned/ &#160; LikeBe the first to like this post. [...]
[...] dealt with &#8211; at least with what information is available &#8211; in other places, including http://blog.rightscale.com/2011/04/25/amazon-ec2-outage-summary-and-lessons-learned/ (thanks to Josh Mahowald for the link), I find the crisis management most [...]
[...] management service Rightscale has a nice summary of the Amazon incident plus an analysis of what they and others can learn from it. They also have some additional links that you can check [...]
[...] For more details on Amethyst and the AWS outage, see the next post.  For a detailed analysis and response from a major user of AWS, and links to most of the other blog posts on the subject, see Amazon EC2 outage: summary and lessons learned. [...]
[...] you want to read more about the AWS outage, Amazon EC2 outage: summary and lessons learned is a good analysis and summary. There are numerous links at the bottom to pretty much everything [...]
[...] their traditional disaster recovery strategies. Several weeks later, the EBS system in one of Amazon’s EC2 data centers in the Eastern U.S. failed due to a faulty router upgrade and a cascade of resulting events, sent hundreds of [...]
[...] there was the major Amazon EC2 (Elastic Cloud) outage April 21-22 that brought down many business and websites. Some of the data was unrecoverable and [...]
[...] there was the major Amazon EC2 (Elastic Cloud) outage April 21-22 that brought down many business and websites. Some of the data was unrecoverable and [...]
How practical would it have been to copy my snapshot from the us-east to us-west and boot my image in us-west. My images are 8 Gig so if the bandwidth is good it should take less than an hour. I then just boot the new images, change the IPs/DNS, and I’m in business.
As Thorsten pointed out, there's no way to copy over a EBS snapshot. So you need to copy things at the OS level using a tool like rsync. It is something that's missing in the AWS toolset, but I can see why it's something that's on the "do later" list. Snapshots are low level entities, so it's somewhat complicated to expose them to the application.
[...] piece of advice in my opinion, and are based on first hand experience: George Reese’s and RightScale’s. Amazon eventually published a detailed post mortem of the incident on 29-Apr-2011. It is very long [...]
A good lesson for those with data in the cloud ... better get to work on a plan B because it seems we can not avoid outage.
[...] then, as luck would have it, that was the day of Amazon&#039;s EC2 outage. Which didn&#039;t mean that the blog went down, but Livefyre, the super-duper comment system I&#039;ve been [...]
So many came away from Amazon’s outage completely enraged, and understandably so. However, there are some important lessons we can take from this experience. For one thing, we now know that outages are inevitable and anyone with data in the cloud needs to prepare sound disaster recovery and failover strategies.
[...] ces bonnes résolutions à prendre pour l&#8217;avenir, ne doivent pas faire oublier trop vite les erreurs qui viennent d&#8217;être commises par Amazon Web Services : aussi bien au niveau de l&#8217;architecture de leur système de stockage, qui est resté en [...]
[...] these databases don&#8217;t need locks and, as Amazon demonstrated, any kind of lock will eventually become a bottleneck: Each node in a system should be able [...]
I created a snapshot and they allowed me to transfer it to the west coast division, because of the outage. It was definitely very disappointing.
[...] also had a post with an analysis of the Amazon EC2 outage, with lessons learned and some strong remarks on Amazon’s failure to communicate properly during [...]
Nice I really appreciate people like you that take the time out to explain things so people like me can understand them more clearly.
for me is work less because my archives are really big and it´s take time to send archives to my clients... Cheers Bill
This is the second big outage of EC2 ,The first is Operation mistake ,and this one will blame the weather . However , if you put your app in cloud ,then the SLA maybe in cloud too.

Pages

Post a comment