A week after the April 21, 2011, outage AWS posted a detailed post mortem explanation of what happened. It'll be interesting to see how everyone digests the very detailed account. Since AWS did not provide an executive summary I'll try my hand at one:
The outage was triggered by an operator error during a router upgrade which funneled very high-volume network traffic into a low-bandwidth control network used by EBS (Elastic Block Store). The resulting flooding of the control network caused a large number of EBS servers to be effectively isolated from one another, which broke the volume replication, and caused these servers to start re-replicating the data to fresh servers. This large-scale re-replication storm in turn had two effects: it failed in many cases causing the volumes to go offline for manual intervention, and it flooded the EBS control plane with re-replication events that affected its operation across the entire us-east region.
The steps taken by AWS to regain control started by stopping the re-replication attempts to quiesce the system and prevent new volumes from being drawn into the outage. AWS then isolated the affected availability zone from the EBS control plane to restore normal operation in other zones. Finally, AWS started to recover volumes by adding storage capacity to allow the re-replication to succeed where possible, by restoring data from snapshots on S3, and finally by manually restoring data. Ultimately 0.07% of the volumes could not be restored to a consistent state.
The Relational Database Service RDS was also affected by the outage. 45% of single-availability-zone databases in the affected availability zone went down because each database server stripes data across multiple EBS volumes with the result that one stuck volume halts the entire database. A number of multi-AZ RDS databases whose master server was in the affected zone failed to fail-over because of a bug in the fail-over process.
The post mortem lists a number of system improvements that AWS is working on. These primarily target improving the resiliency of EBS when replication fails as well as improving the tools created and used during the outage to recover from the situation. Customer communication improvements, especially regarding the frequency of updates, are also listed and AWS is crediting affected users a significant fraction of this month's charges, this way beyond anything covered in its SLAs.
It is interesting to see how a network configuration error caused such a chain reaction within the EBS system. The outage trigger really is pretty incidental, a similar set of events could have probably been triggered by something else as well. The measures taken by AWS to contain and repair the outage highlight the deep technical expertise and full mastery of the entire software and hardware stack at AWS. Clearly deep code changes were made and sophisticated recovery tools were written 24x7 under the pressure of the outage, without which the situation most likely would have spun completely out of control.
The impact of the outage, the public reaction, and the measures necessary to control it show the scale at which AWS operates. It is pretty clear that this type of outage is part of growing the service to unprecedented scale. I find it amazing that this type of outage, where the sophisticated systems necessary to provide cloud computing at scale fail massively hasn't happened years ago. This is a testament to AWS's sophistication.
The outage summary exposes interesting technical details about the architecture of the services that AWS has kept confidential until now, however, more than providing information to competitors I believe that it provides education to cloud customers. All cloud providers who are planning world-wide cloud roll-outs absolutely must understand the power of and the need for availability zones in a region and isolation between regions (or equivalent constructs to "differentiate" from AWS). Without that redundancy and isolation, it has now become crystal clear: "how can we sell that to customers?"
An aspect of EBS durability which is not often mentioned is the role of snapshots during recovery. The EBS product description states "the durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot." Here's what this means. Suppose there are two copies of the volume (i.e. mirroring) and one fails, then a fresh mirror can fetch data contained in snapshots from S3 (which is itself replicated) but must retrieve other data from the single remaining copy, which may itself fail or become unreachable. Sadly the performance impact of taking a snapshot is such that most of our customers with high volume database cannot snapshot the master DB volume. Please fix that AWS!
An item missing from the remedies list in my opinion is EBS performance improvement. Better performance would have helped in the outage. Specifically I'd like AWS to reduce the impact of snapshots on volume performance so customers can actually snapshot high-volume servers and improve the performance of volumes so customers don't have to stripe across multiple volumes which reduces availability (as it did with RDS).
I also am not satisfied with the communication improvements AWS proposes. I was fine with the frequency of status updates because it was clear that the EBS team was on top of it and didn't have much new to report. I would like to see improved responsiveness so we don't have to open a ticket before something shows up on the status page. But foremost I would like better content in the status updates. I'd like to be constructive, so I'll make it concrete. Here is some of what I would have liked to see (I naturally have to make some assumptions about what was concluded when within AWS):
- explicit mention that the initial network event was contained, status updates kept talking about "increased latencies", which made it unclear whether there was a general ongoing network issue
- clear statement that the outage revolved around EBS and noting the impact on launching servers from EBS images, but also stating that there was no impact on servers not using EBS
- clear statement that certain API calls were disabled instead of vaguely referring to "increased error rates affecting EBS CreateVolume API calls"
- timely reporting, e.g., the post mortem states "by 5:30 AM PDT, error rates and latencies again increased for EBS API calls across the Region" while the status updates only mentioned this at 7am
- the fact that the outage was due to failed EBS volumes as opposed to just connectivity or latency issues accessing the volumes was only reported at 8:54am, yet this is crucial piece of information
- the status updates never made it clear that EBS volumes continued to fail after the initial event, nor did they mention when this infection was halted
- the isolation of the other availability zones from the "affected one" was reported several hours after it was put in place
- it would have been useful to see some relative numbers, such as % of volumes deemed operational, % being recovered automatically soon, % slated for later manual recovery; best would have been emails to users with specific volume IDs
I'm sure that some of the items above weren't quite as obvious at the time and in the heat of the moment it's always difficult to determine what to say. But there is no question that the status updates were filled with vague terms, such as "increased latencies", "moderate increase in error rates", "affected availability zone", "a network event", etc. Perhaps foremost it's not until 8 hours after the onset of the outage that AWS made it clear that volumes in the affected zone weren't going to return to normal for hours to come. Up to that point it seemed that everything could return to normal any minute. This lack of clarity made it much harder for users to take the right decisions promptly.
On the public reaction front, while I understand it, I'm still baffled by reporters stating that the loss of 0.07% of volumes as not recoverable is a fundamental problem. This is equivalent to complaining about users losing data because their RAID array failed (happens all the time from operator error to 6ft drop due to earthquake). Users that lost data and were not aware of the risk they were taking need to seriously reflect on what they're doing (and get help as appropriate).
This episode provides a key lesson to all cloud companies regarding architecting to withstand failure, and communicating with customers when failures do occur. While RightScale got through the outage relatively unscathed, we are working to improve on both those fronts ourselves. And we intend to continue to work with customers to enable AWS as well as other providers with independent, best-practice solutions that are resilient and highly available.