RightScale Blog

Cloud Management Blog
RightScale 2014 State of the Cloud Report
Cloud Management Blog

New EC2 Instance Types and Coordinated Failures in the Cloud

Amazon announced a new series of instance types today that do not have any local storage. This got me thinking about some of the failure modes we've seen. The specifics are that AWS released the "m3" series of instances with 15/30GB of memory and 4/8 cores and no local disk storage, which means that the root volume (from the image) and all additional data storage must use EBS volumes. The move to servers without local disks has been expected for a long time and is no surprise given that the mechanical spindles must be one of the top failure causes. Putting disks into redundant storage systems, such as EBS, improves durability and manageability, which are good things.

What got me thinking is that the move towards EBS-only instance types seems to make it more difficult to run highly available distributed services. {C}The design of distributed systems, such as replicated data stores, assumes that servers fail relatively independently of one another such that if a server that holds a piece of data fails then it is very unlikely that the other server(s) holding a replica of the same data also fail at the same time. To take failures of many servers at the same time into account one has to ensure that all the replication occurs across such likely failure clusters, which is what Amazon's availability zone concept is intended to enable.

Instances using EBS storage volumes are tightly coupled to the EBS system and it's difficult to avoid the fact that the EBS system was involved in most of the large scale outages in EC2 in recent times. This means that the EBS system represents a potential source of coordinated failures one has to take into account when deploying highly available distributed systems. EBS is probably the most criticized part of EC2 and so this realization gave me pause. To sort my thoughts I started to list some relevant observations so I could balance them against each other:

  • several memorable outages started within EBS and took down a large number of servers as a result
  • servers within an availability zone can be in the same rack, on the same UPS, or attached to the same router(s), which are also sources of coordinated failures
  • EBS is a distributed storage system, so different instances are likely to be connected to different storage servers/subsystems, it's not a monolithic system
  • AWS has clearly made progress in isolating the availability zones at the EBS level, the EBS issues last week did not propagate across zones (although some other issues did ripple through), but it would be great if AWS provided information on how earlier issues have been resolved
  • any good replication set-up must ensure that data is replicated across zones and not just within a zone, this is true whether one uses EBS or not
  • a typical replication factor is 3 and sometimes higher, that means that multiple replicas have to be in the same zone when one is using one of the smaller regions that have only two zones
  • at the end of the day, for critical data/services, a DR replication to a different region or cloud provider is required

After re-reading these observations I felt a bit better. By the time one follows best practices one should end up with a system that is as resilient to coordinated EBS failures as to other coordinated failures in an availability zone, so the use of EBS doesn't seem to introduce significant new factors.

There is another concern, which is consistent performance, which hasn't been a strength of EBS until the introduction of provisioned IOPS (the ability to reserve a certain minimum rate of I/O operations per second). Sometimes coordinated performance degradations are much more difficult to troubleshoot and remedy than outright failures where a failover decision becomes obvious. For this reason, I would highly recommend the use of provisioned IOPS in distributed systems whose performance hinges on I/O performance.

An additional angle to consider is that the use of EBS volumes and also of provisioned IOPS increases the overall operating cost. In contrast, the 4 local disks in the case of m1.xlarge instances come with the base instance cost. It's perhaps an apples vs. oranges comparison in some cases, but the costs when launching clusters of servers do add up.

After summing everything up I feel better about the new instance types without local disks, but I have to admit that we'll continue to prefer the m1.xlarge instances that have 4 local disks for our cassandra (distributed NoSQL storage) clusters, if only because it's easier to reason about them and troubleshoot when something doesn't perform as expected.

If you have additional observations or thoughts about this topic, please do leave a comment so we all learn more! You can also talk to us about these topics at the AWS re:Invent conference later this month.

Comments

Thorsten: If you don’t deploy across AZs for cost reasons then you have lowered your survival chances long ago before even talking about EBS. With EBS down every 3 months, that won't be the experience of many. You're saying the last EBS outage didn't affect any other zones, but for me it did (could be pure coincidence of course, perhaps, had a slave server that had to be rebooted before it worked again), but there were more complaining about this on the forums. Traditional hardware is pretty reliable, easily lasts 3-5 years without any downtime. Your average data centre also can last 4-5 years before a significant outage (and then it's usually multi-day, see Sandy, but there are numerous other examples). My point is that I suspect that many sites have seen more multi-hour outages with AWS than before, more than they were used too perhaps. We probably should offset that against the ability to auto-scale, so you're usually not going down due to traffic volume. Is AWS worth it? That's the question I'm struggling with. Obviously I never want to be in a single data centre hosting environment ever again (had my multi-day outages, Sandy is another example). PS: my estimate for cross AZ traffic is that it adds 20% to the cost.
This seems similar to SAN booting which has been utilized for quite some time, the nice thing is the ability to replace a failed server with similar hardware attach it to the disks boot and your right where you left off
Posted by Michael Peterson (not verified)   Ι   November 02, 2012   Ι   03:58 PM
Michael, you're correct that this is just the same as SAN booting. It has the same fun implications that if your SAN goes down so do all your replicas of a distributed service. SAN aficionados would claim that "the" SAN really is composed of multiple failure-independent parts. AWS says that EBS consists of many independent parts. We now try to sort all this out :-).
Thorsten: <i>By the time one follows best practices one should end up with a system that is as resilient to coordinated EBS failures as to other coordinated failures in an availability zone, so the use of EBS doesn’t seem to introduce significant new factors.</i> You'd think so? Netflix is basically the only one who manages to stay up when there's an AWS failure, and they have now come out saying they don't use EBS for critical storage. I myself have also started to move away from EBS. So I think this is a step backward. Independent failure is much better, because you can handle that with replication. EBS failure is crippling for an entire AZ and often cross AZ. (ELB goes down, EIP doesn't work). Secondly, cross AZ traffic is highly expensive. Unless Amazon drops cross AZ cost with a factor of 10, cross AZ sites are not within reach of many. So I would conclude this will increase the number of widely noticed failures as there are less options, and newer folk believe the EBS hype (more reliable than disk blablablah), without noting that EBS availability is a significant problem.
Berend, your points are very valid and close to the initial thoughts I had. When I laid out the observations I did realize, however, that the recent ELB failure did not per-se extend across AZs and that booting from EBS doesn't really add a new correlated point of failure. If you don't deploy across AZs for cost reasons then you have lowered your survival chances long ago before even talking about EBS. That's a business choice--valid in some cases, foolish in others. I'm surprised that you bring up the cross-AZ charges. Sure, I'd love for them to be lower or zero, but at the same time, I've not seen them as a dominating cost factor. By the time you decide to replicate you are accepting a bunch of 2x cost factors, if not more, and generally the cross-AZ traffic cost is low in comparison. Of course I'm sure there must be a counter-example out there. I do disagree with your statement that only Netflix manages to stay up. There are many other sites that do as well, they're just not so vocal about it. In the end it's a cost/benefit analysis, in the case of netflix the benefit of staying up is very high, so they can spend $$ on engineering that and they like to talk about it. Doesn't mean others are not doing good work too.

Post a comment