RightScale Blog

Cloud Management Blog
RightScale 2014 State of the Cloud Report
Cloud Management Blog

AWS Outage Lessons Learned: If Netflix Can Suffer, So Can You

On Christmas Eve and continuing into Christmas Day, AWS had a “Service Event” centered on the ELB (Elastic Load Balancing) service in the US-East region. Although only a small percentage of ELBs were functionally disabled and unable to route traffic to their backend servers, all ELBs in the region experienced a time interval in which they could not scale, nor could changes be made to their configuration, such as adding or removing backend servers from the load balancing pool (for the full details, see the AWS post-mortem).

Many AWS customers were affected at some level – a few more negatively than others – as every ELB in the region was involved to some degree. Among these AWS customers was Netflix, whose Cloud Architect Adrian Cockcroft described the Netflix view on the outage in his blog post.

This brings me to an interesting point that many in the IT community (including Netflix) are discussing in regard to this outage: the benefits of a multi-region setup and how such a configuration can help in these situations.

Don’t Get Seduced by Vendor-Specific Solutions

While we here at RightScale are big believers in using multiple regions for disaster recovery purposes, in many situations a multi-region configuration is overkill for day-to-day production operations. Many of our recommended best practices for production deployments can be found in my white paper, Building Scalable Applications In the Cloud: Reference Architecture and Best Practices.

The vast majority of RightScale customers run their productions systems in a single region, and we advise them to avoid using vendor-specific tools to reduce the potential for any hidden dependencies these tools may introduce. A key takeaway from this recent AWS outage is that while it did affect the entire US-East region, it only affected a single vendor-specific service in the region – the Elastic Load Balancing service.

If your website had been using a different instance-based load balancing solution (HAProxy, nginx, etc.) you would have been totally isolated from this failure and seen no service impact. In a previous blog post I provided some tips for fine-tuning your cloud architecture, of which Tip #9, “Be Wary of Cloud Lock-in,” cautions in part:

The use of vendor-specific tools and virtual appliances may make deploying an application easier in the short term, but many times these services are integrated or tied into other services that can result in cascading outages if one of these underlying services suffers a service disruption. The use of vendor-neutral solutions insulates your application tiers from these service integrations, as well as creating a cloud-portable solution…

Using a vendor-specific tool such as ELB makes things easier during the setup phase, but it not only locks you into the vendor since that tool is not available from other cloud providers, but it also (and more importantly in this case) can result in a service disruption due to a cascading effect. The ELB state data that was inadvertently deleted by a manual mistake (more on this later) was only relevant to a very small percentage of ELBs, yet the entire ELB infrastructure was affected in some way, that is not being able to scale nor make configuration changes.

The Benefit of Using Loosely Coupled Components

One of the mantras that we preach in the world of cloud architecture best practices is to use loosely coupled components. This outage illustrates that what might appear to be a standalone component may in actuality have tight integrations with other infrastructure behind the scenes, and thus the degradation of one aspect of the system can have a cascading affect on other components. Netflix has done a lot of amazing things with regard to its cloud automation tools (take a look at some of the company’s open-source projects – Simian Army in particular is awesome), but its continued dependence on AWS-specific components has always been a mystery to me.

Regarding the root cause of the outage, GigaOm provided a thoughtful take on what happened and Netflix’s subsequent response. It is noteworthy that the original issue was caused by a developer manually running a process that was “currently being automated,” as we advise our customers to automate all processes that touch production systems. (And we do practice what we preach at RightScale in running our own complex web of interconnected systems).

Any automated process that requires a manual “kick-off” (such as what I am assuming might have been the case in the AWS ELB situation) should be subject to both access controls (the “who” that can do it) and logging/auditing (the “what” and “when” the “who” did). Manual errors are to be expected in any environment because humans are imperfect beings, so we should relinquish as much control to our automated systems as possible. The only mistakes they make are the ones we tell them to. :)

Highly Available, Resilient Systems Are the Answer

This AWS outage serves as a good illustration of why “best practices” are rightly so. When shortcuts are taken or oversights creep into the mix, what may appear at first blush to be a small, localized issue can potentially cascade into a bigger problem and adversely affect your infrastructure and the applications that depend on them. On the surface, individual components can appear to be distanced from those that are suffering the outage or performance degradation, but in reality they can also be potential points of failure.

The tools are out there to build highly available, resilient systems. How effectively you use these tools will dictate your tolerance to infrastructure service disruptions that will continue to occur, however infrequently. To try some of these vendor-neutral solutions, user permission features, logging and auditing mechanisms, and more, get a free trial of RightScale Cloud Management.

Comments

Brian, > The use of vendor-neutral solutions insulates your application tiers from these service integrations, as well as creating a cloud-portable solution… How can the "vendor-neutral solutions" be significantly more resilient than those of AWS? They sure are more work to both deploy and manage, plus when they do go wrong, you're on your own. Or am I missing something?
Response by Brian Adler   Ι   January 08, 2013   Ι   09:22 AM
Hi Dmitri, Kyle provides an excellent answer to your question (thanks Kyle!), but I will add just a bit more color as well. The ELB system has dependencies on AWS services (most notably EBS as Kyle mentions and as is described here in the post-mortem of one AWS' "cascading" outages: https://aws.amazon.com/message/680342/ ). If you use a vendor-neutral, instance-based load balancer (like HAProxy, Nginx, etc.) an outage of the EBS service will not affect your load balancing capabilities. And if you ever decide that you want to move your infrastructure to another provider, a software load balancer can move with you, while an ELB cannot.
Brian, Thanks for the answer. I do appreciate that ELB has EBS dependencies, and what it means for its reliability. My original question is a bit more general, rather than ELB-specific, and probably boils down to the number of instances of a particular functionality (LB in our case) you need. If a number of LBs needed is also small (read: manually manageable), then you're better off with a simpler system. However once you start dealing with a larger number, you will likely need to involve/develop a system for automated management of your LB instances, and this is where the benefits of home-grown vs. one provided by your service provider will start to become more questionable. Hope this makes sense.
Not sure why I can't reply to Dmitri directly here (too many levels?), but I disagree with the premise that if your deployment is small, you're justified in using single-point-of-failure solutions. I agree that, given two identical solutions, a simpler solution is better, but in this case, the simplest solution (ELB) is also the solution with a higher rate of failure. If that higher rate of failure is within your business continuity/disaster recovery guidelines, then fine, but simplicity alone cannot justify the business decision to use ELB over a more complicated solution that enables better HA.
I think what he is trying to illustrate is that you ultimately have control over vendor-neutral solutions. Amazon's Elastic Load Balancer utilizes Amazon's Elastic Block Storage system, by using ELB you necessarily increase the complexity of your traffic management solution. If you can quickly deploy a software load balancer including state to an ephemeral node with configuration management then you can control how long it takes to recover and your no longer have tight coupling with EBS.
"But I would also like to know why you would not consider Rackspace" I meant "Rightscale", sorry for that
Posted by Damian Traverso (not verified)   Ι   January 08, 2013   Ι   07:44 AM
You can apply the same "only use IaaS" logic with a service like RightScale: use RightScale for its cloud-agnostic API and monitoring, and use Chef or Puppet for configuration management. Even use RS for autoscaling--that should be fairly easy to move to a competitor if necessary.
Hi Damian, I think Joe hit the nail on the head -- if vendor lock-in to RightScale is a concern, then just use the RS tools that you could find or develop elsewhere. Everything that RS provides could be done in other environments and in other ways -- we have just done all the hard work and heavy-lifting so you don't have to :) Some of our customers use just the RightScale API or just the monitoring and alerting system, while others are "all-in" and use the full gamut of the RightScale platform. Autoscaling is indeed a very popular and common feature that customers rely on RightScale for, but as Joe mentions, other companies provide autoscaling tools as well as some of the IaaS providers themselves, so there are other options. Granted, I am a bit biased and I think our offerings are the easiest to use and the most encompassing, but other options do exist.
Posted by Brian Adler   Ι   January 08, 2013   Ι   09:36 AM
Response by Damian Traverso (not verified)   Ι   January 08, 2013   Ι   09:56 AM
That makes sense, thank you both for your replies :)
Nice post. I fully agree with you that using only the basic services (IaaS) from any Cloud provider will provide more resilient systems than using IaaS + SaaS (like ElastiCache, RDS, ELB). But I would also like to know why you would not consider Rackspace another type of vendor lock-in. Even though it supports multiple clouds, the customer would still depend on RS for features like autoscaling and new instance provisioning. Just for the records, I've used Rightscale for about a year, and I can say it is an excellent and powerful tool.
Posted by Damian Traverso (not verified)   Ι   January 08, 2013   Ι   07:43 AM
For those of us who don't like HAProxy-based solutions--because they ultimately rely on round-robin DNS and add complexity and more things to worry about--any thoughts about Cedexis? We'll be evaluating them fairly soon, but would be interested in any comments from existing users. (For those unfamiliar with Cedexis, it's like ELB, but not connected to a particular cloud, and with way more features).

Post a comment