RightScale Blog

Cloud Management Blog
Cloud Management Blog

Lessons Learned from Recent Cloud Outages

Posted by Uri Budnik   Ι   February 27, 2013

Outages happen, and they happen everywhere. Whether you leverage a public cloud, a hosting provider, or your own data center, infrastructure downtime is inevitable. Equipment breaks or does not function as expected, software bugs slip by, natural disasters occur, and unforeseen situations lead to unexpected consequences. Sometimes services are degraded and sometimes complete data centers go dark.

In the last few months a number of public cloud outages have raised the question of whether the cloud is reliable enough to run business-critical environments. To try to answer that question, we looked at some data points on recent outages.

Cloud Outages Happen We scoured the Internet for publicly reported data center or cloud outages in 2012. We found 27 notable outages; many more were probably never reported in the news because they affected a single company and never were reported publicly. We classified the outages into four categories: public cloud (IaaS), SaaS, third-party hosting provider, and private corporate data center. Of the outages we found, 26 percent were in public clouds and 67 percent were either private data centers or hosting provider. Power loss, Hurricane Sandy, and other natural disasters were the biggest culprits.

What do we learn from looking at the numbers? First, any data center will eventually fail, whether public cloud, private corporate data center, or third-party hosting providers. Any single piece of equipment can fail, and sometimes cascading events can make an entire data center unavailable. According to a report from the Uptime Institute, on average a data center will experience one major outage and three partial outages each year.

Second, when outages do occur, the impact can be significant. In the outages we studied, the average downtime was 7.5 hours. When we looked at just public cloud outages, the average downtime was only marginally higher at 7.7 hours. In either case, the resulting business impact can be high, with significant revenue impact.

The key takeaway is the need to architect your applications to stay up even when your cloud or data center isn’t.

Companies have long implemented high availability (HA) features and disaster recovery (DR) processes. However, the advent of cloud computing now demands new approaches and offers new options for outage-proofing your applications.

Cloud Extends the Concept of HA

First let’s take a short look at what it used to be like to set up an HA architecture before the cloud. The basic idea is to have n+1 units available for every type of component so that no single point of failure brings down your environment. In a classic three-tier architecture in a data center, that means dual routers, dual switches, dual load balancers, at least two application servers, and two database machines, each with a RAID for storage. You would also want redundant power drops into your data center and two network connections as well.

Now let’s compare this to what you can do with cloud computing. Most public clouds let you launch instances in multiple regions and availability zones or groups (AZ). Each region is in a discrete geographic location, such as the east coast of the US or Europe. Individual AZs live within each region, and each AZ is meant to be completely isolated from all the others — a shared-nothing architecture — so that a failure of one should not impact the others, while being only single-digit milliseconds of network latency away. This structure means that you can design HA architectures that are resilient to data center outages without having to deal with the significant cost and complexity of managing every aspect of the physical infrastructure on your own.

A modest yet fairly redundant HA environment may run on a single AWS region, but span two AZs. Think what a comparable environment would take to build with your own infrastructure. You would need two complete and separate data centers with great network performance between them, and a VPN.

Thanks to cloud computing, environments like this are not difficult to build anymore. To provision a server in a different availability zone, you simply choose an item from a drop-down menu. Contrast that with what it is like to have to order equipment, ship it to a distant data center, and take a long drive or flight to install it. Many companies simply take shortcuts in their HA/DR architecture because of the cost and effort involved. Leveraging the cloud can reduce those disaster recovery costs  and increase the uptime you can expect from your environment.

Most of us are creatures of habit, and we tend to go back to what we already know. Maybe this is why you keep hearing about companies that have been impacted by an AWS outage because they have relied on a single AZ. I have talked with dozens of companies in the last few years and I can confirm this is a common error. I think that is because many companies don’t realize how cloud computing can reduce the previously prohibitive cost of creating truly HA environments with geographic redundancy.

Mind you, there are also some very public success stories from companies that do not go dark when AWS has a problem and an AZ goes out. One of our customers at RightScale, 500friends, recently blogged on how they limited their downtime during a recent AWS outage through the use of best practices in their cloud architecture. If you are building an environment for the cloud, it’s time to re-examine old assumptions about the cost and effort of HA/DR and take advantage of the capabilities cloud affords you. One of our senior cloud architects explains it like this: You can have great tools, but if you are not a good carpenter, a stiff wind will knock down your house. In the cloud, you cannot depend on the infrastructure layer alone, because the infrastructure is bound to fail eventually.

The real issue is not whether data centers will go out or how long they will stay dark, but rather how prepared you are to deal with that eventuality. Cloud computing gives you a radically simpler and lower cost option that traditional data centers can’t match. Learn how RightScale can help you manage your data and applications in the cloud.

Comments

[...] Despite cloud providers’ best-laid plans, outages do happen, with causes ranging from acts of God to human error and many points in between. Any [...]
WRT SaaS the providers must own HA, the customers have to manage business continuity. Data restore to a point in time, across providers? Its not possible from what I understand. What's the customer to do for that? Most answers seem to be 'remove the requirement'.
@Livers This is a complex question, but, no, I don't think you need to remove the requirement. I think you imply that its not doable. > ..."the customers have to manage business continuity" indeed! i have heard others say it more harshly: blame application architecture, not infrastructure for failures. BTW, i assume that when you say "the providers" you mean the SaaS companies building on top of IaaS. the exact answer depends very much on the particulars of you applications. but, here is one idea. you can have your primary environment in one cloud and a 'warm' backup in a second cloud. what i mean here is that on the second cloud, you keep a database slave node that you are replication onto it so that you have an up-to-date copy of your data there. if you are using tools like rightscale, you can use servertemplates such that you have all the configurations of all your other app servers ready to launch in the second cloud. this also means you are only running (and paying) for one server on your DR (cloud) site but have the ability to launch the rest of your environment quickly if necessary. if you are using log shipping to sync to your remote db slave node, you can keep track of where thing broke down and recover from there between the two sites. we recently had two of our senior guys speak about many different type of failure scenarios and how to plan for them in a webinar: http://www.rightscale.com/info_center/webinars/outage-proof-cloud-apps.php if you want to talk to one of us to discuss you question specifically, we are always open to chat one-on-one. just email sales@rightscale.com and we can get a sales engineer with multi-cloud experience to explain how we do things.
[...] ),4.业务成功(Sharding、专用数据存储,这需要很幸运才能达到)。 http://t.cn/zYTsAR2 从最近的云基础设施故障中学到的一些经验。Lessons Learned from Recent Cloud [...]
[...] For another perspective on this, Uri Budnik wrote a detailed post on the RightScale Blog titled, “Lessons Learned from Recent Cloud Outages.“ [...]
[...] RightScale this week investigated publicly reported data center or cloud outages in 2012, finding that no platform is immune to failure. In an interview with Nancy Gohring, RightScale CEO Michael Crandell stresses that outages are worst for the end-users:  “They’re magnified by the fact that often recovery time is longer,” he notes. [...]
Thanks for creating and sharing! I agree that there are many factors to consider in HA and DR - some cloud providers offer complete solutions that cover these concerns, while another option may leverage your own data center combined with a service-provider solution for HA/DR. The beauty of competition and choice... and the importance of reputation and trust in this space. Would it be possible to see which 27 outages were used in the creation of the infographic?

Post a comment