RightScale Blog

Cloud Management Blog
Cloud Management Blog

Setting Up a Fault-Tolerant Site with Amazon's Availability Zones

Amazon's Availability Zones are a fabulous new feature that allows users to assign instances to locations that are fault-tolerant from one another yet that have high bandwidth between each other. I wish I could have done something like that as easily when I was responsible for operations at Citrix Online and we had five datacenters worldwide. As I'll explain in this post, what Amazon actually provides us is much better than just putting servers into multiple data centers.

The most confusing thing about Availability Zones is the name. In the cloud, what exactly is an "Availability Zone"? The easiest way to think about it is that a zone equals a data center. If power goes out in one data center and the generators fail to start (naah, that never happens) then it doesn't affect the other data center. Or if there's a fire, one data center may burn out or be otherwise incapacitated, but others are unaffected. In reality zones don't necessarily correspond to data centers. Given careful engineering, it's possible to have multiple "rooms" in a data center that are highly failure-isolated while technically still being part of the same data center; imagine football-sized fields here.

The point of Availability Zones is the following: If I launch a server in zone A and a second server in zone B, then the probability that both go down at the same time due to an external event is extremely small. This simple property allows us to construct highly reliable web services by placing servers into multiple zones such that the failure of one zone doesn't disrupt the service, or at the very least, allows us to rapidly reconstruct the service in the second zone.

The one caveat to consider when using multiple zones is that there is no free lunch. First of all there's the speed of light. The zones Amazon is exposing are all on the East Coast (indicated by the names, such as "us-east-1a"). I don't have inside information about the location of their facilities, but I imagine some may be in New York and others may be in Virginia, so the distance between zones may be considerable, thus translating into some network latency. And even if the actual facilities used by EC2 today are not that far apart, they may be someday in the future.

The second gotcha is that bandwidth across zone boundaries is not free.  Amazon is charging $0.01/GB for what it calls "regional" traffic. This is less than 1/10th the cost of Internet traffic, which seems perfectly reasonable to me. In the days where I was managing multiple data centers, the cost of traffic between them was essentially the same as the cost of random Internet traffic. Actually, it cost twice as much: once to exit one data center and once to enter the other. (Granted, at high volume one can do interesting things to save some money, but it doesn't become free by a long shot.)

An Example

Let's see how a simple redundant website looks with Availability Zones and elastic IPs. At the core we'll have two web servers with Apache and PHP running the web application and accessing the master database. All this occurs in one zone. We'll allocate two elastic IP addresses that we assign to the two web servers and then create a round-robin DNS entry for our website that maps the domain name to the two IP addresses.

Fault Tolerance with availability zones img1

To ensure the survival of the data in the case of a massive failure, we start a slave database in a second Availability Zone and replicate the data in real time. This is how we've set up all our customers to date, except that up until now we haven't been able to specify the placement of the slave with respect to the master. In the RightScale Dashboard the zone of each server is shown and at server launch time one can select the desired zone.

Now suppose the zone with the web servers and database fails due to a fire. After receiving an alert, we first promote the slave in the second zone to master using the RightScale Manager for MySQL automation. We then launch fresh web/app servers in the same zone as the slave database. Once the promotion completes and the two new servers are up, it is a simple matter of reassigning the elastic IPs to the two new servers to redirect all the users to the new servers, and we're up and running again.

Fault Tolerance with availability zones img2

The next step is to recreate the redundancy. For this the third Availability Zone that each account has access to comes into play. We start a fresh database slave in the third zone, again using the automation in the Manager for MySQL. Once that comes up and starts replicating we are back to having a redundant setup.

Fault Tolerance with availability zones img4

If you have never tried to set something like this up yourself - renting colo space, purchasing bandwidth,  buying and installing servers - you really can't appreciate the amount of capital expense, time, headache, and ongoing expense saved by EC2's features. And best of all, using RightScale, it's just a couple of clicks away :-).

Beyond the Simple Redundant Setup

You probably noticed that the site described above would go down if there was a failure in the primary zone, which would require a manual restarting of new servers to bring it back up. Some of this can be easily remedied by placing one or multiple web servers into the secondary zone and having them talk to the master DB across the zone boundary. The performance of these servers may be slightly lower due to the inter-zone latency, and there is some cost to the database access traffic. It's somewhat application-dependent how these play out.

A more sophisticated setup uses load balancers to reduce the impact of the cross-site traffic. The idea is to place one load balancer instance in each zone and route the requests primarily to a set of redundant web/app servers in the primary zone, as shown in the figure below. A third app server can be running in the secondary zone and perhaps get a trickle of traffic from the load balancers just to keep it "warm." Keeping it warm makes it easy to monitor and ensure that it's operating properly.

Fault Tolerance with availability zones img3

The good thing about this setup is that the traffic shipped across the zone boundary is exactly the same as comes into the second load balancer. This means that for half the total Internet traffic there is a $0.01/GB surcharge, which results in less than 5% extra cost overall. (This is not counting the DB replication traffic.) Also, the extra latency from one zone to the other is negligible when compared to the already incurred Internet latency.

In the case of a primary zone failure, browsers will fail over to the load balancer in the remaining zone; this is a feature built into web browsers related to the round-robin DNS setup. The load balancer will direct all traffic to the third web/app server. At that point the secondary database needs to be promoted to master and the third app server repointed to that database and everything will be back up and running. With automation the DB promotion could be done automatically, but it's better to be conservative; a promotion due to a false alert could cause a lot of harm.

This second setup is a bit more complicated than the previous one, but it requires less machinery and no server launches in the case of a failure. It also requires one extra machine if one assumes that each load balancer can run on the same instance as a web/app server (typically not a problem). Many more variants on this basic setup are clearly possible and should be considered on a case-by-case basis.

It's mind-boggling how much power Amazon is giving us in designing sophisticated distributed redundant Internet services! In combination, the Availability Zones, the elastic IPs, and the overall programmatic control over all the resources make the cloud a superior environment for deploying sophisticated Internet services. At RightScale we're hard at work to incorporate the new features into our standard deployment templates so all of our customers can take advantage of the new features in their deployments. We're also automating a number of the failure scenarios so that you don't need to have an alert wake you up if there a fire at Amazon in the middle of the night.

Comments

<strong>Amazon improves EC2 (by embracing failure)</strong> Amazon just announced two big improvements to EC2: Multiple LocationsAmazon EC2 now provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of regions and Availability Zones. Regions are geographically dispersed...
[...] The guys at RightScale have described Setting up a fault-tolerant site using Amazon&#8217;s Availability Zones. [...]
Using mysql-proxy http://jan.kneschke.de/projects/mysql/mysql-proxy (in conjuction with rw-splitting) and a configuration like this one http://pics.livejournal.com/capttofu/pic/00003988, you could just be notified when the web node and/or the db node go down. No manual promotion from slave to master would be necessary. Just rebuil the missing node in the third data center and wait for the next failure.
Using mysql-proxy http://jan.kneschke.de/projects/mysql/mysql-proxy (in conjuction with rw-splitting) and a configuration like this one http://capttofu.livejournal.com/1752.html, you could just be notified when the web node and/or the db node go down. No manual promotion from slave to master would be necessary. Just rebuil the missing node in the third data center and wait for the next failure.
Frederic: thanks for pointing out the mysql proxy. To be honest, I haven't really dug into all its possibilities. Our goal with the replication is to use a simple, tried and tested set-up. The replication is really solid with known gotchas and that's always valuable. The main difficulty with the failover isn't actually promoting the slave to master and repointing the clients. We have all that automated and it's literally one click in our interface, the issue is deciding when to fail over. And I don't think the proxy makes that easier. If the master stands up and says "I'm dead" then it's easy, but often there are partial failures or lock-ups and having some automatic decision cause a failover can cause more havoc than good. But in any case I need to read-up on the proxy stuff: thanks for the reminder!
[...] has also posted some tutorials on how to use the new features, including how to set up a fault-tolerant site. Written by: Mike | March 27, 2008 | Filed Under Amazon [...]
Great Article! Thank you for laying it all out so clearly. You got a nice plug for these blog entries on the AWS CTO's blog too. :) Kent
[...] Setting up a fault-tolerant site using Amazon&#8217;s Availability Zones Amazon&#8217;s Availability Zones are a fabulous new feature that allows users to assign instances to locations that [&#8230;] [...]
Slight correction: The $0.01/GB charge seems to be for bandwidth crossing /region/ boundraries, not /zone/ boundraries. And currently everything's in the us-east region, so you won't actually hit the $0.01/GB charge.
Posted by bd_ (not verified)   Ι   March 27, 2008   Ι   09:19 PM
Kent: thanks for the nice words! I hadn't noticed Werner's blog entry, thanks for pointing it out. bd_: Ahhhh, very good point, I had completely misread that. I better update the blog post, although I think I'm still confused about the pricing. Sounds like a good AWS forum topic.
bd_: I just looked at the EC2 pricing at http://aws.amazon.com/ec and I can't follow your argument. It says "Regional Data Transfer -- $0.01 per GB in/out - all data transferred between instances in different Availability Zones in the same region."
[...] in Ruby on Rails) to Amazons infastructure has a good blog post on the new changes here: &#8220;Setting up a fault tolerant site using amazons availability zones&#8220;. These icons link to social bookmarking sites where readers can share and discover new web [...]
Thank you for this blog posting. Very informative.
Posted by noname (not verified)   Ι   March 31, 2008   Ι   10:10 AM
[...] Setting up a fault-tolerant site using Amazon’s Availability Zones « RightScale Blog (tags: ec2 amazon aws scaling infrastructure architecture distributed loadbalancing networking scalability) [...]
[...] Setting up a fault-tolerant site using Amazon’s Availability Zones « RightScale Blog (tags: ec2 amazon scalability cloud_computing) [...]
[...] Good article: http://blog.rightscale.com/2008/03/26/setting-up-a-fault-tolerant-site-using-amazons-availability-zo... [...]
As you posit, the power of the cloud is mind-boggling. And, your config scenarios as well articulated. I offer only one clarification. These scenarios are not fault tolerant. Restarts, transaction losses, rapidly reconstructing services … these and similar qualifiers happen in fail over solutions. High availability, definitely yes. Fault tolerant, definitely no. Starting at five nines and going up from there, true fault tolerance is failure prevention, no downtime, no data loss, no restarts. Call us (Stratus Technologies) sensitive, but after three decades specializing in fault tolerant technology and preventing downtime of mission-critical applications, we feel a clear distinction is … well, critical.
Denny, I certainly understand the difference and you are correct. As you know probably better than me there's a big multiplier factor in cost and complexity for every '9' added. As a result there are lots of solutions that address different levels of fault tolerance and price. Yours tends to score high on both aspects :-). The term "fault tolerance" is actually pretty nice, it's about tolerance: what can your business/application tolerate? If you can't tolerate a single missed database update then you better fork out the really big bucks. The vast majority of business applications seem not to fall into that category. Fault-tolerance really is a relative concept...
[...] The guys at RightScale have described Setting up a fault-tolerant site using Amazon’s Availability Zones. [...]
Thorsten, this is a discussion that can take on a life of its own. I hear what you are saying, but would like to explore a bit more. I would suggest that availability is really the relative concept here, and that fault tolerance is widely assumed to provide the best availability. As for relative concepts, how about “really big bucks”? :-) Cost and complexity do rise with every added nine but mostly when the discussion is about the infrastructure and all that goes into managing it for continuous availability. Not so much when the discussion is just about servers. Yes, there is a premium to pay for 5 9s servers compared to lesser solutions. However hardware prices begin around $12K, which can be quite attractive for applications that are not quite up to the mission-critical level, but important nevertheless. Given a choice, most users would like uninterrupted access to services all the time, mission-critical or not. Enterprise-class servers do go up as high as $60-65K, but that’s a far cry from the hundreds of thousands for proprietary systems of old. As for complexity, a fault-tolerant server is no more complex to deploy, manage and maintain than a single x86 server. These aren’t clusters. Applications run out of the box, and the servers are designed to pretty much care for their own health. It is our contention that for true server fault tolerance, three things are necessary … lock-step hardware, failsafe software and bullet-proof service. But, back to cloud-computing … it all eventually gets back to a physical infrastructure. Outsourced or internal, I believe there is an expectation of service from the user(s) accessing the cloud. That’s a pretty critical situation.
Denny, I completely agree with you. But Thorsten also has a point. If you think about web 2.0 apps, then they are not mission critical in a way that a banking, trading, shipping applications etc. are. Then you have the second aspect that many web 2.0 apps are written by guys that do not really understand complex concepts that you describe and just want to be able to restart an application without too much work in a few minutes when an incident happens. For your average PHP developer your comment sounds like Chinese (assuming a non Chinese speaker PHP developer :). If you really want 5 9s EC2 may not be a good starting point to begin with. Just the lack of IP load balancing is probably a reason to stop looking (GoGrid, on the other hand may be worth looking at in that case, or a combination of GoGrid servers and your own database/back end machines in the same data center, but I have no idea about their physical location abilities). Nonetheless, we should use the right terminology, since there are some engineers out there that understand it and do more than facebook apps ;)
Posted by Nitin (not verified)   Ι   October 27, 2008   Ι   12:35 AM
[...] of being almost crash proof.  But it wasn’t.  Even companies running dual instances in separate availability zones in different EC2 data centers could have suffered from the outage.  Clearly the Cloud is still not [...]
[...] Molto probabilmente la colpa di Amazon è stata solo quella di dei down di alcuni nodi, la gestione di ridondanza la deve fare chi si appoggia al servizio evidentemente Quora, friendfeed non hanno gestito (o forse [...]
[...] outage, we can see that this was indeed not the case. Because of the nature of the errors, multiple availability zones were out of commission, something that users and Amazon themselves have not encountered before, [...]

Post a comment