If you travel enough, it's only a matter of time. You're facing a tight connection and when your first leg gets delayed it only gets tighter. Then you're sprinting to your connecting gate, only to watch as they close the doors and push back from the gate without you.
On my way back from OpenStack Summit 2013 in Portland, where I spoke on Techniques for Managing your OpenStack Cloud, I had this exact experience. I knew it instantly — I was stranded. Of course, I'd missed flights in the past, I'd been delayed, you name it. But this time the flight I missed was the last one out for the night, with no other connecting flights to anywhere close to my destination.
My first reaction was "the infrastructure failed me." I had done everything right, from booking my tickets to arriving on time and making it through security. Unfortunately that wasn't enough, since forces beyond my control impeded my ability to travel successfully.
In reality, though, the infrastructure is fairly reliable, and the sorts of delays I encountered are not uncommon. For the most part, the travel infrastructure gets you where you're going pretty close to your scheduled itinerary.
Using cloud resource pools is very much the same. Any given resource pool generally runs on commodity hardware, is generally available, and usually gets the job done with acceptable levels of performance. But sometimes something goes wrong and application performance degrades. Worse yet your application becomes unavailable!
Planning for Failure
I could have proactively planned around the failure of the travel infrastructure that day in a variety of ways, and they all have parallels for organizations that need to plan for their clouds' availability. Each option had different costs and a different likelihood of success.
Cold Disaster Recovery
By far the least expensive option (and the one I ended up invoking) was to simply assume my itinerary would work out as planned. If something should happen, I could pay for a night in an airport hotel and get the next flight. I would only spend money if something went wrong, but I would get delayed by several hours.
This is like a cold DR (disaster recovery) plan for your cloud application. If you choose cold DR, you must have a plan in place to quickly launch your application on infrastructure that is not impacted by an outage. You incur no cost unless you have to invoke your DR strategy. But your recovery time objective (RTO) — the maximum time you want to allow between failure and renewed availability — is high, because it can take a significant amount of time to bring up your application infrastructure and switch over to it. You also potentially have a high recovery point objective (RPO) — that is, how much data, measured in time, you can afford to lose - in that you have to rely on copying data from your production infrastructure, or a backup from some time in the past.
Warm Disaster Recovery
If I were willing to accept a longer travel time in the case of my main itinerary failing, and I were willing to pay for a backup plan, I could have a standby ticket booked on a train or bus. This would still get me home, though more slowly, and since I'd already booked the ticket I would be able to make use of it the moment I discovered that my air travel wasn't going to work out. It would still be inconvenient but I would get to where I was going pretty quickly and effectively.
In this case, the cloud application parallel is warm DR, where a replica database is kept running in a different resource pool, ready to take over should the primary resource pool have a problem. With warm DR RTO is still medium to high, since it still takes time to launch all of the other servers for your application, but your RPO is very low, or potentially zero.
Finally, I could have booked two completely different itineraries on two completely different airlines, through different airports. If my primary flight were delayed I could have instantly chosen to execute the alternate itinerary. Obviously this is the most complex and by far the most expensive option. I might consider using this option if money were no object and I needed to be home that night to have an anniversary dinner with my wife. I couldn't be delayed, and missing it would be inexcusable. Any cost would be acceptable to ensure that I arrived on time.
In the context of your cloud application, this is like the "five nines" infrastructure, where all tiers of the application are running simultaneously in two or more locations with real-time replication between them. The cost can be extraordinary, but for certain workloads the cost of any downtime outweighs the operational cost.
Don't Leave Your Apps Stranded
When I missed my flight I had to stay a night in an airport hotel and catch another flight early the next day. It was an inconvenience, but I survived. The lesson for me was to plan better for failure and give myself a bigger buffer of time for connections.
The lesson you should learn for your cloud application is that failure happens all the time. Have an availability strategy in place that matches with the demands of your application. Spend the money you need to meet the RTO and RPO goals that make sense for your application, and be honest with yourself about what sort of outages are acceptable.
Using cloud computing to its fullest potential requires a fundamentally different approach from what you might be used to. The good news is that RightScale can help you with the tools and expertise you need to be successful, and empower you to do things you would have never attempted in a traditional or virtualized model.
Take full advantage of the power of cloud computing, plan for failure, and don't leave your apps stranded!