RightScale Blog

Cloud Management Blog
RightScale 2014 State of the Cloud Report
Cloud Management Blog

RightScale Server Orchestration and Amazon SWF launch

The launch of Amazon's SWF (see also Werner's blog) is a good opportunity to talk about some of the exciting new automation features that we have in the works and we will make available with the coming releases. We've been using SWF as one of our back-end services for a number of months and it's a pretty awesome service that greatly accelerated the development of our orchestration features. In my mind, automation is the most fundamental innovation in cloud computing. It enables all the business benefits (pay as you go, scale on demand, resiliency, predictability, etc.) and it lets us increase the scale at which we use computing as well the reliability of services. Cloud computing is inconceivable without automation throughout the entire stack.

RightScale has focused on automation from day one. We provided auto-scaling of server arrays early on: automatically launching and terminating servers based on monitoring metrics, such as cpu load. Something most newcomers don't appreciate is that making the call to launch the next server when the cpu load goes up on the running ones is not the difficult part. The difficult part is bringing the new servers into full operation. That involves loading all required software, configuring everything, and connecting the server with other services, such as load balancing and the databases. This is why a big piece of the RightScale functionality concerns itself with configuration management and automating the entire boot process all the way to the point where the application is in production.

With coming releases we are continuing to build on top of this platform and introduce server orchestration. Server orchestration uses a workflow language that lets you automate at the level of RightScale resources, such as servers, deployments, etc. The first functionality we implemented is to let you customize the three key pieces of auto-scaling: (1) deciding when to scale up or down and by how many servers, (2) launching new servers, (3) terminating existing servers.

The way this works is that the RightScale system calls a user-defined decision function every minute to find out whether the server array should be scaled up or down and by how much. The decision function simply returns an integer that indicates how many servers to launch (value >0) or to terminate (value <0), or a value of zero to keep the server count the same. The decision function can retrieve monitoring data using our API and do a calculation similar to the built-in one or it could do something completely different. An interesting example would be to use knowledge of application specific state and metrics to better predict requirements. You may be able to tell early that a flash event is coming and that you need to launch a large number of servers all at once. That's just one example, the sky is really the limit and I know some of our customers have pretty cool ideas in this area!

When the decision function asks for more servers, RightScale runs a scale-up workflow to actually launch the servers. This puts you in control of how the servers are to be launched and creates an interesting opportunity to carefully manage where the servers are launched. For example, you may want to ensure your servers are equally spread across a number of datacenters for availability reasons. Or you may want to launch where it's the cheapest. Very similarly, the scale-down workflow can be picky about which servers are being terminated. In the built-in auto-scaling we terminate the oldest servers to ensure a continuous refresh of the running stock. But for some applications it's preferable to terminate the youngest servers. In addition, the scale-down workflow can gracefully shut down the application, take a last backup, save away log files, and then terminate the server.

As we designed the orchestration functionality we kept coming back to two key requirements: concurrency and fault tolerance. We need to express concurrent activities with ease because, when one operates on many servers, it's the only way tasks complete in a reasonable amount of time. For example, to perform a rolling upgrade on a number of servers the orchestration ought to grab a set of servers, run them through the upgrade process in parallel, and then move on to the next set.

Where orchestration becomes really exciting is when it is used to recover from failures and automatically relaunches failed resources, possibly in a different datacenter or cloud. That immediately raises the question about the resiliency of the orchestration process itself: what if it is affected by the same broader failure and can't perform the recovery? Similar concerns arise when an orchestration process runs for a long time. The array auto-scaling example above could be implemented using a "parent" workflow that runs forever and invokes the decision function and scaling sub-workflows periodically. And again, this execution must be resilient to failures.

In order to provide good support for concurrency and to offer a fault-tolerant execution environment we decided to base our orchestration system on a workflow language that is built around the open source Ruote workflow system. Ruote offers a multitude of very nice structured concurrency constructs. For example, you can express strategies such as "run concurrently and wait for all" or "run concurrently and wait for the first, then cancel the rest". The latter may sound unusual but it's useful when you need a resource and you want to try multiple avenues and pick the first one that succeeds.

Amazon SWF came in very handy to ensure fault-tolerant execution of the Ruote workflows. We retargeted Ruote to leverage Amazon SWF as an execution back-end with the result that workflows are executed by many servers distributed across multiple availability zones. SWF takes care of scheduling the execution of workflow actions, collecting the results, and then atomically handing the results back to Ruote so it can schedule the next wave of actions. The result is a highly resilient orchestration system that can continue the execution of workflows in the face of major failures.

We're obviously very excited about the upcoming features and can't wait to make them available to our customers. Now that Amazon SWF is live we're on the home stretch and hope to be ready for a private beta shortly after the upcoming release. If you're interested in early access, please send me an email.

Comments

[...] management company Rightscale is already using Simple Workflow. The big idea here is to help developers architect cloud applications the way companies like [...]
[...] fault-tolerant execution of their server scaling workflow. Read Thorsten von Eicken&#039;s post, RightScale Server Orchestration and Amazon SWF Launch, for more [...]
Any idea on the definition of "a while" ? Just want to know if it's worth waiting or not because we are considering to start up the development of an integration ourselves.
It really depends on the use-case. This would be better for an off-line discussion. Please ping me directly.
[...] fault-tolerant execution of their server scaling workflow. Read Thorsten von Eicken&#8217;s post, RightScale Server Orchestration and Amazon SWF Launch, for more [...]
The standard auto-scaling will continue to work as-is. WRT the workflow-based scaling the question is a bit more complex to answer. I'll take your question to be "If I'm operating in EC2 region A and there's a major issue in the region that also affects SWF, will I be able to scale up in a different region or will everything be stuck because of SWF?". We are in the process of implementing multiple RightScale clusters around the world and we will let users decide in which cluster their account is hosted. The first one up will be an AP-Tokyo based cluster, which is currently in private beta. Our recommendation will be for users to place their account in a cluster that is in a different cloud (or AWS region) from where they run their primary operations. This ensures that in a situation where that cloud (or region) has a major melt-down the management plane is not directly affected. (This is also why we believe it's a bad idea to put the management plane into a company's private cloud: if it goes down you probably can't even invoke DR...) But to come back to your question, SWF is per-region, so by using a RightScale cluster in an independent region you can ensure that you're not in a both-hands-tied situation if I interpreted your question correctly. Thanks for asking! :-)
Are you going to contribute the SWF integration back to ruote?
Posted by Jordan Curzon (not verified)   Ι   February 23, 2012   Ι   11:25 AM
Yes, that's the current plan, although it may take a while.
[...] using Amazon SWF. Amazon CTO Werner Vogels discusses the new orchestration service and RightScale talks about how they have been testing SWF as a back end service for a number of [...]
[...] Filed under: AWS, Cloud Computing, EC2 Tagged: Auto-scale, AWS, Cloud Computing, Cloud Management, EC2, RightScale RightScale Blog [...]
[...] the SDK see the developers guide. As always The AWS developer blog has additional details. At the Rightscale blog Thorsten von Eicken talks about their use of [...]
[...] direction. Right Scale has been using a Ruby workflow Ruote for their workflow needs and now they orchestrate these workflows using SWS to achieve fault tolerance and concurrency. As you can see, Amazon has opened up a gold mine for [...]
Will you still be able to autoscale if SWF goes down?
Posted by Eric S (not verified)   Ι   March 02, 2012   Ι   09:24 PM
[...] direction. Right Scale has been using a Ruby workflow Ruote for their workflow needs and now they orchestrate these workflows using SWS &#160;to achieve fault tolerance and concurrency. As you can see, Amazon has opened up a gold mine [...]
[...] direction. Right Scale has been using a Ruby workflow Ruote for their workflow needs and now they orchestrate these workflows using SWS &#160;to achieve fault tolerance and concurrency. As you can see, Amazon has opened up a gold mine [...]
[...] direction. Right Scale has been using a Ruby workflow Ruote for their workflow needs and now they orchestrate these workflows using SWS &#160;to achieve fault tolerance and concurrency. As you can see, Amazon has opened up a gold mine [...]

Post a comment