RightScale Blog

Cloud Management Blog
Cloud Management Blog

How EC2 Changes the Game in Batch Grid Computing

We've been receiving a number of inquiries about our RightGrid batch processing framework recently. RightGrid was designed to simplify the task of processing large numbers of jobs enqueued in Amazon SQS with the data to be processed residing in S3. Basically it takes care of all the fetching and pushing, and all you need to do is plug in your processing code that takes local files as input and produces local output files.

People are asking about the complex priority schedulers that come with traditional frameworks such as Condor, Grid Engine, or Platform LSF. And we're kind of confounded by the purpose of these things in the EC2 world. Traditionally, the problem is that you have a cluster with N nodes, and you have users that enqueue enough jobs to keep those nodes busy until the end of the millennium. So you need to divide the cluster up into partitions and have complex rules to prioritize jobs on each partition.

Enter Amazon EC2. If user A enqueues a job needing 500 nodes for 10 hours and user B a job needing 800 nodes for five hours, what do you do? Very simple: you check the balance in their account and then start 500 instances for user A and 800 instances for user B. Done. No priorities, no scheduling, just pure compute fun!

The resource that is "allocated" in the finite computer center is the use of hardware, but the resource that is "managed" in a cloud is cost. It is a new mind set that one computer for 100 hours has the same cost as 100 computers for one hour. Of course there are details such as startup costs for large numbers of nodes and ensuring that each billed instance hour is fully used. But those details are a small leap when compared to the issue of understanding that 1=100.

I'm sure that there is a role for scheduling software to enable things like running a five-minute job on 500 nodes without having to pay for 500x one hour, or starting the next job on the same set of instances that are still finishing up a few laggard computations on a few instances. But assuming that Amazon can cough up enough instances, the game changes dramatically.

By way of an example, we have been testing some new features in RightGrid, and we wanted to ensure that everything goes smoothly when launching large numbers of instances. So we set up a queue with many thousands of test work items and an array of instances set to ramp up to a max of 500 instances. About 20 minutes later we had just under 500 instances running (we have the array set up to launch about 20 to 25 instances per minute as the queue of tasks keeps growing). Everything ran fine, and some 30 minutes later the queue was emptied and the instances were sitting idle waiting for either more work items to appear or the billing hour to reach its end. There was no warning to Amazon ("hey, we're about to launch 500 instances"), no hiccup, and it cost us all of 50 bucks! We repeated this a few times to try a couple of combinations, all on our schedule.

All this being said, I'm sure there are good reasons to have more sophisticated queueing and scheduling machinery than we have in place today, and perhaps one of the traditional packages can be put to good use. What they seem to lack as far as I can tell is the ability to decide to launch more servers; I guess an email to sysadmin "rack 100 more servers" wouldn't go down very well. What's missing is something like "the job queue is sufficiently full, let me launch 10% more servers," and of course the reverse when the job queue empties. The notion of "full" here doesn't have to mean that jobs are queued for hours, it may simply mean that there is enough work to be able to fill additional full machine hours.

It's going to be interesting to see how the cloud will change high-performance computing usage pattern and whether Amazon can actually keep up with demand. If you have comments about the thinking above or suggestions I'd love to hear them.

Comments

[...] How EC2 changes the game in batch grid computing - RightScale BlogEnter Amazon EC2. If user A enqueues a job needing 500 nodes for 10 hours and user B a job needing 800 nodes for 5 hours what do you do? Very simple: you check the balance in their account and then start 500 instances for user A and 800 instances for user B. Done. No priorities, no scheduling, just pure compute fun! [...]
For me, the issue is using as much as possible of the hour, since the whole hour has been bought on Amazon EC2. If you were to fire up lots of instances to finish the queue, and the average instance lifetime was 30 minutes then I'm paying 100% more to Amazon than I need to. So, my point is can you do some clever scheduler stuff to make sure instances live for 59 minutes, or 1 hour 59 minutes, ad infinitum?
Kevin, that's a good point. Right now what we do is keep instances alive until they're 55 minutes into the hour (to leave a few minutes to actually shut down). This works pretty well in a continuous usage scenario where a queue is getting work pretty much 24x7 and it's only a matter of adjusting the number of workers up and down with load throughout the day. But if you do one-shot batches where you fill the queue and then have it worked down to empty a lot more planning is needed to avoid having too many instances sitting idle for too long at the end. The key here is find the right trade-off between complexity (=fragility) and actual cost. Wasting an hour on 5-10% of the instances is probably not an issue, specially if the batch runs for several hours. Adding an hour on 100% of the instances may start to affect the bottom line.

Post a comment