We've been receiving a number of inquiries about our RightGrid batch processing framework recently. RightGrid was designed to simplify the task of processing large numbers of jobs enqueued in Amazon SQS with the data to be processed residing in S3. Basically it takes care of all the fetching and pushing, and all you need to do is plug in your processing code that takes local files as input and produces local output files.
People are asking about the complex priority schedulers that come with traditional frameworks such as Condor, Grid Engine, or Platform LSF. And we're kind of confounded by the purpose of these things in the EC2 world. Traditionally, the problem is that you have a cluster with N nodes, and you have users that enqueue enough jobs to keep those nodes busy until the end of the millennium. So you need to divide the cluster up into partitions and have complex rules to prioritize jobs on each partition.
Enter Amazon EC2. If user A enqueues a job needing 500 nodes for 10 hours and user B a job needing 800 nodes for five hours, what do you do? Very simple: you check the balance in their account and then start 500 instances for user A and 800 instances for user B. Done. No priorities, no scheduling, just pure compute fun!
The resource that is "allocated" in the finite computer center is the use of hardware, but the resource that is "managed" in a cloud is cost. It is a new mind set that one computer for 100 hours has the same cost as 100 computers for one hour. Of course there are details such as startup costs for large numbers of nodes and ensuring that each billed instance hour is fully used. But those details are a small leap when compared to the issue of understanding that 1=100.
I'm sure that there is a role for scheduling software to enable things like running a five-minute job on 500 nodes without having to pay for 500x one hour, or starting the next job on the same set of instances that are still finishing up a few laggard computations on a few instances. But assuming that Amazon can cough up enough instances, the game changes dramatically.
By way of an example, we have been testing some new features in RightGrid, and we wanted to ensure that everything goes smoothly when launching large numbers of instances. So we set up a queue with many thousands of test work items and an array of instances set to ramp up to a max of 500 instances. About 20 minutes later we had just under 500 instances running (we have the array set up to launch about 20 to 25 instances per minute as the queue of tasks keeps growing). Everything ran fine, and some 30 minutes later the queue was emptied and the instances were sitting idle waiting for either more work items to appear or the billing hour to reach its end. There was no warning to Amazon ("hey, we're about to launch 500 instances"), no hiccup, and it cost us all of 50 bucks! We repeated this a few times to try a couple of combinations, all on our schedule.
All this being said, I'm sure there are good reasons to have more sophisticated queueing and scheduling machinery than we have in place today, and perhaps one of the traditional packages can be put to good use. What they seem to lack as far as I can tell is the ability to decide to launch more servers; I guess an email to sysadmin "rack 100 more servers" wouldn't go down very well. What's missing is something like "the job queue is sufficiently full, let me launch 10% more servers," and of course the reverse when the job queue empties. The notion of "full" here doesn't have to mean that jobs are queued for hours, it may simply mean that there is enough work to be able to fill additional full machine hours.
It's going to be interesting to see how the cloud will change high-performance computing usage pattern and whether Amazon can actually keep up with demand. If you have comments about the thinking above or suggestions I'd love to hear them.