The Animoto guys hit the jackpot on Facebook this past week. Jeff Barr mentioned a few of the stats on his blog: Animoto ramped from 25,000 users to 250,000 users in three days, signing up 20,000 new users per hour at peak. The system they run using RightScale is quite complicated, with the www.animoto.com website, then a separate site for the Facebook app run by Hungry Machines, both of these feeding into a back-end web services site that orchestrates uploads, and, most importantly, the render farm that creates the cool videos.
The upshot is that there are a lot of moving parts. Each of the subsystems consists of many servers and everything needs to scale up as the load increases. What Animoto CTO Stevie Clifton did well is to connect all the operations using queues, many of them in SQS. One queue contains work items that list photo URLs to fetch from sites such as Facebook and Flickr, and that is processed by one array of worker instances. Another queue has the list of render jobs, and each work item in there points to the set of photos sitting at the ready in S3 and at the music files also on S3. All of these queues are held in Amazon SQS and the arrays of worker instances are managed by RightScale. This allows the monitoring part of our service to detect when the queue gets too large and more instances need to be launched. Using queues decouples the various parts of the site, so if the renderers get backlogged the queue simply builds up and users have to wait a little longer for their video to be produced. Waiting is not good, but dropping requests on the floor is much worse!
Producing the videos takes eight to nine minutes on average, and at peak Animoto has pumped more than 450 render requests per minute into the queue. Last week we ended up with just under 3,500 instances in the various Animoto deployments. Tonight it was more than 4,000 and it looks like it will not drop under 2,000 instances through the night. At peak RightScale was launching and configuring 40 new instances per minute pretty much sustained to handle the injection of thousands of render jobs that needed special handling.
Lessons learned? First of all, when you scale 10x and then 10x again to run on thousands of servers, every little problem turns into a large one. That insignificant error rate of 0.1% gets multiplied by 1000x per second and you end up with an error a second, and actually, the error rate typically increases in itself too because of the added load on the system. So suddenly it's not something you can ignore anymore. An example for this was having exponential backoff for uploads to S3 when using curl, but forgetting that the fifth retry exceeds the S3 connection timeout. Normally, this happens only once in a blue moon, but when tens of uploader instances are banging hard on one S3 bucket, the S3 error rate goes up a bit and suddenly uploads are failing left and right. Once we changed this to a constant retry timeout it all went smoothly again.
Now does this mean that you should fix all the little issues before going live? Of course not - you can't! What I've found to be most effective is to think about every little problem that you come across for a few minutes. Don't just brush it aside as being insignificant. It is now, but it will trip you up tomorrow or the day after. So spend five minutes to troubleshoot and hypothesize as far as you can get. You don't have to solve it immediately. Think up a workaround or how you would troubleshoot further, or perhaps how you'd fix it. Then move on. Come tomorrow, when and if the issue becomes big, you will have an invaluable head start. Instead of being caught offguard you'll be able to immediately kick into action and solve the issue.
Another lesson learned is not to forget the manual overrides. Yup, I know, we have this super smart auto-scaling algorithm. But we also have manual overrides, and when Animoto went from about 50 instances to 4,000 instances we used it. We wanted to make sure the extra instances didn't overload the database and the queue, and that everything was running smoothly (and, yes, to take a pause and fix some issues before scaling up further). Stevie and the Hungry Machines guys also had put in some overrides to queue up automatically generated videos and let manually requested ones zip through. This was essential in keeping the active users happy when everything first exploded and the system had trouble keeping up with the load. A lot of the queued videos were processed a bit later when the load went back down. Automation is cool for the daily routine events but for something like this you want the overrides.
Animoto is a great example of leveraging the cloud for its strengths of instant availability and virtually limitless scope.Of course, most sites don't need to launch 4,000 servers in one go, but its nice to know you can if you need to. Whether the number is four or 40 or 4,000, getting the resources you need at the time you need them is a key benefit of "right-scaling" your deployment using the cloud. To see auto-scaling features in action, check out the RightScale free trial.
Looking at our database today I noticed that RightScale has launched, configured, and managed more than 200,000 instances to date! That's an impressive number, but as the Animoto scale-up proves, we're only just beginning.