RightScale Blog

Cloud Management Blog
RightScale 2014 State of the Cloud Report
Cloud Management Blog

Animoto's Facebook Scale-Up

The Animoto guys hit the jackpot on Facebook this past week. Jeff Barr mentioned a few of the stats on his blog: Animoto ramped from 25,000 users to 250,000 users in three days, signing up 20,000 new users per hour at peak. The system they run using RightScale is quite complicated, with the www.animoto.com website, then a separate site for the Facebook app run by Hungry Machines, both of these feeding into a back-end web services site that  orchestrates uploads, and, most importantly, the render farm that creates the cool videos.

The upshot is that there are a lot of moving parts. Each of the subsystems consists of many servers and everything needs to scale up as the load increases. What Animoto CTO Stevie Clifton did well is to connect all the operations using queues, many of them in SQS. One queue contains work items that list photo URLs to fetch from sites such as Facebook and Flickr, and that is processed by one array of worker instances. Another queue has the list of render jobs, and each work item in there points to the set of photos sitting at the ready in S3 and at the music files also on S3. All of these queues are held in Amazon SQS and the arrays of worker instances are managed by RightScale. This allows the monitoring part of our service to detect when the queue gets too large and more instances need to be launched. Using queues decouples the various parts of the site, so if the renderers get backlogged the queue simply builds up and users have to wait a little longer for their video to be produced. Waiting is not good, but dropping requests on the floor is much worse!

Producing the videos takes eight to nine minutes on average, and at peak Animoto has pumped more than 450 render requests per minute into the queue. Last week we ended up with just under 3,500 instances in the various Animoto deployments. Tonight it was more than 4,000 and it looks like it will not drop under 2,000 instances through the night. At peak RightScale was launching and configuring 40 new instances per minute pretty much sustained to handle the injection of thousands of render jobs that needed special handling.  

Lessons learned? First of all, when you scale 10x and then 10x again to run on thousands of servers, every little problem turns into a large one. That insignificant error rate of 0.1% gets multiplied by 1000x per second and you end up with an error a second, and actually, the error rate typically increases in itself too because of the added load on the system. So suddenly it's not something you can ignore anymore. An example for this was having exponential backoff for uploads to S3 when using curl, but forgetting that the fifth retry exceeds the S3 connection timeout. Normally, this happens only once in a blue moon, but when tens of uploader instances are banging hard on one S3 bucket, the S3 error rate goes up a bit and suddenly uploads are failing left and right. Once we changed this to a constant retry timeout it all went smoothly again.

Now does this mean that you should fix all the little issues before going live? Of course not - you can't! What I've found to be most effective is to think about every little problem that you come across for a few minutes. Don't just brush it aside as being insignificant. It is now, but it will trip you up tomorrow or the day after. So spend five minutes to troubleshoot and hypothesize as far as you can get. You don't have to solve it immediately. Think up a workaround or how you would troubleshoot further, or perhaps how you'd fix it. Then move on. Come tomorrow, when and if the issue becomes big, you will have an invaluable head start. Instead of being caught offguard you'll be able to immediately kick into action and solve the issue.

Another lesson learned is not to forget the manual overrides. Yup, I know, we have this super smart auto-scaling algorithm. But we also have manual overrides, and when Animoto went from about 50 instances to 4,000 instances we used it. We wanted to make sure the extra instances didn't overload the database and the queue, and that everything was running smoothly (and, yes, to take a pause and fix some issues before scaling up further). Stevie and the Hungry Machines guys also had put in some overrides to queue up automatically generated videos and let manually requested ones zip through. This was essential in keeping the active users happy when everything first exploded and the system had trouble keeping up with the load. A lot of the queued videos were processed a bit later when the load went back down. Automation is cool for the daily routine events but for something like this you want the overrides.

Animoto is a great example of leveraging the cloud for its strengths of instant availability and virtually limitless scope.Of course, most sites don't need to launch 4,000 servers in one go, but its nice to know you can if you need to. Whether the number is four or 40 or 4,000, getting the resources you need at the time you need them is a key benefit of "right-scaling" your deployment using the cloud. To see auto-scaling features in action, check out the RightScale free trial.

Looking at our database today I noticed that RightScale has launched, configured, and managed more than 200,000 instances to date! That's an impressive number, but as the Animoto scale-up proves, we're only just beginning.

  Animoto AutoScaling Graphs

Comments

Iolaire, thanks for the kind words! While we're not a consulting business, we do engage with our customers. In this case it's really tier-2 support. Stevie is really working his own deployments and pushing the buttons, but we're there helping and pulling on strings in the background to make him successful. You're most welcome about the free developer accounts. These are not about to go away, so be our guest and enjoy!
Thank you for the insight on such a large ramp up. I think this posts illustrates your place/need/service as a consulting business, but in that light I also wanted to say thank you for continuing to have the free developer accounts. As a tool to monitor my Amazon workflow during EC2/SQS/S3 development RightScale is quite an asset.
Posted by Iolaire McFadden (not verified)   Ι   April 24, 2008   Ι   05:56 AM
[...] out a great post on the RightScale blog discussing the Animoto application’s ability to scale dynamically using Amazon cloud services. I’m impressed by what they’ve accomplished, and the article is worth reading. [...]
Awesome job, guys. I consistently talk you up in my presentations and area user group meetings. This post and their situation, is definitely unique and incredible in their own right, but I can't say that I'm in any way surprised by the superior level of service you've provided to them. We love working with you. The interface and features you provide are incredible and well worth the cost. Great job, as always, and I look forward to continue working with you.
[...] koncentrerede primært sit indslag om at eksemplificere hvordan Startups som f.eks. Animoto og SmugMug har baseret hele sin infrastruktur på Amazons S3 [...]
[...] they used Ruby on Rails.” After all, these guys managed to sign 75,000 users a day using Rails. Hmm, but it took 500 servers to do it. That can’t be [...]
Just discovered animoto a couple of weeks ago and I got to say I love it.
Animoto is seen with a lot of wedding photographers nowadays. It is a neat form of slideshow display.
How do they make money? 2000 servers a hour cost a lot!
Posted by Me (not verified)   Ι   July 21, 2008   Ι   01:34 AM
When I first came across a friend with a Animoto video, I was really impressed with the quality and how it turned out. What I was disappointed about and led me to remove the application is that the app released a video with pictures of friends that I had tagged in those pictures into a video that I did not approve of nor was aware of. Is this how they have ramped up by literally sending out spam videos to be more viral. The product in itself is great, but I am hesitant to keep anything when I don't have control over it!
Posted by Unwanted Video ... (not verified)   Ι   July 25, 2008   Ι   11:34 PM
[...] except for dips during the holidays and a spike in activity in april of 2008. That spike was due to Animoto’s scaling to several thousands of servers within few days. We’re a little puzzled about this spike, [...]
Animoto is great. Good for these guys, I hope they do well.
[...] anchor the infrastructure and show case its capability when it has to scale. Think Animoto on [...]
animoto is worth their weight in gold! I love that company and I am not surprised by their success. They keep raising the bar.
Yup. Sometimes you are blessed to have a certain customer.
[...] case no-one comes to your party”, and gave plenty of examples including the Facebook app Animoto & Playfish (with 27M users). Livestream has no infrastructure, but on U.S. election night they [...]
[...] Virtual Private Server (VPS) market.  Many look to the stardom success of Web 2.0 startups like Animoto and SmugMug, who clearly derive tremendous value from Amazon Web Services (AWS), as a measure of [...]
[...] They are quoted as saying that RackSpace support will happen also at some point.  RightScale has a great case study oveview on their blog about Animoto and also explains how they have launched, configured and managed over [...]
Animoto is great. Good for these guys, I hope they do well.
I Love animoto. Great slideshow and so easy to use.
Very Impresive! Another example how The Animoto Team for <a href="http://weddingstoryphoto.com" rel="nofollow">wedding photography</a> slideshows takes every aspect of their business and does it right!

Post a comment