RightScale Blog

Cloud Management Blog
Cloud Management Blog

Cloud Server Monitoring and Automation, with Help from a Chimp

How does the RightScale Operations team manage more than 700 servers in the cloud running on five continents? By using RightScale Cloud Management ourselves, and taking advantage of its powerful monitoring and cloud automation tools. Here's an introduction to our performance data collection capabilities and a peek at our not-so-secret weapon for cross-deployment automation.


At first glance the cloud seems to relieve organizations of some of the burden of monitoring their IT resources. When someone else runs the data center, you don't have to worry about problems with HVAC systems or failing drives. But in fact, when you run in the cloud, resource and application-level monitoring is even more important. Instead of your hardware generating information that can warn you of an impending problem, your servers in the cloud simply stop working, sometimes in new and interesting ways. That's why RightScale builds monitoring tools into our multi-cloud platform.

RightScale can monitor, among other things, database server performance

Most of our monitoring is based around collectd, the modular open source system statistics collection daemon. We use RRDtool to graph data over time in our Cloud Management dashboard.

When you launch a new server with RightScale, the RightLink™ component of its ServerTemplate™ automatically registers with RightScale and sends traffic back to us using the collectd native protocol. We track all the basic host-level metrics — CPU, disk space, disk I/O, memory, and network — and do application monitoring for things such as process state and CPU and memory use. You can use any of those metrics to define alerts for unusual conditions.

From each cloud server instance, collectd pushes data over the network to RightScale to the server we use to generate graphs. We store the data in the widely used RRD format, and you can access it through the dashboard as well as through our API.

For more sophisticated monitoring you can use collectd's Exec plugin, which allows you to run a script in any language, interacting via standard in and out, and thereby leverage the RightScale monitoring infrastructure for your own custom metrics. In a RightScale Compute session, I demonstrated a simple Ruby program that we use to send data through collectd back to RightScale. I also showed how we created a custom HTTP monitor to watch the error codes of RESTful services.

Chris Deutsch's session on cloud monitoring and cloud automation at RightScale Compute

For the NoSQL crowd, I also introduced our Apache Cassandra monitoring suite. Not only do we use it, but it is also available to our customers.

Cassandra is a NoSQL key-value server written in Java. It stores its data on a ring of nodes, each node running on a different server. RightScale runs several large Cassandra rings in the cloud, replicating data over 50 distinct instances, and we have found that bad performance on a single node can have a negative impact on the whole ring. This makes monitoring the health of each node vital.

We developed a sophisticated monitoring plugin that dynamically monitors every aspect of our Cassandra databases. Since Cassandra is a Java app, it exposes its metrics through Java Management Extensions (JMX). We use a plugin called MX4J to read JMX information over HTTP. From there, our custom Ruby script grabs performance data and sends it to collectd. Once our graphing servers have logged the data, we can present it in RightScale, right alongside where we manage the servers themselves. Here's what the monitoring information from one of our production Cassandra servers looks like:



Monitoring is a vital task, but you also have to be able to manage your servers. RightScale has a RESTful API, which Operations uses extensively for both monitoring and automation, and a Ruby client library that developers can use to automate many system management actions.

Like most systems administrators, the RightScale Operations crew members are command-line geeks. We needed an easy way to run commands across large numbers of servers, so a command-line tool was a natural choice. I wrote a command-line program called chimp that uses the RightScale API and tag service to assist admins in selecting what servers to perform actions on, and executing those actions.

Just as you can tag photos on Facebook, on RightScale you can tag servers with important meta information. Tags generally have the format namespace:key=value, so if you wanted to set a tag with the version of Cassandra running on a particular server, it might look like info:cassandra_version=2.0.

To leverage the power of the tag service, we run a RightScript™ or Chef recipe on all of our instances when they boot. This script tags the server with important information such as its IP address and the services it runs. It shows up in RightScale like this:

Using the tags with chimp, we can find, for example, all the servers running Cassandra version 1.1.9 with a command like:

chimp --tag="info:cassandra_version=1.1.9"

Chimp returns:

Your command will be executed on the following:
001. globalring1-1.rightscale.com
002. globalring1-2.rightscale.com
003. globalring1-3.rightscale.com
004. globalring1-4.rightscale.com
No actions to perform.

This tells us that we have four servers to work with. If we had a script to execute on all of them, to upgrade them to version 2.0 for example, we could build off of our previous command and add the --script parameter:

chimp --tag="info:cassandra_version=1.1.9" --script="upgrade"

Chimp finds the servers to upgrade, looks for the script named "upgrade" on the server's ServerTemplate, and executes it.

Automating Releases

Chimp is written with the Unix design philosophy in mind, and in particular with the notion that you should write small programs that each do one thing well. Because it follows that principle, chimp only runs things on servers for you, but you can use it in tandem with other tools to orchestrate complex operations.

To automate our releases, we group chimp commands together using Ruby rake files. Let's say that for a release we need to stop an nginx web server, install a new version of our application, and then start the web server again. The rake task might look like this:

task :upgrade do
sh "chimp --tags=service:webserver=true --script='stop nginx'"
sh "chimp --tags=service:webserver=true --script='upgrade'"
sh "chimp --tags=service:webserver=true --script='start nginx"

To perform the upgrade, we can now type one command instead of three:

$ rake upgrade

The rake files for the last RightScale release included 265 chimp commands, all triggered by a single command. Since rake tasks can call out to other rake tasks, it's easy for sysadmins to organize a number of chores that need to be done.

Chimp also supports controls for concurrency, so it's easy to do things like upgrade 100 servers, but only five at a time. The RightScale API lets us audit our jobs, making it possible for chimp to track each operation and make sure that it ran successfully. Chimp summarizes any failures for the operator and offers to re-execute a command on only failed servers. This is a real lifesaver when you're running a command on a bunch of servers and it fails on just one of them.

RightScale has released chimp under an open source license. We hope that RightScale customers will find it as useful as we have for automating common tasks.

To learn more about how RightScale can help you improve cloud server monitoring and cloud automation, try RightScale for free.

Post a comment