RightScale Blog

Cloud Management Blog
Cloud Management Blog

Cluster Monitoring with Stacked Graphs and Heat Maps

Just in time to demo at the RightScale User Conference, we started a private beta of two new cluster monitoring features: stacked graphs and heat maps. We've been using them for a while internally and they've been invaluable to quickly determine the cause of issues. The idea behind these types of graphs is not new, but what we've been able to do is automate the back-end machinery that lets you create these graphs across any cluster of servers with just a few clicks.

The cluster monitoring we've had for a while shows an individual graph per server. Below is an example showing the CPU load on our web front ends. Each graph shows user time in blue and idle time in green over the course of one week. You can see that one of the last two servers has been sitting idle and the other one was launched less than a day ago. These individual graphs are nice for a small number of servers, but once your cluster has more than about a dozen they cease to be practical.

A stacked graph is a great alternative to display many servers on one graph where their activity contributes to a total or sum. In the case of the web front-ends, they all serve HTTP requests and contribute to the total of HTTP requests served by the cluster. This is what the graph below shows: each color band shows the requests/sec for one server, and the color bands are stacked on top of one another such that you can read the requests/sec served by the application at the top. This now gives a nice overview of what's happening in aggregate. Also, if one server were serving a lot more requests than the others, you'd be able to spot it because its band would be significantly wider. However, something that is actually not easy to notice in the stacked graph is that two of the servers are not serving requests. You'd have to start counting color bands to notice.

A heat map shows a somewhat different view of the same action. In the heat map below each color bar represents the activity of one server and the color of the bar at each point in time shows how "hot" the server is, i.e., the value of the variable being displayed color coded from blue to red. Here you can again see that the top two servers are outliers, one being idle the whole time, the other having just launched. On the other hand, it's pretty difficult to make out absolute values for how busy a server is.

The bottom line is that none of the cluster graph types is better than the others, it just depends on what you're looking for, so it's nice to be able to flip back and forth. To illustrate this further, here is a real-life example of an issue we encountered. On a Wednesday morning we had alerts going off on a small number of our monitoring servers that showed unusually high load on those servers. We first looked at the heat map plotting I/O wait time for the cluster:

You'll notice that there are a lot of bars! The heat maps are currently limited to showing 100 servers at a time and we have more than 100 monitoring servers, so you're seeing a sampling (including the longest running, the shortest running, and some of each different ServerTemplate). Also, you'll notice the color coding alternates between blue-red and green-orange every 10 servers to make it easier to count. The red bars make it pretty clear that something started to affect a small number of servers around 8am. After looking at a number of other variables we saw the following stacked graph of the number of servers monitored.

Here each color bar represents the number of customer servers monitored by one of our monitoring back-ends. This view displays the activity for a whole week and highlights the fact that a slew of additional servers were monitored right when we got the alerts. From there it was easy for us to pinpoint the cause, which turned out to be a limitation in our monitoring back-end assignment algorithm.

I'm really excited about these new monitoring features in part because I've built similar graphs manually several times at different companies and being able to get them automatically is amazing. You may have noticed that we produce the graphs using RRDtool, which we've extended quite a bit, in particular to draw the heat maps. The way the graphs are rendered is that a monitoring front-end queries the data series from the appropriate back-end servers and then assembles everything into one graph which is sent to the browser. The result is that each of the two graphs above displays 60,000 data points, that's a lot of data to be able to see at a glance!

The stacked graphs and heat maps are currently in private beta. If you're running lots of servers using RightScale and would like to try them out, please drop me an email. One of the tasks still ahead of us before we can release this to everyone is improving the parallelism of the data fetching so we can plot more than 100 servers at a time.

Comments

good point I should have addressed. we added some features to rrdtool to facilitate the stacked graphs and we added the heat maps. you can actually find the code if you search, and we're planning to contribute it back if Tobi takes it. The front end is a different situation. it's not code we can easily extract and publish. maybe when we rewrite it we can open source that...
Posted by Thorsten von Eicken (not verified)   Ι   November 03, 2010   Ι   03:45 PM
Huge thanks if this makes it in to RRD - the Heat maps in particular would be a giant leap forward.
the code is actually visible on github. "just" need to create a proper diff without too much debug crap to current trunk and then fix all the x-platform stuff that inevitably comes up. ahh, the fun parts of trying to contribute to OSS...
Excellent points! My response started getting a bit long so I wrote a post that expands upon your key points. You can find it here: http://www.evidentsoftware.com/?p=1927
Yo, Any of this stuff available as FOSS? The stacked graphs is super nice - reminds me of Munin (except with obviously higher-resolution trend data) I just rolled out some auto-collectd+nagios+cuke, finding a nice UI for CollectD RRD's has been hard.
<strong>What are the open source equivalents of RightScale, which allow you to monitor graphs of memcached instances?...</strong> RightScale's monitoring tools are built on open source components; CollectD (www.collectd.org) resides on the server and sends the metrics back to RightScale's array of monitoring servers and RRDtool (http://www.mrtg.org/rrdtool/) is used to graph th...

Post a comment