Just in time to demo at the RightScale User Conference, we started a private beta of two new cluster monitoring features: stacked graphs and heat maps. We've been using them for a while internally and they've been invaluable to quickly determine the cause of issues. The idea behind these types of graphs is not new, but what we've been able to do is automate the back-end machinery that lets you create these graphs across any cluster of servers with just a few clicks.
The cluster monitoring we've had for a while shows an individual graph per server. Below is an example showing the CPU load on our web front ends. Each graph shows user time in blue and idle time in green over the course of one week. You can see that one of the last two servers has been sitting idle and the other one was launched less than a day ago. These individual graphs are nice for a small number of servers, but once your cluster has more than about a dozen they cease to be practical.
A stacked graph is a great alternative to display many servers on one graph where their activity contributes to a total or sum. In the case of the web front-ends, they all serve HTTP requests and contribute to the total of HTTP requests served by the cluster. This is what the graph below shows: each color band shows the requests/sec for one server, and the color bands are stacked on top of one another such that you can read the requests/sec served by the application at the top. This now gives a nice overview of what's happening in aggregate. Also, if one server were serving a lot more requests than the others, you'd be able to spot it because its band would be significantly wider. However, something that is actually not easy to notice in the stacked graph is that two of the servers are not serving requests. You'd have to start counting color bands to notice.
A heat map shows a somewhat different view of the same action. In the heat map below each color bar represents the activity of one server and the color of the bar at each point in time shows how "hot" the server is, i.e., the value of the variable being displayed color coded from blue to red. Here you can again see that the top two servers are outliers, one being idle the whole time, the other having just launched. On the other hand, it's pretty difficult to make out absolute values for how busy a server is.
The bottom line is that none of the cluster graph types is better than the others, it just depends on what you're looking for, so it's nice to be able to flip back and forth. To illustrate this further, here is a real-life example of an issue we encountered. On a Wednesday morning we had alerts going off on a small number of our monitoring servers that showed unusually high load on those servers. We first looked at the heat map plotting I/O wait time for the cluster:
You'll notice that there are a lot of bars! The heat maps are currently limited to showing 100 servers at a time and we have more than 100 monitoring servers, so you're seeing a sampling (including the longest running, the shortest running, and some of each different ServerTemplate). Also, you'll notice the color coding alternates between blue-red and green-orange every 10 servers to make it easier to count. The red bars make it pretty clear that something started to affect a small number of servers around 8am. After looking at a number of other variables we saw the following stacked graph of the number of servers monitored.
Here each color bar represents the number of customer servers monitored by one of our monitoring back-ends. This view displays the activity for a whole week and highlights the fact that a slew of additional servers were monitored right when we got the alerts. From there it was easy for us to pinpoint the cause, which turned out to be a limitation in our monitoring back-end assignment algorithm.
I'm really excited about these new monitoring features in part because I've built similar graphs manually several times at different companies and being able to get them automatically is amazing. You may have noticed that we produce the graphs using RRDtool, which we've extended quite a bit, in particular to draw the heat maps. The way the graphs are rendered is that a monitoring front-end queries the data series from the appropriate back-end servers and then assembles everything into one graph which is sent to the browser. The result is that each of the two graphs above displays 60,000 data points, that's a lot of data to be able to see at a glance!
The stacked graphs and heat maps are currently in private beta. If you're running lots of servers using RightScale and would like to try them out, please drop me an email. One of the tasks still ahead of us before we can release this to everyone is improving the parallelism of the data fetching so we can plot more than 100 servers at a time.