RightScale Blog

Cloud Management Blog
RightScale 2018 State of the Cloud Report
Cloud Management Blog

Security Monitoring In Public IaaS: How We Do It at RightScale

In my experience helping RightScale customers who are at varying points in the cloud adoption spectrum from investigating IaaS to launching a POC to already using IaaS for production applications, I see quite a bit of confusion about how to actually “do” security in the cloud, particularly in IaaS. And the sheer volume of vendor cloud washing and sales FUD that is being perpetuated makes it even more difficult for them to get a straight answer.

As the Security Guy at RightScale and a cloud user myself, I'm writing a series of “How we do it” articles covering the “security things” that you should be doing as part of your security program. In this blog post I will share some of my thoughts and details on one important aspect of security: security monitoring in public IaaS. Read on for some actionable tips, or, if you want more in-depth expert guidance on security in the cloud, ask to talk with me when you request a RightScale demo.

What Is Security Monitoring?

So what exactly is security monitoring and why is it important? In the context of this blog, I am defining security monitoring as:

The ability to collect, analyze, and alert on security-related system and application events. These events are synonymous with individual log entries for the system or application in question.

Security monitoring is important because it is a critical part of a holistic information security program. A holistic program includes such protocols as governance, testing, patching, configuration management, and identity and access management among monitoring and others. Although security monitoring is just another piece of that puzzle like applying patches or configuring systems properly, it is critical because without it your security program is incomplete. If there is a compromise or attempted compromise of your system, without security monitoring you would never know it happened until you read about it in the news.

Let's start with some premises:

In order to get it right, you need to look at security monitoring in IaaS in the following light:

  • Cloud, and thus IaaS, is a new way to deliver IT.
  • Security fundamentals in the cloud are similar to any other environment (there is no secret sauce).
  • Monitoring in IaaS is a subset of monitoring in a traditional enterprise. The main difference arises because there is no general network level visibility (for example, there is no span port to view all traffic passing through the device). I know you can architect it such that there is a "choke point" to pass all traffic through, thus gaining much of the visibility functionality you have lost. However, if you do this, you are essentially trying to shoehorn an old solution where a new one is in order (see the first bullet above), and you will introduce problems (single point of failure, bandwidth limits, latency issues, etc.). I recommend that you embrace the change, figure out how you should do it (compensate for that lack with host-based tools, for example) and don't try to force how you have done it in the past. (Note: There is an aspect of the network-level monitoring that focuses on the actual network itself, which is managed by the provider and thus is out of our scope and would not typically be considered anyway.)

Begin by asking "Why?"

You need to begin the whole process of by asking "Why implement security monitoring?" rather than "How to do it?" or "What to use?".  So why are you implementing security monitoring? To meet compliance requirements? Prevention? Detection? Forensics? Other?

Once you have identified the "Why," then you can start looking at the "How to do it." Will you use Host Intrusion Detection System (HIDS), application logs, system logs, host network traffic, etc. as data sources for your monitoring? After the "How" then the "What to use" should follow: open source (Syslog, GrayLog, OSSEC, etc.) or commercial (SumoLogic, CloudPassage, Trend Micro, etc.).

The RightScale answer to "Why?"

Our answer to the "Why implement security monitoring?" is compliance, burglar alarms, and forensics. We needed to meet compliance requirements, we wanted to have a system that would notify us if something we knew was not supposed to happen did, and we wanted to be able to look at past events if needed for forensics purposes.

Next consider the "How?"

Once you have answered why you need to implement security monitoring, then you can move on to the "how to do it." When we were identifying the "How?" in our security monitoring environment, we identified the following critical items that needed consideration:

  1. Alert latency: How quickly did we need to know that something had happened? We had to consider both the time for the event to be processed and the time to actually receive the alert. Our answer had a direct impact on requirements for storage I/O, network bandwidth, CPU, and memory.
  2. Bandwidth: In a centralized system (I know I have not addressed this yet, but I will), is there enough bandwidth on the processing system to handle all the network load? Also, what will the cost be relative to the amount of data that needs to be transferred off the host?
  3. Reliability: What level of reliability is required for the actual data and alerts that are processed? Most folks would say "total reliability,” yet the most prevalent transport for logs is Syslog over UDP (per RFC 768: "...delivery and duplicate protection are not guaranteed."). Our decision was driven by our answers to the "Why?" question.
  4. Deployment model: I know there are as many ways to deploy monitoring as there are architects that design them, but I believe that the following three models are best for IaaS cloud use (RightScale uses all three, as I describe later):
    1. Local agent, local alerting, central correlation, central archive
    2. Local agent, central alerting, central correlation, central archive
    3. No agent, central collection, central alerting, central correlation, central archive

The RightScale answer to "How?"

Based on our "Why implement security monitoring?" answers, our "How?" decisions came out like this:

  • Alert latency: We will trigger an alert to fire within 3 minutes of a “burglar alarm” event happening.
  • Bandwidth: We will limit the cost associated with data transfer (we generate 100s of GB per day of logs, and it is growing significantly), by using systems in zones/regions that are free (ideally) or have minimal cost for large bandwidth usage. One of the major considerations is that RightScale, as part of our platform design, already has a requirement to centralize logs for troubleshooting purposes, and thus leveraging that common machinery is highly desirable.
  • Reliability: We will ensure that logs are available in a central store by using a reliable mechanism.
  • Deployment model: We will leverage all three deployment models noted above.

Lastly decide on the "What?"

Once you have identified "Why?" and "How?" with regard to implementing security monitoring, you can look into "What?" you will implement. This is where you get to start looking at specific technology solutions that fit into your "How?" and meet your "Why?". You can start looking at vendor products, open source solutions, developing something in house, or most likely a combination of  them all. Another important part of this phase is figuring out the limitations of the technologies that are available to you and potentially feeding that back into the earlier questions if you can get something to meet the specific need.

The RightScale answer to "What?"

Based on the answers to "Why?" and "How?", we developed the following design for "What?" to do. We decided that using a combination of all three deployment models was going to be the best way to the three "Why?" requirements we had:

  1. Critical servers (for example, database servers): We use OSSEC standalone mode, rsyslog and RELP (Reliable Event Logging Protocol) to implement model "A" (local agent, local alerting, central correlation, central archive). This gives us the ability to implement local burglar alarms that can alert quickly if something is amiss. Implementing the local agent and local alerting increased the administrative burden for these systems, but because of their critical nature, we felt that the added burden was justified.
  2. PCI environment: We chose the commercial CloudPassage product Halo to implement deployment model "B" (local agent, central alerting, central correlation, central archive). Halo also provides additional security benefits for our PCI compliance requirements.
  3. For the rest of our systems: We use RELP to send to a central log collector and OSSEC (server mode) for alerting and correlation (model "C"). And we deploy central collectors in locations that meet the "bandwidth cost reduction" requirement. We decided on model "C" because it is the easiest from an administration point of view (no local agents to manage). It allows for generic correlation and alerting for the environment and provides the needed logs for forensics.

I cover deployment models in more detail in my on-demand webinar Security Monitoring In the Cloud: How RightScale Does It.

Examples of Alerts We Use

By now you have no doubt figured out that that by "monitoring" I effectively mean "log analysis," and one of the foremost questions in most people's minds will be: How does one classify a log entry as "interesting"?

The answer, as you would expect, is "it depends." Arguably the single most difficult part of security monitoring is defining what is worth alerting about — especially with logs. Some examples of events we at RightScale find noteworthy and want to be alerted on are:

  • Interactive login to our database server: In our environment this is a rare event, so I want to know when it happens. Also, I keep an eye on any statistical increase of the occurrences of such events, as they may indicate something amiss. 
  • Database access from an unsuspected system: This situation should not occur and would point to a potential firewall problem or indicate some other issue that needs to be addressed ASAP.
  • Former staff user account access attempts: This alert primarily helps with identifying any staff account that may have been "missed" in the employee off-boarding process.

These are the few items on our "what to watch for list" that I am willing to share since I want to give you examples of alerts but don't want everyone to know what specifically I am looking for.

Summing It Up

You must start with the "Why?" question when tackling the implementation of security monitoring . You'll notice that the ultimate decisions for RightScale are directly related to our "Why?" answers, no more, and no less. Without knowing those answers — and I would argue that most of you don't — you will go down the path of deploying technology for technology's sake and likely be continually frustrated with your lack of desired results.

At RightScale, we decided that the ease of administration, especially in a highly elastic IaaS environment, was very important. Also, from my past experience deploying global security event monitoring tools, I knew that custom correlation and alerting is the best place to start (this limits false positives). It is always easier to start with success and grow from there.

Although we have put a decent bit of work into our security monitoring, it is an ongoing process. As I said earlier, security monitoring in public IaaS is a new way of doing things and everyone is still learning at this point. We will adjust and change as our needs or solutions change.