Before you congratulate yourself on crossing off those final to-dos for 2012, don’t forget this critical one: fine-tuning your applications and cloud architecture. And while it’s unlikely that you’ll accomplish this task by end of year, here are nine tips to help you optimally manage your cloud applications and architecture in 2013.
Image credit: Santanu Vasant
1. Test to Determine Instance Size and Counts
- All clouds provide multiple server sizes with differing CPU, memory, and local disk capabilities. Research these offerings so you are aware of the range of server types provided.
- Do load testing to determine the ideal instance size and instance count for each server tier of your application. If your application is CPU intensive but doesn't consume much memory, use your cloud provider’s multi-core instance size.
- For your databases, it’s generally a good idea to use high-memory instance sizes and try to get your entire working set into memory. Pick a size, run some tests, look at the data. Repeat. A good fit is most likely out there, you just gotta find it.
- PlanForCloud cost forecasting is a free service that enables you to do what-if analyses on different deployments, clouds, and purchase options (such as on-demand vs. reserved instances).
2. Build for Server Failure
- Set up auto-scaling to handle dynamic load as well as to accommodate server failure. For example, set a lower bound on your server array size to ensure that you always have at least this minimum number of servers running. Arrays scale collections of servers up and down based on completely customizable conditions: monitors, alerts, batch-processing queues, or operational scripts. For more info on server arrays, see RightScale cloud automation features.
- Configure databases to replicate across zones and clouds to protect your application from single zone and/or single cloud failures. Degraded performance is better than no performance.
- Use dynamic DNS for internal components (such as databases) that may change IPs, and use static IPs for user-facing application entry points to eliminate the need to change DNS records (and the resulting aftermath of issues you’ll need to deal with, including browser caching and DNS propagation).
3. Build for Zone Failure
- Always have two or more servers of each tier of your application distributed across two or more zones. Best to lose some of your eggs than all of your baskets.
- Replicate all data (for example, not just databases but also shared file systems) across zones.
4. Build for Cloud Failure
Back up or replicate data across regions (not just zones) of the same cloud provider — or to a different cloud provider altogether — so that in the unlikely event that an entire region fails, you have a "warm DR" environment in another region/cloud that will allow your application to be relaunched relatively quickly.
5. Automate Everything
- Automate your database/datastore backups, including the replication and/or copying of this data to another region/cloud so that if one region/cloud suffers an outage, your application won’t.
- Monitor all critical aspects of your application and try to identify little problems before they become big problems. Have automated tools to correct and mitigate these issues as they arise.
6. Cache, Cache, and More Cache
- Virtually all applications will benefit from caching, be it in front of the web/application tier, in front of the database tier, or both.
- Don't use caches that are co-resident with the application server. Most applications pull some common items from the database, and with a co-resident cache you will be hitting the database for that object once for each application server instead of just once.
- Use a separate multi-node distributed cache. While a single high-memory instance may be capable of handling your entire cache, the loss of this single instance would result in the loss of the entire cache, as opposed to just a small percentage if you have that cache distributed across many servers.
- Don't use standard caching solutions in an auto-scaling configuration or you will need to continually update the cache configurations on your application servers, which typically require a restart (and unhappy end users). Also, there will be a big object reshuffling every time a cache server leaves or joins the cluster, and this will result in a flurry of activity in the database, which will degrade application performance. Several third-party solutions provide for seamless dynamic scaling of the caching tier, so if cache auto-scaling is a must-have, take a look at solutions such as Couchbase.
7. Replicate, a Lot
- Replicate data to another zone to insulate your application from single-zone failures. If a zone fails, and your slave is in that zone, there will be no impact to your application, and you can relaunch a new slave behind the scenes. If the zone with your master fails, then you will have a short downtime window as you promote your slave to master.
- Replicate data to another region/cloud to protect your application in the event of a disaster scenario. If something catastrophic happens, your data will persist in a different (and unaffected) environment.
- If possible, have an additional slave with data stored on the ephemeral disk. Several of the past AWS outages affected primarily (or solely) the EBS service, so if your application is not reliant on EBS, you can avoid service impacts from these types of outages.
8. Watch for Single Points of Failure Lurking in Unexpected Places
As one example of a commonly overlooked dependency, if your application code is being pulled from a GIT or SVN repository, then your application's ability to scale is directly affected by the availability of this repository. Your infrastructure design and automation may be spot-on, but if your application servers can't get the code they need, then your seemingly non-single point of failure infrastructure actually has a single point of failure. Look for other dependencies at the application layer in particular, as it is easy to focus on the infrastructure and overlook less obvious areas of exposure.
9. Be Wary of Cloud Lock-In
The use of vendor-specific tools and virtual appliances may make deploying an application easier in the short term, but many times these services are integrated or tied into other services that can result in cascading outages if one of these underlying services suffers a service disruption. The use of vendor-neutral solutions insulates your application tiers from these service integrations, as well as creating a cloud-portable solution that can be used to redeploy your application in another cloud in the case that you need to implement a disaster recovery scenario, or if pricing and/or features of another cloud prove enticing.
If you want more details on any of these nine tips I've covered, check out my white paper on Building Scalable Applications in the Cloud. And here’s wishing you a cloudy New Year :)