When we work with a client on an application build in the cloud, we need to build the infrastructure to meet their uptime Service Level Agreement (SLA). This figure is usually expressed as a certain number of 9s – such as 99.99%. That means that the application can only experience unscheduled downtime for a very small fraction of time. It’s important to know that the cloud providers themselves (e.g. AWS and Azure) offer SLAs for some of their services.

When architecting for uptime, there are two traps that are easy to fall into. The first is expecting all cloud services offered by a cloud provider offer the same SLA. The second is seeing that a service offers 99.99% and assuming that the application you build on top of their infrastructure will automatically offer that same level of uptime. Let’s take a deeper look into these two stumbling blocks.

Most cloud offerings don’t come with any kind of SLA. The services that do are often very particular about the type of uptime they’re guaranteeing. A quick glance at the AWS SLA for compute shows a guarantee of 99.99% uptime. Upon closer examination, one will see this is only for the availability of the EC2 service, not for individual instances. Single EC2 servers are only guaranteed to be up 90% of the time!

The second trap is less obvious. When building an application that depends on two service which both offer 99.99% uptime, the effective uptime is less than that of either service because availability of both services is needed for the application to function. Conversely, when an application uses a pool of servers where only one of the group needs to be functional, the effective uptime is greater than any individual server. This chart illustrates the concept:

Let’s try an example. Consider an AWS infrastructure consisting of an Elastic Load Balancer (ELB) with three EC2 instances behind it. Assume any one instance can handle all the expected traffic. The AWS documentation says that the ELB service offers 99.99% uptime. We know that a single EC2 instance offers 90% uptime. To calculate the effective uptime one should expect for this architecture, use this equation: (1-((1-0.9)3)) x 0.9999 x 100, which comes out to 99.89% uptime.

To summarize, calculating effective uptime of a complex, multi-node solution that uses many cloud services can be very difficult. Summit helps clients navigate the challenge of uptime calculation and constructs cloud buildouts to support a variety of SLAs.

Share Insights

What are your Uptime Expectations?