High availability
One of the primary benefits of the public cloud is its geographical dispersion of resources. This distribution allows you to build highly available solutions at low cost. Availability covers a number of diverse topics. Depending on the customer, it can be measured in different ways. Traditionally, system uptime was the primary indicator. In the pre-cloud era, five nines was a good goal to have. This meant that your systems were up 99.999% of the time; downtime could be no more than five and a half minutes per year. As microservices became more prevalent in the cloud era, and systems got distributed across the globe, five nines became unrealistic. This is because complex systems inherently have more potential failure points and are more difficult to implement correctly. In a simple example with three components, each having five nines, the formula 99.999%*99.999%*99.999% = 99.997% illustrates how traditional measures of uptime start to break down in the cloud.
Amazon S3 has a service level agreement for uptime of 99.9%. This allows for ten minutes of downtime a week. We will call this ten-minute window your error budget. In Chapter 7, Operation and Maintenance – Keeping Things Running at Peak Performance, we will go into more detail on error budgets plus service level indicators, objectives, and agreements. For this chapter, we will use three nines as our availability goal and measure product availability, not system uptime. Although our examples primarily focus on instances, these same practices should be applied to improve availability of your containerized and functional workloads.