Reliability and Performance Are Linked
Safety, testability, quality, maintainability, stability, durability, and availability are all aspects of a workload’s overall reliability. Reliability is the critical design consider-ation. For example, if a workload crashes periodically but reboots and carries on, persistent end users who retry their requests might get their results eventually. But besides the obvious reliability issue, there is also a performance issue. Your workload’s effective performance is lower because of the required retries.
An unresponsive website will eventually cause customer dissatisfaction, which negatively impacts trust and the reputation of the application. If it leads prospec-tive or established customers to shop elsewhere, that could result in lost potential business. Maintaining redundant cloud services to improve workload reliability and performance will result in additional operational expense for some cloud services, such as multiple EC2 instances and multiple database instances providing redundant storage.
When designing for reliability, it’s important to realize that not all workload depen-dencies will have the same impact when they fail. An outage for an application stack designed with some hard dependencies, such as a single primary database with no alternate database as a backup, will obviously cause problems that cannot be ignored when failure occurs. An outage with a soft dependency, such as an alternate database read-replica, will hopefully have no short-term impact on regular workload opera-tion. Workload reliability can also positively affect overall performance. With the use of multiple availability zones utilizing separate physical data centers separated by miles, workloads can easily achieve a level of reliability and high availability as the web servers, and primary and alternate database servers, are hosted in separate phys-ical locations. Database records can be kept up to date using synchronous replication between the primary and alternate database instances or storage locations.
Disaster Recovery
In addition to defining your availability objectives, you should consider disaster recovery (DR) objectives. How is each workload recovered when disaster occurs? How much data can you afford to lose, and how quickly must you recover? The application’s acceptable recovery time objective (RTO) and recovery point objective (RPO) must be defined and then tested to ensure that the application meets and pos-sibly exceeds the desired service-level objectives. Both the RTO and RPO for each workload need to be defined by your organization. RTO is the maximum acceptable delay between the interruption of an application and the restoration of service. RPO is the maximum acceptable amount of data loss.
Placing Cloud Services
It’s critical that you choose where each workload component resides and operates. Some cloud services, such as DNS name resolution and traffic routing and content delivery networks (CDNs), are globally distributed across the world. But most AWS cloud services are regional in design. That is, they are hosted in one particular geographical location, even if they might be accessible globally. Techniques such as replication, redirection, and load balancing allow you to deploy workload cloud services as multi-region architecture.
Exam questions will ask you to consider several options when deciding where each workload and associated cloud services should be located to best meet the needs of the question’s scenario: host location, data caching, data replication, load balancing, and failover architecture that is required.
Data Residency and Compute Locations
Running workloads in the cloud is essentially leasing time on storage and com-pute power in a cloud provider’s data centers. Each cloud provider hosts services in regions throughout the world. For example, Amazon, Google, and Microsoft each host their cloud services somewhere in the state of Virginia, in a region near Tokyo, and in dozens of other regions around the globe.
How do you choose a region to host your services in? The first suggestion is to consider data residency. Do you have compliance guidelines, laws, or underwrit-ers that suggest or dictate that you store your data within a certain country, state, or province? That might be your sole consideration. If data residency isn’t strictly mandated or multiple AWS regions don’t meet your criteria, you could instead place your data close to your customers or to your own facilities. Those regions might not be the same geography, depending on the nature of your business and markets you serve.
Let’s use an example of a fictitious business called Terra Firma based in Winni-peg, Ontario, which is in the middle of Canada. Let’s assume that Terra Firma has deployed its customer portal website in the AWS cloud somewhere in the central Canada region, near its offices and the majority of its users. If customers in the cen-tral Canada region have sufficiently fast, low-latency Internet connectivity to this AWS cloud region, their user experience could be adequate with the hosted website portal. Let’s look next at whether caching could be a benefit to the Terra Firma website portal.
Caching Data with CDNs
CDNs serve up temporary copies of data to the client in order to improve effec-tive network performance. Architecturally, the CDN servers are distributed around many global service areas. In the case of modern cloud CDNs, the service area is global; AWS hosts a global CDN world-wide cache called Amazon CloudWatch. How can a CDN cache benefit the web portal users?
Without a CDN cache, the web browsers of Terra Firma’s central Canadian custom-ers would send requests to the website address hosted at the cloud region hundreds of miles away. The speed of the website hosting and backend storage and databases for the site are a factor in the end-user experience. But the Internet latency from each user’s Internet connection to the AWS cloud region chosen by Terra Firma can significantly contribute to the performance and overall experience.
Let’s assume that Terra Firma’s AWS cloud provider CloudFront has a CDN point of presence (POP) located in downtown Winnipeg. If Terra Firma’s cloud architects and operations staff configured their website to use a CDN, their customers could benefit from a faster user experience. Customer Julian’s web browser queries for Terra Firma’s web address and receives a response from the CDN POP location in Winnipeg. Julian’s browser next sends a request to load the website to that local POP in Winnipeg, which is a few miles away through the local Internet gateway instead of hundreds of miles and several network hops away to the location of the web server.