Disaster Recovery Patterns in AWS

Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for ...
AWS Docs

There are four levels of DR protection in AWS. DR patterns: From least costly and complex to most.

1-Backup and restore. Admins typically choose this option for DR requirements such as minimizing data loss. Items are restored hours or days after the DR event occurs due to cold storage retrieval.

2-Pilot light. This option replicates data from one AWS Region to another. It also provisions a copy of the underlying application infrastructure, but resources, such as servers, are only turned on for testing or failover. There is some downtime, but workloads are back online quickly, in minutes to hours depending on how much data was replicated.

3-Warm standby. This option maintains a scaled-down version of your production environment in another AWS Region. The downtime is minimal, typically minutes, because the workloads remain functional in the other Region.

4-Multi-site active/active. With this option, users run workloads in multiple AWS Regions simultaneously, ensuring little or no service interruption. While it is the most complex and expensive option, it can bring recovery times to near zero.

When choosing an AWS DR strategy, assess how much data you can afford to lose, how quickly you need to recover and how much that recovery effort will cost.

How Regions and Availability Zones influence DR

Regions and Availability Zones are a key part of DR initiatives in AWS. A Region is a geographic location in which AWS’ data centres reside. An AZ is group of logical data centres within a particular Region.

Many people think of DR as hardware redundancy, and the need to spread workloads across multiple AZs. That is mostly right. An AZ typically consists of multiple data centres that already have built-in redundancy, such as for power supply and networking. However, that doesn’t mean data centres in a certain AZ are mirrored clones of each other. Services and data can move between them, but more likely as a form of failover versus being fully in sync. This results in partial — not full — redundancy.

Compare AWS Availability Zones versus RegionsEach AWS Region typically has two or more AZs. If you bridge your application across two AZs, there will be low latency and, therefore, minimal downtime. Each AZ can have multiple redundant data centres for failover, ensuring the zone is protected. A multi-AZ strategy is often used to protect against localized disasters, such as an earthquake or flooding.

If you are concerned about an event that could affect all AZs in a Region, such as a massive power outage on the Eastern Seaboard, you go a step further and bridge across two AWS Regions. However, a multi-Region strategy will create more complexity and costs.

Use automation to reduce costs

An aggressive DR strategy can narrow most redundancy gaps — but at a price. A duplicate of your environment in another location can double your AWS bill. That’s a big expense for something that is waiting to be used rather than actively being used.

This is why infrastructure as code (IaC) is ideal in DR. If the recovery time objective can handle a short outage, why not build the infrastructure for your data only when you need it? Automation can enable infrastructure on demand, when you need it — rather than in case you need it. This is a much cheaper approach to DR in AWS.

The importance of data replication

Automation and AZs are important elements for AWS disaster planning. That said, these tactics work only if your data is ready — meaning, you’ve done the necessary data replication.

For a reasonable level of DR response, data needs to be accessible in a timely manner. You can’t pull it from AWS Simple Storage Service Glacier, or other cold storage services, and expect DR automation strategies to work.

You can also use smaller standby environments that always run in a limited active/active scenario. AWS Auto Scaling brings these standby environments to a full production environment without human intervention and with limited downtime. There might be a lag in services during recovery, but the cost savings can be substantial enough to warrant a short performance hit.

Automation via IaC and AWS Autoscaling will, however, require staff time and effort to set up and test.

If a workload demands it, pursue a full multi-AZ strategy. A DR plan doesn’t have to be built around a single approach. Applying one DR strategy to all workloads would likely be cost prohibitive and restrictive. Some workloads need a higher level of protection against downtime, and others do not.

As an organization, make DR choices that reflect your priorities and cost preferences — and adjust them over time.

Source