Abstract:
Deploying on to AWS can be complicated. As a set of visionary principles, there are roughly speaking, with Cloud-Native deployments, 3 guiding ideas, namely: loosely coupled, an automated infrastructure, automated testing including bringing down the system. Keep in mind that these are visions and end objectives. You cannot go from an on-premise mess, to the Cloud Native automated landscape in one leap; nor in a short period of time. It will likely take years, to transform a dysfunctional on-premise estate, into something resembling a Cloud-Native architecture.
Principle 1: Loosely Coupled Systems
A very basic, but easier said than done, design principle. Ideally, we want to decouple components so that each component has a specific boundary, interface and has little or no knowledge of other components. The more loosely coupled the system is, the better it will scale. This principle is appropriate for larger, high volume systems, that need to process a lot of events, and which need to scale.
By using this principle, we will isolate components and eliminate internal dependencies, so that a single component failure, will not bring down the entire system, and will be unknown to the other system components. This creates a series of agnostic black boxes that do not care whether they serve data from EC2 instance A or B, thus creating a more resilient system in the case of the failure of A, B, or another related component.
Best Practices in loosely coupled deployments:
- Deploy Standard Templates. Benefit of using reusable templates and scripts: fine-grain control over instances at the time of deployment so that if, for example, we need to deploy a security update to our instance configuration, we only touch the code once on the Puppet/Chef/Ansible manifest, rather than having to manually patch every instance deployed with Golden Template. By eliminating new instances’ dependency on a Golden Template, you reduce the failure risk of the system and allow the instance to be spun up more quickly.
- Queuing and Workflow. Example: Simple Queuing Service or Simple Workflow Service. When you use a queue or buffer to relate components, the system can support spill-over during load spikes by distributing requests to other components. Put SQS between layers so that the number of instances can scale on its own as needed based the length of the queue. If everything were to be lost, a new instance would pick up queued requests when your application recovers.
- Make your applications as stateless as possible. Application developers usually deploy a variety of methods to store session data for users. This makes the scalability of the application problematic, particularly if session state is stored in the database. If you must store state, saving it on the client reduces database load and eliminates server-side dependencies.
- Minimize interaction with the environment using CI tools, like Jenkins.
- Elastic Load Balancers. Distribute instances across multiple Availability Zones (AZs) in Auto Scaling groups. Elastic Load Balancers (ELBs) should distribute traffic among healthy instances based on frequent health checks, which you control the criteria for.
- Store static assets on S3. Within AWS store static assets on S3 instead of going to the EC2 nodes themselves. Putting AWS CloudFront in front of S3 will let you deploy static assets so you do not have the throughput of those assets going to your application. This not only decreases the likelihood that your EC2 nodes will fail, but also reduces cost by allowing you to run leaner EC2 instance types that do not have to handle content delivery load.
Principle 2. Automate the Infrastructure
Human intervention is itself a single point of failure. To eliminate this, we create a self-healing, auto scaling infrastructure that dynamically creates and destroys instances and gives them the appropriate roles and resources with custom scripts. This often requires a significant upfront engineering investment.
However, automating your environment before build significantly cuts development and maintenance costs later-on. An environment that is fully optimized for automation can mean the difference between hours and weeks to deploy instances in new regions or create development environments.
Best Practices: AWS
- The infrastructure in action. In the case of the failure of any instance, it is removed from the Auto Scaling group and another instance is spun up to replace it.
- CloudWatch triggers the new instance spun up from an AMI in S3, copied to a hard drive about to be brought up.
- A CloudFormation template allows us to automatically set up a VPC, a NAT Gateway, basic security, and creates the tiers of the application and the relationship between them. The goal of the template is to minimally configure the tiers and then get connected to the Chef or Puppet master (or another configuration management tool). This template can then be held in a repository, from where it can be checked out as needed, by version (or branch), making it reproducible and easily deployable as new instances when needed – i.e., when existing applications fail or when they experience degraded performance.
- This minimal configuration lets the tiers be configured by configuration management. Configuring Puppet/Chef/Ansible manifests and making sure master knows what each instance they are spinning up does is one of the more time-consuming and custom solutions you can architect.
- Simple failover RDS. RDS offers a simple option for multiple availability-zone failover during disaster recovery. It also attaches the SQL Server instance to an Elastic Block Store with provisioned IOPS for higher performance.
Principle 3. Try to break and destroy the platform
Build mechanisms to ensure that the system persists no matter what happens. In order to create a resilient application, cloud engineers must anticipate what could possibly develop a bug or be destroyed and eliminate those weaknesses.
Netflix are the true innovators in resiliency testing and havecreated an entire squadron of Chaos Engineers “entirely focused on controlled failure injection.” Implementing best practices and then constantly monitoring and updating the system is only the first step to creating a fail-proof environment.
Best Practices:
- Performance testing. In software engineering as in IaaS, performance testing is often the last and most-frequently ignored phase of testing. Subjecting your database or web tier to stress or performance tests from the very beginning of the design phase – and not just from a single location inside a firewall – will allow you to measure how your system will perform in the real world.
- Iterative Code testing. Automate unit tests, test cases, pair programming and constant code scans will reduce bugs and software discrepancies which impair application performance.
- Unleash the Simian Army. Netflix’s open-source suite of chaotic destroyers is on GitHub. Induced failures prevent future failures.
Unfortunately, deploying resilient infrastructure is not just a set of to-dos. It requires a constant focus throughout AWS deployment on optimizing for automatic fail-over precise configuration of various native and 3rd party tools.
==END