The basics:
- Setup a VPC in AWS with separate networking for both a Control and Data Plane (and associated resources, Ec2 builds automated using Terraform or CFT)
- You will have a Control Plane and a Data Plane. A control plane will set up the configuration of the Data Plane or ‘worker node’ clusters in K8s.
- A cluster is a collection of one or more nodes which run the application containers
- The control plane in K8s manages these worker nodes, and is on separate machines from the data plane or worker node plane
- The Control Plane has aK8s admin API which is the endpoint and controls the worker nodes or data plane
- Within the control plane you will use the etcd datastore, which stores the key-value pairs of the worker node/data plane configuration and pushes the configurations to the worker nodes
- In production you will have high availability using multiple nodes
- Pods run in the worker node. You might have one to a few hundred pods and the pods can contain one to a few thousand containers.
- The smaller and less cluttered the cluster, the easier to manage it.
- In K8s there is a ‘namespace’ which is a mechanism to isolate groups of resources in a cluster (by line of business, or project).
Application View:
For the business, the application and access to the application via the Kubernetes cluster is the relevant bit.
An application for example, may contain a customer portal plus the database that this customer portal calls to serve client requests (example, bank payment, loan submission, account balance view).
What does this application look like in Kubernetes?
In K8s the application runs in a specific namespace with a specific namespace and label denoting the project and environment (eg dev/prod).
An application running within a Kubernetes environment consists of native Kubernetes resources (e.g. service accounts, stateful sets, persistent volumes, secrets, etc.) and custom resources that are defined specifically for that application.
Stateful applications have containers in the pod that share resources such as storage.
Persistent volumes (PVs) are a piece of storage in your cluster. Your pods might talk to external storage as well such as Amazon RDS, EBS, etc. Your external storage EBS are made available inside the cluster via PV. The pod will request the storage via a PersistentVolumeClaim(PVC). Your application might even have dozens of PVCs.
The components of an application include:
- Kubernetes-native resources (e.g. pods, secrets, configmaps, etc.)
- Custom defined resources (e.g. CRDs)
- Persistent volumes (e.g. CSI)
- External data stores (e.g. Amazon RDS, EBS)
Stakeholders
There are three main types of stakeholders in your organization who deal with your Kubernetes environment in the cloud:
The Cloud Administrator: Manages cloud workloads in your organization. Responsible for protecting cloud workloads, policies, roles, and managing backup, restore, and retention.
The Kubernetes Administrator: Cloud admin with Kubernetes expertise managing Kubernetes clusters in the organization. Responsible for setting up clusters and monitoring Kubernetes clusters in your organization. For some organizations, cloud admin and Kubernetes admin can be the same person. In many cases they belong to the same central/cloud administration team.
The Application Administrator: Owners and creators of Kubernetes applications. They know what comprises the application. There are typically multiple app owners in an organization.
Since there is no Kubernetes application object, a custom application definition enables the application owner and central backup team to have a shared understanding of what to protect. But at the same time, you cannot expect the application owner to list every single resource that makes an application. The data protection solution should do the heavy lifting on identifying the applications.
Issues:
- Unintended modifications
- Accidental deletion
- Malicious internal threats
- Ransomware attacks or other external threats
Though Kubernetes is known for resiliency, it can only bring back the container infrastructure, not the data, thus the state is lost. Moreover, there is no Kubernetes application object, so Kubernetes does not know what your application truly is. Hence, you need an application-centric Kubernetes protection solution. This solution should provide cross-namespace, cross-cluster, and cross-region recovery options. Your application-centric Kubernetes protection solution should enable your users to be able to do the following.
Application disaster recovery
In the event of a catastrophic application failure, the original application might no longer be running. Users expect to recover their application(s) sometimes even in a different region. That includes application resources and data to be recovered.
Application rollback
In the event of an unintended change to an application, including configuration and/or data, users expect to revert their application to the point in time a backup was created. Users expect that the revert not only re-creates/modifies application resources, but also that it will eliminate resources that did not exist at the point of time of the backup. They will also expect an option to not overwrite existing resources.
Application migration
Users will want to move applications for multiple reasons, including cost optimization, load balancing, and cluster upgrades. While migration is not strictly a protection use case, many organizations leverage their protection tools for migration.
Depending on how the migration process is managed, the source application may be running concurrently with the migrated application until a “cutover” process occurs to minimize risk and downtime.
Application cloning
Users will want to clone applications for multiple reasons, including training, development, and upgrade testing. While cloning is not strictly a protection use case, many users leverage their protection tools for cloning.
Since the clone will run concurrently with the production instance, the process should ensure that resources do not conflict (e.g. renaming of resources) and that data is copied. Admins may also want to retain provenance information for clones to either track the clone copies or to enable updates to the clones (pushing a new “golden copy”).
Application retrieval
Admins will need to retrieve past versions of applications for reasons, including legal cases, project retrieval, or regulatory compliance. Traditionally, retrieval was focused on data, but now application retrieval is becoming more important. Application retrieval enables users to recreate the application flow and view the data in context.
Application resource recovery
Admins will need to recover a subset of an application for reasons ranging from legal cases to testing a subset of an application, to rolling back only one part of an application.
Some backup admins might just choose to protect the namespace. The data protection solution should still be able to do that. The challenge with namespace protection is that it only has the option to use simplistic crash-consistent protection mechanisms, the application owner cannot easily specify a recovery, and there is no clear connection between the backup team and application team.
Protection is not just for backups, but recovery too
The security for protecting the application is not just applicable to backing up data. The application protection solution should have security postures in place in each layer as explained in the following:
Installation
The images of the protection solution should be certified so Kubernetes admins can be sure that they are deploying only the intended image in their environment. This is to prevent any malware disguised as the data protection solution.
The protection solution’s permissions needed in the environment should be restrictive. For instance, the data protection operator should not have the permissions to delete resources in the cluster.
Backup
The data protection solution should provide options to encrypt the data. Immutability of backups is critical to prevent ransomware attacks. The data protection solution should restrict access to sensitive Kubernetes objects such as secrets. Moreover, the data protection solution should be able to work with secret management tools used currently in the Kubernetes environment.
Orchestration
The data protection solution should store metadata outside the cluster such that in case of any cluster disaster or breach, the backup metadata is not corrupted. This centralized metadata should be able to support cross-account, cross-region restores in case of disasters.
The communication between the data protection operator component in the cluster and the orchestration layer should be secure from end to end. Authorization is managed between the two protection components by the data protection solution.
Restore
The data protection solution should provide the ability to manage users and groups that have access to restore the backups. Recovery of applications should be scoped for the environment and the cloud admin should be able to define the scope in which the restore is valid. This ensures that no bad actors, internal threats, or leaks can re-create the application beyond the defined scope.
Conclusion
Kubernetes is the most commonly used container orchestration tool to run production applications. Though Kubernetes is resilient, it is a complex environment; it cannot bring back data and it does not have application objects, thus making Kubernetes application protection critical. There are multiple users in your organization who interact with Kubernetes environments, i.e. members from the central admin team and the application owners. Users will need the data protection solution to provide disaster recovery, rollback, cloning, retrieval for compliance, and resource recovery for the applications. Moreover, the data protection solution should have security posture for this modern workload across installation, backup, orchestration, and recovery.