Devops Home
Automating and Utilizing EKS Clusters to Provide Stable, Autoscaling Container Services in AWS
Stephen Hengeli, Cloud Engineer, STG
Introduction
Scale can be difficult to estimate when launching a new application, especially when the use of the app is planned for more than one client. Consider the following scenario. A client comes to you with a set of API services they want to host as part of a larger backend API they have developed. These API services come packaged as separate, containerized images which must all run concurrently to provide the functionality that the client solution boasts about. Through the mixed use of EKS, ECR, CodeBuild/CodeCommit, and a strong DevOps policy the issue of scale became a thing of the past.
Client Issue with Evaluating Scale needs
This environment’s application is a backend API that was developed by the client to connect themselves and others to a centralized banking core. Each of the services work together as one, but they are run individually. As a result, the hosting situation had to be evaluated, including scalability. If the client was to try to run each of these services on a server individually, it would work, however when the demand for this app goes up, the server could slow down. If there was triple the normal traffic that it sees, say on a holiday, or through COVID-19 lockdowns, the server would not survive. Hardware costs money, so without proper estimates, they could have ended up spending more than they needed to. Also, something we had to consider was Armdahl’s Law, which essentially mandates that at some point virtualization becomes more lucrative than adding compute power to an existing server. To add some perspective: If you have 4 CPUs, and a latency of 2.105, adding 4 more (8 now) will only bring the latency to 2.581 (higher being a better figure here). This is a prime example of when virtualization becomes more effective. Considering this, the solution we opted for was AWS Elastic Kubernetes Service (EKS). This service has an option to automatically auto scale based on CPU or memory needs. There was no longer a need to consider usage estimates as now the number of pods can increase and decrease on their own depending on demand.
DevOps Comes into Play
In order to properly manage a project of this scale we had to employ a slew of
DevOps practices that work hand in hand to ensure that the environment is
consistent, free of issues, well-maintained, and maintains scalability.
First up is the code control. The client develops the application services themselves,
so we do not have any input on the code itself. However, to help things along, we
suggested the client use AWS Code Commit private repos allowing them to keep
their code tightly controlled and versioned. Code additions or changes here are
branched according to their ticket numbers so everything can be correlated to issue
tracking or feature tickets. When new code is ready to be deployed, we use
CodeBuild to ensure that code is built correctly and that we can monitor the status
of it.
Second is the service builds. When a component of the API is built, it is packaged as
an image and stored in AWS Elastic Container Registry. Any deployments to EKS
come from ECR. This ensures that the containers come from a private repo that is
accessible by the client’s network configuration. ECR also integrates with the
CodeBuild and EKS quite well so it made perfect sense to go with the path of least
resistance.
The third point, and one of the most crucial, is configuration management. For this
client, we chose to use CloudFormation to manage the templates needed to create
the environment resources. There is a stack specifically for creation of the VPCs, the
RDS instances, the EKS stacks, the EKS nodes, and other resources needed by the
environment. These templates are all controlled by a master template which
handles all the templates in the order they need to run. Of course, there are some
services that cannot be run with the main stack, for example the API Gateway. This
requires the ARN of the Network Load Balancer (NLB) which is created in the
original stack. Actions such as this, as well as creating the Route53 subdomain, can
be performed after the initial CloudFormation run.
Fourth is the deployment process. This API is comprised of more than one container
so running them all together would be a hassle if there wasn’t a solid deployment
process behind this. We were able to meet the challenge using AWS CodeBuild
pipelines. We were able to automate the code to container process for all the
environments associated with this project with a buildspec file, which is a set of
instructions to execute command line actions. CodeBuild uses this file to Dockerize
the code from the CodeCommit repo, publish it to ECR, and then deploy to the EKS
cluster. For all environments, the code is built in Dev first, then moved to the higher
environments in succession using pipelines identical to the one described above
ensuring that the builds are consistent.
The final practice is monitoring. AWS CloudWatch allows us to fully monitor every
part of the environment to allow for reactive service should something happen. We
configured multiple alarms for each EKS pod including memory and CPU usage.
Alarms also exist for API health checks (Synthetic Canaries), EC2 disk usage on the
bastions, API gateway traffic, and RDS storage and CPU usage. We used an SNS
topic to send email alerts to our ticketing system so if any of the alerts go off a
ticket gets created and assigned to an engineer as quickly as the notification can be
received. Dashboards have been implemented for the API traffic to monitor for 4XX
or 5XX errors as well as VPN tunnel statuses. These alerts exist for all environments
including Dev so if there is any inconsistency anywhere, we know about it and can
act on it in a timely manner.
Conclusion
To sum things up, we were able to create an autoscaling EKS cluster that runs Containers from ECR that support the client’s API. Using strong DevOps policies, we were able to automate the infrastructure and deployment with CloudFormation and CodeBuild. CloudWatch and SNS allow us to keep track of the state of the environment and its resources to ensure everything remains at peak functionality. This in turn allows the client to provide a stable API offering to their clients. Without the flexibility and availability of AWS or a strong DevOps policy this project would have been difficult to build and manage. Monitoring would also have been a hassle as a third-party solution would have been needed. In conclusion, AWS and DevOps assisted us in this environment build out, allowing us to meet all customer challenges.