Leveraging AWS DevOps Practices and Terraform to Support Cloud Infrastructure on a Large Scale
Stephen Hengeli, Cloud Engineer, STG
If you work in the technology industry, or are familiar with some of the practices, you might have heard the term DevOps a time or two. For those who may be unaware, DevOps is a set of practices that are followed in an effort to: streamline the development process of a project, provide Continuous Delivery of software releases, and ensure that all deliverables are of a high quality. DevOps combines Software Development, and IT Operations and subscribes to a more Agile Methodology. This all sounds very meaningful in relation to the technology world, but what does DevOps look like in practice? This article aims at looking at the DevOps practices we employ for each DevOps toolchain to manage the infrastructure for this large-scale project.
Consider the following scenario: A client comes to you and wants to support an application that they will then sell to individual clients. This application was already planned such that it lives on individual EC2 instances, one for each customer onboarded. To paint the picture, when a customer signs on to use this application, a new EC2 is required. With this, comes network rules, routing rules, changes to existing infrastructure (Nginx configs, Load Balancer rules, etc...), additions to Secrets Manager, prerequisite package installation, and a slew of other tasks. This would be a nightmare to do manually, the time for labor alone is astronomical. However, there is a better, more scalable way to perform these tasks. Cue, DevOps.
DevOps Tools Put to Work
The following sections will examine DevOps toolchains that support each procedure during each stage of the SDLC. The toolchains evaluated are Coding, Building, Testing, Packaging, Releasing, Configuring, and Monitoring.
When new code is written, it is committed to a private CodeCommit repository. Several guidelines are implemented to ensure the repos are clean and precise. Peer Code Reviews are enforced by following a standardized process. For example, when a ticket is received on Jira, let’s say a setting needs to be added to the prerequisites script, a new branch is created from master named after the ticket. Any changes are related to that ticket and that ticket alone. This is to ensure that if there are any issues when that branch is merged with master, it can be rolled back in small sections. Granularity here fully ensures codebase integrity. For code to be merged, the developer must submit a pull request with at least two other developers. A second set of eyes is always a good measure when it comes to preventing issues. Once code has been reviewed, it can be merged into master for testing in a UAT environment.
Once the code is written, it needs to be built. In order to build anything in the environment we use a Jenkins server that sits in one of the VPCs. AWS offers solutions to automate code builds, however the client already had a custom AMI and preexisting IaC modules, so we felt it was best to build with what they were familiar with. Jenkins connects to servers as nodes and can deploy new code changes that way, or to the entire environment. Code requiring special measures, such as Terraform, is built using a docker container which is all handled by the Jenkins server. When a build is started, it is monitored via the BlueOcean view on Jenkins. This gives the development team real- time access to build status and console output. If there are any issues, the output can be analyzed for the issue. This will happen in a timely fashion as the build status while execution is underway is always monitored. The use of Jenkins helps us ensure that code is built properly and without any issues. Only successful build statuses are accepted.
Once the code is built it needs to be tested using real life scenarios. To emulate that, we use a series of JMeter Tests administered from an EC2 that we have provisioned to generate requests for our performance tests. This is located in a shared VPC along with our bastion hosts. Testing uses the application, emulating multiple connections with various amounts of traffic. This effectively simulates a real-life usage scenario. Metrics are captured from these tests and evaluated by the QA team. Should any anomalies exist, they are investigated by the development team. If any changes made, start the process over again at Coding.
Most of the code written is infrastructure code, which isn’t necessarily packaged, but before a run, the code is copied over to a docker container that runs Terraform. The client also has their application. Which is packaged as a standalone set of executables and is deployed in a similar fashion after the infrastructure code finishes building. They store these executables and supporting scripts in an S3 bucket. An additional S3 bucket is used to store all the executables and installers required by our infrastructure code as well. Both buckets use Object versioning to better support updates. Any Docker Containers are stored in ECR which are readily available for the environment when they are required.
For new releases to production a change management process is used. Changes are built, deployed, and tested in dev before being deployed into UAT. This two-step process must happen in order for changes to be considered for the higher environments. These changes are kicked off manually from within Jenkins (think, click and forget). When a change is submitted, along with it, a business justification, impact statement, testing evidence, executive summary, and time estimates are required before it can enter the approval phase. Once these items are satisfied, the change owner approves the change, then it goes off for approval by the change management board. This board reviews the change in a T-CAB meeting, and if approved moves to a B-CAB meeting later in the same week. If approved there, the change is scheduled for an upcoming maintenance window. No changes are allowed to be deployed to production without this change management process. When the change is completed, validation is required as well as confirmation of the change to close the RFC. This is to track the records if there is ever a question about what exactly was deployed and when. Recently this was a manual process, but with the implementation of the client’s Service Management Automation software (SMAX) it has become fully integrated with the DevOps processes adopted.
The infrastructure of the environments is, as mentioned in a previous section, managed by code. Terraform to exact. While AWS provides a fantastic way to build resources via templates using AWS CloudFormation, Terraform is provider agnostic. The code can be pulled, branched, and committed just like any other code, and can be tested locally without Jenkins. Running a plan or a taint on set resources can be very effective while in a dev environment. Another positive is if you need to provision a quick resource for testing, you can use a dev TF vars file for that environment to build directly in the dev environment from the dev machine. Wait, using a dev file? Right, this is another benefit of using Terraform. We wrote a single code base for all the resources. The code is the same for dev and production. The difference is the tfvars files. There is one for each environment. While this sounds sort of messy, the benefits are staggering. For example, I can define smaller resource sizes for my dev environment than my Production, as there might not be a need to test with large infra. This helps save money in the long run as you can scale up or down as needed. We also sometimes have different naming conventions between environments and although minor, they can be difficult to anticipate when witching between environment builds. Using one infrastructure configuration codebase ensures that resources in dev are near to life, and resources in prod are closest to the testing machines. If code works in one place, we can be sure it will work everywhere. Another positive is Terraform has a rich Open-Source ecosystem that makes it easy to integrate additional tools into our workflow to help us present our plans clearly to the stakeholders. Whenever we present new cloud infrastructure plans to the stakeholders, we use a chart generated with Terraform-Visual to show all the new AWS resources planned and where in AWS they will go.
Once everything is built, it was a requirement that we monitor the resources. For the infrastructure we relied heavily on CloudWatch. Currently we track CPU, Memory, Disk space, network usage, requests on the EC2 instances, Load Balancer Statuses, Nginx requests, secrets access, cloud account actions, and many more. The metrics are represented using Dashboards, and the logs are used to create alerts. There are health checks implemented for the various app endpoints and alerts associated with those as well. Custom logs are streamed to CloudWatch where we use insights to keep track of various operations such as backup status, AppPool statuses, DB statuses, and localized alerts. Nothing can happen in the environments without someone knowing about it. Alerts are fired to an SNS topic, so an email is sent to everyone registered to the topic, in this case, the operations team. When alarms go off, SNS fires an alert to the support ticketing system, which in turn opens a ticket with support. Issues can also be reported to the product support team who can also create similar tickets.
Using AWS tools to support these seven phases of the SDLC, we were able to provide our client (and their clients) with a methodical, stable, environment that is easily managed. When a new client onboards, a simple deploy process occurs and the server is built, provisioned correctly with the proper prerequisites (thanks Terraform!), and is monitored. Setting all of this up manually would take around five to six hours, however with the DevOps process we follow, that time is only around an hour. While DevOps can sometimes be seen as a constraint on a developer, it is situations like this that allow developers to scale their expertise in a way that allows them to manage environment orders of a magnitude larger than before.