Moving a Business-Critical Monolith to Kubernetes

At Blend we have been pushing for Kubernetes adoption across all services for the last two years. Migrating our monolith from AWS ECS to a self-hosted Kubernetes cluster marked a major milestone. Moving business-critical applications in general requires deliberate planning and in many cases major updates to deployment pipelines, system monitoring, testing, and infrastructure. This post will explore the migration strategies and lessons learned as we got the monolith up and running across deployments with zero downtime.

Before Kubernetes

Blend began as a monolithic multi-layered application in a single codebase. Prior to migrating toward Kubernetes, we had a few external services, but most of the logic remained in the monolith. The monolith encompasses a wide range of functionalities ranging from loan workflow engines to document generation and single sign-on. Working out of the monolith enabled high developer productivity in the early days. We only had to build one image per release and were able to deliver features iteratively to customers while running an automated release pipeline on Jenkins. On the infrastructure layer, we made use of AWS extensively, from hosting virtual instances with EC2 to storing container images on ECR. Our monolith ran on clusters managed by ECS. The overall architecture worked well given our engineering capacity at the time.

But as the number of new services grew over time, we started exploring the possibility of switching to Kubernetes for various reasons. The biggest reason was to encapsulate the whole process of running a service into a single system. ECS is great for running containers, but still requires we use a multitude of different AWS products for a functioning service. Kubernetes provides all the necessary machinery in one system; simplifying our architecture. Avoiding vendor lock-in was important to us and Kubernetes was designed to be interoperable with support for different cloud services, network fabrics, and storage options. We also wanted to build new tooling that abstracts away the deployment complexity for other engineering teams in ways that are simply not possible with ECS (we will discuss more in detail in a follow-up post). To standardize container management across all services, we eventually decided to move the monolith to the Kubernetes-based cluster.

Moving the Monolith

Running experimental Kubernetes services with requisite security controls and production validations was essential before we could plan the monolith migration (it needs to occur without any degradation of service). The amount of custom-built infrastructure for the monolith that other services did not need posed complexity to this project. In case of unexpected incidents during migration, we had to be able to revert to a stable deployment immediately.

With this in mind, the requirements included:

Dual Deployment

After accounting for the constraints and deployment methods used by the monolith, we decided the safest route forward was to build out dual deployment. We did so by creating two release workflows that supported running the monolith on ECS and Kubernetes from building images to hosting containers. The general idea was to create a toggle in our deployment scripts that allowed a simple boolean flag to dictate which orchestration service to release to. We then set up all deployment multi-jobs on Jenkins to run the deploy with both permutations, so that the running instance in both orchestrators was always running the same version of code.

Release pipeline after enabling dual deployment
Release pipeline after enabling dual deployment

Changes to the application codebase was also necessary in a few parts. We used feature flags to ensure that the monolith would be compatible with both ECS and Kubernetes. This means that deployment-specific application logic was coupled with feature flags. For example, our monolith was running multiple child processes in a container when running on ECS. But as we moved to Kubernetes we decided to run just one process per container and to let Horizontal Pod Autoscaler do the heavy lifting by minimizing the application-level complexity. We also had to add feature flags to services interacting with the monolith, most frequently reflected in selecting the monolith service URL according to the orchestration service.

Regardless of the selected orchestration service, each release version would go through four phases of deployment environments: sandbox, preprod, beta, and prod. We began our work in sandbox and only proceeded to a higher environment when we had gathered enough signals to conclude that the monolith was running reliably on Kubernetes. Running the migration in a new environment only required minor configuration changes and traffic cutover, which we will discuss in the following section, so most of the development work was completed in sandbox.

Testing and Validation

In order to safeguard our monolith, we run a large suite of tests for every pull request before merging them. Each pull requests runs unit, frontend, backend, and end-to-end tests. Part of this involves running tasks and another involves setting up a full instance of the service, running tests, and then tearing it down again. The whole process creates a lot of churn in our Kubernetes clusters, which is not the use case we originally designed for.

The churn for pods created many issues around scheduling and scaling or cluster resources. For instance, the sheer number of pods being created caused nodes to occasionally exhaust their IP pools and become unable to schedule new pods even if they had free resources. We originally decided to run each test in its own pod, but this became unruly quickly. Due to our network issues, a flake in a single pod could fail the tests for an entire pull request making it much more likely to experience failures. This strategy also resulted in inefficient use of the cluster as pods create unnecessary overhead. We trimmed down the number of pods by running multiple tests per pod which helped to alleviate this.

In addition to the sheer number of pods created, we had to deal with ephemeral services that needed to last long enough to test but get cleaned up later. In production, we decided to give this service its own Elastic Load Balancer instead of using the shared load balancer for cluster ingress. This works well for long-lived services that aren’t created and destroyed often, but it is harder to deal with in the case of our testing infrastructure. To keep the tests as close to production as possible, we still use load balancers for each service, but we ended up hitting the limit on the number of rules per security group due to the number of load balancers. By default, Kubernetes creates a security group for each ELB and adds an entry to the nodes’ security group for that ELB. In order to mitigate this we needed to disable this feature, and instead had a shared security group that we could attach to each ELB to allow ingress to the nodes. We also hit the AWS limit on number of load balancers multiple times. We were able to resolve this by requesting an increase from AWS, but it caused us problems preventing merges and would be better to avoid entirely. While this is working at the moment, we have plans to reduce the number of load balancers we use in the future.

Running end-to-end test against monolith services in sandbox also made us rethink how to deal with the HTTP traffic handler. We provide services on Kubernetes a lot of easy configuration options through our in house deployment system. Web services at Blend have a sidecar injected into their pod to handle TLS and HTTP redirection allowing most service owners to ignore the details of HTTPS entirely. However, our monolith needed to handle TLS itself as a result of using SNI to determine the correct certificates to serve. In order to support general use cases while proceeding the monolith migration, we built out options in the deployment system to configure traffic handlers with the proper sidecar injection. HTTP requests sent to the monolith automatically get redirected to HTTPS as a result.

Serving Traffic

Once the application was up and running on both orchestrators, we needed a way to dictate the traffic being sent to either deployment. Each of our customers points their custom domain to a DNS entry internally managed by Blend rather than directly to the load balancer. In the scenario of a hypothetical customer Bailey Bank who controlled the URL mortgage.baileybank.com, they would point the URL to an internally managed DNS entry, which would then get resolved to the IP of the load balancer.

DNS resolution for a customer
DNS resolution for a customer

Therefore we only needed to update the internally managed DNS entry from pointing to the ECS load balancer to the Kubernetes load balancer. We built tooling to configure traffic cutover on a per-customer basis and roll the traffic over to the Kubernetes deployment for a subset of the customers at a time. We proceeded the rollout by carefully monitoring application-level metrics (i.e. API requests/sec, 95th percentile request response time, and error logs) and service-level metrics (i.e. CPU/memory utilization and pod count). Having the right alerts and monitors in place before the traffic cutover was also essential for determining whether things were working or not. In case of an emergency, we would have re-routed the traffic back to the ECS load balancer. This practice worked out well for us and we were eventually able to move all customers to Kubernetes without encountering any major issues.

Conclusion

In general, migrating business-critical applications requires deliberate planning and validation for different outcomes before escalating to higher development environments. We optimized for making two-way door decisions where changes are reversible, then moved forward as we verified the service was operating properly in each phase. We have learned a lot about fine-tuning our deployment system and Kubernetes clusters throughout the monolith migration, and now the production Kubernetes-based cluster is handling 100% of our monolith traffic that processes $2 billion of loans per day. As we continue learning and improving our infrastructure, we aim to improve service reliability and facilitate developer velocity.