Migrating the Kubernetes Network Overlay With Zero Downtime
At Blend, we make extensive use of Kubernetes on AWS to power our infrastructure. Kubernetes has many moving parts, and most of these components are swappable, allowing us to customize clusters to our needs. An important component of any cluster is the Container Network Interface (CNI), which handles the networking for all pods running on the cluster. Choosing the right CNI for each use case is critically important and making changes, once serving production traffic, can be painful. Blend had several problems with the CNI we initially chose (Weave), leading us to explore alternatives. We eventually decided to switch and in this post, we describe the challenges and solutions to migrating without downtime.
We’ll cover the following topics:
- Why Changing Overlays is Hard
- Changing CNI’s at Blend: Weave to Calico
- The Migration: Mirroring Clusters
- The Migration: Switching CNIs
- Gotchas: Why Practice is Important
Why Changing Overlays is Hard
The CNI handles networking between pods, establishing an overlay network for the cluster on top of the existing network. Any pod running on Kubernetes that isn’t using host networking is managed by the overlay. It assigns IP addresses and handles networking on the cluster. Pod addresses are not visible outside the cluster, so the CNI is vital to allowing pods to communicate across the network. If the CNI pod on a host isn’t available, then any pods running on that host won’t be able to reach the network causing many connectivity issues within the cluster. Normally this isn’t a problem because replication of pods and the ability to move pods to healthy nodes is a strength of Kubernetes. But if all of the network overlay pods are unavailable, all services running on the cluster can no longer serve traffic, causing a major outage. This type of failure is rare, but it is the type of failure you’ll experience if you need to change the CNI on a running cluster. During the transition, the cluster has no network, which is the crux of the challenge with changing overlays.
Changing CNIs at Blend: Weave to Calico
We’ve been using Kubernetes on AWS for over two years and we manage our clusters with tooling built around kops. Because we manage our own clusters¹, we’ve experimented with a number of configuration options and have arrived on a set that works well for us. By the time we moved to production with Kubernetes we had settled on Weave. We started seeing problems with more load on our cluster but continued to work through it and alleviate the stress on Weave. We saw problems with low network throughput, dropped connections, timeouts, and running out of IP space on some nodes. We tried to solve these with suggestions from the community, upgrading, and resource tuning, but we ended up searching for other options. Calico is another common CNI for clusters, so we put it to the test in one of our higher load environments, our sandbox cluster. Calico performed significantly better with our use case, so we made the decision to switch. We hope to follow up with another post discussing in more detail our experiences with different CNIs.
We didn’t want to accept the downtime to migrate our existing clusters to Calico because we were already running most of our SLA’d services on Kubernetes. We wanted to avoid interrupting service as much as we could, not just for our customers but also for services internal to Blend. Since we run several clusters, standing up new ones has become relatively simple for us. This gave us plenty of room to experiment with the best way to change our CNI.
The general process for switching a cluster to Calico was:
- Change kopscluster configuration to install Calico and update the cluster
- Remove Weave from the cluster
kopsrolling update to bring up new machines into the cluster
This process would leave the cluster with no network starting with the second step until the end of the third step. The more machines in the cluster, the longer this process would take, so doing this and eating the downtime was out of the question, especially for our larger clusters. We realized that since there was no way to avoid downtime, we needed to direct traffic elsewhere while we performed the switch. We do not currently have multiple clusters serving production traffic, so we could not switch to another region. The conclusion we ended up on was to set up another cluster to mirror the one we want to work on.
While this sounds simple, the devil is in the details.
The Migration: Mirroring Clusters
At this point, having used kops extensively, setting up a new cluster is straightforward. To avoid downtime, our plan was to set up a second cluster alongside the main one and have traffic directed there during maintenance. While this sounds simple, the devil is in the details. First a caveat: we did not worry about persistent volumes. Our workloads, for the most part, do not use them, and for the few services that did, we could tolerate the downtime. Most of our services being stateless helped immensely to simplify things. The steps were:
- Setup a secondary cluster
The first step was easy for us since we could just reuse our existing tooling for setting up a new cluster. The secondary cluster would be substantially similar to the main one, and where they differ will be reflected across all our clusters, so we wrote a small utility to transform a configuration for a cluster into an equivalent configuration for its mirror.
- Copy all Kubernetes resources (secrets, services, deployments, jobs, etc..) to the new cluster
The next step was to copy the Kubernetes resources. Because we only cared about stateless services, this was pretty straightforward as well. We wrote a script that iterated through the resources in the namespaces we cared about and copied them to the new cluster. Prior to copying, we disabled new deploys so there would be no difference between the versions running on each cluster.
- Redirect DNS entries to the new cluster
Copying DNS entries proved to be the most challenging aspect of this. In addition to Route 53, we store the DNS entries for a service in Kubernetes metadata, so we have records of what each service needs. This helped give confidence that we weren’t missing any DNS entries and made the process of switching the entries between clusters easier. We wrote a job to run on the cluster that iterated over the services, validating that each service is available on the cluster first, then changing the DNS record to point to the load balancer or ingress. The job pointed DNS entries to the same cluster running the job, making it easy to reason about.
- Perform the CNI migration on the original cluster
We can now do any downtime maintenance we want to. In this case it was to switch the cluster CNI which is discussed more below.
- Redirect DNS entries back
Again, we can use the same script as in step 3 to switch traffic back to the primary cluster.
- Take down the secondary cluster
After everything is done, we took down the secondary cluster to avoid the extra overhead of keeping it up to date and paying for it when it isn’t being used.
The Migration: Switching CNIs
With traffic switched over, we are able to perform downtime maintenance without worries. To switch from Weave to Calico we did the following:
- Replaced Weave with Calico in the configuration and used kops to update the cluster
At this point, the Calico daemonset was created, but all the Calico pods were crashing. Weave was still available though, so nothing on the cluster was broken.
- Deleted Weave, breaking the network
We found that we needed to delete Weave before rolling the cluster, as Kubernetes wouldn’t pick up the changes and ended up in a bad state.
- Rolling update on the cluster
Since we didn’t need to keep services up on it, a simple way to perform a rolling update is to double the number of nodes and then scale in the old ones. Normally this would wreak havoc on a cluster if not giving new nodes time to absorb the load, but we didn’t need to worry about that since the mirror cluster was handling traffic.
- New nodes (should) all have working Calico pods, and the network (should) be available
If there is any instability, rolling the cluster once more should fix it. Sometimes fresh nodes are required if there are any hiccups.
Gotchas: Why Practice is Important
We ran into a couple of problems trying to iron out our mirroring process. The first thing we came across was the naming differences between the clusters. Since clusters need unique names, they need unique hosted zones. This ends up creating a bunch of problems, but one of the big ones is TLS. TLS certificates are domain name specific, so we needed certificates for the new cluster. We use Let’s Encrypt with cert-manager to issue certificates on the cluster, so before setting up the secondary cluster, we added the new domains as subject alternative names (SANs) to all our certificates. This way, certificates were still valid for the new cluster.
The next problem we ran into was with our cluster Docker image registry. We needed to make all of the images in the registry available to the mirror cluster. Since our registry is backed by S3, we were able to run the same registry on the replica cluster by copying auth secrets between clusters. The registry on the new cluster ran under a different domain (i.e. registry-mirror.example.com instead of registry.example.com). However, the copied Kubernetes resources referred to the old registry domain (e.g. registry.example.com). If we wanted to swap the DNS for the registry too, we would need to do two rounds of DNS repointing. To avoid this, our script for copying Kubernetes resources takes this into account by doctoring the image name on resources to reference the new registry instead.
One last thing, also related to naming, is IAM roles and security groups. kops creates security groups and roles for the nodes and masters in the cluster based on the name of the cluster. These new roles need the same permissions as the old cluster, and the security groups need to be mirrored too. For example, if you have an RDS database with security groups open to the mirror cluster, make sure to add the new security groups to it as well.
CNIs are a critical component of Kubernetes clusters. Making CNI changes without causing disruptions isn’t always possible. When switching CNIs the cluster’s network will have downtime. The desire to avoid downtime in a scenario where we couldn’t have any led us to our cluster mirroring solution. Not only did this help solve the problem of switching CNIs, but it also gives us another tool when we encounter similar cluster maintenance tasks.