Moving to Multitenancy

Here at Blend, we recently shifted to a multitenant paradigm for our core application.

That is to say we moved from a paradigm where a single instance of our app served traffic from a single customer to one where a single instance can serve any number of them.

Why didn’t we start that way?

If you have a system where customers need to interact with each other, multitenancy is necessary from the start. However, when we first started selling Blend, we were selling instances to only a few customers, each of which wanted to serve traffic from their own URLs, had plenty of different configuration points, and obviously wanted zero interaction of their data with any of our other customers.

This meant that there were originally some pretty significant simplicity advantages of just managing a single tenant at a time. So that’s what we did.

A logical, but problematic, first choice

However, in that pre-multitenant world, we had multiplicative growth of both cost and complexity as it related to customer count, not volume. Separate instances for each customer would have different resource utilizations that it would be a nightmare to actually optimize around, which translated instead to just wasting money on unnecessarily over-provisioned infrastructure.

The goal of shifting to multitenancy was to change this relationship — we envisioned a world where customers would be an abstraction that most engineers would not have to deal with on a daily basis, and we could build out economies of scale with traffic throughput. If we could do that, we’d reduce both infrastructural cost and the engineering complexity of managing all of these different instances.

In this post, I’ll walk you through the changes we needed to make for this paradigm shift, the costs and benefits of doing so, and some information on what new projects we’ve unlocked as a result.

Scoping a solution

By the time we decided to shift to a multitenant paradigm, it had become a major project that would require both entirely new features and large refactors of existing functionality.

First, all of our customers have their own URLs under their own domains. Our security policies forbid SSL termination at the load balancer level to avoid any insecure network traffic whatsoever, so we needed a way to resolve different URLs to different SSL certificates in our web handler.

Second, our application logic needs to understand which tenant is in play in order to resolve to the correct database and manage tracing appropriately; this information needed to move from a single environment variable to something piped through every facet of the app as it processes requests.

Third, financial data is sensitive and vulnerable to abuse without access controls. Any multitenant approach that sacrificed security was a non-starter, so we needed to ensure that we managed all of this in a way that was sufficiently secure for dealing with users’ financial data.

Finally, we were managing service-level, tenant-specific configuration through a combination of environment variables and source control. Even without migrating to multitenancy, this system needed revisiting, as we needed to share this configuration across services, and adding new customers needed changes to source code. Thus, we needed to move this configuration to a service that worked as a global source of truth for all of Blend, and to intelligently manage changes to that information.

Resolving a tenant

The first step — resolving a tenant — proved half-simple: serving different SSL certificates based on the domain used to hit your app is a problem solved by a technology called Server Name Indication, or SNI. This is the same technology that load balancers use internally, and is managed natively by Node’s https:

https.createServer({
  SNICallback: (domain, cb) => {
    return cb(null, tls.createSecureContext({
      key: getSSLKey(domain),
      cert: getSSLCert(domain)
      ca: getSSLCA(domain)
    })
  }
}, app)

This is some dumbed down pseudocode that throws caching away and ignores some implementation details, but the point remains that Node allows you to do this quite simply; after this update, all that was left was to point our route53 records at the correct load balancer and make sure that we could actually access the SSL certificates.

The other half of the problem was not so simple, however. Not only did we need to resolve which certificate to use, we also needed to pipe that information through the rest of the app. From a technical perspective, this isn’t incredibly complicated: we just attach a context to the request object that gets forwarded along to every middleware and route handler. However, this did end up being quite a headache nonetheless since that context was important to probably 50–60% of the functions in our application, which meant that many function signatures needed to be changed, with many layers of nesting. Fortunately, this piece of work had been partially undergone for unrelated reasons a few months before we began work on multitenancy, but it was nonetheless a project that would have been unnecessary had we designed for this case from the beginning.

Multitenancy and security

Everything we build at Blend must adhere to high security standards and multitenancy was no exception.

To ensure that unifying all of the tenants under a single deployment had no impact on the security and isolation of our different customers, we took several measures.

First and foremost, we still maintain separate databases for each of our customers. Since all of the databases use different credentials that are locked down based on the request that comes in, this prevents even hostile database queries from reaching across our customers’ data. Database credential variation and separation prevent the co-mingling of data across customers in Blend. This isolation is not limited just to the databases either: we also maintain separate keys and encrypt data uniquely for every customer.

Furthermore, all of this isolation is backed by stringent validation on any requests to access, encrypt, or decrypt data. Anytime we try to perform encryption-related operations, we double check that the context that is performing the request was created appropriately and matches the origin of the request. That context is further backed by immutability to prevent any tampering that could circumvent these security controls. Even copying the context has immutability rules on the original tenant.

Tenant configuration as a service

As mentioned earlier, our tenant-specific configuration prior to multitenancy was managed through a combination of environment variables and source control, a system that no longer satisfied our requirements. To deal with this, we built out a microservice that serves this configuration. We call it Rolodex because it manages looking up information for any given tenant.

Building out this microservice has already proved incredibly valuable. Its function is simple: it stores metadata about our different tenants, including a few logical groupings. It’s barely over a thousand lines of src/, but we’ve already removed several times that in duplicated configuration across different services, with more to come. Not only that, but it has paved the way for deploy-less customer creation, something we’re piloting in our demo service right now (this is used to provide our sales team with customized versions of our product for potential customers, spun up with a simple Slack command).

Efficiency gains and other benefits

Building out all of these features was a lot of work, so it’s important to note some of the final benefits that it brought us.

First, we were able to halve our AWS EC2 cluster sizes while actually increasing overall potential throughput. That translated to huge cost savings for us while improving performance and reliability for all of our customers.

We also reduced our deployment complexity dramatically. Prior to all of this work, the number of jobs we maintained was linear with the number of customers, which meant a great deal of effort went into creating these jobs. We no longer need to maintain any customer specific jobs. Even jobs that we run per-tenant for log isolation can now be automatically generated with the advent of Rolodex.

Not only did we reduce deployment complexity in terms of job count and infrastructure management, but raw deploy time also saw a significant reduction due to some of the parallelization and simplification here.

Last but certainly not least, the largest business value to this feature is that onboarding new customers costs us nothing — we only need to pay for traffic moving through our application.

On the other hand, there were some costs here. Single-tenancy does have some benefits in terms of isolation: on a multitenant instance, something hogging the CPU or crashing on a single tenant has the potential to impact all of our customers. Additionally, root causing an issue with a single customer can prove more challenging since all logs are centrally located, and there can be a lot of irrelevant data. Furthermore, while we removed a lot of complexity from our deployments, some of this complexity moved into the app since we need to multiplex across tenants now.

All that being said, the benefits far outweigh the costs: the vast majority of issues that affect performance or uptime are systemic rather than isolated, even systemic issues have much higher redundancy with multitenancy, we can use search functionality to get a stream of logs by tenant, and the complexity increase to the app was much smaller than the complexity reduction to the infrastructure.

Looking back

After finishing this project, there were a few things we wish we knew at the beginning of our undertaking.

The biggest thing to note here is that much of the complexity of this problem was accidental rather than essential. Basically, this became a major project for us because we initially built the product in a way that was incompatible with our long-term goals. I’m a strong advocate of investing in infrastructure early and often, and this is another case where a lot of hours could have been saved by making that decision early on.

Possibly the most notable indication that our existing solution needed revisiting was in our management of environment variables. Environment variables are very low effort and can be an incredibly simple way to manage configuration at the beginning, but they can quickly become unsustainable. A particularly strong signal that you should work on designing a more extensible solution is when you start having interdependent environment variables or if you need to reuse the same set of variables across services.

Looking forward

Our shift to a multitenant paradigm has opened a lot of possibilities that we’re excited to start building out.

For one, now that our costs scale with volume rather than customer count, we can make onboarding a new customer almost as simple as opening an account on Facebook. To do this, we need to automate a whole suite of customer provisioning steps, including creating the database and user, a set of API keys, storing their SSL certs, creating the DNS records, and a whole slew of other minor infrastructure tasks. For now, we’re only exposing this internally, but some day maybe we’ll even allow customers to self-serve this process.

Additionally, our reduced deployment complexity has created a much more manageable and monitorable infrastructure, and will pave the way to move to a more continuous deployment setup.

Another possibility on the horizon is that right now we only have a single multitenant deployment serving all of our customers, but we could easily break these apart on boundaries other than customers. For example, it might be beneficial to release two deployments at a different cadence to improve release health. Since we would still aggregate a large number of customers into a single deployment, we would maintain all of the benefits of multitenancy on top of some of the isolation of multiple deployments.

Finally, this shift has paved the way for service-level auto-scaling. In our pre-multitenant world, the minimum number of containers we would provide had capacity that far exceeded the most extreme peak of traffic that they would ever get; we also couldn’t scale them down without risking compromising the high availability nature of our service. In our new multitenant world, these assumptions are no longer valid.

If you’re interested in any of these problems or scaling distributed systems, our team is hiring. Apply through our open Software Engineer posting and let the recruiting team know you’re interested in the platform team.

Originally published at blend.com on January 21, 2019.

← → Top