BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News How to Enable Testing a Distributed System on a Single Environment Using Proxy Routing

How to Enable Testing a Distributed System on a Single Environment Using Proxy Routing

Listen to this article -  0:00

Without a dedicated QA environment, teams faced tech and coordination issues when testing a distributed system. A slow, unmaintainable CLI led an organization to shift left with automated testing. They built a tool for versioned deployments using CI and proxy routing, enabling developers to run isolated tests on multiple versions to catch bugs earlier.

Po Linn Chia explained how they re-used a single development environment to deploy multiple service versions for testing their distributed system at Dev Summit Boston.

Having no dedicated QA environment causes a lot of problems, both with technology and teams, as everything is social-technical, Chia said. The social part of the equation is always hard, probably harder than the technical part, she added.

Their testing infrastructure consists of a microservices galore with a single ECS development environment cluster. In their teams, people are contending over the same microservice, or one microservice impacts another microservice’s behavior.

They had previously invested in a homegrown command line interface program that would run their entire environment on a continuous integration runner, but it turned out it took 15-30 minutes to get the framework up before even running a test, Chia said. There were also build failures from time-outs, and the developer of the system left the company, and nobody knew how to maintain it.

The solution they then decided on to serve development teams is to shift left toward automated testing, and prioritize tooling and coordination. This would help them catch bugs earlier, Chia said. It would be one environment, but with multiple versions on it, which they deploy using continuous integration to run integration tests.

They have developed an internal deployment tool that lets engineers select which versions they want to deploy or spin down:

Under the hood, we spin up the appropriate ECS task with the desired version and register conditional routing rules with a proxy (Traefik) that looks at Baggage headers. The Baggage header should contain a special key/value pair `dynamic_route=VERSION` denoting the version a client wants to hit, otherwise it’ll fall back to routing to the `main` version of a service.

They ship APM data, custom metrics, and logs to various third-party vendors, Chia said. For multiple versions of a single deployment, they update metadata such as the service name, deployed version, and also send along Baggage headers so that they can keep track of request traces and monitor deployments individually.

Their integration tests are live now, and developers can spin up their ephemeral containers and write integration tests. Teams can deploy on demand outside of CI to test larger things like an upgrade of a react framework without disturbing the development front-end, Chai said.

She mentioned that they also wired up the architecture so that developers can write the services either in a shared repo or just inside their own repo, and they can move back and forth.

InfoQ interviewed Po Linn Chia about their testing environment.

InfoQ: How did you route to different versions in one environment?

Po Linn Chia: We call it "dynamic routing" internally, where application code and DNS don’t need to be modified – just proxy rules.

To illustrate: if the main version of a service is accessed at `http://my-service.classpass.com`, traffic that doesn’t specify a Baggage header routes there by default. Requests that set `Baggage: dynamic_route=feature-2981` as a header will be evaluated against the Traefik routing rules that were added when `feature-2981` was deployed and get sent to that version of the service instead.

InfoQ: How do you do telemetry and what benefits does this bring you?

Chia: Telemetry on multiple deployments lets us distinguish bugs and performance issues for specific versions separately from what’s running in main. We aren’t at the stage where we perform canary deploys in production, so this is a poor man’s version as we work towards that.

It’s hugely helpful: for example, sometimes we can’t help but ship large changesets that are hard or impossible to test in small units, and when we do, being able to monitor and debug the version separately while keeping the stable latest running on our development environment has improved our QA efforts and reduced disruptions for our engineers. We’ve done a couple of major version bumps for large frameworks this way.

InfoQ: When would a tester use a shared repo, and when do they prefer using their own repo?

Chia: Since we are a microservice architecture, we have core business flows that are served by multiple underlying applications. If a test is one that’s critical to one of those types of flows, putting it in a shared repository helps – services involved in those flows can trigger these shared tests without having to re-invent the wheel in their own repo. The downside is that the tests need to be set up and written really well; a failure impacts multiple teams. It’s also a little harder to do the work of development in one repo and write tests in another.

On the flip side, if an application is fairly self-contained, having tests in its own repo is convenient and easier to iterate on. It’s the inverse of writing shared tests. At the end of the day, engineers make judgment calls on when to extract a test out to a shared location – they might start by writing it in an application repo, then realise it could be generally useful and move it out later.

About the Author

BT