Over the past four years, the Heroku Runtime team has transitioned from occasional, manual deployments to continuous, automated deployments. Changes are now rolled out globally within a few hours of merging any change—without any human intervention. It's been an overwhelmingly positive experience for us. This post describes why we decided to make the change, how we did it, and what we learned along the way.
Where We StartedHeroku’s Runtime team builds and operates Heroku’s Private Space (single-tenant) and Common Runtime (multi-tenant) platforms, from container orchestration to routing and logging. When I joined the team in 2016, the process for deploying changes to a component looked something like this:
The process was manual, repetitive, and required coordination and extra approvals. It was toilsome. However, it also served the team well for a long time! In particular, at that time the team was small, owned few services, and had just two production environments (US and EU).
Notice that this process already relies on some important properties of the system and the engineering environment within Heroku:
We understood that there were opportunities to improve the deployment process, but it worked well enough most of the time that it wasn't a big priority. It was toil, but it was manageable. However, what passes for manageable changes as systems evolve, and two particular things changed that forced us to re-evaluate how we managed deployments:
Between these two changes, we knew it wouldn’t be sustainable to simply make incremental changes to the existing workflow; we had to bite the bullet and move to a fully automated and continuous deployment model.
Increasing Deployment ComplexityThe primary control system for the Common Runtime was a monolithic Ruby application, which changed infrequently (< 1 change per week) and was running in two production environments.
The sharding project had us actively working (multiple changes per day) on multiple services running in 12 environments (2 staging, 10 production). The cost of every manual step was amplified. Pulling up the right dashboards for an environment used to cost a few minutes a week, but could now be hours. Waiting 3 hours between deploys and looking at graphs to notice anomalies? Totally untenable.
Higher Coordination CostsThe original process was to merge changes, and then release them … at some point later. It was common to merge something and then come back the next day to release it. These changes might require some manual operations before they were safe to release, but we were able to manage that by dropping a message to the team, "Hey, please don't deploy until I've had a chance to run the database migrations." As we brought more people on to the team and increased activity, however, this became a major pain point. Changes would pile up behind others, and no one knew if it was safe to release all of them together.
We also introduced a set of shared packages to support the development of our new microservices. This introduced a new problem: one set of changes could impact many services. If you're working on one service, and introduce some changes to a shared library to support what you want to do—you'll deploy the one service and move on. We would regularly accumulate weeks worth of undeployed changes on services, which were probably unrelated but represented a major risk.
We needed to shift to a model where changes were always being released to all of our services.
How We Did ItTo build our new release system, we made use of existing Heroku primitives. Every service is part of a Pipeline. The simplest services might only be deployed to staging and production environments, but our sharded services have up to 10 environments, modeled as Heroku apps running the same code but in different regions.
The pipelines are configured to automatically deploy to the first app in staging when a change is merged into master and CI has passed. That means that for any service, changes are running in staging within a few minutes of merging. Note that this is not something custom we built—this is a feature available to any Heroku Pipelines user.
This is where we would previously begin checking dashboards and watching ops channels before manually executing a pipeline promotion from the staging app to the next app in the pipeline.
Automation To The RescueWe built a tool called cedar-service-deployer to manage the rest of the automated pipeline promotions. It’s written in Go and uses the Heroku client package to integrate with the API.
The deployer periodically scans the pipelines it manages for differences between stages. If it detects a difference, it runs a series of checks to see if it should promote the changes.
The suite of checks for every service are:
If any checks fail, it does nothing and waits for the next tick to check again. Once all the checks pass, it executes a pipeline promotion to push the changes to the next stage and then repeats.
The deployer itself is stateless; every time it runs it checks the state with the Heroku API. It also doesn’t have a strict concept of a “release.” Instead it’s always trying to reconcile differences between stages in the pipeline. That means there can be concurrent releases rolling out over time, for example here, where the 813a04
commit has been rolled out from qa01 through to va05 while at the same time 4cc067
is one stage away from completing its rollout.
There are a few things in particular that made the project successful:
We also learned a few unexpected things along the way!
It’s been about 2 years since we turned on automated deployments for our first service. It now manages 23 services. Instead of hours of checking dashboards and coordination to roll out changes, Runtime engineers now just hit “merge” and assume their changes will be released globally as long as everything is healthy.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4