If you maintain cloud infrastructure as part of your job, as our Cloud Operations team here at Backblaze does, you’ll recognize the wisdom in the mantra, “Automate early, automate often”. When you’re working with tens, hundreds, or even thousands of production servers, manually applying changes gets old very quickly!
Today, Backblaze is releasing a new open source project: Boardwalk, hosted on GitHub at https://github.com/Backblaze/boardwalk, to help automate rolling maintenance jobs like kernel and operating system (OS) upgrades. Boardwalk is a linear Ansible workflow engine, written in Python, that our infrastructure systems engineers built to help automate complex operations tasks for large numbers of production hosts.
Why did Backblaze create Boardwalk?
Back in 2021, the Backblaze Storage Cloud platform comprised about 1,800 servers, the majority of which were Storage Pods. Upgrading those machines to a new OS version was an arduous task. The job took over a year and required well over 1,000 hours of hands-on toil by our data center staff. It was clear that we would need to automate the next OS upgrade, especially since it would involve even more machines.
While there are a range of tools available for this kind of work, we couldn’t just feed a list of server addresses into one of them and set it loose. Each Storage Pod is a server fitted with between 26 and 60 hard drives containing customer data, plus a boot drive holding the server’s OS. Twenty pods make up a Backblaze Vault.
Normal storage operations are as follows: Incoming customer data is assigned to a Vault for storage, then split into 20 shards, each of which is stored in a separate Pod. (I’m skipping some of the details here; for the full story, see How Backblaze Scales Our Storage Cloud). If you’ve followed our Drive Stats blog posts over the years, you’ll know that, at our scale, drives fail every day, so any one of those Pods can be taken temporarily offline for a drive replacement at any time.
This architecture means that we have to be quite intentional when we take Pods offline for upgrade.
Remotely upgrading the OS on a Storage Pod takes about 40 minutes. When the pod goes offline for upgrade, we put its Vault into read-only mode so that the upgraded server doesn’t have to catch up with writes that occurred when it was offline; the remaining 19 Pods in the Vault can still serve read requests. While one Storage Pod is being upgraded, we absolutely do not want a second Storage Pod in the same Vault upgrading.
Doing so would reduce read performance for the Vault, since fewer Storage Pods would be available to handle incoming requests, as well as increasing the risk that random drive failures in the other Pods might take the entire Vault offline. Once the upgrade is complete and the Pod comes back online, the Vault is returned to read-write mode.
The challenge of automation at scale
Backblaze has a long history of using Ansible to configure and deploy changes to its fleets of servers. However, while Ansible is a very capable agentless, modular, remote execution and configuration management engine, it isn’t well suited to complex, multi-stage operations tasks at Backblaze’s scale. Ansible playbooks have always helped us automate most of the process of managing so many servers, but eventually we hit challenges trying to reduce human toil even further.
Ansible is connection-oriented and most operations are performed on remote hosts, rather than on the administrative machine. From the administrative machine, Ansible connects to a remote host, copies code over, and executes it. There’s no practical way to run pre-checks about a host before connecting to it. This makes long-running background jobs difficult to work with using Ansible alone.
For example, if a playbook is running for days or weeks and fails, Ansible doesn’t retain any knowledge of where it left off, and can’t make any offline decisions about which hosts it needs to finish up with. When the playbook is re-run, Ansible will attempt to connect to all of the hosts it had previously connected to, potentially resulting in a long recovery time for a failed job. Considering Backblaze runs thousands of Storage Pods, this takes a long time!
The reality was that we needed something more, but also wanted to leverage all of our history with Ansible, including the playbooks that we had built, and the skills we already had. So we decided to build a workflow engine around Ansible, and we called it Boardwalk.
What does Boardwalk do?
We created Boardwalk to manage these kinds of long-running Ansible workflows, codifying our vast experience operating storage systems at scale. Boardwalk makes it easy to define workflows composed of a series of jobs to perform tasks on hosts using Ansible. It connects to hosts one-at-a-time, running jobs in a defined order, and maintaining local state as it goes; this makes stopping and resuming long-running Ansible workflows easy and efficient. It’s designed and built to be easy for DevOps and systems engineers to introduce, and frontline operators to use, while leveraging existing playbooks.
One of Boardwalk’s features is its ability to connect to a host and determine whether it should run a job on that host now, or leave it until later. When we use Boardwalk to perform rolling OS upgrades, it connects to a Pod and requests that the Pod temporarily remove itself from its Vault. The Pod checks that the other 19 Pods in the Vault are online and healthy; if so, then that Pod proceeds. Then Boardwalk can run the Ansible playbook to upgrade it. If, on the other hand, one or more of the other Pods are offline for some reason, that Pod sends a failure response to Boardwalk, causing the upgrade to be postponed until the Vault is in its correct state.
When Boardwalk is working on a host, it acquires a virtual “lock,” and saves its progress as it walks through the steps. The lock prevents multiple instances of Boardwalk from conflicting with each other, and the progress state allows Boardwalk to pick up where it left off in case of failure. If something does go wrong, an alert brings a human into the loop. Once a Pod has been successfully upgraded, Boardwalk updates its local state accordingly.
In practice, for OS upgrades, we run a single Boardwalk workflow per data center, which keeps things simple. It has a list of all of the servers it needs to upgrade, and quietly works down the list, with little or no manual intervention.
In this way, in our most recent OS upgrade, we were able to upgrade 6,000 servers over the course of nine months, with zero impact on availability and minimal intervention from data center staff. Customers were able to read files regardless of whether a Pod was being upgraded in one of the Vaults holding their data; file uploads were automatically sent to Pods in read-write mode.
What can I do with Boardwalk?
Today, we are releasing Boardwalk under the MIT License, a permissive open source license with very few restrictions on reuse. You are free to download Boardwalk, run it yourself, modify it, build it into a product, even sell it, as long as you observe the terms of the license.
We anticipate that most Boardwalk users will be able to use it as-is to automate long-running jobs across large numbers of hosts, but we welcome contributions from the community, whether they be documentation, examples, fixes, or enhancements.
We do not require contributors to sign a Contributor License Agreement (CLA) or Developer’s Certificate of Origin (DCO); instead, we simply accept contributions subject to the GitHub Terms of Service, specifically section D.6, which states, helpfully, in both legalese and plain English:
Whenever you add Content to a repository containing notice of a license, you license that Content under the same terms, and you agree that you have the right to license that Content under those terms. If you have a separate agreement to license that Content under different terms, such as a contributor license agreement, that agreement will supersede.
Isn’t this just how it works already? Yep. This is widely accepted as the norm in the open-source community; it’s commonly referred to by the shorthand “inbound=outbound”. We’re just making it explicit.
The CONTRIBUTING file explains how to build and test Boardwalk, and how to submit your contribution via a pull request. After you submit your pull request, a project maintainer will review it and respond within two weeks, likely much less unless we are flooded with contributions!
How Do I Get Started?
The README file at https://github.com/Backblaze/boardwalk is the best place to start—it contains much more detail on Boardwalk’s architecture, design, installation, and usage. Feel free to ask questions at the Boardwalk project discussions page, or file an issue if you encounter a bug or see an opportunity to enhance Boardwalk. We hope you find Boardwalk useful, and look forward to hearing how you’re using it!
We’d like to express our gratitude to Mat Hornbeek for not only writing the initial version of Boardwalk, but also kindly contributing to this article some time after he moved on from Backblaze to a new opportunity. Thanks, Mat!