Backblaze likes to talk about hard drive failures — a lot. What we haven’t talked much about is how we deal with those failures: the daily dance of temp drives, replacement drives, and all the clones that it takes to keep over 100,000 drives healthy. Let’s go behind the scenes and take a look at that dance from the eyes of one Backblaze hard drive.
After sitting still for what seemed like forever, ZCH007BZ was on the move. ZCH007BZ, let’s call him Zach, is a Seagate 12 TB hard drive. For the last few weeks, Zach and over 6,000 friends were securely sealed inside their protective cases in the ready storage area of a Backblaze data center. Being a hard disk drive, Zach’s modest dream was to be installed in a system, spin merrily, and store data for many years to come. And now the wait was nearly over, or was it?
The Life of Zach
Zach was born in a factory in Singapore and shipped to the US, eventually finding his way to Backblaze, but he didn’t know that. He had sat sealed in the dark for weeks. Now Zach and boxes of other drives were removed from their protective cases and gently stacked on a cart. Zach was near the bottom of the pile, but even he could see endless columns of beautiful red boxes stacked seemingly to the sky. “Backblaze!” one of the drives on the cart whispered. All the other drives gasped with recognition. Thank goodness the noise-cancelling headphones worn by all Backblaze Data Center Techs covered the drives’ collective excitement.
While sitting in the dark, the drives had gossiped about where they were: a data center, a distribution warehouse, a Costco, or Best Buy. Backblaze came up a few times, but that was squashed — they couldn’t be that lucky. After all, Backblaze was the only place where a drive could be famous. Before Backblaze, hard drives labored in anonymity. Occasionally, one or two would be seen in a hard drive teardown article, but even that sort of exposure had died out a couple of years ago. But Backblaze publishes everything about their drives, their model numbers, their serial numbers, heck even their S.M.A.R.T. statistics. There was a rumor that hard drives worked extra hard at Backblaze because they knew they would be in the public eye. With red Backblaze Storage Pods as far as the eye could see, Zach and friends were about to find out.
The cart Zach and his friends were on glided to a stop at the production build facility. This is where storage pods are filled with drives and tested before being deployed. The cart stopped by the first of twenty V6.0 Backblaze Storage Pods that together would form a Backblaze Vault. At each Storage Pod station 60 drives were unloaded from the cart. The serial number of each drive was recorded along with the Storage Pod ID and drive location in the pod. Finally, each drive was fitted with a pair of drive guides and slid into its new home as a production drive in a Backblaze Storage Pod. “Spin long and prosper,” Zach said quietly each time the lid of a Storage Pod snapped in place covering the 60 giddy hard drives inside. The process was repeated for the remaining 19 Storage Pods, and when it was done Zach remained on the cart. He would not be installed in a production system today.
The Clone Room
Zach and the remaining drives on the cart were slowly wheeled down the hall. Bewildered, they were rolled in the clone room. “What’s a clone room?” Zach asked himself. The drives on the cart were divided into two groups, with one group being placed on the clone table, and the other being placed on the test table. Zach was on the test table.
Almost as soon as Zach was placed on the test table, the DC Tech picked him up again and placed him and several other drives into a machine. He was about to get formatted. The entire formatting process only took a few minutes for Zach, as it did for all of the other drives on the test table. Zach counted 25 drives, including himself.
Still confused and a little sore from the formatting, Zach and two other drives were picked up from the bench by a different DC Tech. She recorded their vitals — serial number, manufacturer, and model — and left the clone room with all three drives on a different cart.
Dreams of a Test Drive
The three drives were back on the data center floor with red Storage Pods all around. The DC Tech had maneuvered Luigi, the local Storage Pod lift unit, to hold a Storage Pod she was sliding from a data center rack. The lid was opened, the tech attached a grounding clip, and then removed one of the drives in the Storage Pod. She recorded the vitals of the removed drive. While she was doing so, Zach could hear the removed drive breathlessly mumble something about media errors, but before Zach could respond, the tech picked him up, attached drive guides to his frame and gently slid him into the Storage Pod. The tech updated her records, closed the lid, and slid the pod back into place. A few seconds later, Zach felt a jolt of electricity pass through his circuits and he and 59 other drives spun to life. Zach was now part of a production Backblaze Storage Pod.
First, Zach was introduced to the other 19 members of his tome. There are 20 drives in a tome, with each living in a separate Storage Pod. Files are divided (sharded) across these 20 drives using Backblaze’s open-sourced erasure code algorithm.
Zach’s first task was to rebuild all of the files that were stored on the drive he replaced. He’d do this by asking for pieces (shards) of all the files from the 19 other drives in his tome. He only needed 17 of the pieces to rebuild a file, but he asked everyone in case there was a problem. Rebuilding was hard work, and the other drives were often busy with reading files, performing shard integrity checks, and so on. Depending on how busy the system was, and how full the drives were, it might take Zach a couple of weeks to rebuild the files and get him up to speed with his contemporaries.
Nightmares of a Test Drive
Little did he know, but at this point, Zach was still considered a temp replacement drive. The dysfunctional drive that he replaced was making its way back to the clone room where a pair of cloning units, named Harold and Maude in this case, waited. The tech would attempt to clone the contents of the failed drive to a new drive assigned to the clone table. The primary reason for trying to clone a failed drive was recovery speed. A drive can be cloned in a couple of days, but as noted above, it can take up to a couple of weeks to rebuild a drive, especially large drives on busy systems. In short, a successful clone would speed up the recovery process.
For nearly two days straight, Zach was rebuilding. He barely had time to meet his pod neighbors, Cheryl and Carlos. Since they were not rebuilding, they had plenty of time to marvel at how hard Zach was working. He was 25 % done and going strong when the Storage Pod powered down. Moments later, the pod was slid out of the rack and the lid popped open. Zach assumed that another drive in the pod had failed, when he felt the spindly, cold fingers of the tech grab him and yank firmly. He was being replaced.
Zach had done nothing wrong. It was just that the clone was successful, with nearly all the files being copied from the previous drive to the smiling clone drive that was putting on Zach’s drive guides and gently being inserted in Zach’s old slot. “Goodbye,” he managed to eek out as he was placed on the cart and watched the tech bring the Storage Pod back to life. Confused, angry, and mostly exhausted, Zach quickly fell asleep.
Zach woke up just in time to see he was in the formatting machine again. The data he had worked so hard to rebuild was being ripped from his platters and replaced randomly with ones and zeroes. This happened multiple times and just as Zach was ready to scream, it stopped, and he was removed from his torture and stacked neatly with a few other drives.
After a while he looked around, and once the lights went out the stories started. Zach wasn’t alone. Several of the other temp drives had pretty much the same story; they thought they had found a home, only to be replaced by some uppity clone drive. One of the temp drives, Clarice, said she had been in three different systems only to be replaced each time by a clone drive. No one wanted to believe her, but no one knew what was next either.
The Day the Clone Died
Zach found out the truth a few days later when he was selected, inspected, and injected as a temp drive into another Storage Pod. Then three days later he was removed, wiped, reformatted, and placed back in the temp pool. He began to resign himself to life as a temp drive. Not exactly glamorous, but he did get his serial number in the Backblaze Drive Stats data tables while he was a temp. That was more than the millions of other drives in the world that would forever be unknown.
On his third temp drive stint, he was barely in the pod a day when the lid opened and he was unceremoniously removed. This was the life of temp drive, and when the lid opened on the fourth day of his fourth temp drive shift, he just closed his eyes and waited for his dream to end again. Except, this time, the tech’s hand reached past him and grabbed a drive a few slots away. That unfortunate drive had passed the night before, a full-fledged crash. Zach, like all the other drives nearby, had heard the screams.
Another temp drive Zach knew from the temp table replaced the dead drive, then the lid was closed, the pod slid back into place, and power was restored. With that, Zach doubled down on getting rebuilt — maybe if he could get done before the clone was finished then he could stay. What Zach didn’t know was that the clone process for the drive he had replaced had failed. This happens about half the time. Zach was home free; he just didn’t know it.
In a couple of days, Zach was finished rebuilding and become a real member of a production Backblaze Storage Pod. He now spends his days storing and retrieving data, getting his bits tested by shard integrity checks, and having his S.M.A.R.T. stats logged for the Backblaze Drive Stats. His hard drive life is better than he ever dreamed.