{"id":106520,"date":"2022-08-16T10:41:13","date_gmt":"2022-08-16T17:41:13","guid":{"rendered":"https:\/\/www.backblaze.com\/blog\/?p=106520"},"modified":"2024-08-14T12:10:50","modified_gmt":"2024-08-14T19:10:50","slug":"storing-and-querying-analytical-data-in-backblaze-b2","status":"publish","type":"post","link":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/","title":{"rendered":"Storing and Querying Analytical Data in Backblaze B2"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-106521\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png\" alt=\"\" width=\"1440\" height=\"820\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png 1440w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1-300x171.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1-1024x583.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1-768x437.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1-560x319.png 560w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" \/><\/p>\n<div class=\"abstract\" style=\"line-height: 1.8; margin: 24px 12px; padding: 24px 12px 10px 12px;\">Note: This blog is the result of a collaborative effort of the Backblaze Evangelism team members <a href=\"\/blog\/author\/andy\/\" target=\"_blank\" rel=\"noopener\">Andy Klein<\/a>, <a href=\"\/blog\/author\/pat\/\" target=\"_blank\" rel=\"noopener\">Pat Patterson<\/a> and <a href=\"\/blog\/author\/greg\/\" target=\"_blank\" rel=\"noopener\">Greg Hamer<\/a>.<\/div>\n<h2>Have You Ever Used Backblaze B2 Cloud Storage for Your Data Analytics?<\/h2>\n<p id=\"bzdropcap\">Backblaze customers find that Backblaze B2 Cloud Storage is optimal for a wide variety of use cases. However, one application that many teams might not yet have tried is using Backblaze B2 for data analytics. You may find that having a highly reliable pre-provisioned storage option like Backblaze B2 Cloud Storage for your data lakes can be a useful and very cost-effective alternative for your data analytic workloads.<\/p>\n<p>This article is an introductory primer on getting started using Backblaze B2 for data analytics that uses our <a href=\"https:\/\/www.backblaze.com\/cloud-storage\/resources\/hard-drive-test-data\" target=\"_blank\" rel=\"noopener\">Drive Stats<\/a> as the example of the data being analyzed. For readers new to data lakes, this article can help you get your own data lake up and going on Backblaze B2 Cloud Storage.<\/p>\n<p>As you probably know, a commonly used technology for data analytics is SQL (Structured Query Language). Most people know SQL from databases. However, SQL can be used against collections of files stored outside of databases, now commonly referred to as data lakes. We will focus here on several options using SQL for analyzing Drive Stats data stored on Backblaze B2 Cloud Storage.<\/p>\n<p>It should be noted that data lakes most frequently prove optimal for read-only or append-only datasets. Whereas databases often remain optimal for \u201chot\u201d data with active insert, update and delete of individual rows, and especially updates of individual column values on individual rows.<\/p>\n<p>We can only scratch the surface of storing, querying, and analyzing tabular data in a single blog post. So for this introductory article, we will:<\/p>\n<ul>\n<li>Briefly explain the Drive Stats data.<\/li>\n<li>Introduce open-source Trino as one option for executing SQL against the Drive Stats data.<\/li>\n<li>Query Drive Stats data both in raw CSV format versus enhanced performance after transforming the data into the open-source Apache Parquet format.<\/li>\n<\/ul>\n<p>The sections below take a step-by-step approach including details on the performance improvements realized when implementing recommended data engineering options. We start with a demonstration of analysis of raw data. Then progress through \u201cdata engineering\u201d that transforms the data into formats that are optimal for accelerating repeated queries of the dataset. We conclude by highlighting our hosted, consolidated, complete Drive Stats dataset.<\/p>\n<p>As mentioned earlier, this blog post is intended only as an introductory primer. In future blog posts, we will detail additional best practices and other common issues and opportunities with data analysis using Backblaze B2.<\/p>\n<h2>Backblaze Hard Drive Data and Stats (aka Drive Stats)<\/h2>\n<p><a href=\"https:\/\/www.backblaze.com\/cloud-storage\/resources\/hard-drive-test-data\" target=\"_blank\" rel=\"noopener\">Drive Stats<\/a> is an open-source data set of the daily metrics on the hard drives in Backblaze\u2019s cloud storage infrastructure that Backblaze has open-sourced starting with April 2013. Currently, Drive Stats comprises nearly 300 million records, occupying over 90GB of disk space in raw comma-separated values (CSV) format, rising by over 200,000 records, or about 75MB of CSV data, per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted.<\/p>\n<p>The Drive Stats dataset is not quite \u201cbig data,\u201d where datasets range from a few dozen terabytes to many zettabytes, but enough that physical data architecture starts to have a significant effect in both the amount of space that the data occupies and how the data can be accessed.<\/p>\n<p>At the end of each quarter, Backblaze creates a CSV file for each day of data, ZIP those 90 or so files together, and <a href=\"https:\/\/www.backblaze.com\/cloud-storage\/resources\/hard-drive-test-data#downloading-the-raw-hard-drive-test-data\" target=\"_blank\" rel=\"noopener\">make the compressed file available for download<\/a> from a Backblaze B2 Bucket. While it\u2019s easy to download and decompress a single file containing three months of data, this data architecture is not very flexible. With a little data engineering, though, it\u2019s possible to make analytical data, such as the Drive Stats data set, available for modern data analysis tools to directly access from cloud storage, unlocking new opportunities for data analysis and data science.<\/p>\n<p>Later, for comparison, we include a brief demonstration of performance of the data lake versus a traditional relational database. Architecturally, a difference between a data lake and a database is that databases integrate together both the query engine and the data storage. When data is either inserted or loaded into a database, the database has optimized internal storage structures it uses. Alternatively, with a data lake, the query engine and the data storage are separate. What we highlight below are basics for optimizing data storage in a data lake to enable the query engine to deliver the fastest query response times.<\/p>\n<p>As with all data analysis, it is helpful to understand details of what the data represents. Before showing results, let&#8217;s take a deeper dive into the nature of the Drive Stats data. (For readers interested in first reviewing outcomes and improved query performance results, please skip ahead to the later sections \u201cCompressed CSV\u201d and \u201cEnter Apache Parquet.\u201d)<\/p>\n<h2>Navigating the Drive Stats Data<\/h2>\n<p>At Backblaze we collect a Drive Stats record from each hard drive, each day, containing the following data:<\/p>\n<ul>\n<li><strong>date: <\/strong>the date of collection.<\/li>\n<li><strong>serial_number: <\/strong>the unique serial number of the drive.<\/li>\n<li><strong>model: <\/strong>the manufacturer\u2019s model number of the drive.<\/li>\n<li><strong>capacity_bytes: <\/strong>the drive\u2019s capacity, in bytes.<\/li>\n<li><strong>failure: <\/strong>1 if this was the last day that the drive was operational before failing, 0 if all is well.<\/li>\n<li><strong>A collection of <a href=\"\/blog\/what-smart-stats-indicate-hard-drive-failures\/\" target=\"_blank\" rel=\"noopener\">SMART<\/a> attributes. <\/strong>The number of attributes collected has risen over time; currently we store 87 SMART attributes in each record, each one in both raw and normalized form, with field names of the form smart_<em>n<\/em>_normalized and smart_<em>n<\/em>_raw, where <em>n<\/em> is between 1 and 255.<\/li>\n<\/ul>\n<p>In total, each record currently comprises 179 fields of data describing the state of an individual hard drive on a given day (the number of SMART attributes collected has risen over time).<\/p>\n<h2>Comma-Separated Values, a Lingua Franca for Tabular Data<\/h2>\n<p>A CSV file is a delimited text file that, as its name implies, uses a comma to separate values. Typically, the first line of a CSV file is a header containing the field names for the data, separated by commas. The remaining lines in the file hold the data: one line per record, with each line containing the field values, again separated by commas.<\/p>\n<p>Here\u2019s a subset of the Drive Stats data represented as CSV. We\u2019ve omitted most of the SMART attributes to make the records more manageable.<\/p>\n<pre>date,serial_number,model,capacity_bytes,failure,\r\nsmart_1_normalized,smart_1_raw\r\n2022-01-01,ZLW18P9K,ST14000NM001G,14000519643136,0,73,20467240\r\n2022-01-01,ZLW0EGC7,ST12000NM001G,12000138625024,0,84,228715872\r\n2022-01-01,ZA1FLE1P,ST8000NM0055,8001563222016,0,82,157857120\r\n2022-01-01,ZA16NQJR,ST8000NM0055,8001563222016,0,84,234265456\r\n2022-01-01,1050A084F97G,TOSHIBA MG07ACA14TA,14000519643136,0,100,0<\/pre>\n<p>Currently, we create a CSV file for each day\u2019s data, comprising a record for each drive that was operational at the beginning of that day. The CSV files are each named with the appropriate date in year-month-day order, for example, <code>2022-06-28.csv<\/code>. As mentioned above, we make each quarter\u2019s data available as a ZIP file containing the CSV files.<\/p>\n<p>At the beginning of the last Drive Stats quarter, Jan 1, 2022, we were spinning over 200,000 hard drives, so each daily file contained over 200,000 lines and occupied nearly 75MB of disk space. The ZIP file containing the Drive Stats data for the first quarter of 2022 compressed 90 files totaling 6.63GB of CSV data to a single 1.06GB file made available for download <a href=\"https:\/\/www.google.com\/url?q=https:\/\/f001.backblazeb2.com\/file\/Backblaze-Hard-Drive-Data\/data_Q1_2022.zip&amp;sa=D&amp;source=docs&amp;ust=1660669195420114&amp;usg=AOvVaw0Y1aU8qPceoenGHhge8QXH\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h2>Big Data Analytics in the Cloud with Trino<\/h2>\n<p>Zipped CSV files allow users to easily download, inspect, and analyze the data locally, but a new generation of tools allows us to explore and query data in situ on Backblaze B2 and other cloud storage platforms. One example is the open-source <a href=\"https:\/\/trino.io\/\" target=\"_blank\" rel=\"noopener\">Trino<\/a> query engine (<a href=\"https:\/\/trino.io\/blog\/2020\/12\/27\/announcing-trino.html\" target=\"_blank\" rel=\"noopener\">formerly known as Presto SQL<\/a>). Trino can natively query data in Backblaze B2, Cassandra, MySQL, and many other data sources without copying that data into its own dedicated store.<\/p>\n<p>A powerful capability of Trino is that it is a distributed query engine and offers what is sometimes referred to as massively parallel processing (MPP). Thus, adding more nodes in your Trino compute cluster consistently delivers dramatically shorter query execution times. Faster results are always desirable. We achieved the results we report below running Trino on only a single node.<\/p>\n<div class=\"abstract\" style=\"line-height: 1.8; margin: 24px 12px; padding: 24px 12px 10px 12px;\">Note: If you are unfamiliar with Trino, the open-source project was previously known as Presto and leverages the Hadoop ecosystem.<\/div>\n<p>In preparing this blog post, our team used <a href=\"https:\/\/twitter.com\/bitsondatadev\" target=\"_blank\" rel=\"noopener\">Brian Olsen<\/a>\u2019s excellent <a href=\"https:\/\/github.com\/bitsondatadev\/trino-getting-started\/tree\/main\/hive\/trino-minio\" target=\"_blank\" rel=\"noopener\">Hive connector over MinIO file storage<\/a> tutorial as a starting point for integrating Trino with Backblaze B2. The tutorial environment includes a preconfigured <a href=\"https:\/\/docs.docker.com\/compose\/\" target=\"_blank\" rel=\"noopener\">Docker Compose<\/a> environment comprising the <a href=\"https:\/\/hub.docker.com\/r\/trinodb\/trino\" target=\"_blank\" rel=\"noopener\">Trino Docker image<\/a> and other required services for working with data in Backblaze B2. We brought up the environment in <a href=\"https:\/\/www.docker.com\/products\/docker-desktop\/\" target=\"_blank\" rel=\"noopener\">Docker Desktop<\/a>; alternately on ThinkPads and MacBook Pros.<\/p>\n<p>As a first step, we downloaded the <a href=\"https:\/\/f001.backblazeb2.com\/file\/Backblaze-Hard-Drive-Data\/data_Q1_2022.zip\" target=\"_blank\" rel=\"noopener\">data set for the most recent quarter<\/a>, unzipped it to our local disks, and then finally reuploaded the unzipped CSV into Backblaze B2 buckets. As mentioned above, the uncompressed CSV data occupies 6.63GB of storage, so we confined our initial explorations to just a single day\u2019s data: over 200,000 records, occupying 72.8MB.<\/p>\n<h2>A Word About Apache Hive<\/h2>\n<p>Trino accesses analytical data in Backblaze B2 and other cloud storage platforms via its <a href=\"https:\/\/trino.io\/docs\/current\/connector\/hive.html\" target=\"_blank\" rel=\"noopener\">Hive connector<\/a>. Quoting from the Trino documentation:<\/p>\n<p>The Hive connector allows querying data stored in an <a href=\"https:\/\/hive.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Hive<\/a> data warehouse. Hive is a combination of three components:<\/p>\n<ul>\n<li>Data files in varying formats, that are typically stored in the Hadoop Distributed File System (HDFS) or in object storage systems such as Amazon S3.<\/li>\n<li>Metadata about how the data files are mapped to schemas and tables. This metadata is stored in a database, such as MySQL, and is accessed via the Hive metastore service.<\/li>\n<li>A query language called HiveQL. This query language is executed on a distributed computing framework such as MapReduce or Tez.<\/li>\n<\/ul>\n<p>Trino only uses the first two components: the data and the metadata. It does not use HiveQL or any part of Hive\u2019s execution environment.<\/p>\n<p>The Hive connector tutorial includes Docker images for the Hive metastore service (HMS) and MariaDB, so it\u2019s a convenient way to explore this functionality with Backblaze B2.<\/p>\n<h2>Configuring Trino for Backblaze B2<\/h2>\n<p>The tutorial uses <a href=\"https:\/\/min.io\/\" target=\"_blank\" rel=\"noopener\">MinIO<\/a>, an open-source implementation of the Amazon S3 API, so it was straightforward to adapt the sample <a href=\"https:\/\/github.com\/bitsondatadev\/trino-getting-started\/blob\/main\/hive\/trino-minio\/etc\/catalog\/minio.properties\" target=\"_blank\" rel=\"noopener\">MinIO configuration<\/a> to Backblaze B2\u2019s S3 Compatible API by just replacing the endpoint and credentials. Here\u2019s the <code>b2.properties<\/code> file we created:<\/p>\n<pre>connector.name=hive\r\nhive.metastore.uri=thrift:\/\/hive-metastore:9083\r\nhive.s3.path-style-access=true\r\nhive.s3.endpoint=https:\/\/s3.us-west-004.backblazeb2.com\r\nhive.s3.aws-access-key=\r\nhive.s3.aws-secret-key=\r\nhive.non-managed-table-writes-enabled=true\r\nhive.s3select-pushdown.enabled=false\r\nhive.storage-format=CSV\r\nhive.allow-drop-table=true\r\n<\/pre>\n<p>Similarly, we edited the Hive configuration files, again replacing the MinIO configuration with the corresponding Backblaze B2 values. Here\u2019s a sample <code>core-site.xml:<\/code><\/p>\n<pre>&lt;?xml version=\"1.0\"?&gt;\r\n&lt;configuration&gt;\r\n\r\n    &lt;property&gt;\r\n        &lt;name&gt;fs.defaultFS&lt;\/name&gt;\r\n        &lt;value&gt;s3a:\/\/b2-trino-getting-started&lt;\/value&gt;\r\n    &lt;\/property&gt;\r\n\r\n\r\n    &lt;!-- B2 properties --&gt;\r\n    &lt;property&gt;\r\n        &lt;name&gt;fs.s3a.connection.ssl.enabled&lt;\/name&gt;\r\n        &lt;value&gt;true&lt;\/value&gt;\r\n    &lt;\/property&gt;\r\n\r\n    &lt;property&gt;\r\n        &lt;name&gt;fs.s3a.endpoint&lt;\/name&gt;\r\n        &lt;value&gt;https:\/\/s3.us-west-004.backblazeb2.com&lt;\/value&gt;\r\n    &lt;\/property&gt;\r\n\r\n    &lt;property&gt;\r\n        &lt;name&gt;fs.s3a.access.key&lt;\/name&gt;\r\n        &lt;value&gt;&lt;my b2 application key id&gt;&lt;\/value&gt;\r\n    &lt;\/property&gt;\r\n\r\n    &lt;property&gt;\r\n        &lt;name&gt;fs.s3a.secret.key&lt;\/name&gt;\r\n        &lt;value&gt;&lt;my b2 application key id&gt;&lt;\/value&gt;\r\n    &lt;\/property&gt;\r\n\r\n    &lt;property&gt;\r\n        &lt;name&gt;fs.s3a.path.style.access&lt;\/name&gt;\r\n        &lt;value&gt;true&lt;\/value&gt;\r\n    &lt;\/property&gt;\r\n\r\n    &lt;property&gt;\r\n        &lt;name&gt;fs.s3a.impl&lt;\/name&gt;\r\n        &lt;value&gt;org.apache.hadoop.fs.s3a.S3AFileSystem&lt;\/value&gt;\r\n    &lt;\/property&gt;\r\n\r\n&lt;\/configuration&gt;\r\n\r\n<\/pre>\n<p>We made a similar set of edits to <code>metastore-site.xml<\/code> and restarted the Docker instances so our changes took effect.<\/p>\n<h2>Uncompressed CSV<\/h2>\n<p>Our first test validated creating a table and running a query on a single-day CSV data set. Hive tables are configured with the directory containing the actual data files, so we uploaded <code>2020-01-01.csv<\/code> from a local disk to <code>data_20220101_csv\/2020-01-01.csv<\/code> in a Backblaze B2 bucket, opened the Trino CLI, and created a schema and a table:<\/p>\n<pre>CREATE SCHEMA b2.ds\r\nWITH (location = 's3a:\/\/b2-trino-getting-started\/');\r\n\r\nUSE b2.ds;\r\n\r\nCREATE TABLE jan1_csv (\r\n    date VARCHAR,\r\n    serial_number VARCHAR,\r\n    model VARCHAR,\r\n    capacity_bytes VARCHAR,\r\n    failure VARCHAR,\r\n    smart_1_normalized VARCHAR,\r\n    smart_1_raw VARCHAR,\r\n    ...\r\n    smart_255_normalized VARCHAR,\r\n    smart_255_raw VARCHAR)\r\nWITH (format = 'CSV',\r\n    skip_header_line_count = 1,\r\n    external_location = '\r\ns3a:\/\/b2-trino-getting-started\/data_20220101_csv');<\/pre>\n<p>Unfortunately, the Trino Hive connector only supports the VARCHAR data type when accessing CSV data, but, as we\u2019ll see in a moment, we can use the CAST function in queries to convert character data to numeric and other types.<\/p>\n<p>Now to run some queries! A good test is to check if all the data is there:<\/p>\n<pre>trino:ds&gt; SELECT COUNT(*) FROM jan1_csv;\r\n _col0  \r\n--------\r\n 206954 \r\n(1 row)\r\n\r\nQuery 20220629_162533_00024_qy4c6, FINISHED, 1 node\r\nSplits: 8 total, 8 done (100.00%)\r\n8.23 [207K rows, 69.4MB] [25.1K rows\/s, 8.43MB\/s]\r\n<\/pre>\n<div class=\"abstract\" style=\"line-height: 1.8; margin: 24px 12px; padding: 24px 12px 10px 12px;\">Note: If you\u2019re wondering about the discrepancy between the size of the CSV file\u201372.8MB\u2013and the amount of data read by Trino\u201369.4MB\u2013it\u2019s accounted for in the different usage of the \u2018MB\u2019 abbreviation. For instance Mac interprets MB as a <em>megabyte<\/em>, 1,000,000 bytes, while Trino is reporting <em>mebibytes<\/em>, 1,048,576 bytes. Strictly speaking, Trino should use the abbreviation MiB. Pat <a href=\"https:\/\/github.com\/trinodb\/trino\/issues\/13054\" target=\"_blank\" rel=\"noopener\">opened an issue<\/a> for this (with a goal of fixing it and submitting a pull request to the Trino project).<\/div>\n<p>Now let\u2019s see how many drives failed that day, grouped by the drive model:<\/p>\n<pre>trino:ds&gt; SELECT model, COUNT(*) as failures \r\n       -&gt; FROM jan1_csv \r\n       -&gt; WHERE failure = 1 \r\n       -&gt; GROUP BY model \r\n       -&gt; ORDER BY failures DESC;\r\n       model        | failures \r\n--------------------+----------\r\n TOSHIBA MQ01ABF050 |        1 \r\n ST4000DM005        |        1 \r\n ST8000NM0055       |        1 \r\n(3 rows)\r\n\r\nQuery 20220629_162609_00025_qy4c6, FINISHED, 1 node\r\nSplits: 17 total, 17 done (100.00%)\r\n8.23 [207K rows, 69.4MB] [25.1K rows\/s, 8.43MB\/s]\r\n<\/pre>\n<p>Notice that the query execution time is identical between the two queries. This makes sense\u2013the time taken to run the query is dominated by the time required to download the data from Backblaze B2.<\/p>\n<p>Finally, we can use the CAST function with SUM and ROUND to see how many exabytes of storage we were spinning on that day:<\/p>\n<pre>trino:ds&gt; SELECT ROUND(SUM(CAST(capacity_bytes AS bigint))\/1e+18, 2) FROM jan1_csv;\r\n _col0 \r\n-------\r\n  2.25 \r\n(1 row)\r\n\r\nQuery 20220629_172703_00047_qy4c6, FINISHED, 1 node\r\nSplits: 12 total, 12 done (100.00%)\r\n7.83 [207K rows, 69.4MB] [26.4K rows\/s, 8.86MB\/s]\r\n<\/pre>\n<p>Although this performance may seem too long running, please note that this is against raw data. What we are highlighting here with Drive Stats data can also be used for querying data in log files. As new records are written on this append-only dataset they immediately appear as new rows in the query. This is very powerful for both real-time and near real-time analysis, and faster performance is easily achieved by scaling out the Trino cluster. Remember, Trino is a distributed query engine. For this demonstration, we have limited Trino to running on just a single node.<\/p>\n<h2>Compressed CSV<\/h2>\n<p>This is pretty neat, but not exactly fast. Extrapolating, we might expect it to take about 12 minutes to run a query against a whole quarter of Drive Stats data.<\/p>\n<p>Can we improve performance? Absolutely\u2013we simply need to reduce the amount of data that needs to be downloaded for each query!<\/p>\n<p>Commonplace in the world of data analytics are data pipelines, often known as ETL for Extract, Transform, and Load. Where data is repeatedly queried, it is often advantageous to \u201ctransform\u201d data from the raw form that it originates in into some format more optimized for the repeated queries that follow through the next stages of that data\u2019s life cycle.<\/p>\n<p>For our next test we will perform an elementary transformation of the data using a lossless compression of the CSV data with Hive\u2019s preferred gzip format, resulting in an 11.7 MB file, <code>2020-01-01.csv.gz.<\/code> After uploading the compressed file to <code>data_20220101_csv_gz\/2020-01-01.csv.gz<\/code>, we created a second table, copying the schema from the first:<\/p>\n<pre>CREATE TABLE jan1_csv_gz (\r\n\tLIKE jan1_csv\r\n)\r\nWITH (FORMAT = 'CSV',\r\n    EXTERNAL_LOCATION = 's3a:\/\/b2-trino-getting-started\/data_20220101_csv_gz');\r\n\r\nTrying the failure count query:\r\n\r\ntrino:ds&gt; SELECT model, COUNT(*) as failures \r\n       -&gt; FROM jan1_csv_gz \r\n       -&gt; WHERE failure = 1 \r\n       -&gt; GROUP BY model \r\n       -&gt; ORDER BY failures DESC;\r\n       model        | failures \r\n--------------------+----------\r\n TOSHIBA MQ01ABF050 |        1 \r\n ST8000NM0055       |        1 \r\n ST4000DM005        |        1 \r\n(3 rows)\r\n\r\nQuery 20220629_162713_00027_qy4c6, FINISHED, 1 node\r\nSplits: 15 total, 15 done (100.00%)\r\n2.71 [207K rows, 11.1MB] [76.4K rows\/s, 4.1MB\/s]\r\n<\/pre>\n<p>As you might expect, given that Trino has to download less than \u2159 as much data as previously, the query time fell dramatically\u2013from just over 8 seconds to under 3 seconds. Can we do even better than this?<\/p>\n<h2>Enter Apache Parquet<\/h2>\n<p>The issue with running this kind of analytical query is that it often results in a \u201cfull table scan\u201d\u2013Trino has to read the model and failure fields from every record to execute the query. The row-oriented layout of CSV data means that Trino ends up reading the entire file. We can get around this by using a file format designed specifically for analytical workloads.<\/p>\n<p>While CSV files comprise a line of text for each record, <a href=\"https:\/\/parquet.apache.org\/\" target=\"_blank\" rel=\"noopener\">Parquet<\/a> is a <em>column-oriented, binary<\/em> file format, storing the binary values for each column contiguously. Here\u2019s a simple visualization of the difference between row and column orientation:<\/p>\n<h3>Table representation:<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-106549 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/1.png\" alt=\"\" width=\"396\" height=\"139\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/1.png 396w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/1-300x105.png 300w\" sizes=\"auto, (max-width: 396px) 100vw, 396px\" \/><\/p>\n<h3>Row orientation:<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-106550 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/2.png\" alt=\"\" width=\"633\" height=\"107\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/2.png 633w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/2-300x51.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/2-560x95.png 560w\" sizes=\"auto, (max-width: 633px) 100vw, 633px\" \/><\/p>\n<h3>Column Orientation:<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-106551 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/3.png\" alt=\"\" width=\"635\" height=\"108\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/3.png 635w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/3-300x51.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/3-560x95.png 560w\" sizes=\"auto, (max-width: 635px) 100vw, 635px\" \/><br \/>\nParquet also implements <a href=\"https:\/\/en.wikipedia.org\/wiki\/Run-length_encoding\" target=\"_blank\" rel=\"noopener\">run-length encoding<\/a> and other compression techniques. Where a series of records have the same value for a given field the Parquet file need only store the value and the number of repetitions:<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-106552 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/4.png\" alt=\"\" width=\"632\" height=\"107\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/4.png 632w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/4-300x51.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/4-560x95.png 560w\" sizes=\"auto, (max-width: 632px) 100vw, 632px\" \/><br \/>\nThe result is a compact file format well suited for analytical queries.<\/p>\n<p>There are many tools to manipulate tabular data from one format to another. In this case, we wrote a very simple Python script that used the <a href=\"https:\/\/arrow.apache.org\/docs\/python\/index.html\" target=\"_blank\" rel=\"noopener\">pyarrow<\/a> library to do the job:<\/p>\n<pre>import pyarrow.csv as csv\r\nimport pyarrow.parquet as parquet\r\n\r\nfilename = '2022-01-01.csv'\r\n\r\nparquet.write_table(csv.read_csv(filename), \r\nfilename.replace('.csv', '.parquet'))\r\n<\/pre>\n<p>The resulting Parquet file occupies 12.8MB\u2013only 1.1MB more than the gzip file. Again, we uploaded the resulting file and created a table in Trino.<\/p>\n<pre>CREATE TABLE jan1_parquet (\r\n    date DATE,\r\n    serial_number VARCHAR,\r\n    model VARCHAR,\r\n    capacity_bytes BIGINT,\r\n    failure TINYINT,\r\n    smart_1_normalized BIGINT,\r\n    smart_1_raw BIGINT,\r\n    ...\r\n    smart_255_normalized BIGINT,\r\n    smart_255_raw BIGINT)\r\nWITH (FORMAT = 'PARQUET',\r\n    EXTERNAL_LOCATION = \r\n's3a:\/\/b2-trino-getting-started\/data_20220101_parquet);\r\n<\/pre>\n<p>Note that the conversion to Parquet automatically formatted the data using appropriate types, which we used in the table definition.<\/p>\n<p>Let\u2019s run a query and see how Parquet fares against compressed CSV:<\/p>\n<pre>trino:ds&gt; SELECT model, COUNT(*) as failures \r\n       -&gt; FROM jan1_parquet \r\n       -&gt; WHERE failure = 1 \r\n       -&gt; GROUP BY model \r\n       -&gt; ORDER BY failures DESC;\r\n       model        | failures \r\n--------------------+----------\r\n TOSHIBA MQ01ABF050 |        1 \r\n ST4000DM005        |        1 \r\n ST8000NM0055       |        1 \r\n(3 rows)\r\n\r\nQuery 20220629_163018_00031_qy4c6, FINISHED, 1 node\r\nSplits: 15 total, 15 done (100.00%)\r\n0.78 [207K rows, 334KB] [265K rows\/s, 427KB\/s]\r\n<\/pre>\n<p>The test query is executed in well under a second! Looking at the last line of output, we can see that the same number of rows were read, but only 334KB of data was retrieved. Trino was able to retrieve just the two columns it needed, out of the 179 columns in the file, to run the query.<\/p>\n<p>Similar analytical queries execute just as efficiently. Calculating the total amount of storage in exabytes:<\/p>\n<pre>trino:ds&gt; SELECT ROUND(SUM(capacity_bytes)\/1e+18, 2) FROM jan1_parquet;\r\n _col0 \r\n-------\r\n  2.25 \r\n(1 row)\r\n\r\nQuery 20220629_163058_00033_qy4c6, FINISHED, 1 node\r\nSplits: 10 total, 10 done (100.00%)\r\n0.83 [207K rows, 156KB] [251K rows\/s, 189KB\/s]\r\n<\/pre>\n<p>What was the capacity of the largest drive in terabytes?<\/p>\n<pre>trino:ds&gt; SELECT max(capacity_bytes)\/1e+12 FROM jan1_parquet;\r\n      _col0      \r\n-----------------\r\n 18.000207937536 \r\n(1 row)\r\n\r\nQuery 20220629_163139_00034_qy4c6, FINISHED, 1 node\r\nSplits: 10 total, 10 done (100.00%)\r\n0.80 [207K rows, 156KB] [259K rows\/s, 195KB\/s]\r\n<\/pre>\n<p>Parquet\u2019s columnar layout excels with analytical workloads, but if we try a query more suited to an operational database, Trino has to read the entire file, as we would expect:<\/p>\n<pre>trino:ds&gt; SELECT * FROM jan1_parquet WHERE serial_number = 'ZLW18P9K';\r\n    date    | serial_number |     model     | capacity_bytes | failure\r\n------------+---------------+---------------+----------------+--------\r\n 2022-01-01 | ZLW18P9K      | ST14000NM001G | 14000519643136 |       0\r\n(1 row)\r\n\r\nQuery 20220629_163206_00035_qy4c6, FINISHED, 1 node\r\nSplits: 5 total, 5 done (100.00%)\r\n2.05 [207K rows, 12.2MB] [101K rows\/s, 5.95MB\/s]\r\n<\/pre>\n<h2>Scaling Up<\/h2>\n<p>After validating our Trino configuration with just a single day\u2019s data, our next step up was to create a Parquet file containing an entire quarter. The file weighed in at 1.0GB, a little smaller than the zipped CSV.<\/p>\n<p>Here\u2019s the failed drives query for the entire quarter, limited to the top 10 results:<\/p>\n<pre>trino:ds&gt; SELECT model, COUNT(*) as failures \r\n       -&gt; FROM q1_2022_parquet \r\n       -&gt; WHERE failure = 1 \r\n       -&gt; GROUP BY model \r\n       -&gt; ORDER BY failures DESC \r\n       -&gt; LIMIT 10;\r\n        model         | failures \r\n----------------------+----------\r\n ST4000DM000          |      117 \r\n TOSHIBA MG07ACA14TA  |       88 \r\n ST8000NM0055         |       86 \r\n ST12000NM0008        |       73 \r\n ST8000DM002          |       38 \r\n ST16000NM001G        |       24 \r\n ST14000NM001G        |       24 \r\n HGST HMS5C4040ALE640 |       21 \r\n HGST HUH721212ALE604 |       21 \r\n ST12000NM001G        |       20 \r\n(10 rows)\r\n\r\nQuery 20220629_183338_00050_qy4c6, FINISHED, 1 node\r\nSplits: 43 total, 43 done (100.00%)\r\n3.38 [18.8M rows, 15.8MB] [5.58M rows\/s, 4.68MB\/s]\r\n<\/pre>\n<p>Of course, those are absolute failure numbers; they don\u2019t take account of how many of each drive model are in use. We can construct a more complex query that tells us the percentages of failed drives, by model:<\/p>\n<pre>trino:ds&gt; SELECT drives.model AS model, drives.drives AS drives, \r\n       -&gt;   failures.failures AS failures, \r\n       -&gt;   ROUND((CAST(failures AS double)\/drives)*100, 6) AS percentage\r\n       -&gt; FROM\r\n       -&gt; (\r\n       -&gt;   SELECT model, COUNT(*) as drives \r\n       -&gt;   FROM q1_2022_parquet \r\n       -&gt;   GROUP BY model\r\n       -&gt; ) AS drives\r\n       -&gt; RIGHT JOIN\r\n       -&gt; (\r\n       -&gt;   SELECT model, COUNT(*) as failures \r\n       -&gt;   FROM q1_2022_parquet \r\n       -&gt;   WHERE failure = 1 \r\n       -&gt;   GROUP BY model\r\n       -&gt; ) AS failures\r\n       -&gt; ON drives.model = failures.model\r\n       -&gt; ORDER BY percentage DESC\r\n       -&gt; LIMIT 10;\r\n        model         | drives | failures | percentage \r\n----------------------+--------+----------+------------\r\n ST12000NM0117        |    873 |        1 |   0.114548 \r\n ST10000NM001G        |   1028 |        1 |   0.097276 \r\n HGST HUH728080ALE604 |   4504 |        3 |   0.066607 \r\n TOSHIBA MQ01ABF050M  |  26231 |       13 |    0.04956 \r\n TOSHIBA MQ01ABF050   |  24765 |       12 |   0.048455 \r\n ST4000DM005          |   3331 |        1 |   0.030021 \r\n WDC WDS250G2B0A      |   3338 |        1 |   0.029958 \r\n ST500LM012 HN        |  37447 |       11 |   0.029375 \r\n ST12000NM0007        | 118349 |       19 |   0.016054 \r\n ST14000NM0138        | 144333 |       17 |   0.011778 \r\n(10 rows)\r\n\r\nQuery 20220629_191755_00010_tfuuz, FINISHED, 1 node\r\nSplits: 82 total, 82 done (100.00%)\r\n8.70 [37.7M rows, 31.6MB] [4.33M rows\/s, 3.63MB\/s]\r\n<\/pre>\n<p>This query took twice as long as the last one! Again, data transfer time is the limiting factor\u2013Trino downloads the data for each subquery. A real-world deployment would take advantage of the <a href=\"https:\/\/trino.io\/blog\/2024\/03\/08\/cache-refresh.html\" target=\"_blank\" rel=\"noopener\">Hive Connector\u2019s storage caching feature<\/a> to avoid repeatedly retrieving the same data.<\/p>\n<h2>Picking the Right Tool for the Job<\/h2>\n<p>You might be wondering how a relational database would stack up against the Trino\/Parquet\/Backblaze B2 combination. As a quick test, we installed PostgreSQL 14 on a MacBook Pro, loaded the same quarter\u2019s data into a table, and ran the same set of queries:<\/p>\n<h3>Count Rows<\/h3>\n<pre>sql_stmt=# \\timing\r\nTiming is on.\r\nsql_stmt=# SELECT COUNT(*) FROM q1_2022;\r\n\r\n  count   \r\n----------\r\n 18845260\r\n(1 row)\r\n\r\nTime: 1579.532 ms (00:01.580)\r\n<\/pre>\n<h3>Absolute Number of Failures<\/h3>\n<pre>sql_stmt=# SELECT model, COUNT(*) as failures                                                                                                          FROM q1_2022                                                                                                                                             WHERE failure = 't'                                                                                                                                      GROUP BY model                                                                                                                                           ORDER BY failures DESC                                                                                                                                   LIMIT 10;\r\n\r\n        model         | failures \r\n----------------------+----------\r\n ST4000DM000          |      117\r\n TOSHIBA MG07ACA14TA  |       88\r\n ST8000NM0055         |       86\r\n ST12000NM0008        |       73\r\n ST8000DM002          |       38\r\n ST14000NM001G        |       24\r\n ST16000NM001G        |       24\r\n HGST HMS5C4040ALE640 |       21\r\n HGST HUH721212ALE604 |       21\r\n ST12000NM001G        |       20\r\n(10 rows)\r\n\r\nTime: 2052.019 ms (00:02.052)\r\n<\/pre>\n<h3>Relative Number of Failures<\/h3>\n<pre>sql_stmt=# SELECT drives.model AS model, drives.drives AS drives,                                                                                      failures.failures,                                                                                                                                       ROUND((CAST(failures AS numeric)\/drives)*100, 6) AS percentage                                                                                           FROM                                                                                                                                                     (                                                                                                                                                        SELECT model, COUNT(*) as drives                                                                                                                         FROM q1_2022                                                                                                                                             GROUP BY model                                                                                                                                           ) AS drives                                                                                                                                              RIGHT JOIN                                                                                                                                               (                                                                                                                                                        SELECT model, COUNT(*) as failures                                                                                                                       FROM q1_2022                                                                                                                                             WHERE failure = 't'                                                                                                                                      GROUP BY model                                                                                                                                           ) AS failures                                                                                                                                            ON drives.model = failures.model                                                                                                                         ORDER BY percentage DESC                                                                                                                                 LIMIT 10;\r\n        model         | drives | failures | percentage \r\n----------------------+--------+----------+------------\r\n ST12000NM0117        |    873 |        1 |   0.114548\r\n ST10000NM001G        |   1028 |        1 |   0.097276\r\n HGST HUH728080ALE604 |   4504 |        3 |   0.066607\r\n TOSHIBA MQ01ABF050M  |  26231 |       13 |   0.049560\r\n TOSHIBA MQ01ABF050   |  24765 |       12 |   0.048455\r\n ST4000DM005          |   3331 |        1 |   0.030021\r\n WDC WDS250G2B0A      |   3338 |        1 |   0.029958\r\n ST500LM012 HN        |  37447 |       11 |   0.029375\r\n ST12000NM0007        | 118349 |       19 |   0.016054\r\n ST14000NM0138        | 144333 |       17 |   0.011778\r\n(10 rows)\r\n\r\nTime: 3831.924 ms (00:03.832)\r\n<\/pre>\n<h3>Retrieve a Single Record by Serial Number and Date<\/h3>\n<p>Modifying the query, since we have an entire quarter\u2019s data:<\/p>\n<pre>sql_stmt=# SELECT * FROM q1_2022 WHERE serial_number = 'ZLW18P9K' AND date = '2022-01-01';\r\n    date    | serial_number |     model     | capacity_bytes | failure\r\n------------+---------------+---------------+----------------+-------- \r\n 2022-01-01 | ZLW18P9K      | ST14000NM001G | 14000519643136 | f       (1 row)\r\n\r\nTime: 1690.091 ms (00:01.690)\r\n<\/pre>\n<p>For comparison, we tried to run the same query against the quarter\u2019s data in Parquet format, but Trino crashed with an out of memory error after 58 seconds. Clearly some tuning of the default configuration is required!<\/p>\n<p>Bringing the numbers together for the quarterly data sets. All times are in seconds.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-106535\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/Screen-Shot-2022-08-16-at-10.00.46-AM.png\" alt=\"\" width=\"535\" height=\"175\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/Screen-Shot-2022-08-16-at-10.00.46-AM.png 535w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/Screen-Shot-2022-08-16-at-10.00.46-AM-300x98.png 300w\" sizes=\"auto, (max-width: 535px) 100vw, 535px\" \/><br \/>\nPostgreSQL is faster for most operations, but not by much, especially considering that its data is on the local SSD, rather than Backblaze B2!<\/p>\n<p>It\u2019s worth mentioning that there are yet more tuning optimizations that we have not demonstrated in this exercise. For instance, the Trino Hive connector supports storage caching. Implementing a cache yields further performance gains by avoiding repeatedly retrieving the same data from Backblaze B2. Further, Trino is a distributed query engine. Trino\u2019s architecture is horizontally scalable. This means that Trino can also deliver shorter query run times by adding more nodes in your Trino compute cluster. We have limited all timings in this demonstration to Trino running on just a single node.<\/p>\n<h2>Partitioning Your Data Lake<\/h2>\n<p>Our final exercise was to create a single Drive Stats dataset containing all nine years of Drive Stats data. As stated above, at the time of writing the full Drive Stats dataset comprises nearly 300 million records, occupying over 90GB of disk space when in raw CSV format, rising by over 200,000 records per day, or about 75MB of CSV data.<\/p>\n<p>As the dataset grows in size, an additional data engineering best practice is to include partitions.<\/p>\n<p>In the introduction we mentioned that databases use optimized internal storage structures. Foremost among these are indexes. Data lakes have limited support for indexes. Data lakes do, however, support partitions. Data lake partitions are functionally similar to what databases alternately refer to as either a primary key index or index-organized tables. Regardless of the name, they effectively achieve faster data retrieval by having the data itself physically sorted. Since Drive Stats is append-only, when sorting on a date field, new records are appended to the dataset.<\/p>\n<p>Having the data physically sorted greatly aids retrieval in cases that are known as range queries. To achieve fastest retrieval on a given query, it is important to only retrieve data that resolves true on the predicate in the WHERE clause. In the case of Drive Stats, for a query on only a single month or several consecutive months we get the fastest time to the result if we can read only the data for these months. Without partitioning Trino would need to do a full table scan, resulting in slower response due to the overhead of reading records for which the WHERE clause logic resolves to false. Organizing the Drive Stats data into partitions enables Trino to efficiently skip records that resolve the WHERE clause to false. Thus with partitions, many queries are far more efficient and incur the read cost only of those records whose WHERE clause logic resolves to true.<\/p>\n<p>Our final transformation required a tweak to the Python script to iterate over all of the Drive Stats CSV files, writing Parquet files partitioned by year and month, so the files have prefixes of the form.<\/p>\n<pre>\/drivestats\/year={year}\/month={month}\/<\/pre>\n<p>For example:<\/p>\n<pre>\/drivestats\/year=2021\/month=12\/<\/pre>\n<p>The number of SMART attributes reported can change from one day to the next, and a single Parquet file can have only one schema, so there are one or more files with each prefix, named<\/p>\n<pre>{year}-{month}-{index}.parquet<\/pre>\n<p>For example:<\/p>\n<pre>2021-12-1.parquet<\/pre>\n<p>Again, we uploaded the resulting files and created a table in Trino.<\/p>\n<pre>CREATE TABLE drivestats (\r\n    serial_number VARCHAR,\r\n    model VARCHAR,\r\n    capacity_bytes BIGINT,\r\n    failure TINYINT,\r\n    smart_1_normalized BIGINT,\r\n    smart_1_raw BIGINT,\r\n    ...\r\n    smart_255_normalized BIGINT,\r\n    smart_255_raw BIGINT,\r\n    day SMALLINT,\r\n    year SMALLINT,\r\n    month SMALLINT\r\n)\r\nWITH (format = 'PARQUET',\r\n PARTITIONED_BY = ARRAY['year', 'month'],\r\n      EXTERNAL_LOCATION = 's3a:\/\/b2-trino-getting-started\/drivestats-parquet');<\/pre>\n<p>Note that the conversion to Parquet automatically formatted the data using appropriate types, which we used in the table definition.<\/p>\n<p>This command tells Trino to scan for partition files.<\/p>\n<pre>CALL system.sync_partition_metadata('ds', 'drivestats', 'FULL');\r\n<\/pre>\n<p>Let\u2019s run a query and see the performance against the full Drive Stats dataset in Parquet format, partitioned by month:<\/p>\n<pre>trino:ds&gt; SELECT COUNT(*) FROM drivestats;\r\n   _col0   \r\n-----------\r\n296413574 \r\n(1 row)\r\n\r\nQuery 20220707_182743_00055_tshdf, FINISHED, 1 node\r\nSplits: 412 total, 412 done (100.00%)\r\n15.84 [296M rows, 5.63MB] [18.7M rows\/s, 364KB\/s]<\/pre>\n<p>It takes 16 seconds to count the total number of records, reading only 5.6MB of the 15.3GB total data.<\/p>\n<p>Next, let\u2019s run a query against just one month\u2019s data:<\/p>\n<pre>trino:ds&gt; SELECT COUNT(*) FROM drivestats WHERE year = 2022 AND month = 1;\r\n  _col0  \r\n---------\r\n 6415842 \r\n(1 row)\r\n\r\nQuery 20220707_184801_00059_tshdf, FINISHED, 1 node\r\nSplits: 16 total, 16 done (100.00%)\r\n0.85 [6.42M rows, 56KB] [7.54M rows\/s, 65.7KB\/s]<\/pre>\n<p>Counting the records for a given month takes less than a second, retrieving just 56KB of data\u2013partitioning is working!<\/p>\n<p>Now we have the entire Drive Stats data set loaded into Backblaze B2 in an efficient format and layout for running queries. Our next blog post will look at some of the queries we\u2019ve run to clean up the data set and gain insight into nine years of hard drive metrics.<\/p>\n<h2>Conclusion<\/h2>\n<p>We hope that this article inspires you to try using Backblaze for your data analytics workloads if you\u2019re not already doing so, and that it also serves as a useful primer to help you set up your own data lake using Backblaze B2 Cloud Storage. Our Drive Stats data is just one example of the type of data set that can be used for data analytics on Backblaze B2.<\/p>\n<p>Hopefully, you too will find that Backblaze B2 Cloud Storage can be a useful, powerful, and very cost effective option for your data lake workloads.<\/p>\n<p><strong>If you\u2019d like to get started working with analytical data in Backblaze B2, <a href=\"https:\/\/www.backblaze.com\/b2\/sign-up.html?referrer=nopref\" target=\"_blank\" rel=\"noopener\">sign up here<\/a> for 10 GB storage, free of charge, and get to work. If you\u2019re already storing and querying analytical data in Backblaze B2, please let us know in the comments what tools you\u2019re using and how it\u2019s working out for you!<\/strong><\/p>\n<p>If you already work with Trino (or other data lake analytic engines), and would like connection credentials for our partitioned, Parquet, complete Drive Stats data set that is now hosted on Backblaze B2 Cloud Storage, please contact us at <a href=\"mailto:evangelism@backblaze.com\" target=\"_blank\" rel=\"noopener\">evangelism@backblaze.com<\/a>.<br \/>\nFuture blog posts focused on Drive Stats and analytics will be using this complete Drive Stats dataset.<\/p>\n<p>Similarly, please let us know if you would like to run a proof of concept hosting your own data in a Backblaze B2 data lake and would like the assistance of the Backblaze Developer Evangelism team.<\/p>\n<p>And lastly, if you think this article may be of interest to your colleagues, we\u2019d very much appreciate you sharing it with them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Read this primer on getting started using Backblaze B2 Cloud Storage for data analytics. We&#8217;ll walk you through several options using SQL to analyze our Drive Stats data. <\/p>\n","protected":false},"author":178,"featured_media":106521,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[7,483,479],"tags":[468],"class_list":["post-106520","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cloud-storage","category-tech-lab","category-technology","tag-b2cloud","entry"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Storing and Querying Analytical Data in Backblaze B2<\/title>\n<meta name=\"description\" content=\"Uncover how to optimize data analytics with Backblaze B2 Cloud Storage, leveraging SQL and Trino for efficient querying of Drive Stats data.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Storing and Querying Analytical Data in Backblaze B2\" \/>\n<meta property=\"og:description\" content=\"Uncover how to optimize data analytics with Backblaze B2 Cloud Storage, leveraging SQL and Trino for efficient querying of Drive Stats data.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/\" \/>\n<meta property=\"og:site_name\" content=\"Backblaze Blog | Cloud Storage &amp; Cloud Backup\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/backblaze\" \/>\n<meta property=\"article:published_time\" content=\"2022-08-16T17:41:13+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-08-14T19:10:50+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1440\" \/>\n\t<meta property=\"og:image:height\" content=\"820\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Greg Hamer\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@backblaze\" \/>\n<meta name=\"twitter:site\" content=\"@backblaze\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Greg Hamer\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"26 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Storing and Querying Analytical Data in Backblaze B2","description":"Uncover how to optimize data analytics with Backblaze B2 Cloud Storage, leveraging SQL and Trino for efficient querying of Drive Stats data.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/","og_locale":"en_US","og_type":"article","og_title":"Storing and Querying Analytical Data in Backblaze B2","og_description":"Uncover how to optimize data analytics with Backblaze B2 Cloud Storage, leveraging SQL and Trino for efficient querying of Drive Stats data.","og_url":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/","og_site_name":"Backblaze Blog | Cloud Storage &amp; Cloud Backup","article_publisher":"https:\/\/www.facebook.com\/backblaze","article_published_time":"2022-08-16T17:41:13+00:00","article_modified_time":"2024-08-14T19:10:50+00:00","og_image":[{"width":1440,"height":820,"url":"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png","type":"image\/png"}],"author":"Greg Hamer","twitter_card":"summary_large_image","twitter_creator":"@backblaze","twitter_site":"@backblaze","twitter_misc":{"Written by":"Greg Hamer","Est. reading time":"26 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#article","isPartOf":{"@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/"},"author":{"name":"Greg Hamer","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/person\/0ba2b0b510f8f3ea0ef4d8b72b69bff1"},"headline":"Storing and Querying Analytical Data in Backblaze B2","datePublished":"2022-08-16T17:41:13+00:00","dateModified":"2024-08-14T19:10:50+00:00","mainEntityOfPage":{"@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/"},"wordCount":3891,"commentCount":0,"publisher":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#primaryimage"},"thumbnailUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png","keywords":["B2Cloud"],"articleSection":["Cloud Storage","Tech Lab","Technology"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/","url":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/","name":"Storing and Querying Analytical Data in Backblaze B2","isPartOf":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#primaryimage"},"image":{"@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#primaryimage"},"thumbnailUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png","datePublished":"2022-08-16T17:41:13+00:00","dateModified":"2024-08-14T19:10:50+00:00","description":"Uncover how to optimize data analytics with Backblaze B2 Cloud Storage, leveraging SQL and Trino for efficient querying of Drive Stats data.","breadcrumb":{"@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#primaryimage","url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png","contentUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png","width":1440,"height":820},{"@type":"BreadcrumbList","@id":"https:\/\/www.backblaze.com\/blog\/storing-and-querying-analytical-data-in-backblaze-b2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Storing and Querying Analytical Data in Backblaze B2"}]},{"@type":"WebSite","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#website","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/","name":"Backblaze Cloud Solutions Blog","description":"Cloud Storage &amp; Cloud Backup","publisher":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization","name":"Backblaze","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/i0.wp.com\/www.backblaze.com\/blog\/wp-content\/uploads\/2017\/12\/backblaze_icon_transparent.png?fit=512%2C512&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.backblaze.com\/blog\/wp-content\/uploads\/2017\/12\/backblaze_icon_transparent.png?fit=512%2C512&ssl=1","width":512,"height":512,"caption":"Backblaze"},"image":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/backblaze","https:\/\/x.com\/backblaze","https:\/\/www.youtube.com\/user\/Backblaze","https:\/\/en.wikipedia.org\/wiki\/Backblaze"]},{"@type":"Person","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/person\/0ba2b0b510f8f3ea0ef4d8b72b69bff1","name":"Greg Hamer","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/05\/Hamer_Greg-1-150x150.jpg","url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/05\/Hamer_Greg-1-150x150.jpg","contentUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/05\/Hamer_Greg-1-150x150.jpg","caption":"Greg Hamer"},"description":"Greg Hamer is the senior developer evangelist. Greg is a builder and has spent his career working with teams capturing, storing, scaling, securing, analyzing, and speeding up data storage and retrieval systems. He is happy to assist you, your team, and your community deciding how to optimize your various data storage requirements. Connect with Greg on LinkedIn.","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/author\/greg\/"}]}},"jetpack_featured_media_url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/08\/bb-bh-Storing-and-Querying-Analytical-Data-in-Backblaze-B2_DesignA1.png","_links":{"self":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts\/106520","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/users\/178"}],"replies":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/comments?post=106520"}],"version-history":[{"count":0,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts\/106520\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/media\/106521"}],"wp:attachment":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/media?parent=106520"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/categories?post=106520"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/tags?post=106520"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}