iceberg — bits on data

Iceberg won the table format war

Wed, 05 Jul 2023 17:00:00 +0000

TL;DR: I believe Apache Iceberg won the table format wars, not because of a feature race, but primarily because of the open Iceberg spec. There are some features only available in Iceberg due to the breaking of compatibility with Hive, which was also a contributing factor to the adoption of the implementation.

Disclaimer: I am the Head of Developer Relations at Tabular and a Developer Advocate in both Apache Iceberg and Trino communities. All of my 🌶️ takes are my biased opinion and not necessarily the opinion of Tabular, the Apache Software Foundation, the Trino Software Foundation, or the communities I work with. This also goes into a bit of my personal story for leaving my previous company but relates to my reasoning so I offer you a TL;DR if you don’t care about the details.

My revelation with Iceberg

Two months ago, I made the difficult decision to leave Starburst which was hands down the best job I’ve ever had up to this point. Since I left I’ve had a lot of questions about my motivations for leaving and wanted to put some concerns to rest. This role allowed me to get deeply involved in open-source during working hours and showed me how I could aid the community to get traction on their work and drive the roadmap for many in the project. This was a new calling that overlapped with many altruist parts of how I define myself and was deeply rewarding.

I made some incredible friends, some of which have become invaluable mentors during this process of learning the nuances and interplay between venture capital and an open-source community. So why did I leave this job that I love so much?

Apache Iceberg Baby

Let’s time-travel (pun intended) to the first Iceberg episode of the Trino Community Broadcast. In true ADHD form, I crammed learning about Apache Iceberg well into the night before the broadcast with the creator of Iceberg, Ryan Blue. While setting up that demo, I really started to understand what a game-changer Iceberg was. I had heard the Trino users and maintainers talk about Iceberg replacing Hive but it just didn’t sink in for the first couple of months. I mean really, what could be better than Hive? 🥲

While researching I learned about hidden partitioning, schema evolution, and most importantly, the open specification. The whole package was just such an elegant solution to problems that had caused me and many in the Trino community failed deployments and late-night calls. Just as I had the epiphany with Trino (Presto at the time) of how big of a productivity booster SQL queries over multiple systems were, I had a similar experience with Iceberg that night. Preaching the combination of these two became somewhat of a mission of mine after that.

ANSI SQL + Iceberg + Parquet + S3

Immediately after that show, I wrote a four-blog series on Trino on Iceberg, did a talk, and built out the getting started repository for Iceberg. I was rather hooked on the thought of these two technologies in tandem. You start out with a system that can connect to any data source you throw at it and sees it as yet another SQL table. Take that system and add a table format that interacts with all the popular analytics engines people use today from Spark, Snowflake, and DuckDB, to Trino variants like EMR, Athena, and Starburst.

Standards all the way down

This approach to data virtualization is so interesting as each system offers full leverage over vendors trying to lock you in their particular query language or storage format. It pushes the incentives for vendors to support these open standards which puts them in a seemingly vulnerable position compared to locking users in. However, that’s a fallacy I hope vendors will slowly begin to understand is not true. With the open S3 storage standard, open file standards like Parquet, open table standards like Iceberg, and the ANSI SQL spec closely followed by Trino, the entire analytics warehouse has become a modular stack of truly open components. This is not just open in the sense of the freedom to use and contribute to a project, but the open standard that enables you the freedom to simply move between different projects.

This new freedom gives the user the features of a data warehouse, with the scalability of the cloud, and a free market of a la carte services to handle your needs at whatever price point you need at any given time. All users need to do in this new ecosystem is shop around and choose any open project or vendor that implements the open standard and your migration cost will be practically non-existent. This is the ultimate definition of future-proofing your architecture.

Back to why I left Starburst

Trino Community

I’ll quickly tie up why I left Starburst before I reveal why Iceberg won the table format wars. For the last three years, I have worked on building awareness around Trino. My partner in crime, Manfred Moser, had been in the Trino community and literally wrote the book on Trino. Together we spent long days and nights growing the Trino community. I loved every minute of it and honestly didn’t see myself leaving Starburst or shifting focus from Trino until it became an analytics organization standard.

Something became apparent though. Trino community health was thriving, and there were many organic product movements taking place in the Trino community. Cole Bowden was boosting the Trino release process getting us to cutting Trino releases every 1-2 weeks which is unprecedented in open-source. Cole, Manfred, and I did a manual scan over the pull requests and gracefully closed or revived outdated or abandoned pull requests. The Trino community is in great shape.

Iceberg Community

As I looked at Iceberg, the adoption and awareness were growing at an unprecedented rate with Snowflake, BigQuery, Starburst, and Athena all announcing support between 2021 and 2022. However, nothing was moving the needle forward from a developer experience perspective. There was some initial amazing work done by Sam Redai, but there was still so much to be done. I noticed the Iceberg documentation needed improvement. While many vendors were advocating for Iceberg, there was nobody putting in consistent work to the vanilla Iceberg site. PMCs like Ryan Blue, Jack Ye, Russel Spitzer, Dan Weeks, and many others are doing a great shared job of driving roadmap features for Iceberg, but no individual currently has the time to dedicate to the cat herding, improving communication in the project, or bettering the developer and contributor experience for users. Since Trino was on stable ground it felt imperative to move to Iceberg and fill in these gaps. When Ryan approached me with a Head of DevRel position at Tabular, I couldn’t pass up the opportunity. To be clear I left Starburst but not the Trino community. Being at Iceberg also helped me in my mission to continue forging these two technologies that I believe in so much.

Tell us why Iceberg won already!

Moving back to the meat of the subject. My first blog in the Trino community covered what I once called, the invisible Hive spec to alleviate confusion around why Trino would need a “Hive” connector if it’s a query engine itself. The reason we called the Hive connector as such is that it translated Parquet files sitting in an S3 object store into a schema that could be read and modified via a query engine that knew the Hive spec. This had nothing to do with Hive the query engine, but Hive the spec. Why was the Hive spec invisible 👻? Because nobody wrote it down. It was in the minds of the engineers who put Hive together, spread across Cloudera forums by engineers who had bashed their heads against the wall and reverse-engineered the binaries to understand this “spec”.

Why do you even need a spec?

Having an “invisible” spec was rather problematic, as every implementation of that spec ran on different assumptions. I spent days searching Hive solutions on Hortonworks/Cloudera forums trying to solve an error with solutions that for whatever reason, didn’t work on my instance of Hive. There were also implicit dependencies required with using the Hive spec. It’s actually incorrect to call the Hive spec as such because a spec should have independence of platform, programming language, and hardware, while including minimal implementation details not required for interoperability between implementations. SQL, for instance, is a declarative language that runs on systems written in, C++, Rust, Java, and doesn’t get into the business of telling query engines how to answer that query, just what behavior is expected.

The open spec is why Iceberg won

By the time there were various iterations of Hive from the Hadoop and Big Data Boom, there was not a very central source to lay a stake in the ground to standardize this spec. While you may imagine that we would have learned our lesson, until Iceberg, none of the projects officially formalized their assumptions and specifications for their project. Delta Lake and Hudi extended the Hive models to make migration from Hive simpler but kept a few of the issues that Hive introduced, like exposing the partitioning format to users running analysis.

Open specifications and vendor politics

You may seem skeptical that an open specification for a table format holds such weight, but it crept its way into a well-known feud between two large analytics vendors of the day, Databricks and Snowflake. Databricks has been the leading vendor in the data lakehouse market while Snowflake dominated the data warehouse market. Snowflake’s original strategy initially involved encouraging movement from the data lakehouse market back to the data warehouse market, while Databrick’s original strategy was to do the exact opposite. This conveyed the outdated trap that many vendors still fall prey to. This idea is that locking users in will help your business and stakeholders reduce churn and keep customers in the long run. Vendor lock-in was once a decades-long play, but as B2C consumerism expectations creep into B2B consumerism, we are seeing a gradual shift of practitioners demanding interoperability from vendors to give them the level of autonomy they have experienced with open-source projects.

This surfaced with Snowflake when they announced that they would be offering support for an Iceberg external table functionality in 2022. At first, I raised my eyebrow at this as I figured this was a feeble attempt for Snowflake to market itself as an open-friendly warehouse when the external table would be nothing but an easy way to migrate data from the data lake to Snowflake. Whatever their motives, this was good visibility for Iceberg and I was thrilled that Snowflake was showcasing the need for an open specification, gimmick or not. This even pressured other competing data warehouses like BigQuery to add Iceberg support.

The final signal that the open Iceberg spec won

I didn’t, however, expect to see what took place at both Snowflake Summit and Data + AI Summit this year in 2023. If you don’t know these are Snowflake and Databricks’ big events that happened on the same week as an extension of their feud. There were already some hints that Snowflake had dropped to signal that they were ramping up their Iceberg support but were doing a great job at keeping it under wraps. The final reveal came at Snowflake Summit; Snowflake now offers managed Iceberg catalog support.

I was thrilled to see that finally, a data warehouse vendor understands there’s no way to beat open standards and they should adopt open storage as part of their core business model. What’s even more about this picture is they show both Trino and Spark engines as open compute alternatives to their own engine. This was beyond my expectations for Snowflake, and definitely showed me they were heading in the right direction for their customers.

Meanwhile, in a California town not far away, Databricks would also have a response to Snowflake’s announcement stored up. Delta Lake 3.0 was announced and it now supports compatibility across Delta Lake, Iceberg, and Hudi (be it limited compatibility for Iceberg for now). Whatever happens, there is now opportunity for Databricks customers to trial Iceberg, and this should excite everyone. We’re one step closer to having this spec become the common denominator. With both of these moves from a company that initially only wanted their proprietary format to win and a company that built on a competing format, I have the opinion that Iceberg has won the format wars.

Now I don’t want to act like these vendors simply have users’ best interests in mind. All they want is for you to make the decision to choose them. I work for a vendor, we also want your money. What I am seeing though is that now the industry is trending towards incentivizing openness as more customers demand it. In order for companies to stay ahead, they must embrace this fact rather than fight it. What is rather historical about this moment in time is that since the dawn of analytics and the data warehouse, there has been vendor lock-in on many fronts of the analytics market. This to me, signals the nail in the coffin of any capability for vendors to do this on the level playing field of open standards. It’s now up to the vendors to keep you happy and continuously stop you from churning. This in the long run is good for users and vendors and will ultimately drive better products.

One quick aside, some may mention that Hudi recently added a “spec” so why does Iceberg having a spec give it the winning vote? I recommend you go back to reading the purpose of an open spec section earlier in this blog, then look at both the Iceberg spec and the Hudi “spec” and determine which one satisfies the criteria. Hudi’s “spec” exposes Java dependencies making it unusable for any system not running on the JDK, doesn’t clarify schema, and has a lot of implementation details rather than leaving that to the systems that implement it. This “spec” is something closer to an architecture document than an open spec.

Will the Iceberg project drive Delta Lake and Hudi to nonexistence?

Maybe, or maybe not. That really depends on the feature set that these three table formats offer and what the actual value they bring to the users is. In other words, you decide what’s important, we decide what we’re going to support, and vendors and the larger data community decide who stays and who goes. This simply boils down to what users want, and if there is enough deviation for it to make sense for multiple formats to exist. Having competition in open-source is every bit as healthy as having competition in the vendor space – with some exceptions:

Exception 1: The projects don’t collectively build on an open specification. This enables vendor lock-in under the guise of being “open” simply because the source is available.

Exception 2: The projects are not just doubled efforts of each other. This is a sad waste of time and adds analysis paralysis to users. If two projects have clear value added in different domains with some overlap, then the choice becomes clearer based on the use case.

Exception 3: Any of the competing OSS projects primarily serve the needs of one or a few companies over the larger community. This exception is an anti-pattern of innovation brought up in the Innovator’s Dilemma. Focusing on the needs of the few will eventually force the majority to move to the next thing. Truly open projects will continue to evolve at the pace of the industry.

Just because I’m touting that Apache Iceberg has “won the table format wars” with an open spec, does not mean I am discrediting the hard work done by the competing table formats or advocating not to use them. Delta Lake has made sizeable efforts to stay competitive on a feature-by-feature basis. I’m also still good friends with Mr. DeltaLake himself, Denny Lee. I hold no grudge against these amazing people trying to do better for their users and customers. I am also excited at the fact that all formats now have some level of interoperability for users outside of the Iceberg ecosystem!

And…Hudi?

I toiled with my take on Hudi. I really enjoyed working with the folks from Hudi and while I usually follow the “don’t say anything unless it’s nice” rule, it would be disingenuous not to give my real opinion on this matter. I even quoted Reddit user u/JudgingYouThisSecond to initially avoid saying my own words, but even his take still doesn’t quite capture my thoughts concisely:

Ultimately, there is room enough for there to be several winners in the table format space:

Hudi, for folks who care about or need near-real-time ingestion and streaming. This project will likely be #3 in the overall pecking order as it's become less relevant and seems to be losing steam realative to the other two in this list (at least IMO)

Delta Lake, for people who play a lot in the Databricks ecosystem and don't mind that Delta Lake (while OSS) may end up being a bit of a “gateway drug” into a pay-to-play world

Iceberg, for folks looking to avoid vendor lock-in and are looking for a table format with a growing community behind it.

While my opinion may change as these projects evolve, I consider myself an “Iceberg Guy” at the moment.

Comment by u/JudgingYouThisSecond in dataengineering

I feel like the open-source Hudi project is just not in a state where I would recommend it to a friend. I once thought perhaps Hudi’s upserts made their complexity worth it, but both Delta and Iceberg have improved on this front. My opinion from supporting all three formats in the Trino community is that Hudi is an overly complex implementation with many leaky abstractions such as exposing Hadoop and other legacy dependencies (e.g. the KyroSerializer) and doesn’t actually provide proper transaction semantics.

What I personally hope for is that we get to a point where we converge down to the minimum set of table format projects needed to scratch the different itches that shouldn’t be overlapping, and all adhering to the Iceberg specification. In essence, really only the Iceberg spec has won and Iceberg is currently the only implementation to support it. I would be thrilled to see Delta Lake and Hudi projects support the Iceberg spec and make this a fair race and make the new question, which Iceberg implementation won the most adoption? That game will be fun to play. Imagine Tabular, Databricks, Cloudera, and Onehouse all building on the same spec!

Note: Please don’t respond with silly performance benchmarks with TPC (or any standardized dataset in a vendor-controlled benchmark) to suggest the performance of these benchmarks is relevant to this conversation. It’s not.

Thanks for reading, and if you disagree and want to debate or love this and want to discuss it, reach out to me on LinkedIn, Twitter, or the Apache Iceberg Slack.

Stay classy Data Folks!

#iceberg #opensource #openstandard

bits

Trino on ice IV: Deep dive into Iceberg internals

Thu, 12 Aug 2021 05:00:00 +0000

So far, this series has covered some very interesting user level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some implementation details of Iceberg by dissecting some files that result from various operations carried out using Trino. To dissect you must use some surgical instrumentation, namely Trino, Avro tools, the MinIO client tool and Iceberg’s core library. It’s useful to dissect how these files work, not only to help understand how Iceberg works, but also to aid in troubleshooting issues, should you have any issues during ingestion or querying of your Iceberg table. I like to think of this type of debugging much like a fun game of operation, and you’re looking to see what causes the red errors to fly by on your screen.

Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:

Understanding Iceberg metadata

Iceberg can use any compatible metastore, but for Trino, it only supports the Hive metastore and AWS Glue similar to the Hive connector. This is because there is already a vast amount of testing and support for using the Hive metastore in Trino. Likewise, many Trino use cases that currently use data lakes already use the Hive connector and therefore the Hive metastore. This makes it convenient to have as the leading supported use case as existing users can easily migrate between Hive to Iceberg tables. Since there is no indication of which connector is actually executed in the diagram of the Hive connector architecture, it serves as a diagram that can be used for both Hive and Iceberg. The only difference is the connector used, but if you create a table in Hive, you can view the same table in Iceberg.

To recap the steps taken from the first three blogs; the first blog created an events table, while the first two blogs ran two insert statements. The first insert contained three records, while the second insert contained a single record.

Up until this point, the state of the files in MinIO haven’t really been shown except some of the manifest list pointers from the snapshot in the third blog post. Using the MinIO client tool, you can list files that Iceberg generated through all these operations and then try to understand what purpose they are serving.

% mc tree -f local/
local/
└─ iceberg
   └─ logging.db
      └─ events
         ├─ data
         │  ├─ event_time_day=2021-04-01
         │  │  ├─ 51eb1ea6-266b-490f-8bca-c63391f02d10.orc
         │  │  └─ cbcf052d-240d-4881-8a68-2bbc0f7e5233.orc
         │  └─ event_time_day=2021-04-02
         │     └─ b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc
         └─ metadata
            ├─ 00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json
            ├─ 00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
            ├─ 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
            ├─ 23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro
            ├─ 92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro
            ├─ snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
            ├─ snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro
            └─ snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro

There are a lot of files here, but here are a couple of patterns that you can observe with these files.

First, the top two directories are named data and metadata.

////data/////metadata/
As you might expect, data contains the actual ORC files split by partition. This is akin to what you would see in a Hive table data directory. What is really of interest here is the metadata directory. There are specifically three patterns of files you’ll find here.
///




/metadata/.avro

///
/metadata/snap---.avro

///
/metadata/-.metadata.json
Iceberg has a persistent tree structure that manages various snapshots of the data that are created for every mutation of the data. This enables not only a concurrency model that supports serializable isolation, but also cool features like time travel across a linear progression of snapshots.
This tree structure contains two types of Avro files, manifest lists and manifest files. Manifest list files contain pointers to various manifest files and the manifest files themselves point to various data files. This post starts out by covering these manifest files, and later covers the table metadata files that are suffixed by .metadata.json.
The last blog covered the command in Trino that shows the snapshot information that is stored in the metastore. Here is that command and its output again for your review.
SELECT manifest_list 
FROM iceberg.logging."events$snapshots";
Result:















snapshots
s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro


You’ll notice that the manifest list returns the Avro files prefixed with
snap- are returned. These files are directly correlated with the snapshot record stored in the metastore. According to the diagram above, snapshots are records in the metastore that contain the url of the manifest list in the Avro file. Avro files are binary files and not something you can just open up in a text editor to read. Using the avro-tools.jar tool distributed by the Apache Avro project, you can actually inspect the contents of this file to get a better understanding of how it is used by Iceberg.

The first snapshot is generated on the creation of the events table. Upon inspecting this file, you notice that the file is empty. The output is an empty line that the jq JSON command line utility removes on pretty printing the JSON that is returned, which is just a newline. This snapshot represents an empty state of the table upon creation. To investigate the snapshots you need to download the files to your local filesystem. Let's move them to the home  directory:

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro | jq .


Result: (is empty)




The second snapshot is a little more interesting and actually shows us the contents of a manifest list.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro | jq .


Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "manifest_length":6114,
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "added_data_files_count":{
      "int":2
   },
   "existing_data_files_count":{
      "int":0
   },
   "deleted_data_files_count":{
      "int":0
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   },
   "existing_rows_count":{
      "long":0
   },
   "deleted_rows_count":{
      "long":0
   }
}


To understand each of the values in each of these rows, you can refer to the  Iceberg
specification in the manifest list file section. Instead of covering these exhaustively, let's focus on a few key fields. Below are the fields, and their definition according to the specification.
manifest_path – Location of the manifest file.
partition_spec_id – ID of a partition spec used to write the manifest; must be listed in table metadata partition-specs.
added_snapshot_id – ID of the snapshot where the manifest file was added.
partitions – A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec.
added_rows_count – Number of rows in all files in the manifest that have status ADDED, when null this is assumed to be non-zero.

As mentioned above, manifest lists hold references to various manifest files. These manifest paths are the pointers in the persistent tree that tells any client using Iceberg where to find all of the manifest files associated with a particular snapshot. To traverse this tree, you can look over the different manifest paths to find all the manifest files associated with the particular snapshot you want to traverse. Partition spec ids are helpful to know the current partition specification which are stored in the table metadata in the metastore. This references where to find the spec in the metastore. Added snapshot ids tells you which snapshot is associated with the manifest list. Partitions hold some high level partition bound information to make for faster querying. If a query is looking for a particular value, it only traverses the manifest files where the query values fall within the range of the file values. Finally, you get a few metrics like the number of changed rows and data files, one of which is the count of added rows. The first operation consisted of three rows inserts and the second operation was the insertion of one row. Using the row counts you can easily determine which manifest file belongs to which operation.

The following command shows the final snapshot after both operations executed and filters out only the fields pointed out above.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro | jq '. | {manifest_path: .manifest_path, partition_spec_id: .partition_spec_id, added_snapshot_id: .added_snapshot_id, partitions: .partitions, added_rows_count: .added_rows_count }'


Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":4564366177504223700
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001eI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":1
   }
}
{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   }
}


In the listing of the manifest file related to the last snapshot, you notice the first operation where three rows were inserted is contained in the manifest file in the second JSON object. You can determine this from the snapshot id, as well as, the number of rows that were added in the operation. The first JSON object contains the last operation that inserted a single row. So the most recent operations are listed in reverse commit order.

The next command does the same listing of the file that you ran with the manifest list, except you run this on the manifest files themselves to expose their contents and discuss them. To begin with, you run the command to show the contents of the manifest file associated with the insertion of three rows.

% java -jar  ~/avro-tools-1.10.0.jar tojson ~/Desktop/avro_files/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro | jq .


Result:

{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-01/51eb1ea6-266b-490f-8bca-c63391f02d10.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18718
         }
      },
      "record_count":1,
      "file_size_in_bytes":870,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":1
            },
            {
               "key":2,
               "value":1
            },
            {
               "key":3,
               "value":1
            },
            {
               "key":4,
               "value":1
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}
{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-02/b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18719
         }
      },
      "record_count":2,
      "file_size_in_bytes":1084,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":2
            },
            {
               "key":2,
               "value":2
            },
            {
               "key":3,
               "value":2
            },
            {
               "key":4,
               "value":2
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Double oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"WARN"
            },
            {
               "key":3,
               "value":"Maybeh oh noes?"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}


Now this is a very big output, but in summary, there’s really not too much to these files. As before, there is a Manifest section in the Iceberg spec that details what each of these fields means. Here are the important fields:
snapshot_id – Snapshot id where the file was added, or deleted if status is two. Inherited when null.
data_file – Field containing metadata about the data files pertaining to the manifest file, such as file path, partition tuple, metrics, etc…
data_file.file_path – Full URI for the file with FS scheme.
data_file.partition – Partition data tuple, schema based on the partition spec.
data_file.record_count – Number of records in the data file.
data_file.*_count – Multiple fields that contain a map from column id to  number of values, null, nan counts in the file. These can be used to quickly  filter out unnecessary get operations.
data_file.*_bounds – Multiple fields that contain a map from column id to lower or upper bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.

Each data file struct contains a partition and data file that it maps to. These files only be scanned and returned if the criteria for the query is met when  checking all of the count, bounds, and other statistics that are recorded in the file. Ideally only files that contain data relevant to the query should be scanned at all. Having information like the record count may also help in the query planning process to determine splits and other information. This particular optimization hasn’t been completed yet as planning typically happens before traversal of the files. It is still in ongoing discussion and is discussed a bit by Iceberg creator Ryan Blue in a recent meetup. If this is something you are interested in, keep posted on the Slack channel and releases as the Trino Iceberg connector progresses in this area.

As mentioned above, the last set of files that you find in the metadata directory which are suffixed with .metadata.json. These files at baseline are a bit strange as they aren’t stored in the Avro format, but instead the JSON format. This is because they are not part of the persistent tree structure. These files are essentially a copy of the table metadata that is stored in the metastore. You can find the fields for the table metadata listed in the Iceberg specification. These tables are typically stored persistently in a metasture much like the Hive metastore but could easily be replaced by any datastore that can support an atomic swap (check-and-put) operation required for Iceberg to support the optimistic concurrency operation.

The naming of the table metadata includes a table version and UUID:
-.metadata.json. To commit a new metadata version, which just adds 1 to the current version number, the writer performs these steps:
It creates a new table metadata file using the current metadata.
It writes the new table metadata to a file following the naming with the next version number.

It requests the metastore swap the table’s metadata pointer from the old location to the new location.
If the swap succeeds, the commit succeeded. The new file is now the
current metadata.
If the swap fails, another writer has already created their own. The
current writer goes back to step 1.

If you want to see where this is stored in the Hive metastore, you can reference the TABLE_PARAMS table. At the time of writing, this is the only method of using the metastore that is supported by the Trino Iceberg connector.

SELECT PARAM_KEY, PARAM_VALUE FROM metastore.TABLE_PARAMS;


Result:


PARAM_KEY PARAM_VALUE
EXTERNAL TRUE
metadata_location s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
numFiles 2
previous_metadata_location s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
table_type iceberg
totalSize 5323
transient_lastDdlTime 1622865672


So as you can see, the metastore is saying the current metadata location is the
00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json file. Now you can dive in to see the table metadata that is being used by the Iceberg connector.

% cat ~/Desktop/avro_files/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json


Result:

{
   "format-version":1,
   "table-uuid":"32e3c271-84a9-4be5-9342-2148c878227a",
   "location":"s3a://iceberg/logging.db/events",
   "last-updated-ms":1622865686323,
   "last-column-id":5,
   "schema":{
      "type":"struct",
      "fields":[
         {
            "id":1,
            "name":"level",
            "required":false,
            "type":"string"
         },
         {
            "id":2,
            "name":"event_time",
            "required":false,
            "type":"timestamp"
         },
         {
            "id":3,
            "name":"message",
            "required":false,
            "type":"string"
         },
         {
            "id":4,
            "name":"call_stack",
            "required":false,
            "type":{
               "type":"list",
               "element-id":5,
               "element":"string",
               "element-required":false
            }
         }
      ]
   },
   "partition-spec":[
      {
         "name":"event_time_day",
         "transform":"day",
         "source-id":2,
         "field-id":1000
      }
   ],
   "default-spec-id":0,
   "partition-specs":[
      {
         "spec-id":0,
         "fields":[
            {
               "name":"event_time_day",
               "transform":"day",
               "source-id":2,
               "field-id":1000
            }
         ]
      }
   ],
   "default-sort-order-id":0,
   "sort-orders":[
      {
         "order-id":0,
         "fields":[
            
         ]
      }
   ],
   "properties":{
      "write.format.default":"ORC"
   },
   "current-snapshot-id":4564366177504223943,
   "snapshots":[
      {
         "snapshot-id":6967685587675910019,
         "timestamp-ms":1622865672882,
         "summary":{
            "operation":"append",
            "changed-partition-count":"0",
            "total-records":"0",
            "total-data-files":"0",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro"
      },
      {
         "snapshot-id":2720489016575682283,
         "parent-snapshot-id":6967685587675910019,
         "timestamp-ms":1622865680419,
         "summary":{
            "operation":"append",
            "added-data-files":"2",
            "added-records":"3",
            "added-files-size":"1954",
            "changed-partition-count":"2",
            "total-records":"3",
            "total-data-files":"2",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro"
      },
      {
         "snapshot-id":4564366177504223943,
         "parent-snapshot-id":2720489016575682283,
         "timestamp-ms":1622865686278,
         "summary":{
            "operation":"append",
            "added-data-files":"1",
            "added-records":"1",
            "added-files-size":"746",
            "changed-partition-count":"1",
            "total-records":"4",
            "total-data-files":"3",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro"
      }
   ],
   "snapshot-log":[
      {
         "timestamp-ms":1622865672882,
         "snapshot-id":6967685587675910019
      },
      {
         "timestamp-ms":1622865680419,
         "snapshot-id":2720489016575682283
      },
      {
         "timestamp-ms":1622865686278,
         "snapshot-id":4564366177504223943
      }
   ],
   "metadata-log":[
      {
         "timestamp-ms":1622865672894,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json"
      },
      {
         "timestamp-ms":1622865680524,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json"
      }
   ]
}


As you can see, these JSON files can quickly grow as you perform different updates on your table. This file contains a pointer to all of the snapshots and manifest list files, much like the output you found from looking at the snapshots in the table. A really important piece to note is the schema is stored here. This is what Trino uses for validation on inserts and reads. As you may expect, there is the root location of the table itself, as well as a unique table identifier. The final part I’d like to note about this file is the partition-spec and partition-specs fields. The partition-spec field holds the current partition spec, while the partition-specs is an array that can hold a list of all partition specs that have existed for this table. As pointed out earlier, you can have many different manifest files that use different partition specs. That wraps up all of the metadata file types you can expect to see in Iceberg!

This post wraps up the Trino on ice series. Hopefully these blog posts serve as a helpful initial dialogue about what is expected to grow as a vital portion of an open data lakehouse stack. What are you waiting for? Come join the fun and help us implement some of the missing features or instead go ahead and try Trino on Ice(berg) yourself!

#trino #iceberg

snapshots
s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro

PARAM_KEY	PARAM_VALUE
EXTERNAL	TRUE
metadata_location	s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
numFiles	2
previous_metadata_location	s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
table_type	iceberg
totalSize	5323
transient_lastDdlTime	1622865672



Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Fri, 30 Jul 2021 05:00:00 +0000


In the last two blog posts, we’ve covered a lot of cool feature improvements of Iceberg over the Hive model. I recommend you take a look at those if you haven’t yet. We introduced concepts and issues that table formats address. This blog closes up the overview of Iceberg features by discussing the concurrency model Iceberg uses to ensure data integrity, how to use snapshots via Trino, and the Iceberg Specification.





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals



Concurrency Model

 Some issues with the Hive model are the distinct locations where the metadata is stored and where the data files are stored. Having your data and metadata split up like this is a recipe for disaster when trying to apply updates to both services atomically.

 
 A very common problem with Hive is that if a writing process failed during insertion, many times you would find the data written to file storage, but the metastore writes failed to occur. Or conversely, the metastore writes were successful, but the data failed to finish writing to file storage due to a  network or file IO failure. There’s a good  Trino Community Broadcast episode that talks about a function in Trino that exists to resolve these issues by syncing the metastore and file storage. You can watch  a simulation of this error on that episode.

 Aside from having issues due to the split state in the system, there are many  other issues that stem from the file system itself. In the case of HDFS,  depending on the specific filesystem implementation you are using, you may have different atomicity guarantees for various file systems and their operations, such as creating, deleting, and renaming files and directories. HDFS isn’t the only troublemaker here. Other than Amazon S3’s  recent announcement of strong consistency in their S3 service, most object storage systems only offer eventual consistency that may not show the latest files immediately after writes. Despite storage systems showing more progress towards offering better performance and guarantees, these systems still offer no reliable locking mechanism.

 Iceberg addresses all of these issues in a multitude of ways. One of the primary ways Iceberg introduces transactional guarantees is by storing the metadata in the same datastore as the data itself. This simplifies handling commit failures down to rolling back on one system rather than trying to coordinate a rollback across two systems like in Hive. Writers independently write their metadata and attempt to perform their operations, needing no coordination with other writers. The only time the writers coordinate is when they attempt to perform a commit of their operations. In order to do a commit, they perform a lock of the current snapshot record in a database. This concurrency model where writers eagerly do the work upfront is called optimistic concurrency control.
 Currently, in Trino, this method still uses the Hive metastore to perform the lock-and-swap operation necessary to coordinate the final commits. Iceberg  creator, Ryan Blue, covers this lock-and-swap mechanism and how the metastore can be replaced with alternate locking methods. In the event that two writers attempt to commit at the same time, the writer that first acquires the lock successfully commits by swapping its snapshot as the current snapshot, while the second writer will retry to apply its changes again. The second writer should have no problem with this, assuming there are no conflicting changes between the two snapshots.

 

 This works similarly to a git workflow where the main branch is the locked resource, and two developers try to commit their changes at the same time. The first developer’s changes may conflict with the second developer’s changes. The second developer is then forced to rebase or merge the first developer’s code with their changes before commiting to the main branch again. The same logic applies to merging data files. Currently, Iceberg clients use a copy-on-write mechanism that makes a new file out of the merged data in the next snapshot. This enables accurate time traveling and preserves previous split versions of the files. At the time of writing, upserts via MERGE INTO syntax are not supported in Trino, but  this is in active development. UPDATE: Since the original writing of this post, the  MERGE syntax exists as of version 393.

 One of the great benefits of tracking each individual change that gets written to Iceberg is that you are given a view of the data at every point in time. This enables a really cool feature that I mentioned earlier called time travel.

 ## Snapshots and Time Travel

 To showcase snapshots, it’s best to go over a few examples drawing from the event table we  created in the previous blog posts. This time we’ll only be working with the Iceberg table, as this capability is not available in Hive. Snapshots allow you to have an immutable set of your data at a given time. They are automatically created on every append or removal of data. One thing to note is that for now, they do not store the state of your metadata.
 Say that you have c
 reated your events table and inserted the three initial rows as we did previously. Let’s look at the data we get back and see how to check the existing snapshots in Trino:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




To query the snapshots, all you need is to use the $ operator appended to the
end of the table name, and add the hidden table, snapshots:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;


Result:




snapshot_id
parent_id
operation





7620328658793169607

append



2115743741823353537
7620328658793169607
append




Let’s take a look at the manifest list files that are associated with each
snapshot ID. You can tell which file belongs to which snapshot based on the
snapshot ID embedded in the filename:

SELECT manifest_list
FROM iceberg.logging.“events$snapshots”;


Result:




shapshots





s3a://iceberg/logging.db/events/metadata/snap-7620328658793169607-1-cc857d89-1c07-4087-bdbc-2144a814dae2.avro



s3a://iceberg/logging.db/events/metadata/snap-2115743741823353537-1-4cb458be-7152-4e99-8db7-b2dda52c556c.avro




Now, let’s insert another row to the table:

INSERT INTO iceberg.logging.events
VALUES
(
‘INFO’,
timestamp ‘2021-04-02 00:00:11.1122222’,
‘It is all good’,
ARRAY [‘Just updating you!’]
);


Let’s check the snapshot table again:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;


Result:




snapshot_id
parent_id
operation





7620328658793169607

append



2115743741823353537
7620328658793169607
append



7030511368881343137
2115743741823353537
append




Let’s also verify that our row was added:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Oh noes



INFO
It is all good



ERROR
Double oh noes



WARN
Maybeh oh noes?




 Since Iceberg is already tracking the list of files added and removed at each snapshot, it would make sense that you can travel back and forth between these different views into the system, right? This concept is called time traveling. You need to specify which snapshot you would like to read from and you will see the view of the data at that timestamp. In Trino, you need to use the @ operator, followed by the snapshot you wish to read from:

SELECT level, message
FROM iceberg.logging.“events@2115743741823353537”;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




 If you determine there is some issue with your data, you can always roll back to the previous state permanently as well. In Trino we have a function called rollback_to_snapshot to move the table state to another snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 2115743741823353537);


Now that we have rolled back, observe what happens when we query the events
table with:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




 Notice the INFO row is still missing even though we query the table without specifying a snapshot id. Now just because we rolled back, doesn’t mean we’ve lost the snapshot we just rolled back from. In fact, we can roll forward, or as I like to call it,  back to the future! In Trino, you use the same function call but with a predecessor of the existing snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 7030511368881343137)


And now we should be able to query the table again and see the INFO row
return:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Oh noes



INFO
It is all good



ERROR
Double oh noes



WARN
Maybeh oh noes?




 As expected, the INFO row returns when you roll back to the future.

 Having snapshots not only provides you with a level of immutability that is key to the eventual consistency model, but gives you a rich set of features to version and move between different versions of your data like a git repository.

 ## Iceberg Specification

 Perhaps saving the best for last, the benefit of using Iceberg is the community that surrounds it, and the support you receive. It can be daunting to have to choose a project that replaces something so core to your architecture. While Hive has so many drawbacks, one of the things keeping many companies locked in is the fear of the unknown. How do you know which table format to choose? Are there unknown data corruption issues that I’m about to take on? What if this doesn’t scale like it promises on the label? It is worth noting that  alternative table formats are also emerging in this space  and we encourage you to investigate these for your own use cases. When sitting down with Iceberg creator, Ryan Blue,  comparing Iceberg to other table formats,  he claims the community’s greatest strength is their ability to look forward. They intentionally broke compatibility with Hive to enable them to provide a richer level of features. Unlike Hive, the Iceberg project explained their thinking in a spec.

 The strongest argument I can see for Iceberg is that it has a specification. This is something that has largely been missing from Hive and shows a real maturity in how the Iceberg community has approached the issue. On the Trino project, we think standards are important. We adhere to many of them ourselves, such as the ANSI SQL syntax, and exposing the client through a JDBC connection. By creating a standard around this, you’re no longer tied to any particular technology, not even Iceberg itself. You are adhering to a standard that will hopefully become the de facto standard over a decade or two, much like Hive did. Having the standard in clear writing invites multiple communities to the table and brings even more use  cases. Doing so improves the standards and therefore the technologies that implement them.

 The previous three blog posts of this series covered the features and massive benefits from using this novel table format. The following post will dive deeper and discuss more about how Iceberg achieves some of this functionality, with an overview into some of the internals and metadata layouts. In the meantime, feel free to try  Trino on Ice(berg).

#trino #iceberg

bits





Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Mon, 12 Jul 2021 05:00:00 +0000


The first post covered how Iceberg is a table format and not a file format It demonstrated the benefits of hidden partitioning in Iceberg in contrast to exposed partitioning in Hive. There really is no such thing as “exposed partitioning.” I just thought that sounded better than not-hidden partitioning. If any of that wasn’t clear, I recommend either that you stop reading now, or go back to the first post before starting this one. This post discusses evolution. No, the post isn’t covering Darwinian nor Pokémon evolution, but in-place table evolution!





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals





You may find it a little odd that I am getting excited over tables evolving
in-place, but as mentioned in the last post, if you have experience performing table evolution in Hive, you’d be as happy as Ash Ketchum when Charmander evolved into Charmeleon discovering that Iceberg supports Partition evolution and schema evolution. That is, until Charmeleon started treating Ash like a jerk after the evolution from Charmander. Hopefully, you won’t face the same issue when your tables evolve.

Another important aspect that is covered, is how Iceberg is developed with cloud storage in mind. Hive and other data lake technologies were developed with file systems as their primary storage layer. This is still a very common layer today, but as more companies move to include object storage, table formats did not adapt to the needs of object stores. Let’s dive in!

Partition Specification evolution

In Iceberg, you are able to update the partition specification, shortened to partition spec in Iceberg, on a live table. You do not need to perform a table migration as you do in Hive. In Hive, partition specs don’t explicitly exist because they are tightly coupled with the creation of the Hive table. Meaning, if you ever need to change the granularity of your data partitions at any point, you need to create an entirely new table, and move all the data to the new partition granularity you desire. No pressure on choosing the right granularity or anything!

In Iceberg, you’re not required to choose the perfect partition specification upfront, and you can have multiple partition specs in the same table, and query across the different sized partition specs. How great is that! This means, if you’re initially partitioning your data by month, and later you decide to move to a daily partitioning spec due to a growing ingest from all your new customers, you can do so with no migration, and query over the table with no issue.

This is conveyed pretty succinctly in this graphic from the Iceberg
documentation. At the end of the year 2008, partitioning occurs at a monthly granularity and after 2009, it moves to a daily granularity. When the query to pull data from December 14th, 2008 and January 13th, 2009, the entire month of December gets scanned due to the monthly partition, but for the dates in January, only the first 13 days are scanned to answer the query.



At the time of writing, Trino is able to perform reads from tables that have multiple partition spec changes but partition evolution write support does not yet exist. There are efforts to add this support in the near future. Edit: this has since been merged!

Schema evolution

Iceberg also handles schema evolution much more elegantly than Hive. In Hive, adding columns worked well enough, as data inserted before the schema change just reports null for that column. For formats that use column names, like ORC and Parquet, deletes are also straightforward for Hive, as it simply ignores fields that are no longer part of the table. For unstructured files like CSV that use the position of the column, deletes would still cause issues, as deleting one column shifts the rest of the columns. Renames for schemas pose an issue for all formats in Hive as data written prior to the rename is not modified to the new field. This effectively works the same as if you deleted the old field and added a new column with the new name. This lack of support for schema. evolution across various file types in Hive requires a lot of memorizing
the formats underneath various tables. This is very susceptible to causing user errors if someone executes one of the unsupported operations on the wrong table.



  
    Hive 2.2.0 schema evolution based on file type and operation.
  


  
    
    Add
    Delete
    Rename
  
  
    CSV/TSV
    ✅
    ❌
    ❌
  
  
    JSON
    ✅
    ✅
    ❌
  
  
    ORC/Parquet/Avro
    ✅
    ✅
    ❌
  



Currently in Iceberg, schemaless position-based data formats such as CSV and TSVare not supported, though there are some discussions on adding limited support for them. This would be good from a reading standpoint, to load data from the CSV, into an Iceberg format with all the guarantees that Iceberg offers.

While JSON doesn’t rely on positional data, it does have an explicit dependency on names. This means, that if I remove a text column from a JSON table named severity, then later I want to add a new int column called severity, I encounter an error when I try to read in the data with the string type from before when I try to deserialize the JSON files. Even worse would be if the new severity column you add has the same type as the original but a semantically different meaning. This results in old rows containing values that are unknowingly from a different domain, which can lead to wrong analytics. After all, someone who adds the new severity column might not even be aware of the old severity column, if it was quite some time ago when it was dropped.

ORC, Parquet, and Avro do not suffer from these issues as they are columnar formats that keep a schema internal to the file itself, and each format tracks changes to the columns through IDs rather than name values or position. Iceberg uses these unique column IDs to also keep track of the columns as changes are applied.

In general, Iceberg can only allow this small set of file formats due to the correctness guarantees it provides. In Trino, you can add, delete, or rename columns using the ALTER TABLE command. Here’s an example that continues from the table created  in the last post  that inserted three rows. The DDL statement looked like this.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6), 
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioning = ARRAY['day(event_time)']
);


Here is an ALTER TABLE sequence that adds a new column named severity, inserts data including into the new column, renames the column, and prints the data.

ALTER TABLE iceberg.logging.events ADD COLUMN severity INTEGER; 

INSERT INTO iceberg.logging.events VALUES 
(
  'INFO', 
  timestamp 
  '2021-04-01 19:59:59.999999' AT TIME ZONE 'America/Los_Angeles', 
  'es muy bueno', 
  ARRAY ['It is all normal'], 
  1
);

ALTER TABLE iceberg.logging.events RENAME COLUMN severity TO priority;

SELECT level, message, priority
FROM iceberg.logging.events;


Result:




level
message
priority





ERROR
Double oh noes
NULL



WARN
Maybeh oh noes?
NULL



ERROR
Oh noes
NULL



INFO
es muy bueno
1




ALTER TABLE iceberg.logging.events 
DROP COLUMN priority;

SHOW CREATE TABLE iceberg.logging.events;


Result

CREATE TABLE iceberg.logging.events (
   level varchar,
   event_time timestamp(6),
   message varchar,
   call_stack array(varchar)
)
WITH (
   format = 'ORC',
   partitioning = ARRAY['day(event_time)']
)


Notice how the priority and severity columns are both not present in the schema. As noted in the table above, Hive renames cause issues for all file formats. Yet in Iceberg, performing all these operations causes no issues with the table and underlying data.

Cloud storage compatibility

Not all developers consider or are aware of the performance implications of using Hive over a cloud object storage solution like S3 or Azure Blob storage. One thing to remember is that Hive was developed with the Hadoop Distributed File System (HDFS) in mind. HDFS is a filesystem and is particularly well suited to handle listing files on the filesystem, because they were stored in a contiguous manner. When Hive stores data associated with a table, it assumes there is a contiguous layout underneath it and performs list operations that are expensive on cloud storage systems.

The common cloud storage systems are typically object stores that do not lay out the files in a contiguous manner based on paths. Therefore, it becomes very expensive to list out all the files in a particular path. Yet, these list operations are executed for every partition that could be included in a query, regardless of only a single row, in a single file out of thousands of files needing to be retrieved to answer the query. Even ignoring the performance costs for a minute, object stores may also pose issues for Hive due to eventual  consistency. Inserting and deleting can cause inconsistent results for readers, if the files you end up reading are out of date.

Iceberg avoids all of these issues by tracking the data at the file level,
rather than the partition level. By tracking the files, Iceberg only accesses the files containing data relevant to the query, as opposed to accessing files in the same partition looking for the few files that are relevant to the query. Further, this allows Iceberg to control for the inconsistency issue in cloud-based file systems by using a locking mechanism at the file level. See the file layout below that Hive layout versus the Iceberg layout. As you can see in the next image, Iceberg makes no assumptions about the data being contiguous or not. It simply builds a persistent tree using the snapshot (S) location stored in the metadata, that points to the manifest list (ML), which points to
manifests containing partitions (P). Finally, these manifest files contain the file (F) locations and stats that can quickly be used to prune data versus
needing to do a list operation and scanning all the files.



Referencing the picture above, if you were to run a query where the result set only contains rows from file F1, Hive would require a list operation and scanning the files, F2 and F3. In Iceberg, file metadata exists in the manifest file, P1, that would have a range on the predicate field that prunes out files F2 and F3, and only scans file F1. This example only shows a couple of files, but imagine storage that scales up to thousands of files! Listing becomes expensive on files that are not contiguously stored in memory. Having this flexibility in the logical layout is essential to increase query performance. This is especially true on cloud object stores.

If you want to play around with Iceberg using Trino, check out the
Trino Iceberg docs. To avoid issues like the eventual consistency issue, as well as other problems of trying to sync operations across systems, Iceberg provides optimistic concurrency support, which is covered in more detail in
the next post.

#trino #iceberg

bits





Trino on ice I: A gentle introduction To Iceberg
Mon, 03 May 2021 05:00:00 +0000


Back in the Gentle introduction to the Hive connector blog post, I discussed a commonly misunderstood architecture and uses of the Trino Hive connector. In short, while some may think the name indicates Trino makes a call to a running Hive instance, the Hive connector does not use the Hive runtime to answer queries. Instead, the connector is named Hive connector because it relies on Hive conventions and implementation details from the Hadoop ecosystem – the invisible Hive specification.





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals



I call this specification invisible because it doesn’t exist. It lives in the Hive code and the minds of those who developed it. This is makes it very difficult for anybody else who has to integrate with any distributed object storage that uses Hive, since they had to rely on reverse engineering and keeping up with the changes. The way you interact with Hive changes based on which version of Hive or Hadoop you are running. It also varies if you are in the cloud or over an object store. Spark has even modified the Hive spec in some ways to fit the Hive model to their use cases. It’s a big mess that data engineers have put up with for years. Yet despite the confusion and lack of organization due to Hive’s number of unwritten assumptions, the Hive connector is the most popular connector in use for Trino. Virtually every big data query engine uses the Hive model today in some form. As a result it is used by numerous companies to store and access data in their data lakes.

So how did something with no specification become so ubiquitous in data lakes? Hive was first in the large object storage and big data world as part of Hadoop. Hadoop became popular from good marketing for Hadoop to solve the problems of dealing with the increase in data with the Web 2.0 boom . Of course, Hive didn’t get everything wrong. In fact, without Hive, and the fact that it is open source, there may not have been a unified specification at all. Despite the many hours data engineers have spent bashing their heads against the wall with all the unintended consequences of Hive, it still served a very useful purpose.

So why did I just rant about Hive for so long if I’m here to tell you about Apache Iceberg? It’s impossible for a teenager growing up today to truly appreciate music streaming services without knowing what it was like to have an iPod with limited storage, or listening to a scratched burnt CD that skips, or flipping your tape or record to side-B. The same way anyone born before the turn of the millennium really appreciates streaming services, so you too will appreciate Iceberg once you’ve learned the intricacies of managing a data lake built on Hive and Hadoop.

If you haven’t used Hive before, this blog post outlines just a few pain points that come from this data warehousing software to give you proper context. If you have already lived through these headaches, this post acts as a guide to Iceberg from Hive. This post is the first in a series of blog posts discussing Apache Iceberg in great detail, through the lens of the Trino query engine user. If you’re not aware of Trino (formerly PrestoSQL) yet, it is the project that houses the founding Presto community after the founders of Presto left Facebook. This and the next couple of posts discuss the Iceberg specification and all the features Iceberg has to offer, many times in comparison with Hive.

Before jumping into the comparisons, what is Iceberg exactly? The first thing to understand is that Iceberg is not a file format, but a table format. It may not be clear what this means by just stating that, but the function of a table format becomes clearer as the improvements Iceberg brings from the Hive table standard materialize. Iceberg doesn’t replace file formats like ORC and Parquet, but is the layer between the query engine and the data. Iceberg maps and indexes the files in order to provide a higher level abstraction that handles the relational table format for data lakes. You will understand more about table formats through examples in this series.

Hidden Partitions

Hive Partitions

Since most developers and users interact with the table format via the query language, a noticeable difference is the flexibility you have while creating a partitioned table. Assume you are trying to create a table for tracking events occurring in our system. You run both sets of SQL commands from Trino, just using the Hive and Iceberg connectors which are designated by the catalog name (i.e. the catalog name starting with hive. uses the Hive connector, while the iceberg. table uses the Iceberg connector). To begin with, the first DDL statement attempts to create an events table in the logging schema in the hive catalog, which is configured to use the Hive connector. Trino also creates a partition on the events table using the event_time field which is a TIMESTAMP field.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);


Running this in Trino using the Hive connector produces the following error message.

Partition keys must be the last columns in the table and in the same order as the table properties: [event_time]


The Hive DDL is very dependent on ordering for columns and specifically partition columns. Partition fields must be located in the final column positions and in the order of partitioning in the DDL statement. The next statement attempts to create the same table, but now with the event_time field moved to the last column position.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time TIMESTAMP
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);


This time, the DDL command works successfully, but you likely don’t want to partition your data on the plain timestamp. This results in a separate file for each distinct timestamp value in your table (likely almost a file for each event). In Hive, there’s no way to indicate the time granularity at which you want to partition natively. The method to support this scenario with Hive is to create a new VARCHAR column, event_time_day that is dependent on the event_time column to create the date partition value.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time_day VARCHAR
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time_day']
);


This method wastes space by adding a new column to your table. Even worse, it puts the burden of knowledge on the user to include this new column for writing data. It is then necessary to use that separate column for any read access to take advantage of the performance gains from the partitioning.

INSERT INTO hive.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'], 
  '2021-04-01'
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'],
  '2021-04-02'
),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??'], 
  '2021-04-02'
);


Notice that the last partition value '2021-04-01' has to match the TIMESTAMP date during insertion. There is no validation in Hive to make sure this is happening because it only requires a VARCHAR and knows to partition based on different values.

On the other hand, If a user runs the following query:

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02';


they get the correct results back, but have to scan all the data in the table:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


This happens because the user forgot to include the event_time_day < '2021-04-02' predicate in the WHERE clause. This eliminates all the benefits that led us to create the partition in the first place and yet frequently this is missed by the users of these tables.

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02' 
AND event_time_day < '2021-04-02';


Result:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


Iceberg Partitions

The following DDL statement illustrates how these issues are handled in Iceberg via the Trino Iceberg connector.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6),
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  partitioning = ARRAY['day(event_time)']
);


Taking note of a few things. First, notice the partition on the event_time column that is defined without having to move it to the last position. There is also no need to create a separate field to handle the daily partition on the event_time field. The partition specification is maintained internally by Iceberg, and neither the user nor the reader of this table needs to know anything about the partition specification to take advantage of it. This concept is called hidden partitioning , where only the table creator/maintainer has to know the partitioning specification. Here is what the insert statements look like now:

INSERT INTO iceberg.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??']
);


The VARCHAR dates are no longer needed. The event_time field is internally converted to the proper partition value to partition each row. Also, notice that the same query that ran in Hive returns the same results. The big difference is that it doesn’t require any extra clause to indicate to filter partition as well as filter the results.

SELECT *
FROM iceberg.logging.events
WHERE event_time < timestamp '2021-04-02';


Result:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


So hopefully that gives you a glimpse into what a table format and specification are, and why Iceberg is such a wonderful improvement over the existing and outdated method of storing your data in your data lake. While this post covers a lot of aspects of Iceberg’s capabilities, this is just the tip of the Iceberg…



If you want to play around with Iceberg using Trino, check out the Trino Iceberg docs. The next post covers how table evolution works in Iceberg, as well as, how Iceberg is an improved storage format for cloud storage.

#trino #iceberg

bits

snapshot_id	parent_id	operation
7620328658793169607		append
2115743741823353537	7620328658793169607	append

level	message
ERROR	Oh noes
INFO	It is all good
ERROR	Double oh noes
WARN	Maybeh oh noes?

Hive 2.2.0 schema evolution based on file type and operation.
	Add	Delete	Rename
CSV/TSV	✅	❌	❌
JSON	✅	✅	❌
ORC/Parquet/Avro	✅	✅	❌

level	message	priority
ERROR	Double oh noes	NULL
WARN	Maybeh oh noes?	NULL
ERROR	Oh noes	NULL
INFO	es muy bueno	1