trino — bits on data

Integrating Trino and Snowflake

Wed, 08 May 2024 05:00:00 +0000

An open source Success Story

TL;DR: Contributing to open source can be frustrating as the consensus needed for code to align to the project vision is often out of scope for many companies. This post dives deep into the obstacles and wins of two contributors from different companies working together to add the same proprietary connector. It's both inspiring and carries many lessons to bring along as you venture into open source to gain the pearls and avoid the perils.

We’re seeing open source usher in a challenge to the economic model where the success metric is increasing the commonwealth of economic capital. This acceleration comes from playing positive-sum games with friends online and avoiding limiting a community to a vision that only benefits a small number of corporations or individuals. It’s hard to imagine how to embed such frameworks within our current zero-sum winner-takes-all economic system. There’s certainly no shortage of heated debates around how to construct a harmonious relationship between the open source community and companies participating in them. Something we don’t talk about enough are the positive examples of when a coordinated effort in open source sticks the landing, and so many benefit from it.

This post highlights the extraordinary contributions of Erik Anderson, Teng Yu, Yuya Ebihara, and the broader Trino community to finally contribute the long-coveted Trino Snowflake connector. It is a success story paired with a blueprint for individuals and corporations wanting to contribute to open source projects they use. These stories are valuable in that they demonstrate how to be most effective in collaborating with strangers-soon-to-be-friends and common pitfalls to avoid.

A common challenge in open source

Despite the importance of delivering marketing and education in a community (aka edutainment), it’s only the first part of the equation of what makes open source projects successful. Once developers see some exciting video or tutorial, they ultimately land on the docs site, GitHub, StackOverflow, or some communication platform in the community. It's at this point that developers can easily lose the motivation if the docs lacks proper getting started materials or the community is completely silent. This is how I categorize the developer experience (aka devex), which aims to improve both the user and contributor experiences in the developer community by empowering decisions through hands-on learning, removing inefficiencies, and as we'll cover here, exposing untapped opportunities.

Much like any open source project, maintainers on the Trino project struggle with communicating the lack of proper resources to build and test new features built for various proprietary software. For those less familiar, Trino is a federated query engine with multiple data sources. Trino tests integrations with open data sources by running small local instances of the connecting system. Snowflake is a proprietary, cloud-native data warehouse, also known as cloud data platform. This provided no viable and free way to support testing this integration that was eagerly sought by many. After an initial attempt by my friend Phillipe Gagnon, a similar pattern emerged with the second pull request where the development velocity started strong and after some months stagnated.

Cognitive surplus and communication deficit

A common and unfortunate class of issues are that various well-known larger objectives known among the core group often move faster than less-established individual contributions. These additions are often much needed and welcomed, but often fail to fit a larger project roadmap narrative. As its easier to coordinate between the smaller core group as trust and norms have been communicated and established. This makes changes outside of this group have a higher likelihood to get lost in the shuffle. As an open source project grows, you end up with a cognitive surplus in the form of an abundance of bright people willing to share their time, intellect, and experience with a larger community.

Often both contributors and maintainers are so busy with their day jobs, families, and self care, that they dedicate most of their remaining energy to ensuring they write quality code and tests to the best of their ability. Lack of upfront communication to validate ideas from newer contributors, and lack of communication by maintainers who see a large number of issues to address are two communication issues that stagnate a project. Maintainers are often doers that see more value in addressing quick-win work that flows from the well-established contributors of the project. Followthrough on either side can be difficult as newcomers don't want to be rude and maintainers accidentally forget or hope someone else will take the time to address the issues on that pull request.

Waiting for your work to be reviewed by someone in the community kind of works like a wishing well, you toss in a coin (i.e. your time and effort represented as code and a pull request) and hope your wish of getting your code reviewed and merged comes true. The satisfaction of hypothetical developers that benefit from your small and significant change floods your mind and you feel like you’ve improved humanity just that one little bit more.

Maintainers are in a constant state of pulling triage on all the surplus of innovation being thrown at them and simultaneously trying to look for more help reviewing and being the expert at some areas of the code. As you can imagine, good communication can be hard to come by as many newcomers are strangers and concerned they are wasting precious time by asking too many questions rather than just showing a proof of concept. This backfires when developers will spend a large portion of their time developing a solution that is not compatible with the project, and maintainers will lose the opportunity to quickly spin up on the value of the new feature. This is why regular contributor meetings help solve both of these issues synchronously to cut out the delayed feedback loops.

History repeats itself, until it doesn't

It became apparent that each time there was a discussion for how to do integration testing there was no good way to test a Snowflake instance with the lack of funding for the project. Trino has a high bar for quality and none of the maintainers felt it was a risk worth taking due to the likely popularity of the integration and likelihood of future maintenance issues. Once each pull request hit this same fate, it stalled with no clear path to resolve the real issue of funding the Snowflake infrastructure needed by the Trino Software Foundation (TSF). It’s never fun to mention that you can’t move forward on work with constraints like these, and without a monetary solution, silence is what is experienced on the side of the contributor.

Noticing that Teng had already done a significant amount of work to contribute his Snowflake connector, I reached out to him to see if we could brainstorm a solution. Not long after, Erik also reached out to get my thoughts on how to go about contributing Bloomberg's Snowflake connector. Great, now we have two connector implementations and no solution to getting the infrastructure to get them tested. During the first Trino Contributor Congregation, Erik and I brought up Bloomberg's desire to contribute a Snowflake connector and I articulated the testing issue. Ironically, this was the first time I had thoroughly articulated the issue to Erik as well.

As soon as I was done, Erik requested the mic said something to the effect of, “Oh I wish I would have known that's the problem, the solution is simple, Bloomberg will provide the TSF a Snowflake account.”

Done!

Just as in business, you can never underestimate the power of communication in an open source project as well. Shortly after Erik, Teng, and I discussed the best ways to merge their work, they set up the Snowflake accounts for Trino maintainers and start the arduous process of building a thorough test suite with the help of Yuya, Piotr Findeisen, Manfred Moser, and Martin Traverso.

The long road to Snowflake

As Teng and Erik merged their efforts, the process was anything but straightforward. There were setbacks, vacations, meticulous reviews, and infrastructure issues. But the perseverance of everyone involved was unwavering.

Bloomberg started by creating an official Bloomberg Trino repository originally as a means for Teng and Erik to mesh their solutions together and build the testing infrastructure that relied on Bloomberg resources. Without needing to rely on the main Trino project to merge incremental solutions, they were able to quickly iterate the early solutions. This repository also facilitated Bloomberg’s now numerous contributions to Trino.

It took a few months just to get the ForePaaS¹ and Bloomberg solutions merged. There were valuable takes from each system and better integration tests were written with the new testing infrastructure. The two Snowflake connector implementations were merged together by April of 2023. Finally, the reviews could start. Once the initial two passes happened we anticipated that we would see the Snowflake connector release in the summer of 2023 around Trino Fest. So much so, that we planned a talk with Erik and Teng initially as a reveal, assuming the pull request would be merged by then. Lo and behold, this didn’t happen, as there were still a lot of concerns around use cases not being properly tested.

The halting review problem

A necessary evil that comes with pull request reviews and more broadly, distributed consensus is that reviews can drag on over time. This can lead to countless number of updates you have to make to your changes to accommodate the ever changing project shifting beneath your feet as you simultaneously try to make progress on suggestions from those reviewing your code.

Many critics of open source like to point this out as a drawback, when in fact, this same problem exists in closed source systems. Closed source projects can generally delay difficult decisions to make fast upfront progress to meet certain deadlines. This may be seen as an advantage at first, but as many developers can attest, this simply leads to technical debt and fragile products in most environments that struggle to prioritize a healthy codebase.

Regardless, having to face these larger discussions upfront can induce fatigue, especially when managing external circumstances; personal affairs, a project at work – you know, the entity that pays these engineers – or countless other factors will rear their ugly heads and progress will stagger with ebbs and flows of attention. This can be really dangerous territory and commonly resolves in contributors and reviewers abandoning the PR when it stalls.

This is why I believe open source, while not beholden to any timelines, needs a project and product management role which is currently covered often by project leaders and devex engineers. This can also relieve tension between the needs of open source and big businesses in the community with real deadlines, at least keeping the communication consistent while ensuring bugs and design flaws aren’t introduced to the code base.

What’s in it for Bloomberg and ForePaaS?

If you’ve never worked in open source or for a company that contributes to open source, you may be thinking how the heck do these engineers convince their leadership to let them dump so much time into these contributions? The simple answer is, it’s good for business.

If we peep into why Bloomberg uses Trino, they aggregate data from an unusually large number of data sources across their customers who use their services. Part of this requires them to merge the customer’s dataset with existing aggregate data in Bloomberg’s product. Since Trino can connect to most customer databases out-of-the-box, this requires Bloomberg to manage a small array of custom connectors that provide their services to customers as multiple catalogs in a single convention SQL endpoint. Having engineers maintain a few small connectors rather than an entire distributed query engine themselves saves a lot of time and maintenance.

Despite how many problems Trino already solves for them, Bloomberg and ForePaaS needed this Snowflake connector and through the open source model created it for themselves. The drawback is that the solution must be maintained by the engineers at each company any time they want to upgrade to a new Trino feature. This takes consistently depletes engineering resources and so they want to maintain as few features as possible to relieve their engineer’s time. Open source projects are generally more than happy to accept features that the community benefits from. This doesn’t mean we shouldn’t appreciate when companies contribute. This dual-sum generosity and forward-thinking approach enabled Erik and Teng to combine their battle-tested connectors, crafting a high value creation for the community.

If you are a developer who sees the value in contributing to open source, and you aren't sure how to convince leadership to get on board, you need to speak their language. Show how companies like Bloomberg get involved in open source, and how it lowers maintenance costs when done correctly. If you see an open project like Trino that could replace 97% of a new project, demonstrate that the upfront cost will pay off when you remove the amount of code to be managed by your team which lowers the future need to expand headcounts. I don’t imagine a world where your boss and colleagues are altruists, but present an economic incentive that lowers the amortized cost of engineers needed to maintain a project, then your strategy becomes helpful to the company's bottom line.

While the immediate investment shows small gains for a single team on a single company, once that change exists in open source, other companies can immediately benefit and offer better testing and improvements than you could have asked for when managing the original project with your own team. Humanity at large gets to benefit upon every contribution done this way, and the more companies that embrace this, the less we waste our efforts of pointlessly duplicating work.

Esprit de Corps

The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”, which I mistakenly took the “Corps” part for the Marine Corps rather than the more general meaning of a body or group of people. In fact, it expresses the common spirit existing in the members of a group and inspiring enthusiasm, devotion, and strong regard for the honor of the group. Any time I see this type of shared and selfless cooperation in open source, I’m reminded of the bond, friendships, and care of me and my fellow marines. Despite the unfortunate political circumstances of our mission, I do treasure the shared companionship with both my fellow marines and the local Iraqi people. There is ultimately a power in the gathering of many when aimed for building an altruistic means of improving each others lives.

In the same way, demonstration of human cooperation is about more than just developing a connector; it's about the shared experiences, the friendships forged, and the skills honed in the pursuit of a common goal. The successful addition of the Trino Snowflake connector is a testament to the positive sum outcomes of open source collaboration. This journey has been about collaboration, learning, and growth that will benefit many. I remember the night I got the email that Yuya had merged the pull request, I was ecstatic to say the least. The connector shipped with Trino version 440, and made connection to the most widely adopted cloud warehouse possible.

Once the hard work was done, many valuable iterations like adding Top-N support(Shoppee), adding Snowflake Iceberg REST catalog support (Starburst), and adding better type mapping(Apple) were added to the Snowflake integration. I love showcasing this trailblazing and yes, altruistic work from Erik, Teng, Yuya, Martin, Manfred, and Piotr – and everyone who helped in the Trino community. A special thanks to the managers and leadership at Bloomberg and ForePaaS for their generous commitment of time and resources.

As we celebrate this milestone, we're already looking forward to the next adventure. Here's to federating them all, together!

Notes: ¹ForePaaS has been integrated into OVHCloud, which is now called Data Platform.

bits

Intro to Trino for the Trinewbie

Fri, 17 Dec 2021 18:00:00 +0000

Learn how to quickly join data across multiple sources

If you haven’t heard of Trino before, it is a query engine that speaks the language of many genres of databases. As such, Trino is commonly used to provide fast ad-hoc queries across heterogeneous data sources. Trino’s initial use case was built around replacing the Hive runtime engine to allow for faster querying of Big Data warehouses and data lakes. This may be the first time you have heard of Trino, but you’ve likely heard of the project from which it was “forklifted”, Presto. If you want to learn more about why the creators of Presto now work on Trino (formerly PrestoSQL) you can read the renaming blog that they produced earlier this year. Before you commit too much to this blog, I’d like to let you know why you should even care about Trino.

So what is Trino anyways?

The first thing I like to make sure people know about when discussing Trino is that it is a SQL query engine, but not a SQL database. What does that mean? Traditional databases typically consist of a query engine and a storage engine. Trino is just a query engine and does not store data. Instead, Trino interacts with various databases that store their own data in their own formats. Trino parses and analyzes the SQL query you pass in, creates and optimizes a query execution plan that includes the data sources, and then schedules worker nodes that are able to intelligently query the underlying databases they connect to.

I say intelligently, specifically talking about pushdown queries. That’s right, the most intelligent thing for Trino to do is to avoid making more work for itself, and try to offload that work to the underlying database. This makes sense as the underlying databases generally have special indexes and data that are stored in a specific format to optimize the read time. It would be silly of Trino to ignore all of that optimized reading capability and do a linear scan of all the data to run the query itself. The goal in most optimizations for Trino is to push down the query to the database and only get back the smallest amount of data needed to join with another dataset from another database, do some further Trino specific processing, or simply return as the correct result set for the query.

Query all the things

So I still have not really answered your question of why you should care about Trino. The short answer is, Trino acts as a single access point to query all the things. Yup. Oh, and it’s super fast at ad-hoc queries over various data sources including data lakes (e.g. Iceberg/Databricks) or data warehouses (e.g. Hive/Snowflake). It has a connector architecture that allows it to speak the language of a whole bunch of databases. If you have a special use case, you can write your own connector that abstracts any database or service away to just be another table in Trino’s domain. Pretty cool right? But that’s actually rarely needed because the most common databases already have a connector written for them. If not, more connectors are getting added by Trino’s open source community every few months.

To make the benefits of running federated queries a bit more tangible, I will present an example. Trino brings users the ability to map standardized ANSI SQL query to query databases that have a custom query DSL like Elasticsearch. With Trino it’s incredibly simple to set up an Elasticsearch catalog and start running SQL queries on it. If that doesn’t blow your mind, let me explain why that’s so powerful.

Imagine you have five different data stores, each with its own independent query language. Your data science or analyst team just wants access to these data stores. It would take a ridiculous amount of time for them to have to go to each data system individually, look up the different commands to pull data out of each one, and dump the data into one location and clean it up so that they can actually run meaningful queries. With Trino all they need to use is SQL to access them through Trino. Also, it doesn’t just stop at accessing the data, your data science team is also able to join data across tables of different databases like a search engine like Elasticsearch with an operational database like MySQL. Further, using Trino even enables joining data sources with themselves where joins are not supported, like in Elasticsearch and MongoDB. Did it happen yet? Is your mind blown?

Getting Started with Trino

So what is required to give Trino a test drive? Relative to many open-source database projects, Trino is one of the more simple projects to install, but this still doesn’t mean it is easy. An important element to a successful project is how it adapts to newer users and expands capability for growth and adoption. This really pushes the importance of making sure that there are multiple avenues of entry into using a product all of which have varying levels of difficulty, cost, customizability, interoperability, and scalability. As you increase in the level of customizability, interoperability, and scalability, you will generally see an increase in difficulty or cost and vice versa. Luckily, when you are starting out, you just really need to play with Trino.

Image added by Author

The low-cost and low difficulty way to try out Trino is to use Docker containers. The nice thing about these containers is that you don’t have to really know anything about the installation process of Trino to play around with Trino. While many enjoy poking around documentation and working with Trino to get it set up, it may not be for all. I certainly have my days where I prefer a nice chill CLI sesh and other days where I just need to opt-out. If you want to skip to the Easy Button way to deploy Trino (hint, it’s the SaaS deployment) then skip the next few sections.

Using Trino With Docker

Trino ships with a Docker image that does a lot of the setup necessary for Trino to run. Outside of simply running a docker container, there are a few things that need to happen for setup. First, in order to use a database like MySQL, we actually need to run a MySQL container as well using the official mysql image. There is a trino-getting-started repository that contains a lot of the setup needed for using Trino on your own computer or setting it up on a test server as a proof of concept. Clone this repository and follow the instructions in the README to install Docker if it is not already.

You can actually run a query before learning the specifics of how this compose file works. Before you run the query, you will need to run the mysql and trino-coordinator instances. To do this, navigate to the mysql/trino-mysql/ directory that contains the docker-compose.yml and run:

docker-compose up -d

Running your first query!

Now that you have Trino running in Docker, you need to open a session to access it. The easiest way to do this is via a console. Run the following Docker command to connect to a terminal on the coordinator:

docker container exec -it trino-mysql_trino-coordinator_1 trino

This will bring you to the Trino terminal.

trino>

Your first query will actually be to generate data from the tpch catalog and then query the data that was loaded into mysql catalog. In the terminal, run the following two queries:

CREATE TABLE mysql.tiny.customer
AS SELECT * FROM tpch.tiny.customer;

SELECT custkey, name, nationkey, phone 
FROM mysql.tiny.customer LIMIT 5;

The output should look like this.

|custkey|name              |nationkey|phone          |
|-------|------------------|---------|---------------|
|751    |Customer#000000751|0        |10-658-550-2257|
|752    |Customer#000000752|8        |18-924-993-6038|
|753    |Customer#000000753|17       |27-817-126-3646|
|754    |Customer#000000754|0        |10-646-595-5871|
|755    |Customer#000000755|16       |26-395-247-2207|

Congrats! You just ran your first query on Trino. Did you feel the rush!? Okay well, technically we just copied data from a data generation connector and moved it into a MySQL database and queried that back out. It’s fine if this simple exercise didn’t send goosebumps flying down your spine but hopefully, you can extrapolate the possibilities when connecting to other datasets.

A good initial exercise to study the compose file and directories before jumping into the Trino installation documentation. Let’s see how this was possible by breaking down the docker-compose file that you just ran.

version: '3.7'
services:
  trino-coordinator:
    image: 'trinodb/trino:latest'
    hostname: trino-coordinator
    ports:
      - '8080:8080'
    volumes:
      - ./etc:/etc/trino
    networks:
      - trino-network

  mysql:
    image: mysql:latest
    hostname: mysql
    environment:
      MYSQL_ROOT_PASSWORD: admin
      MYSQL_USER: admin
      MYSQL_PASSWORD: admin
      MYSQL_DATABASE: tiny
    ports:
      - '3306:3306'
    networks:
      - trino-network
networks:
  trino-network:
    driver: bridge

Notice that the hostname of mysql matches the instance name, and the mysql instance is on the trino-network that the trino-coordinator instance will also join. Also notice that the mysql image exposes port 3306 on the network.

Finally, we will use the trinodb/trino image for the trino-coordinator instance, and use the volumes option to map our local custom configurations for Trino to the /etc/trino directory we discuss further down in the Trino Configuration section. Trino should also be added to the trino-network and expose ports 8080 which is how external clients can access Trino. Below is an example of the docker-compose.yml file. The full configurations can be found in this getting started with Trino repository.

These instructions are a basic overview of the more complete installation instructions if you’re really going for it! If you’re not that interested in the installation, feel free to skip ahead to the Deploying Trino at Scale with Kubernetes section. If you’d rather not deal with Kubernetes I offer you another pass to the easy button section of this blog.

Trino requirements

The first requirement is that Trino must be run on a POSIX-compliant system such as Linux or Unix. There are some folks in the community that have gotten Trino to run on Windows for testing using runtime environments like cygwin but this is not supported officially. However, in our world of containerization, this is less of an issue and you will be able to at least test this on Docker no matter which operating system you use.

Trino is written in Java and so it requires the Java Runtime Environment (JRE). Trino requires a 64-bit version of Java 11, with a minimum required version of 11.0.7. Newer patch versions such as 11.0.8 or 11.0.9 are recommended. The launch scripts for Trino bin/launcher, also require python version 2.6.x, 2.7.x, or 3.x.

Trino Configuration

To configure Trino, you need to first know the Trino configuration directory. If you were installing Trino by hand, the default would be in a etc/ directory relative to the installation directory. For our example, I’m going to use the default installation directory of the Trino Docker image, which is set in the run-trino script as /etc/trino. We need to create four files underneath this base directory. I will describe what these files do and you can see an example in the docker image I have created below.

config.properties — This is the primary configuration for each node in the trino cluster. There are plenty of options that can be set here, but you’ll typically want to use the default settings when testing. The required configurations include indicating if the node is the coordinator, setting the http port that Trino communicates on, and the discovery node url so that Trino servers can find each other.
jvm.config — This configuration contains the command line arguments you will pass down to the java process that runs Trino.
log.properties — This configuration is helpful to indicate the log levels of various java classes in Trino. It can be left empty to use the default log level for all classes.
node.properties — This configuration is used to uniquely identify nodes in the cluster and specify locations of directories in the node.

The next directory you need to know about is the catalog/ directory, located in the root configuration directory. In the docker container, it will be in /etc/trino/catalog. This is the directory that will contain the catalog configurations that Trino will use to connect to the different data sources. For our example, we’ll configure two catalogs, the mysql catalog, and the tpch catalog. The tpch catalog is a simple data generation catalog that simply needs the conector.name property to be configured and is located in /etc/trino/catalog/tpch.properties.

tpch.properties

connector.name=tpch

The mysql catalog just needs the connector.name to specify which connector plugin to use, the connection-url property to point to the mysql instance, and the connection-user and connection-password properties for the mysql user.

mysql.properties

connector.name=mysql
connection-url=jdbc:mysql://mysql:3306
connection-user=root
connection-password=admin

Note: the name of the configuration file becomes the name of the catalog in Trino. If you are familiar with MySQL, you are likely to know that MySQL supports a two-tiered containment hierarchy, though you may have never known it was called that. This containment hierarchy refers to databases and tables. The first tier of the hierarchy is the tables, while the second tier consists of databases. A database contains multiple tables and therefore two tables can have the same name provided they live under a different database.

Image by Author

Since Trino has to connect to multiple databases, it supports a three-tiered containment hierarchy. Rather than call the second tier, databases, Trino refers to this tier as schemas. So a database in MySQL is equivalent to a schema in Trino. The third tier allows Trino to distinguish between multiple underlying data sources which are made of catalogs. Since the file provided to Trino is called mysql.properties it automatically names the catalog mysql without the .properties file type. To query the customer table in MySQL under the tiny you specify the following table name mysql.tiny.customer.

If you’ve reached this far, congratulations, you now know how to set up catalogs and query them through Trino! The benefits at this point should be clear, and making a proof of concept is easy to do this way. It’s time to put together that proof of concept for your team and your boss! What next though? How do you actually get this deployed in a reproducible and scalable manner? The next section covers a brief overview of faster ways to get Trino deployed at scale.

Deploying Trino at Scale with Kubernetes

Up to this point, this post only describes the deployment process. What about after that once you’ve deployed Trino to production and you slowly onboard engineering, BI/Analytics, and your data science teams. As many Trino users have experienced, the demand on your Trino cluster grows quickly as it becomes the single point of access to all of your data. This is where these small proof-of-concept size installations start to fall apart and you will need something more pliable to scale as your system starts to take on heavier workloads.

You will need to monitor your cluster and will likely need to stand up other services that run these monitoring tasks. This also applies to running other systems for security and authentication management. This list of complexity grows as you consider all of these systems need to scale and adapt around the growing Trino clusters. You may, for instance, consider deploying multiple clusters to handle different workloads, or possibly running tens or hundreds of Trino clusters to provide a self-service platform to provide isolated tenancy in your platform.

The solution to express all of these complex scenarios as the configuration is already solved by using an orchestration platform like Kubernetes, and its package manager project, Helm. Kubernetes offers a powerful way to express all the complex adaptable infrastructures based on your use cases.

In the interest of brevity, I will not include the full set of instructions on how to run a helm chart or cover the basics of running Trino on Kubernetes. Rather, I will refer you to an episode of Trino Community Broadcast that discusses Kubernetes, the community helm chart, and the basics of running Trino on Kubernetes. In the interest of transparency, the official Trino helm charts are still in an early phase of development. There is a very popular community-contributed helm chart that is adapted by many users to suit their needs and it is currently the best open source option for self-managed deployments of Trino. If you decide to take this route, proceed with caution and know that there is development to support the helm deployments moving forward.

While this will provide all the tools to enable a well-suited engineering department to run and maintain their own Trino cluster, this begs the question, based on your engineering team size, should you and your company be investing costly data engineer hours into maintaining, scaling, and hacking required to keep a full-size production infrastructure afloat?

Starburst Galaxy: The Easy Button method of deploying and maintaining Trino

Full Disclosure: This blog post was originally written while I was working at Starburst. I still stand by Starburst Galaxy as one of the better options but I will add the caveat that it depends on your use case and things change so reach out if you need my latest thoughts on the matter. That said, Galaxy is the general purpose version of Trino the creators never got to build at Facebook. If you have custom features you need that you'd like to contribute, a lot of folks run an open source cluster in testing and production is run by Starburst. You can then test and develop features to contribute to open source that will eventually upstream to Galaxy, Athena, or any other Trino variant.

Image By: lostvegas, License: CC BY-NC-ND 2.0

As mentioned, Trino has a relatively simple deployment setup, with an emphasis on relatively. This blog really only hits the tip of the iceberg when it comes to the complexity involved in managing and scaling Trino. While it is certainly possible to manage running Trino and even do so at scale with helm charts in Kubernetes, it is still a difficult setup for Trinewbies and difficult to maintain and scale for those who already have experience maintaining Trino. I experienced firsthand many of these difficulties myself when I began my Trino journey years ago and started on my own quest to help others overcome some of these challenges. This is what led me to cross paths with Starburst, the company behind the SaaS Trino platform Galaxy.

Galaxy makes Trino accessible to companies having difficulties scaling and customizing Trino to their needs. Unless you are in a company that houses a massive data platform and you have dedicated data and DevOps engineers to each system in your platform, many of these options won’t be feasible for you in the long run.

One thing to make clear is that a Galaxy cluster is really just a Trino cluster on demand. Outside of managing the scaling policies, to avoid any surprises on your cloud bill, you really don’t have to think about scaling Trino up or down, or suspending it when it is not in use. The beautiful thing about Trino and therefore Galaxy is that it is an ephemeral compute engine much like AWS Lambda that you can quickly spin up or down. Not only are you able to run ad-hoc and federated queries over disparate data sources, but now you can also run the infrastructure for those queries on-demand with almost no cost to your engineering team’s time.

Getting Started With Galaxy

Here’s a quick getting started guide with the Starburst Galaxy that mirrors the setup we realized with the Docker example above with Trino and MySQL.

Set up a trial of Galaxy by filling in your information at the bottom of the Galaxy information page.
Once you receive a link, you will see this sign-up screen. Fill out the email address, enter the pin sent to the email, and choose the domain for your cluster.
The rest of the tutorial is provided in the video below provides a basic demo of what you’ll need to do to get started.

This introduction may feel a bit underwhelming but extrapolate being able to run federated queries across your relational databases like MySQL, a data lake storing data in S3, or soon data in many NoSQL and real-time data stores. The true power of Starburst Galaxy is that now your team will no longer need to dedicate a giant backlog of tickets aimed at scaling up and down, monitoring, and securing Trino. Rather you can return to focus on the business problems and the best model for the data in your domain.

#trino

Trino on ice IV: Deep dive into Iceberg internals

Thu, 12 Aug 2021 05:00:00 +0000

So far, this series has covered some very interesting user level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some implementation details of Iceberg by dissecting some files that result from various operations carried out using Trino. To dissect you must use some surgical instrumentation, namely Trino, Avro tools, the MinIO client tool and Iceberg’s core library. It’s useful to dissect how these files work, not only to help understand how Iceberg works, but also to aid in troubleshooting issues, should you have any issues during ingestion or querying of your Iceberg table. I like to think of this type of debugging much like a fun game of operation, and you’re looking to see what causes the red errors to fly by on your screen.

Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:

Understanding Iceberg metadata

Iceberg can use any compatible metastore, but for Trino, it only supports the Hive metastore and AWS Glue similar to the Hive connector. This is because there is already a vast amount of testing and support for using the Hive metastore in Trino. Likewise, many Trino use cases that currently use data lakes already use the Hive connector and therefore the Hive metastore. This makes it convenient to have as the leading supported use case as existing users can easily migrate between Hive to Iceberg tables. Since there is no indication of which connector is actually executed in the diagram of the Hive connector architecture, it serves as a diagram that can be used for both Hive and Iceberg. The only difference is the connector used, but if you create a table in Hive, you can view the same table in Iceberg.

To recap the steps taken from the first three blogs; the first blog created an events table, while the first two blogs ran two insert statements. The first insert contained three records, while the second insert contained a single record.

Up until this point, the state of the files in MinIO haven’t really been shown except some of the manifest list pointers from the snapshot in the third blog post. Using the MinIO client tool, you can list files that Iceberg generated through all these operations and then try to understand what purpose they are serving.

% mc tree -f local/
local/
└─ iceberg
   └─ logging.db
      └─ events
         ├─ data
         │  ├─ event_time_day=2021-04-01
         │  │  ├─ 51eb1ea6-266b-490f-8bca-c63391f02d10.orc
         │  │  └─ cbcf052d-240d-4881-8a68-2bbc0f7e5233.orc
         │  └─ event_time_day=2021-04-02
         │     └─ b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc
         └─ metadata
            ├─ 00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json
            ├─ 00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
            ├─ 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
            ├─ 23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro
            ├─ 92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro
            ├─ snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
            ├─ snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro
            └─ snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro

There are a lot of files here, but here are a couple of patterns that you can observe with these files.

First, the top two directories are named data and metadata.

////data/////metadata/
As you might expect, data contains the actual ORC files split by partition. This is akin to what you would see in a Hive table data directory. What is really of interest here is the metadata directory. There are specifically three patterns of files you’ll find here.
///




/metadata/.avro

///
/metadata/snap---.avro

///
/metadata/-.metadata.json
Iceberg has a persistent tree structure that manages various snapshots of the data that are created for every mutation of the data. This enables not only a concurrency model that supports serializable isolation, but also cool features like time travel across a linear progression of snapshots.
This tree structure contains two types of Avro files, manifest lists and manifest files. Manifest list files contain pointers to various manifest files and the manifest files themselves point to various data files. This post starts out by covering these manifest files, and later covers the table metadata files that are suffixed by .metadata.json.
The last blog covered the command in Trino that shows the snapshot information that is stored in the metastore. Here is that command and its output again for your review.
SELECT manifest_list 
FROM iceberg.logging."events$snapshots";
Result:















snapshots
s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro


You’ll notice that the manifest list returns the Avro files prefixed with
snap- are returned. These files are directly correlated with the snapshot record stored in the metastore. According to the diagram above, snapshots are records in the metastore that contain the url of the manifest list in the Avro file. Avro files are binary files and not something you can just open up in a text editor to read. Using the avro-tools.jar tool distributed by the Apache Avro project, you can actually inspect the contents of this file to get a better understanding of how it is used by Iceberg.

The first snapshot is generated on the creation of the events table. Upon inspecting this file, you notice that the file is empty. The output is an empty line that the jq JSON command line utility removes on pretty printing the JSON that is returned, which is just a newline. This snapshot represents an empty state of the table upon creation. To investigate the snapshots you need to download the files to your local filesystem. Let's move them to the home  directory:

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro | jq .


Result: (is empty)




The second snapshot is a little more interesting and actually shows us the contents of a manifest list.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro | jq .


Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "manifest_length":6114,
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "added_data_files_count":{
      "int":2
   },
   "existing_data_files_count":{
      "int":0
   },
   "deleted_data_files_count":{
      "int":0
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   },
   "existing_rows_count":{
      "long":0
   },
   "deleted_rows_count":{
      "long":0
   }
}


To understand each of the values in each of these rows, you can refer to the  Iceberg
specification in the manifest list file section. Instead of covering these exhaustively, let's focus on a few key fields. Below are the fields, and their definition according to the specification.
manifest_path – Location of the manifest file.
partition_spec_id – ID of a partition spec used to write the manifest; must be listed in table metadata partition-specs.
added_snapshot_id – ID of the snapshot where the manifest file was added.
partitions – A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec.
added_rows_count – Number of rows in all files in the manifest that have status ADDED, when null this is assumed to be non-zero.

As mentioned above, manifest lists hold references to various manifest files. These manifest paths are the pointers in the persistent tree that tells any client using Iceberg where to find all of the manifest files associated with a particular snapshot. To traverse this tree, you can look over the different manifest paths to find all the manifest files associated with the particular snapshot you want to traverse. Partition spec ids are helpful to know the current partition specification which are stored in the table metadata in the metastore. This references where to find the spec in the metastore. Added snapshot ids tells you which snapshot is associated with the manifest list. Partitions hold some high level partition bound information to make for faster querying. If a query is looking for a particular value, it only traverses the manifest files where the query values fall within the range of the file values. Finally, you get a few metrics like the number of changed rows and data files, one of which is the count of added rows. The first operation consisted of three rows inserts and the second operation was the insertion of one row. Using the row counts you can easily determine which manifest file belongs to which operation.

The following command shows the final snapshot after both operations executed and filters out only the fields pointed out above.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro | jq '. | {manifest_path: .manifest_path, partition_spec_id: .partition_spec_id, added_snapshot_id: .added_snapshot_id, partitions: .partitions, added_rows_count: .added_rows_count }'


Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":4564366177504223700
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001eI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":1
   }
}
{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   }
}


In the listing of the manifest file related to the last snapshot, you notice the first operation where three rows were inserted is contained in the manifest file in the second JSON object. You can determine this from the snapshot id, as well as, the number of rows that were added in the operation. The first JSON object contains the last operation that inserted a single row. So the most recent operations are listed in reverse commit order.

The next command does the same listing of the file that you ran with the manifest list, except you run this on the manifest files themselves to expose their contents and discuss them. To begin with, you run the command to show the contents of the manifest file associated with the insertion of three rows.

% java -jar  ~/avro-tools-1.10.0.jar tojson ~/Desktop/avro_files/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro | jq .


Result:

{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-01/51eb1ea6-266b-490f-8bca-c63391f02d10.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18718
         }
      },
      "record_count":1,
      "file_size_in_bytes":870,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":1
            },
            {
               "key":2,
               "value":1
            },
            {
               "key":3,
               "value":1
            },
            {
               "key":4,
               "value":1
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}
{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-02/b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18719
         }
      },
      "record_count":2,
      "file_size_in_bytes":1084,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":2
            },
            {
               "key":2,
               "value":2
            },
            {
               "key":3,
               "value":2
            },
            {
               "key":4,
               "value":2
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Double oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"WARN"
            },
            {
               "key":3,
               "value":"Maybeh oh noes?"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}


Now this is a very big output, but in summary, there’s really not too much to these files. As before, there is a Manifest section in the Iceberg spec that details what each of these fields means. Here are the important fields:
snapshot_id – Snapshot id where the file was added, or deleted if status is two. Inherited when null.
data_file – Field containing metadata about the data files pertaining to the manifest file, such as file path, partition tuple, metrics, etc…
data_file.file_path – Full URI for the file with FS scheme.
data_file.partition – Partition data tuple, schema based on the partition spec.
data_file.record_count – Number of records in the data file.
data_file.*_count – Multiple fields that contain a map from column id to  number of values, null, nan counts in the file. These can be used to quickly  filter out unnecessary get operations.
data_file.*_bounds – Multiple fields that contain a map from column id to lower or upper bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.

Each data file struct contains a partition and data file that it maps to. These files only be scanned and returned if the criteria for the query is met when  checking all of the count, bounds, and other statistics that are recorded in the file. Ideally only files that contain data relevant to the query should be scanned at all. Having information like the record count may also help in the query planning process to determine splits and other information. This particular optimization hasn’t been completed yet as planning typically happens before traversal of the files. It is still in ongoing discussion and is discussed a bit by Iceberg creator Ryan Blue in a recent meetup. If this is something you are interested in, keep posted on the Slack channel and releases as the Trino Iceberg connector progresses in this area.

As mentioned above, the last set of files that you find in the metadata directory which are suffixed with .metadata.json. These files at baseline are a bit strange as they aren’t stored in the Avro format, but instead the JSON format. This is because they are not part of the persistent tree structure. These files are essentially a copy of the table metadata that is stored in the metastore. You can find the fields for the table metadata listed in the Iceberg specification. These tables are typically stored persistently in a metasture much like the Hive metastore but could easily be replaced by any datastore that can support an atomic swap (check-and-put) operation required for Iceberg to support the optimistic concurrency operation.

The naming of the table metadata includes a table version and UUID:
-.metadata.json. To commit a new metadata version, which just adds 1 to the current version number, the writer performs these steps:
It creates a new table metadata file using the current metadata.
It writes the new table metadata to a file following the naming with the next version number.

It requests the metastore swap the table’s metadata pointer from the old location to the new location.
If the swap succeeds, the commit succeeded. The new file is now the
current metadata.
If the swap fails, another writer has already created their own. The
current writer goes back to step 1.

If you want to see where this is stored in the Hive metastore, you can reference the TABLE_PARAMS table. At the time of writing, this is the only method of using the metastore that is supported by the Trino Iceberg connector.

SELECT PARAM_KEY, PARAM_VALUE FROM metastore.TABLE_PARAMS;


Result:


PARAM_KEY PARAM_VALUE
EXTERNAL TRUE
metadata_location s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
numFiles 2
previous_metadata_location s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
table_type iceberg
totalSize 5323
transient_lastDdlTime 1622865672


So as you can see, the metastore is saying the current metadata location is the
00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json file. Now you can dive in to see the table metadata that is being used by the Iceberg connector.

% cat ~/Desktop/avro_files/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json


Result:

{
   "format-version":1,
   "table-uuid":"32e3c271-84a9-4be5-9342-2148c878227a",
   "location":"s3a://iceberg/logging.db/events",
   "last-updated-ms":1622865686323,
   "last-column-id":5,
   "schema":{
      "type":"struct",
      "fields":[
         {
            "id":1,
            "name":"level",
            "required":false,
            "type":"string"
         },
         {
            "id":2,
            "name":"event_time",
            "required":false,
            "type":"timestamp"
         },
         {
            "id":3,
            "name":"message",
            "required":false,
            "type":"string"
         },
         {
            "id":4,
            "name":"call_stack",
            "required":false,
            "type":{
               "type":"list",
               "element-id":5,
               "element":"string",
               "element-required":false
            }
         }
      ]
   },
   "partition-spec":[
      {
         "name":"event_time_day",
         "transform":"day",
         "source-id":2,
         "field-id":1000
      }
   ],
   "default-spec-id":0,
   "partition-specs":[
      {
         "spec-id":0,
         "fields":[
            {
               "name":"event_time_day",
               "transform":"day",
               "source-id":2,
               "field-id":1000
            }
         ]
      }
   ],
   "default-sort-order-id":0,
   "sort-orders":[
      {
         "order-id":0,
         "fields":[
            
         ]
      }
   ],
   "properties":{
      "write.format.default":"ORC"
   },
   "current-snapshot-id":4564366177504223943,
   "snapshots":[
      {
         "snapshot-id":6967685587675910019,
         "timestamp-ms":1622865672882,
         "summary":{
            "operation":"append",
            "changed-partition-count":"0",
            "total-records":"0",
            "total-data-files":"0",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro"
      },
      {
         "snapshot-id":2720489016575682283,
         "parent-snapshot-id":6967685587675910019,
         "timestamp-ms":1622865680419,
         "summary":{
            "operation":"append",
            "added-data-files":"2",
            "added-records":"3",
            "added-files-size":"1954",
            "changed-partition-count":"2",
            "total-records":"3",
            "total-data-files":"2",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro"
      },
      {
         "snapshot-id":4564366177504223943,
         "parent-snapshot-id":2720489016575682283,
         "timestamp-ms":1622865686278,
         "summary":{
            "operation":"append",
            "added-data-files":"1",
            "added-records":"1",
            "added-files-size":"746",
            "changed-partition-count":"1",
            "total-records":"4",
            "total-data-files":"3",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro"
      }
   ],
   "snapshot-log":[
      {
         "timestamp-ms":1622865672882,
         "snapshot-id":6967685587675910019
      },
      {
         "timestamp-ms":1622865680419,
         "snapshot-id":2720489016575682283
      },
      {
         "timestamp-ms":1622865686278,
         "snapshot-id":4564366177504223943
      }
   ],
   "metadata-log":[
      {
         "timestamp-ms":1622865672894,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json"
      },
      {
         "timestamp-ms":1622865680524,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json"
      }
   ]
}


As you can see, these JSON files can quickly grow as you perform different updates on your table. This file contains a pointer to all of the snapshots and manifest list files, much like the output you found from looking at the snapshots in the table. A really important piece to note is the schema is stored here. This is what Trino uses for validation on inserts and reads. As you may expect, there is the root location of the table itself, as well as a unique table identifier. The final part I’d like to note about this file is the partition-spec and partition-specs fields. The partition-spec field holds the current partition spec, while the partition-specs is an array that can hold a list of all partition specs that have existed for this table. As pointed out earlier, you can have many different manifest files that use different partition specs. That wraps up all of the metadata file types you can expect to see in Iceberg!

This post wraps up the Trino on ice series. Hopefully these blog posts serve as a helpful initial dialogue about what is expected to grow as a vital portion of an open data lakehouse stack. What are you waiting for? Come join the fun and help us implement some of the missing features or instead go ahead and try Trino on Ice(berg) yourself!

#trino #iceberg

snapshots
s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro

PARAM_KEY	PARAM_VALUE
EXTERNAL	TRUE
metadata_location	s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
numFiles	2
previous_metadata_location	s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
table_type	iceberg
totalSize	5323
transient_lastDdlTime	1622865672



Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Fri, 30 Jul 2021 05:00:00 +0000


In the last two blog posts, we’ve covered a lot of cool feature improvements of Iceberg over the Hive model. I recommend you take a look at those if you haven’t yet. We introduced concepts and issues that table formats address. This blog closes up the overview of Iceberg features by discussing the concurrency model Iceberg uses to ensure data integrity, how to use snapshots via Trino, and the Iceberg Specification.





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals



Concurrency Model

 Some issues with the Hive model are the distinct locations where the metadata is stored and where the data files are stored. Having your data and metadata split up like this is a recipe for disaster when trying to apply updates to both services atomically.

 
 A very common problem with Hive is that if a writing process failed during insertion, many times you would find the data written to file storage, but the metastore writes failed to occur. Or conversely, the metastore writes were successful, but the data failed to finish writing to file storage due to a  network or file IO failure. There’s a good  Trino Community Broadcast episode that talks about a function in Trino that exists to resolve these issues by syncing the metastore and file storage. You can watch  a simulation of this error on that episode.

 Aside from having issues due to the split state in the system, there are many  other issues that stem from the file system itself. In the case of HDFS,  depending on the specific filesystem implementation you are using, you may have different atomicity guarantees for various file systems and their operations, such as creating, deleting, and renaming files and directories. HDFS isn’t the only troublemaker here. Other than Amazon S3’s  recent announcement of strong consistency in their S3 service, most object storage systems only offer eventual consistency that may not show the latest files immediately after writes. Despite storage systems showing more progress towards offering better performance and guarantees, these systems still offer no reliable locking mechanism.

 Iceberg addresses all of these issues in a multitude of ways. One of the primary ways Iceberg introduces transactional guarantees is by storing the metadata in the same datastore as the data itself. This simplifies handling commit failures down to rolling back on one system rather than trying to coordinate a rollback across two systems like in Hive. Writers independently write their metadata and attempt to perform their operations, needing no coordination with other writers. The only time the writers coordinate is when they attempt to perform a commit of their operations. In order to do a commit, they perform a lock of the current snapshot record in a database. This concurrency model where writers eagerly do the work upfront is called optimistic concurrency control.
 Currently, in Trino, this method still uses the Hive metastore to perform the lock-and-swap operation necessary to coordinate the final commits. Iceberg  creator, Ryan Blue, covers this lock-and-swap mechanism and how the metastore can be replaced with alternate locking methods. In the event that two writers attempt to commit at the same time, the writer that first acquires the lock successfully commits by swapping its snapshot as the current snapshot, while the second writer will retry to apply its changes again. The second writer should have no problem with this, assuming there are no conflicting changes between the two snapshots.

 

 This works similarly to a git workflow where the main branch is the locked resource, and two developers try to commit their changes at the same time. The first developer’s changes may conflict with the second developer’s changes. The second developer is then forced to rebase or merge the first developer’s code with their changes before commiting to the main branch again. The same logic applies to merging data files. Currently, Iceberg clients use a copy-on-write mechanism that makes a new file out of the merged data in the next snapshot. This enables accurate time traveling and preserves previous split versions of the files. At the time of writing, upserts via MERGE INTO syntax are not supported in Trino, but  this is in active development. UPDATE: Since the original writing of this post, the  MERGE syntax exists as of version 393.

 One of the great benefits of tracking each individual change that gets written to Iceberg is that you are given a view of the data at every point in time. This enables a really cool feature that I mentioned earlier called time travel.

 ## Snapshots and Time Travel

 To showcase snapshots, it’s best to go over a few examples drawing from the event table we  created in the previous blog posts. This time we’ll only be working with the Iceberg table, as this capability is not available in Hive. Snapshots allow you to have an immutable set of your data at a given time. They are automatically created on every append or removal of data. One thing to note is that for now, they do not store the state of your metadata.
 Say that you have c
 reated your events table and inserted the three initial rows as we did previously. Let’s look at the data we get back and see how to check the existing snapshots in Trino:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




To query the snapshots, all you need is to use the $ operator appended to the
end of the table name, and add the hidden table, snapshots:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;


Result:




snapshot_id
parent_id
operation





7620328658793169607

append



2115743741823353537
7620328658793169607
append




Let’s take a look at the manifest list files that are associated with each
snapshot ID. You can tell which file belongs to which snapshot based on the
snapshot ID embedded in the filename:

SELECT manifest_list
FROM iceberg.logging.“events$snapshots”;


Result:




shapshots





s3a://iceberg/logging.db/events/metadata/snap-7620328658793169607-1-cc857d89-1c07-4087-bdbc-2144a814dae2.avro



s3a://iceberg/logging.db/events/metadata/snap-2115743741823353537-1-4cb458be-7152-4e99-8db7-b2dda52c556c.avro




Now, let’s insert another row to the table:

INSERT INTO iceberg.logging.events
VALUES
(
‘INFO’,
timestamp ‘2021-04-02 00:00:11.1122222’,
‘It is all good’,
ARRAY [‘Just updating you!’]
);


Let’s check the snapshot table again:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;


Result:




snapshot_id
parent_id
operation





7620328658793169607

append



2115743741823353537
7620328658793169607
append



7030511368881343137
2115743741823353537
append




Let’s also verify that our row was added:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Oh noes



INFO
It is all good



ERROR
Double oh noes



WARN
Maybeh oh noes?




 Since Iceberg is already tracking the list of files added and removed at each snapshot, it would make sense that you can travel back and forth between these different views into the system, right? This concept is called time traveling. You need to specify which snapshot you would like to read from and you will see the view of the data at that timestamp. In Trino, you need to use the @ operator, followed by the snapshot you wish to read from:

SELECT level, message
FROM iceberg.logging.“events@2115743741823353537”;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




 If you determine there is some issue with your data, you can always roll back to the previous state permanently as well. In Trino we have a function called rollback_to_snapshot to move the table state to another snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 2115743741823353537);


Now that we have rolled back, observe what happens when we query the events
table with:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




 Notice the INFO row is still missing even though we query the table without specifying a snapshot id. Now just because we rolled back, doesn’t mean we’ve lost the snapshot we just rolled back from. In fact, we can roll forward, or as I like to call it,  back to the future! In Trino, you use the same function call but with a predecessor of the existing snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 7030511368881343137)


And now we should be able to query the table again and see the INFO row
return:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Oh noes



INFO
It is all good



ERROR
Double oh noes



WARN
Maybeh oh noes?




 As expected, the INFO row returns when you roll back to the future.

 Having snapshots not only provides you with a level of immutability that is key to the eventual consistency model, but gives you a rich set of features to version and move between different versions of your data like a git repository.

 ## Iceberg Specification

 Perhaps saving the best for last, the benefit of using Iceberg is the community that surrounds it, and the support you receive. It can be daunting to have to choose a project that replaces something so core to your architecture. While Hive has so many drawbacks, one of the things keeping many companies locked in is the fear of the unknown. How do you know which table format to choose? Are there unknown data corruption issues that I’m about to take on? What if this doesn’t scale like it promises on the label? It is worth noting that  alternative table formats are also emerging in this space  and we encourage you to investigate these for your own use cases. When sitting down with Iceberg creator, Ryan Blue,  comparing Iceberg to other table formats,  he claims the community’s greatest strength is their ability to look forward. They intentionally broke compatibility with Hive to enable them to provide a richer level of features. Unlike Hive, the Iceberg project explained their thinking in a spec.

 The strongest argument I can see for Iceberg is that it has a specification. This is something that has largely been missing from Hive and shows a real maturity in how the Iceberg community has approached the issue. On the Trino project, we think standards are important. We adhere to many of them ourselves, such as the ANSI SQL syntax, and exposing the client through a JDBC connection. By creating a standard around this, you’re no longer tied to any particular technology, not even Iceberg itself. You are adhering to a standard that will hopefully become the de facto standard over a decade or two, much like Hive did. Having the standard in clear writing invites multiple communities to the table and brings even more use  cases. Doing so improves the standards and therefore the technologies that implement them.

 The previous three blog posts of this series covered the features and massive benefits from using this novel table format. The following post will dive deeper and discuss more about how Iceberg achieves some of this functionality, with an overview into some of the internals and metadata layouts. In the meantime, feel free to try  Trino on Ice(berg).

#trino #iceberg

bits





Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Mon, 12 Jul 2021 05:00:00 +0000


The first post covered how Iceberg is a table format and not a file format It demonstrated the benefits of hidden partitioning in Iceberg in contrast to exposed partitioning in Hive. There really is no such thing as “exposed partitioning.” I just thought that sounded better than not-hidden partitioning. If any of that wasn’t clear, I recommend either that you stop reading now, or go back to the first post before starting this one. This post discusses evolution. No, the post isn’t covering Darwinian nor Pokémon evolution, but in-place table evolution!





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals





You may find it a little odd that I am getting excited over tables evolving
in-place, but as mentioned in the last post, if you have experience performing table evolution in Hive, you’d be as happy as Ash Ketchum when Charmander evolved into Charmeleon discovering that Iceberg supports Partition evolution and schema evolution. That is, until Charmeleon started treating Ash like a jerk after the evolution from Charmander. Hopefully, you won’t face the same issue when your tables evolve.

Another important aspect that is covered, is how Iceberg is developed with cloud storage in mind. Hive and other data lake technologies were developed with file systems as their primary storage layer. This is still a very common layer today, but as more companies move to include object storage, table formats did not adapt to the needs of object stores. Let’s dive in!

Partition Specification evolution

In Iceberg, you are able to update the partition specification, shortened to partition spec in Iceberg, on a live table. You do not need to perform a table migration as you do in Hive. In Hive, partition specs don’t explicitly exist because they are tightly coupled with the creation of the Hive table. Meaning, if you ever need to change the granularity of your data partitions at any point, you need to create an entirely new table, and move all the data to the new partition granularity you desire. No pressure on choosing the right granularity or anything!

In Iceberg, you’re not required to choose the perfect partition specification upfront, and you can have multiple partition specs in the same table, and query across the different sized partition specs. How great is that! This means, if you’re initially partitioning your data by month, and later you decide to move to a daily partitioning spec due to a growing ingest from all your new customers, you can do so with no migration, and query over the table with no issue.

This is conveyed pretty succinctly in this graphic from the Iceberg
documentation. At the end of the year 2008, partitioning occurs at a monthly granularity and after 2009, it moves to a daily granularity. When the query to pull data from December 14th, 2008 and January 13th, 2009, the entire month of December gets scanned due to the monthly partition, but for the dates in January, only the first 13 days are scanned to answer the query.



At the time of writing, Trino is able to perform reads from tables that have multiple partition spec changes but partition evolution write support does not yet exist. There are efforts to add this support in the near future. Edit: this has since been merged!

Schema evolution

Iceberg also handles schema evolution much more elegantly than Hive. In Hive, adding columns worked well enough, as data inserted before the schema change just reports null for that column. For formats that use column names, like ORC and Parquet, deletes are also straightforward for Hive, as it simply ignores fields that are no longer part of the table. For unstructured files like CSV that use the position of the column, deletes would still cause issues, as deleting one column shifts the rest of the columns. Renames for schemas pose an issue for all formats in Hive as data written prior to the rename is not modified to the new field. This effectively works the same as if you deleted the old field and added a new column with the new name. This lack of support for schema. evolution across various file types in Hive requires a lot of memorizing
the formats underneath various tables. This is very susceptible to causing user errors if someone executes one of the unsupported operations on the wrong table.



  
    Hive 2.2.0 schema evolution based on file type and operation.
  


  
    
    Add
    Delete
    Rename
  
  
    CSV/TSV
    ✅
    ❌
    ❌
  
  
    JSON
    ✅
    ✅
    ❌
  
  
    ORC/Parquet/Avro
    ✅
    ✅
    ❌
  



Currently in Iceberg, schemaless position-based data formats such as CSV and TSVare not supported, though there are some discussions on adding limited support for them. This would be good from a reading standpoint, to load data from the CSV, into an Iceberg format with all the guarantees that Iceberg offers.

While JSON doesn’t rely on positional data, it does have an explicit dependency on names. This means, that if I remove a text column from a JSON table named severity, then later I want to add a new int column called severity, I encounter an error when I try to read in the data with the string type from before when I try to deserialize the JSON files. Even worse would be if the new severity column you add has the same type as the original but a semantically different meaning. This results in old rows containing values that are unknowingly from a different domain, which can lead to wrong analytics. After all, someone who adds the new severity column might not even be aware of the old severity column, if it was quite some time ago when it was dropped.

ORC, Parquet, and Avro do not suffer from these issues as they are columnar formats that keep a schema internal to the file itself, and each format tracks changes to the columns through IDs rather than name values or position. Iceberg uses these unique column IDs to also keep track of the columns as changes are applied.

In general, Iceberg can only allow this small set of file formats due to the correctness guarantees it provides. In Trino, you can add, delete, or rename columns using the ALTER TABLE command. Here’s an example that continues from the table created  in the last post  that inserted three rows. The DDL statement looked like this.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6), 
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioning = ARRAY['day(event_time)']
);


Here is an ALTER TABLE sequence that adds a new column named severity, inserts data including into the new column, renames the column, and prints the data.

ALTER TABLE iceberg.logging.events ADD COLUMN severity INTEGER; 

INSERT INTO iceberg.logging.events VALUES 
(
  'INFO', 
  timestamp 
  '2021-04-01 19:59:59.999999' AT TIME ZONE 'America/Los_Angeles', 
  'es muy bueno', 
  ARRAY ['It is all normal'], 
  1
);

ALTER TABLE iceberg.logging.events RENAME COLUMN severity TO priority;

SELECT level, message, priority
FROM iceberg.logging.events;


Result:




level
message
priority





ERROR
Double oh noes
NULL



WARN
Maybeh oh noes?
NULL



ERROR
Oh noes
NULL



INFO
es muy bueno
1




ALTER TABLE iceberg.logging.events 
DROP COLUMN priority;

SHOW CREATE TABLE iceberg.logging.events;


Result

CREATE TABLE iceberg.logging.events (
   level varchar,
   event_time timestamp(6),
   message varchar,
   call_stack array(varchar)
)
WITH (
   format = 'ORC',
   partitioning = ARRAY['day(event_time)']
)


Notice how the priority and severity columns are both not present in the schema. As noted in the table above, Hive renames cause issues for all file formats. Yet in Iceberg, performing all these operations causes no issues with the table and underlying data.

Cloud storage compatibility

Not all developers consider or are aware of the performance implications of using Hive over a cloud object storage solution like S3 or Azure Blob storage. One thing to remember is that Hive was developed with the Hadoop Distributed File System (HDFS) in mind. HDFS is a filesystem and is particularly well suited to handle listing files on the filesystem, because they were stored in a contiguous manner. When Hive stores data associated with a table, it assumes there is a contiguous layout underneath it and performs list operations that are expensive on cloud storage systems.

The common cloud storage systems are typically object stores that do not lay out the files in a contiguous manner based on paths. Therefore, it becomes very expensive to list out all the files in a particular path. Yet, these list operations are executed for every partition that could be included in a query, regardless of only a single row, in a single file out of thousands of files needing to be retrieved to answer the query. Even ignoring the performance costs for a minute, object stores may also pose issues for Hive due to eventual  consistency. Inserting and deleting can cause inconsistent results for readers, if the files you end up reading are out of date.

Iceberg avoids all of these issues by tracking the data at the file level,
rather than the partition level. By tracking the files, Iceberg only accesses the files containing data relevant to the query, as opposed to accessing files in the same partition looking for the few files that are relevant to the query. Further, this allows Iceberg to control for the inconsistency issue in cloud-based file systems by using a locking mechanism at the file level. See the file layout below that Hive layout versus the Iceberg layout. As you can see in the next image, Iceberg makes no assumptions about the data being contiguous or not. It simply builds a persistent tree using the snapshot (S) location stored in the metadata, that points to the manifest list (ML), which points to
manifests containing partitions (P). Finally, these manifest files contain the file (F) locations and stats that can quickly be used to prune data versus
needing to do a list operation and scanning all the files.



Referencing the picture above, if you were to run a query where the result set only contains rows from file F1, Hive would require a list operation and scanning the files, F2 and F3. In Iceberg, file metadata exists in the manifest file, P1, that would have a range on the predicate field that prunes out files F2 and F3, and only scans file F1. This example only shows a couple of files, but imagine storage that scales up to thousands of files! Listing becomes expensive on files that are not contiguously stored in memory. Having this flexibility in the logical layout is essential to increase query performance. This is especially true on cloud object stores.

If you want to play around with Iceberg using Trino, check out the
Trino Iceberg docs. To avoid issues like the eventual consistency issue, as well as other problems of trying to sync operations across systems, Iceberg provides optimistic concurrency support, which is covered in more detail in
the next post.

#trino #iceberg

bits





Trino on ice I: A gentle introduction To Iceberg
Mon, 03 May 2021 05:00:00 +0000


Back in the Gentle introduction to the Hive connector blog post, I discussed a commonly misunderstood architecture and uses of the Trino Hive connector. In short, while some may think the name indicates Trino makes a call to a running Hive instance, the Hive connector does not use the Hive runtime to answer queries. Instead, the connector is named Hive connector because it relies on Hive conventions and implementation details from the Hadoop ecosystem – the invisible Hive specification.





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals



I call this specification invisible because it doesn’t exist. It lives in the Hive code and the minds of those who developed it. This is makes it very difficult for anybody else who has to integrate with any distributed object storage that uses Hive, since they had to rely on reverse engineering and keeping up with the changes. The way you interact with Hive changes based on which version of Hive or Hadoop you are running. It also varies if you are in the cloud or over an object store. Spark has even modified the Hive spec in some ways to fit the Hive model to their use cases. It’s a big mess that data engineers have put up with for years. Yet despite the confusion and lack of organization due to Hive’s number of unwritten assumptions, the Hive connector is the most popular connector in use for Trino. Virtually every big data query engine uses the Hive model today in some form. As a result it is used by numerous companies to store and access data in their data lakes.

So how did something with no specification become so ubiquitous in data lakes? Hive was first in the large object storage and big data world as part of Hadoop. Hadoop became popular from good marketing for Hadoop to solve the problems of dealing with the increase in data with the Web 2.0 boom . Of course, Hive didn’t get everything wrong. In fact, without Hive, and the fact that it is open source, there may not have been a unified specification at all. Despite the many hours data engineers have spent bashing their heads against the wall with all the unintended consequences of Hive, it still served a very useful purpose.

So why did I just rant about Hive for so long if I’m here to tell you about Apache Iceberg? It’s impossible for a teenager growing up today to truly appreciate music streaming services without knowing what it was like to have an iPod with limited storage, or listening to a scratched burnt CD that skips, or flipping your tape or record to side-B. The same way anyone born before the turn of the millennium really appreciates streaming services, so you too will appreciate Iceberg once you’ve learned the intricacies of managing a data lake built on Hive and Hadoop.

If you haven’t used Hive before, this blog post outlines just a few pain points that come from this data warehousing software to give you proper context. If you have already lived through these headaches, this post acts as a guide to Iceberg from Hive. This post is the first in a series of blog posts discussing Apache Iceberg in great detail, through the lens of the Trino query engine user. If you’re not aware of Trino (formerly PrestoSQL) yet, it is the project that houses the founding Presto community after the founders of Presto left Facebook. This and the next couple of posts discuss the Iceberg specification and all the features Iceberg has to offer, many times in comparison with Hive.

Before jumping into the comparisons, what is Iceberg exactly? The first thing to understand is that Iceberg is not a file format, but a table format. It may not be clear what this means by just stating that, but the function of a table format becomes clearer as the improvements Iceberg brings from the Hive table standard materialize. Iceberg doesn’t replace file formats like ORC and Parquet, but is the layer between the query engine and the data. Iceberg maps and indexes the files in order to provide a higher level abstraction that handles the relational table format for data lakes. You will understand more about table formats through examples in this series.

Hidden Partitions

Hive Partitions

Since most developers and users interact with the table format via the query language, a noticeable difference is the flexibility you have while creating a partitioned table. Assume you are trying to create a table for tracking events occurring in our system. You run both sets of SQL commands from Trino, just using the Hive and Iceberg connectors which are designated by the catalog name (i.e. the catalog name starting with hive. uses the Hive connector, while the iceberg. table uses the Iceberg connector). To begin with, the first DDL statement attempts to create an events table in the logging schema in the hive catalog, which is configured to use the Hive connector. Trino also creates a partition on the events table using the event_time field which is a TIMESTAMP field.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);


Running this in Trino using the Hive connector produces the following error message.

Partition keys must be the last columns in the table and in the same order as the table properties: [event_time]


The Hive DDL is very dependent on ordering for columns and specifically partition columns. Partition fields must be located in the final column positions and in the order of partitioning in the DDL statement. The next statement attempts to create the same table, but now with the event_time field moved to the last column position.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time TIMESTAMP
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);


This time, the DDL command works successfully, but you likely don’t want to partition your data on the plain timestamp. This results in a separate file for each distinct timestamp value in your table (likely almost a file for each event). In Hive, there’s no way to indicate the time granularity at which you want to partition natively. The method to support this scenario with Hive is to create a new VARCHAR column, event_time_day that is dependent on the event_time column to create the date partition value.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time_day VARCHAR
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time_day']
);


This method wastes space by adding a new column to your table. Even worse, it puts the burden of knowledge on the user to include this new column for writing data. It is then necessary to use that separate column for any read access to take advantage of the performance gains from the partitioning.

INSERT INTO hive.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'], 
  '2021-04-01'
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'],
  '2021-04-02'
),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??'], 
  '2021-04-02'
);


Notice that the last partition value '2021-04-01' has to match the TIMESTAMP date during insertion. There is no validation in Hive to make sure this is happening because it only requires a VARCHAR and knows to partition based on different values.

On the other hand, If a user runs the following query:

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02';


they get the correct results back, but have to scan all the data in the table:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


This happens because the user forgot to include the event_time_day < '2021-04-02' predicate in the WHERE clause. This eliminates all the benefits that led us to create the partition in the first place and yet frequently this is missed by the users of these tables.

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02' 
AND event_time_day < '2021-04-02';


Result:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


Iceberg Partitions

The following DDL statement illustrates how these issues are handled in Iceberg via the Trino Iceberg connector.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6),
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  partitioning = ARRAY['day(event_time)']
);


Taking note of a few things. First, notice the partition on the event_time column that is defined without having to move it to the last position. There is also no need to create a separate field to handle the daily partition on the event_time field. The partition specification is maintained internally by Iceberg, and neither the user nor the reader of this table needs to know anything about the partition specification to take advantage of it. This concept is called hidden partitioning , where only the table creator/maintainer has to know the partitioning specification. Here is what the insert statements look like now:

INSERT INTO iceberg.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??']
);


The VARCHAR dates are no longer needed. The event_time field is internally converted to the proper partition value to partition each row. Also, notice that the same query that ran in Hive returns the same results. The big difference is that it doesn’t require any extra clause to indicate to filter partition as well as filter the results.

SELECT *
FROM iceberg.logging.events
WHERE event_time < timestamp '2021-04-02';


Result:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


So hopefully that gives you a glimpse into what a table format and specification are, and why Iceberg is such a wonderful improvement over the existing and outdated method of storing your data in your data lake. While this post covers a lot of aspects of Iceberg’s capabilities, this is just the tip of the Iceberg…



If you want to play around with Iceberg using Trino, check out the Trino Iceberg docs. The next post covers how table evolution works in Iceberg, as well as, how Iceberg is an improved storage format for cloud storage.

#trino #iceberg

bits





A gentle introduction to the Hive connector
Wed, 21 Oct 2020 17:00:00 +0000
TL;DR: The Hive connector is what you use in Trino for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.





Originally Posted on https://trino.io/blog/2020/10/20/intro-to-hive-connector.html

One of the most confusing aspects when starting Trino is the Hive connector. Typically, you seek out the use of Trino when you experience an intensely slow query turnaround from your existing Hadoop, Spark, or Hive infrastructure. In fact, the genesis of Trino came about due to these slow Hive query conditions at Facebook back in 2012.

So when you learn that Trino has a Hive connector, it can be rather confusing since you moved to Trino to circumvent the slowness of your current Hive cluster. Another common source of confusion is when you want to query your data from your cloud object storage, such as AWS S3, MinIO, and Google Cloud Storage. This too uses the Hive connector. If that confuses you, don’t worry, you are not alone. This blog aims to explain this commonly confusing nomenclature.

Hive architecture

To understand the origins and inner workings of Trino’s Hive connector, you first need to know a few high-level components of the Hive architecture.



You can simplify the Hive architecture to four components:

The runtime contains the logic of the query engine that translates the SQL -esque Hive Query Language(HQL) into MapReduce jobs that run over files stored in the filesystem.

The storage component is simply that, it stores files in various formats and index structures to recall these files. The file formats can be anything as simple as JSON and CSV, to more complex files such as columnar formats like ORC and Parquet. Traditionally, Hive runs on top of the Hadoop Distributed Filesystem (HDFS). As cloud-based options became more prevalent, object storage like Amazon S3, Azure Blob Storage, Google Cloud Storage, and others needed to be leveraged as well and replaced HDFS as the storage component.

In order for Hive to process these files, it must have a mapping from SQL tables in the runtime to files and directories in the storage component. To accomplish this, Hive uses the Hive Metastore Service (HMS), often shortened to the metastore to manage the metadata about the files such as table columns, file locations, file formats, etc…

The last component not included in the image is Hive’s data organization specification. The documentation of this element only exists in the code in Hive and has been reverse engineered to be used by other systems like Trino to remain compatible with other systems.

Trino reuses all of these components except for the runtime. This is the same approach most compute engine takes when dealing with data in object stores, specifically, Trino, Spark, Drill, and Impala. When you think of the Hive connector, you should think about a connector that is capable of reading data organized by the unwritten Hive specification.

Trino runtime replaces Hive runtime

In the early days of big data systems, many expected query turnaround to take a long time due to the high volume of unstructured data in ETL workloads. The primary goal in early iterations of these systems was simply throughput over large volumes of data while maintaining fault-tolerance. Now, more businesses want to run fast interactive queries over their big data instead of running jobs that take hours and produce possibly undesirable results. Many companies have petabytes of data and metadata in their data warehouse. Data in storage is cumbersome to move and the data in the metastore takes a long time to repopulate in other formats. Since only the runtime that executed Hive queries needs replacement, the Trino engine utilizes the existing metastore metadata and files residing in storage, and the Trino runtime effectively replaces the Hive runtime responsible for analyzing the data.

Trino Architecture



The Hive connector nomenclature

Notice, that the only change in the Trino architecture is the runtime. The HMS still exists along with the storage. This is not by accident. This design exists to address a common problem faced by many companies. It simplifies the migration from using Hive to using Trino. Regardless of the storage component used the runtime makes use of the HMS and that is the reason this connector is the Hive connector.

Where the confusion tends to come from, is when you search for a connector from the context of the storage systems you want to query. You may not even be aware the metastore is a necessity or even exists. Typically, you look for an S3 connector, a GCS connector or a MinIO connector. All you need is the Hive connector and the HMS to manage the metadata of the objects in your storage.

The Hive Metastore Service

The HMS is the only Hive process used in the entire Trino ecosystem when using the Hive connector. The HMS is actually a simple service with a binary API using the Thrift protocol. This service makes updates to the metadata, stored in an RDBMS such as PostgreSQL, MySQL, or MariaDB. There are other compatible replacements of the HMS such as AWS Glue, a drop-in substitution for the HMS.

https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio

Getting started with the Hive Connector on Trino

To drive this point home, I created a tutorial that showcases using Trino and looking at the metadata it produces. In the following scenario, the docker environment contains four docker containers:
trino - the runtime in this scenario that replaces Hive.
minio - the storage is an open-source cloud object storage.
hive-metastore - the metastore service instance.
mariadb - the database that the metastore uses to store the metadata.

You can play around with the system and optionally view the configurations. The scenario asks you to run a query to populate data in MinIO and then see the resulting metadata populated in MariaDB by the HMS. The next step asks you to run queries over the mariadb database which holds the generated metadata from the metastore.

If you have any questions or run into any issues with the example, you can find us on slack on the #dev or #general channels.

Have fun!

![https://trino.io/assets/blog/intro-to-hive-connector/intro-to-hive.jpeg]()

https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio

#trino #hive

bits





What is benchmarketing and why is it bad?
Sat, 12 Sep 2020 17:00:00 +0000
There’s something I have to get off my chest. If you really need to, just read the TLDR and listen to the Justin Bieber parody posted below. If you’re confused by the lingo, the rest of the post will fill in any gaps.

TL;DR: Benchmarketing, the practice of using benchmarks for marketing, is bad. Consumers should run their own benchmarks and ideally open-source them instead of relying on an internal and biased report.





For the longest time, I have wondered what is the point of corporations, specifically in the database sectors, running their own benchmarks. Would a company ever have any incentive to post results from a benchmark that didn’t show its own system winning in at least the majority of cases? I understand that these benchmarks have become part of the furniture we come to expect to see when visiting any hot new database’s website. I doubt anybody in the public domain gains much insight out of these results, to begin with, at least nothing they weren’t expecting to see.

Now to be clear, I am in no way indicating that companies running their own internal benchmarks to analyze their own performance in comparison to their competitors is a bad thing. It’s when they take those results and intentionally skew the methods or data from these benchmarks for sales or marketing purposes that is the problem we’re discussing here. Vendors that take part in the practice, not only use these benchmarks to show their systems succeeding a little but rather perversely taint their methodology with settings, caching, and other performance enhancements while leaving their competition’s settings untouched.

This should be obvious that this is NOT what benchmarking is about! If you read about the history of the Transaction Processing Performance Council (TPC) you come to understand that this is the very wrongdoing that the council was created to address. But like with any proxy involving measurements, the measurements are inherently pliable.

By the spring of 1991, the TPC was clearly a success. Dozens of companies were running multiple TPC-A and TPC-B results. Not surprisingly, these companies wanted to capitalize on the TPC’s cachet and leverage the investment they had made in TPC benchmarking. Several companies launched aggressive advertising and public relations campaigns based around their TPC results. In many ways, this was exactly why the TPC was created: to provide objective measures of performance. What was wrong, therefore, with companies wanting to brag about their good results? What was wrong is that there was often a large gap between the objective benchmark results and their benchmark marketing claims — this gap, over the years, has been dubbed “benchmarketing.” So the TPC was faced with an ironic situation. It had poured an enormous amount of time and energy into creating a good benchmark and even a good benchmark review process. However, the TPC had no means to control how those results were used once they were approved. The resulting problems generated intense debates within the TPC.

This benchmarketing ultimately fails the clients that these companies are marketing to. It demonstrates not only a lack of care for addressing the users’ actual pain but a lack of respect by intentionally pulling the wool over their eyes simply in an attempt to mask that their performance isn’t up to par with their competitors. This leads to consumers not being able to make informed decisions as most of our decisions are made from gut instincts and human emotion which these benchmarks aim to manipulate.

If you’re not sure exactly how a company would pull this off, an example of might be that database A enables using a cost-based optimizer that requires precomputing statistics about different tables involved in the computation, while database B is running a query against this table without any type of stats based optimization made available to it. Database A will clearly dominate as now it can reorder joins and apply better execution plans while database B is going to go with the simplest plan and run much slower in most scenarios. The company whose product depends on database A will then hone in on the numerical outcomes of this report. Even if they’re decent enough to report the methods they skewed to get these results, they bury it within their report and focus on advertising the outcome of what would otherwise be considered an absurd comparison. Companies will even go as far as to say that their competition’s database wasn’t straightforward to configure when they were setting up optimizations. If you’re not capable of understanding how to make equivalent changes to both systems, well then I guess you don’t get to run that comparison until you figure it out.

Many think that consumers are not susceptible to such attacks and would be able to see right through this scheme, but these reports appeal to any of us when we don’t have the necessity or resources to thoroughly examine all the data. Many times we have to take cues from our gut when a decision needs to be made and the time to make it is constrained by our time and other business needs. We see this type of phenomenon described in the book, Thinking Fast and Slow by Daniel Kahneman. To briefly summarize the model they use, there are two modes that humans use when they reason about their decisions, System 1 and System 2.

Systems 1 and 2 are both active whenever we are awake. System 1 runs automatically and System 2 is normally in comfortable low-effort mode, in which only a fraction of its capacity is engaged. System 1 continuously generates suggestions for System 2: impressions, intuitions, intentions, and feelings. If endorsed by System 2, impressions and intuitions turn into beliefs, and impulses turn into voluntary actions. When all goes smoothly, which is most of the time, System 2 adopts the suggestions of System 1 with little or no modification. You generally believe your impressions and act on your desires, and that is fine — usually.

No surprise, that’s usually the part where we get into trouble. While we like to think that we are generally thinking in the logical System 2 mode, we don’t have time or energy to live in this space for long periods throughout the day and we find ourselves very reliant on System 1 for much of our decision making.

The measure of success for System 1 is the coherence of the story it manages to create. The amount and quality of the data on which the story is based are largely irrelevant. When information is scarce, which is a common occurrence, System 1 operates as a machine for jumping to conclusions.

This is why benchmarketing can be so dangerous because it is so effective at manipulating our belief in claims that simply aren’t true. These decisions affect how your architecture will unfold, your time-to-value, and lost hours for your team and customers. It makes having these systems that fairly compare the performance and merits of two systems all the more paramount.



So why am I talking about this now?

I have become a pretty big fanboy of Trino, a distributed query engine that runs interactive queries from many sources. I have witnessed firsthand how fast a cluster of Trino nodes is able to process a huge amount of data at fast speeds. When you dive into how these speeds are achieved you find that this project is an incredible modern feat of solid engineering that makes interactive analysis over petabytes of data a reality. Going into all the reasons I like this project would be too tangential but it fuels the fire for why I believe this message needs to be heard.

Recently there was a “benchmark” that came out comparing the performance Dremio and Trino (then Presto) open-source and enterprise versions, touting performance improvements over Trino by an amount that would have been called out as too high in a CSI episode . Trino isn’t the only system in the data space to come under similar types of attacks. It makes sense too, as this type of technical peacocking is common as it successfully gains attention.

Luckily, as more companies strive to become transparent and associate themselves with open-source efforts, we are starting to see a relatively new pattern of open-source efforts emerge. Typically, you’re used to hearing about open-source within the context of software projects maintained by open-source communities. We are now arriving at the age of any noun being able to be used in an open-source framework. There is open-source music, open-source education, and even open-source data. So why not reach a point where open-source benchmarking through consumer collaboration is a thing? This is not just for the sake of the consumers of these technologies who simply want to have more data to inform their design choices to better serve their clients, it’s also unfortunate that this affects developer communities that are putting in a lot of hard work on these projects, only to have that hard work get berated unintelligibly by the likes of some corporate status competition.

Now I’m clearly a little biased when I tell you that I think Trino is currently the best analytics engine on the market today. When I say this, you really should be skeptical too. Really, I encourage it. You should verify in some way beyond a shadow of a doubt that:
Any TPC or other benchmarks are validated and no “magic” was used to improve their performance.

using your own use cases to make sure the system you choose is going to meet the needs of your particular use case.

While this may seem like a lot of work, with cloud infrastructure and the simplicity of deploying different systems into the cloud, it’s now more possible to do this today than it ever was even 10 years ago to run a benchmark of competing systems internally and at scale. Not only can this benchmark be run by your own unbiased data engineers who have more stake to find out which system best fits the companies’ needs, but you don’t have to rely on generic benchmarking data to analyze this if you don’t want. You can spin up these systems and let them query your system, using your use cases, and do it any way you want it.

In summary, if consumers can work together, we can work to get rid of this specific type of misinformation while providing a richer more insightful analysis that will aid both companies and consumers. As I mention in the song above, go run the test yourselves.

#trino #presto #opensource

bits

snapshot_id	parent_id	operation
7620328658793169607		append
2115743741823353537	7620328658793169607	append

level	message
ERROR	Oh noes
INFO	It is all good
ERROR	Double oh noes
WARN	Maybeh oh noes?

Hive 2.2.0 schema evolution based on file type and operation.
	Add	Delete	Rename
CSV/TSV	✅	❌	❌
JSON	✅	✅	❌
ORC/Parquet/Avro	✅	✅	❌

level	message	priority
ERROR	Double oh noes	NULL
WARN	Maybeh oh noes?	NULL
ERROR	Oh noes	NULL
INFO	es muy bueno	1