<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>trino &amp;mdash; bits on data</title>
    <link>https://bitsondata.dev/tag:trino</link>
    <description>\&gt;_ the imposter&#39;s guide to software, data, and life</description>
    <pubDate>Wed, 15 Apr 2026 01:17:27 +0000</pubDate>
    <image>
      <url>https://i.snap.as/vWVqkBBl.png</url>
      <title>trino &amp;mdash; bits on data</title>
      <link>https://bitsondata.dev/tag:trino</link>
    </image>
    <item>
      <title>Integrating Trino and Snowflake</title>
      <link>https://bitsondata.dev/trino-snowflake-bloomberg-oss-win?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[An open source Success Story&#xA;&#xA;TL;DR: Contributing to open source can be frustrating as the consensus needed for code to align to the project vision is often out of scope for many companies. This post dives deep into the obstacles and wins of two contributors from different companies working together to add the same proprietary connector. It&#39;s both inspiring and carries many lessons to bring along as you venture into open source to gain the pearls and avoid the perils.&#xA;&#xA;We’re seeing open source usher in a challenge to the economic model where the success metric is increasing the commonwealth of economic capital. This acceleration comes from playing positive-sum games with friends online and avoiding limiting a community to a vision that only benefits a small number of corporations or individuals. It’s hard to imagine how to embed such frameworks within our current zero-sum winner-takes-all economic system.!--more-- There’s certainly no shortage of heated debates around how to construct a harmonious relationship between the open source community and companies participating in them. Something we don’t talk about enough are the positive examples of when a coordinated effort in open source sticks the landing, and so many benefit from it.&#xA;&#xA;This post highlights the extraordinary contributions of Erik Anderson, Teng Yu, Yuya Ebihara, and the broader Trino community to finally contribute the long-coveted Trino Snowflake connector. It is a success story paired with a blueprint for individuals and corporations wanting to contribute to open source projects they use. These stories are valuable in that they demonstrate how to be most effective in collaborating with strangers-soon-to-be-friends and common pitfalls to avoid.&#xA;&#xA;!--emailsub--&#xA;&#xA;A common challenge in open source&#xA;&#xA;Despite the importance of delivering marketing and education in a community (aka edutainment), it’s only the first part of the equation of what makes open source projects successful. Once developers see some exciting video or tutorial, they ultimately land on the docs site, GitHub, StackOverflow, or some communication platform in the community. It&#39;s at this point that developers can easily lose the motivation if the docs lacks proper getting started materials or the community is completely silent. This is how I categorize the developer experience (aka devex), which aims to improve both the user and contributor experiences in the developer community by empowering decisions through hands-on learning, removing inefficiencies, and as we&#39;ll cover here, exposing untapped opportunities. &#xA;&#xA;Much like any open source project, maintainers on the Trino project struggle with communicating the lack of proper resources to build and test new features built for various proprietary software. For those less familiar, Trino is a federated query engine with multiple data sources. Trino tests integrations with open data sources by running small local instances of the connecting system. Snowflake is a proprietary, cloud-native data warehouse, also known as cloud data platform. This provided no viable and free way to support testing this integration that was eagerly sought by many. After an initial attempt by my friend Phillipe Gagnon, a similar pattern emerged with the second pull request where the development velocity started strong and after some months stagnated.&#xA;&#xA;Cognitive surplus and communication deficit&#xA;&#xA;A common and unfortunate class of issues are that various well-known larger objectives known among the core group often move faster than less-established individual contributions. These additions are often much needed and welcomed, but often fail to fit a larger project roadmap narrative. As its easier to coordinate between the smaller core group as trust and norms have been communicated and established. This makes changes outside of this group have a higher likelihood to get lost in the shuffle. As an open source project grows, you end up with a cognitive surplus in the form of an abundance of bright people willing to share their time, intellect, and experience with a larger community.&#xA;&#xA;Often both contributors and maintainers are so busy with their day jobs, families, and self care, that they dedicate most of their remaining energy to ensuring they write quality code and tests to the best of their ability. Lack of upfront communication to validate ideas from newer contributors, and lack of communication by maintainers who see a large number of issues to address are two communication issues that stagnate a project. Maintainers are often doers that see more value in addressing quick-win work that flows from the well-established contributors of the project. Followthrough on either side can be difficult as newcomers don&#39;t want to be rude and maintainers accidentally forget or hope someone else will take the time to address the issues on that pull request. &#xA;&#xA;Waiting for your work to be reviewed by someone in the community kind of works like a wishing well, you toss in a coin (i.e. your time and effort represented as code and a pull request) and hope your wish of getting your code reviewed and merged comes true. The satisfaction of hypothetical developers that benefit from your small and significant change floods your mind and you feel like you’ve improved humanity just that one little bit more. &#xA;&#xA;Maintainers are in a constant state of pulling triage on all the surplus of innovation being thrown at them and simultaneously trying to look for more help reviewing and being the expert at some areas of the code. As you can imagine, good communication can be hard to come by as many newcomers are strangers and concerned they are wasting precious time by asking too many questions rather than just showing a proof of concept. This backfires when developers will spend a large portion of their time developing a solution that is not compatible with the project, and maintainers will lose the opportunity to quickly spin up on the value of the new feature. This is why regular contributor meetings help solve both of these issues synchronously to cut out the delayed feedback loops.&#xA;&#xA;History repeats itself, until it doesn&#39;t&#xA;&#xA;It became apparent that each time there was a discussion for how to do integration testing there was no good way to test a Snowflake instance with the lack of funding for the project. Trino has a high bar for quality and none of the maintainers felt it was a risk worth taking due to the likely popularity of the integration and likelihood of future maintenance issues. Once each pull request hit this same fate, it stalled with no clear path to resolve the real issue of funding the Snowflake infrastructure needed by the Trino Software Foundation (TSF). It’s never fun to mention that you can’t move forward on work with constraints like these, and without a monetary solution, silence is what is experienced on the side of the contributor.&#xA;&#xA;Noticing that Teng had already done a significant amount of work to contribute his Snowflake connector, I reached out to him to see if we could brainstorm a solution. Not long after, Erik also reached out to get my thoughts on how to go about contributing Bloomberg&#39;s Snowflake connector. Great, now we have two connector implementations and no solution to getting the infrastructure to get them tested. During the first Trino Contributor Congregation, Erik and I brought up Bloomberg&#39;s desire to contribute a Snowflake connector and I articulated the testing issue. Ironically, this was the first time I had thoroughly articulated the issue to Erik as well.&#xA;&#xA;As soon as I was done, Erik requested the mic said something to the effect of, &#34;Oh I wish I would have known that&#39;s the problem, the solution is simple, Bloomberg will provide the TSF a Snowflake account.&#34;&#xA;&#xA;Done!&#xA;&#xA;Just as in business, you can never underestimate the power of communication in an open source project as well. Shortly after Erik, Teng, and I discussed the best ways to merge their work, they set up the Snowflake accounts for Trino maintainers and start the arduous process of building a thorough test suite with the help of Yuya, Piotr Findeisen, Manfred Moser, and Martin Traverso.&#xA;&#xA;The long road to Snowflake&#xA;&#xA;As Teng and Erik merged their efforts, the process was anything but straightforward. There were setbacks, vacations, meticulous reviews, and infrastructure issues. But the perseverance of everyone involved was unwavering.&#xA;&#xA;Bloomberg started by creating an official Bloomberg Trino repository originally as a means for Teng and Erik to mesh their solutions together and build the testing infrastructure that relied on Bloomberg resources. Without needing to rely on the main Trino project to merge incremental solutions, they were able to quickly iterate the early solutions. This repository also facilitated Bloomberg’s now numerous contributions to Trino.&#xA;&#xA;It took a few months just to get the ForePaaSa name=&#34;fn1&#34;/asupa class=&#34;footnote&#34; href=&#34;#fnref1&#34;1/a/sup and Bloomberg solutions merged. There were valuable takes from each system and better integration tests were written with the new testing infrastructure. The two Snowflake connector implementations were merged together by April of 2023. Finally, the reviews could start. Once the initial two passes happened we anticipated that we would see the Snowflake connector release in the summer of 2023 around Trino Fest. So much so, that we planned a talk with Erik and Teng initially as a reveal, assuming the pull request would be merged by then. Lo and behold, this didn’t happen, as there were still a lot of concerns around use cases not being properly tested.&#xA;&#xA;The halting review problem&#xA;&#xA;A necessary evil that comes with pull request reviews and more broadly, distributed consensus is that reviews can drag on over time. This can lead to countless number of updates you have to make to your changes to accommodate the ever changing project shifting beneath your feet as you simultaneously try to make progress on suggestions from those reviewing your code.&#xA;&#xA;Many critics of open source like to point this out as a drawback, when in fact, this same problem exists in closed source systems. Closed source projects can generally delay difficult decisions to make fast upfront progress to meet certain deadlines. This may be seen as an advantage at first, but as many developers can attest, this simply leads to technical debt and fragile products in most environments that struggle to prioritize a healthy codebase.&#xA;&#xA;Regardless, having to face these larger discussions upfront can induce fatigue, especially when managing external circumstances; personal affairs, a project at work - you know, the entity that pays these engineers - or countless other factors will rear their ugly heads and progress will stagger with ebbs and flows of attention. This can be really dangerous territory and commonly resolves in contributors and reviewers abandoning the PR when it stalls.&#xA;&#xA;This is why I believe open source, while not beholden to any timelines, needs a project and product management role which is currently covered often by project leaders and devex engineers. This can also relieve tension between the needs of open source and big businesses in the community with real deadlines, at least keeping the communication consistent while ensuring bugs and design flaws aren’t introduced to the code base.&#xA;&#xA;What’s in it for Bloomberg and ForePaaS?&#xA;&#xA;If you’ve never worked in open source or for a company that contributes to open source, you may be thinking how the heck do these engineers convince their leadership to let them dump so much time into these contributions? The simple answer is, it’s good for business.&#xA;&#xA;If we peep into why Bloomberg uses Trino, they aggregate data from an unusually large number of data sources across their customers who use their services. Part of this requires them to merge the customer’s dataset with existing aggregate data in Bloomberg’s product. Since Trino can connect to most customer databases out-of-the-box, this requires Bloomberg to manage a small array of custom connectors that provide their services to customers as multiple catalogs in a single convention SQL endpoint. Having engineers maintain a few small connectors rather than an entire distributed query engine themselves saves a lot of time and maintenance.&#xA;&#xA;Despite how many problems Trino already solves for them, Bloomberg and ForePaaS needed this Snowflake connector and through the open source model created it for themselves. The drawback is that the solution must be maintained by the engineers at each company any time they want to upgrade to a new Trino feature. This takes consistently depletes engineering resources and so they want to maintain as few features as possible to relieve their engineer’s time. Open source projects are generally more than happy to accept features that the community benefits from. This doesn’t mean we shouldn’t appreciate when companies contribute. This dual-sum generosity and forward-thinking approach enabled Erik and Teng to combine their battle-tested connectors, crafting a high value creation for the community.&#xA;&#xA;If you are a developer who sees the value in contributing to open source, and you aren&#39;t sure how to convince leadership to get on board, you need to speak their language. Show how companies like Bloomberg get involved in open source, and how it lowers maintenance costs when done correctly. If you see an open project like Trino that could replace 97% of a new project, demonstrate that the upfront cost will pay off when you remove the amount of code to be managed by your team which lowers the future need to expand headcounts. I don’t imagine a world where your boss and colleagues are altruists, but present an economic incentive that lowers the amortized cost of engineers needed to maintain a project, then your strategy becomes helpful to the company&#39;s bottom line.&#xA;&#xA;While the immediate investment shows small gains for a single team on a single company, once that change exists in open source, other companies can immediately benefit and offer better testing and improvements than you could have asked for when managing the original project with your own team. Humanity at large gets to benefit upon every contribution done this way, and the more companies that embrace this, the less we waste our efforts of pointlessly duplicating work.&#xA;&#xA;Esprit de Corps&#xA;&#xA;The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”, which I mistakenly took the “Corps” part for the Marine Corps rather than the more general meaning of a body or group of people. In fact, it expresses the common spirit existing in the members of a group and inspiring enthusiasm, devotion, and strong regard for the honor of the group. Any time I see this type of shared and selfless cooperation in open source, I’m reminded of the bond, friendships, and care of me and my fellow marines. Despite the unfortunate political circumstances of our mission, I do treasure the shared companionship with both my fellow marines and the local Iraqi people. There is ultimately a power in the gathering of many when aimed for building an altruistic means of improving each others lives.&#xA;&#xA;In the same way, demonstration of human cooperation is about more than just developing a connector; it&#39;s about the shared experiences, the friendships forged, and the skills honed in the pursuit of a common goal. The successful addition of the Trino Snowflake connector is a testament to the positive sum outcomes of open source collaboration. This journey has been about collaboration, learning, and growth that will benefit many. I remember the night I got the email that Yuya had merged the pull request, I was ecstatic to say the least. The connector shipped with Trino version 440, and made connection to the most widely adopted cloud warehouse possible.&#xA;&#xA;Once the hard work was done, many valuable iterations like adding Top-N support(Shoppee), adding Snowflake Iceberg REST catalog support (Starburst), and adding better type mapping(Apple) were added to the Snowflake integration. I love showcasing this trailblazing and yes, altruistic work from Erik, Teng, Yuya, Martin, Manfred, and Piotr - and everyone who helped in the Trino community. A special thanks to the managers and leadership at Bloomberg and ForePaaS for their generous commitment of time and resources.&#xA;&#xA;As we celebrate this milestone, we&#39;re already looking forward to the next adventure. Here&#39;s to federating them all, together!&#xA;&#xA;Notes:&#xA;a name=&#34;fnref1&#34;/asupa class=&#34;footnote-ref&#34; href=&#34;#fn1&#34;1/a/supspan class=&#34;footnote-ref-text&#34;ForePaaS has been integrated into OVHCloud, which is now called Data Platform./span&#xA;&#xA;bits]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="an-open-source-success-story" id="an-open-source-success-story">An open source Success Story</h2>

<p>TL;DR: Contributing to open source can be frustrating as the consensus needed for code to align to the project vision is often out of scope for many companies. This post dives deep into the obstacles and wins of two contributors from different companies working together to add the same proprietary connector. It&#39;s both inspiring and carries many lessons to bring along as you venture into open source to gain the pearls and avoid the perils.</p>

<p><img src="https://i.snap.as/CvkkjKzk.jpeg" alt=""/></p>

<p>We’re seeing open source usher in a challenge to the economic model where the success metric is increasing the commonwealth of economic capital. This acceleration comes from playing positive-sum games with friends online and avoiding limiting a community to a vision that only benefits a small number of corporations or individuals. It’s hard to imagine how to embed such frameworks within our current zero-sum winner-takes-all economic system. There’s certainly no shortage of heated debates around how to construct a harmonious relationship between the open source community and companies participating in them. Something we don’t talk about enough are the positive examples of when a coordinated effort in open source sticks the landing, and so many benefit from it.</p>

<p>This post highlights the extraordinary contributions of <a href="https://www.linkedin.com/in/erikanderson/">Erik Anderson</a>, <a href="https://www.linkedin.com/in/tyu-fr/">Teng Yu</a>, <a href="https://www.linkedin.com/in/ebyhr/">Yuya Ebihara</a>, and the broader <a href="https://github.com/trinodb/trino">Trino community</a> to finally contribute the long-coveted <a href="https://trino.io/docs/current/connector/snowflake.html">Trino Snowflake connector</a>. It is a success story paired with a blueprint for individuals and corporations wanting to contribute to open source projects they use. These stories are valuable in that they demonstrate how to be most effective in collaborating with strangers-soon-to-be-friends and common pitfalls to avoid.</p>



<h2 id="a-common-challenge-in-open-source" id="a-common-challenge-in-open-source">A common challenge in open source</h2>

<p>Despite the importance of delivering marketing and education in a community (aka <a href="https://en.wikipedia.org/wiki/Educational_entertainment">edutainment</a>), it’s only the first part of the equation of what makes open source projects successful. Once developers see some exciting video or tutorial, they ultimately land on the docs site, GitHub, StackOverflow, or some communication platform in the community. It&#39;s at this point that developers can easily lose the motivation if the docs lacks proper getting started materials or the community is completely silent. This is how I categorize the developer experience (aka devex), which aims to improve both the user and contributor experiences in the developer community by <a href="https://en.wikipedia.org/wiki/Experiential_learning">empowering decisions through hands-on learning</a>, <a href="https://trino.io/blog/2023/01/09/cleaning-up-the-trino-backlog">removing inefficiencies</a>, and as we&#39;ll cover here, exposing untapped opportunities.</p>

<p>Much like any open source project, maintainers on the Trino project struggle with communicating the lack of proper resources to build and test new features built for various proprietary software. For those less familiar, Trino is a federated query engine with <a href="https://trino.io/docs/current/connector.html">multiple data sources</a>. Trino tests integrations with open data sources by running small local instances of the connecting system. Snowflake is a proprietary, cloud-native data warehouse, also known as cloud data platform. This provided no viable and free way to support testing this integration that was <a href="https://github.com/trinodb/trino/pull/2551#issuecomment-873082280">eagerly</a> <a href="https://github.com/trinodb/trino/issues/1863">sought</a> <a href="https://github.com/trinodb/trino/issues/7247">by many</a>. After an <a href="https://github.com/trinodb/trino/pull/2551">initial attempt</a> by my friend <a href="https://www.linkedin.com/in/pfgagnon">Phillipe Gagnon</a>, a similar pattern emerged <a href="https://github.com/trinodb/trino/pull/10387">with the second pull request</a> where the development velocity started strong and after some months stagnated.</p>

<h3 id="cognitive-surplus-and-communication-deficit" id="cognitive-surplus-and-communication-deficit">Cognitive surplus and communication deficit</h3>

<p>A common and unfortunate class of issues are that various well-known larger objectives known among the core group often move faster than less-established individual contributions. These additions are often much needed and welcomed, but often fail to fit a larger project roadmap narrative. As its easier to coordinate between the smaller core group as trust and norms have been communicated and established. This makes changes outside of this group have a higher likelihood to get lost in the shuffle. As an open source project grows, you end up with a cognitive surplus in the form of an abundance of bright people willing to share their time, intellect, and experience with a larger community.</p>

<p>Often both contributors and maintainers are so busy with their day jobs, families, and self care, that they dedicate most of their remaining energy to ensuring they write quality code and tests to the best of their ability. Lack of upfront communication to validate ideas from newer contributors, and lack of communication by maintainers who see a large number of issues to address are two communication issues that stagnate a project. Maintainers are often doers that see more value in addressing quick-win work that flows from the well-established contributors of the project. Followthrough on either side can be difficult as newcomers don&#39;t want to be rude and maintainers accidentally forget or hope someone else will take the time to address the issues on that pull request.</p>

<p><img src="https://i.snap.as/7TdSoquQ.jpg" alt=""/></p>

<p>Waiting for your work to be reviewed by someone in the community kind of works like a wishing well, you toss in a coin (i.e. your time and effort represented as code and a pull request) and hope your wish of getting your code reviewed and merged comes true. The satisfaction of hypothetical developers that benefit from your small and significant change floods your mind and you feel like you’ve improved humanity just that one little bit more.</p>

<p>Maintainers are in a constant state of pulling triage on all the surplus of innovation being thrown at them and simultaneously trying to look for more help reviewing and being the expert at some areas of the code. As you can imagine, good communication can be hard to come by as many newcomers are strangers and concerned they are wasting precious time by asking too many questions rather than just showing a proof of concept. This backfires when developers will spend a large portion of their time developing a solution that is not compatible with the project, and maintainers will lose the opportunity to quickly spin up on the value of the new feature. This is why regular <a href="https://github.com/trinodb/trino/wiki/Contributor-meetings">contributor meetings</a> help solve both of these issues synchronously to cut out the delayed feedback loops.</p>

<h3 id="history-repeats-itself-until-it-doesn-t" id="history-repeats-itself-until-it-doesn-t">History repeats itself, until it doesn&#39;t</h3>

<p>It became apparent that each time there was <a href="https://github.com/trinodb/trino/pull/2551#issuecomment-709220790">a discussion</a> for how to do <a href="https://github.com/trinodb/trino/pull/10387#issuecomment-1008430060">integration testing</a> there was no good way to test a Snowflake instance with the lack of funding for the project. Trino has a high bar for quality and none of the maintainers felt it was a risk worth taking due to the likely popularity of the integration and likelihood of future maintenance issues. Once each pull request hit this same fate, it stalled with no clear path to resolve the real issue of funding the Snowflake infrastructure needed by the <a href="https://trino.io/foundation.html">Trino Software Foundation (TSF)</a>. It’s never fun to mention that you can’t move forward on work with constraints like these, and without a monetary solution, silence is what is experienced on the side of the contributor.</p>

<p>Noticing that Teng had already done a significant amount of work to contribute his Snowflake connector, I reached out to him to see if we could brainstorm a solution. Not long after, Erik also reached out to get my thoughts on how to go about contributing Bloomberg&#39;s Snowflake connector. Great, now we have two connector implementations and no solution to getting the infrastructure to get them tested. During the first <a href="https://trino.io/blog/2022/11/21/trino-summit-2022-recap.html#trino-contributor-congregation">Trino Contributor Congregation</a>, Erik and I brought up Bloomberg&#39;s desire to contribute a Snowflake connector and I articulated the testing issue. Ironically, this was the first time I had thoroughly articulated the issue to Erik as well.</p>

<p>As soon as I was done, Erik requested the mic said something to the effect of, “Oh I wish I would have known that&#39;s the problem, the solution is simple, Bloomberg will provide the TSF a Snowflake account.”</p>

<p>Done!</p>

<p>Just as in business, <strong>you can never underestimate the power of communication in an open source project</strong> as well. Shortly after Erik, Teng, and I discussed the best ways to merge their work, they set up the Snowflake accounts for Trino maintainers and start the arduous process of building a thorough test suite with the help of Yuya, <a href="https://www.linkedin.com/in/piotrfindeisen/">Piotr Findeisen</a>, <a href="https://www.linkedin.com/in/manfredmoser/">Manfred Moser</a>, and <a href="https://www.linkedin.com/in/traversomartin/">Martin Traverso</a>.</p>

<h2 id="the-long-road-to-snowflake" id="the-long-road-to-snowflake">The long road to Snowflake</h2>

<p>As Teng and Erik merged their efforts, the process was anything but straightforward. There were setbacks, vacations, meticulous reviews, and infrastructure issues. But the perseverance of everyone involved was unwavering.</p>

<p>Bloomberg started by creating <a href="https://github.com/bloomberg/trino">an official Bloomberg Trino repository</a> originally as a means for Teng and Erik to mesh their solutions together and build the testing infrastructure that relied on Bloomberg resources. Without needing to rely on the main Trino project to merge incremental solutions, they were able to quickly iterate the early solutions. This repository also facilitated Bloomberg’s now numerous contributions to Trino.</p>

<p>It took a few months just to get the ForePaaS<sup><a class="footnote" href="#fnref1">1</a></sup> and Bloomberg solutions merged. There were valuable takes from each system and better integration tests were written with the new testing infrastructure. The two Snowflake connector implementations were merged together by April of 2023. Finally, the reviews could start. Once the initial two passes happened we anticipated that we would see the Snowflake connector release in the summer of 2023 around Trino Fest. So much so, that we planned <a href="https://trino.io/blog/2023/07/12/trino-fest-2023-let-it-snow-recap.html">a talk with Erik and Teng</a> initially as a reveal, assuming the pull request would be merged by then. Lo and behold, this didn’t happen, as there were still a lot of concerns around use cases not being properly tested.</p>

<h3 id="the-halting-review-problem" id="the-halting-review-problem">The halting review problem</h3>

<p>A necessary evil that comes with pull request reviews and more broadly, distributed consensus is that reviews can drag on over time. This can lead to <a href="https://github.com/trinodb/trino/pull/17909#issuecomment-1841809727">countless number of updates</a> you have to make to your changes to accommodate the ever changing project shifting beneath your feet as you simultaneously try to make progress on <a href="https://github.com/trinodb/trino/pull/17909#pullrequestreview-1793724311">suggestions from those reviewing your code</a>.</p>

<p>Many critics of open source like to point this out as a drawback, when in fact, this same problem exists in closed source systems. Closed source projects can generally delay difficult decisions to make fast upfront progress to meet certain deadlines. This may be seen as an advantage at first, but as many developers can attest, this simply leads to technical debt and fragile products in most environments that struggle to prioritize a healthy codebase.</p>

<p><img src="https://i.snap.as/Oi74UR5y.jpg" alt=""/></p>

<p>Regardless, having to face these larger discussions upfront can induce fatigue, especially when managing external circumstances; personal affairs, a project at work – you know, the entity that pays these engineers – or countless other factors will rear their ugly heads and <a href="https://github.com/trinodb/trino/pull/17909#discussion_r1418149737">progress will stagger</a> with ebbs and flows of attention. This can be really dangerous territory and commonly resolves in contributors and reviewers abandoning the PR when it stalls.</p>

<p>This is why I believe open source, while not beholden to any timelines, needs a project and product management role which is currently covered often by project leaders and devex engineers. This can also relieve tension between the needs of open source and big businesses in the community with real deadlines, at least keeping the communication consistent while ensuring bugs and design flaws aren’t introduced to the code base.</p>

<h2 id="what-s-in-it-for-bloomberg-and-forepaas" id="what-s-in-it-for-bloomberg-and-forepaas">What’s in it for Bloomberg and ForePaaS?</h2>

<p>If you’ve never worked in open source or for a company that contributes to open source, you may be thinking how the heck do these engineers convince their leadership to let them dump so much time into these contributions? The simple answer is, it’s good for business.</p>

<p>If we peep into why Bloomberg uses Trino, they aggregate data from an unusually large number of data sources across their customers who use their services. Part of this requires them to merge the customer’s dataset with existing aggregate data in Bloomberg’s product. Since Trino can connect to most customer databases out-of-the-box, this requires Bloomberg to manage a small array of custom connectors that provide their services to customers as multiple catalogs in a single convention SQL endpoint. Having engineers maintain a few small connectors rather than an entire distributed query engine themselves saves a lot of time and maintenance.</p>

<p>Despite how many problems Trino already solves for them, Bloomberg and ForePaaS needed this Snowflake connector and through the open source model created it for themselves. The drawback is that the solution must be maintained by the engineers at each company any time they want to upgrade to a new Trino feature. This takes consistently depletes engineering resources and so they want to maintain as few features as possible to relieve their engineer’s time. Open source projects are generally more than happy to accept features that the community benefits from. This doesn’t mean we shouldn’t appreciate when companies contribute. This dual-sum generosity and forward-thinking approach enabled Erik and Teng to combine their battle-tested connectors, crafting a high value creation for the community.</p>

<p>If you are a developer who sees the value in contributing to open source, and you aren&#39;t sure how to convince leadership to get on board, you need to speak their language. Show how companies like Bloomberg get involved in open source, and how it lowers maintenance costs when done correctly. If you see an open project like Trino that could replace 97% of a new project, demonstrate that the upfront cost will pay off when you remove the amount of code to be managed by your team which lowers the future need to expand headcounts. I don’t imagine a world where your boss and colleagues are altruists, but present an economic incentive that lowers the amortized cost of engineers needed to maintain a project, then your strategy becomes helpful to the company&#39;s bottom line.</p>

<p>While the immediate investment shows small gains for a single team on a single company, once that change exists in open source, other companies can immediately benefit and offer better testing and improvements than you could have asked for when managing the original project with your own team. Humanity at large gets to benefit upon every contribution done this way, and the more companies that embrace this, the less we waste our efforts of pointlessly duplicating work.</p>

<h2 id="esprit-de-corps" id="esprit-de-corps">Esprit de Corps</h2>

<p>The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”, which I mistakenly took the “Corps” part for the Marine Corps rather than the more general meaning of a body or group of people. In fact, it expresses <a href="https://www.merriam-webster.com/dictionary/esprit%20de%20corps">the common spirit existing in the members of a group and inspiring enthusiasm, devotion, and strong regard for the honor of the group</a>. Any time I see this type of shared and selfless cooperation in open source, I’m reminded of the bond, friendships, and care of me and my fellow marines. Despite the unfortunate political circumstances of our mission, I do treasure the shared companionship with both my fellow marines and the local Iraqi people. There is ultimately a power in the gathering of many when aimed for building an altruistic means of improving each others lives.</p>

<p><img src="https://i.snap.as/TO03Akr4.jpeg" alt=""/></p>

<p>In the same way, demonstration of human cooperation is about more than just developing a connector; it&#39;s about the shared experiences, the friendships forged, and the skills honed in the pursuit of a common goal. The successful addition of the Trino Snowflake connector is a testament to the positive sum outcomes of open source collaboration. This journey has been about collaboration, learning, and growth that will benefit many. I remember the night I got the email that Yuya had <a href="https://github.com/trinodb/trino/pull/17909">merged the pull request</a>, I was ecstatic to say the least. The connector shipped with <a href="https://trino.io/docs/current/release/release-440.html#general">Trino version 440</a>, and made connection to the most widely adopted cloud warehouse possible.</p>

<p>Once the hard work was done, many valuable iterations like <a href="https://github.com/trinodb/trino/pull/21219">adding Top-N support</a>(Shoppee), <a href="https://github.com/trinodb/trino/pull/21365">adding Snowflake Iceberg REST catalog support</a> (Starburst), and <a href="https://github.com/trinodb/trino/pull/21365">adding better type mapping</a>(Apple) were added to the Snowflake integration. I love showcasing this trailblazing and yes, altruistic work from Erik, Teng, Yuya, Martin, Manfred, and Piotr – and everyone who helped in the Trino community. A special thanks to the managers and leadership at Bloomberg and ForePaaS for their generous commitment of time and resources.</p>

<p>As we celebrate this milestone, we&#39;re already looking forward to the next adventure. Here&#39;s to federating them all, together!</p>

<p>Notes:
<sup><a class="footnote-ref" href="#fn1">1</a></sup><span class="footnote-ref-text">ForePaaS has been integrated into <a href="https://ovhcloud.com">OVHCloud</a>, which is now called <a href="https://help.ovhcloud.com/csm/en-public-cloud-data-platform-what-is?id=kb_article_view&amp;sysparm_article=KB0060801">Data Platform</a>.</span></p>

<p><em>bits</em></p>
]]></content:encoded>
      <guid>https://bitsondata.dev/trino-snowflake-bloomberg-oss-win</guid>
      <pubDate>Wed, 08 May 2024 05:00:00 +0000</pubDate>
    </item>
    <item>
      <title>Intro to Trino for the Trinewbie</title>
      <link>https://bitsondata.dev/intro-to-trino-for-the-trinewbie?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[Learn how to quickly join data across multiple sources&#xA;&#xA;If you haven’t heard of Trino before, it is a query engine that speaks the language of many genres of databases. As such, Trino is commonly used to provide fast ad-hoc queries across heterogeneous data sources. Trino’s initial use case was built around replacing the Hive runtime engine to allow for faster querying of Big Data warehouses and data lakes. This may be the first time you have heard of Trino, but you’ve likely heard of the project from which it was “forklifted”, Presto. If you want to learn more about why the creators of Presto now work on Trino (formerly PrestoSQL) you can read the renaming blog that they produced earlier this year. Before you commit too much to this blog, I’d like to let you know why you should even care about Trino.&#xA;&#xA;!--more--&#xA;&#xA;!--emailsub--&#xA;&#xA;So what is Trino anyways?&#xA;&#xA;The first thing I like to make sure people know about when discussing Trino is that it is a SQL query engine, but not a SQL database. What does that mean? Traditional databases typically consist of a query engine and a storage engine. Trino is just a query engine and does not store data. Instead, Trino interacts with various databases that store their own data in their own formats. Trino parses and analyzes the SQL query you pass in, creates and optimizes a query execution plan that includes the data sources, and then schedules worker nodes that are able to intelligently query the underlying databases they connect to.&#xA;&#xA;I say intelligently, specifically talking about pushdown queries. That’s right, the most intelligent thing for Trino to do is to avoid making more work for itself, and try to offload that work to the underlying database. This makes sense as the underlying databases generally have special indexes and data that are stored in a specific format to optimize the read time. It would be silly of Trino to ignore all of that optimized reading capability and do a linear scan of all the data to run the query itself. The goal in most optimizations for Trino is to push down the query to the database and only get back the smallest amount of data needed to join with another dataset from another database, do some further Trino specific processing, or simply return as the correct result set for the query.&#xA;&#xA;Query all the things&#xA;&#xA;So I still have not really answered your question of why you should care about Trino. The short answer is, Trino acts as a single access point to query all the things. Yup. Oh, and it’s super fast at ad-hoc queries over various data sources including data lakes (e.g. Iceberg/Databricks) or data warehouses (e.g. Hive/Snowflake). It has a connector architecture that allows it to speak the language of a whole bunch of databases. If you have a special use case, you can write your own connector that abstracts any database or service away to just be another table in Trino’s domain. Pretty cool right? But that’s actually rarely needed because the most common databases already have a connector written for them. If not, more connectors are getting added by Trino’s open source community every few months.&#xA;&#xA;To make the benefits of running federated queries a bit more tangible, I will present an example. Trino brings users the ability to map standardized ANSI SQL query to query databases that have a custom query DSL like Elasticsearch. With Trino it’s incredibly simple to set up an Elasticsearch catalog and start running SQL queries on it. If that doesn’t blow your mind, let me explain why that’s so powerful.&#xA;&#xA;Imagine you have five different data stores, each with its own independent query language. Your data science or analyst team just wants access to these data stores. It would take a ridiculous amount of time for them to have to go to each data system individually, look up the different commands to pull data out of each one, and dump the data into one location and clean it up so that they can actually run meaningful queries. With Trino all they need to use is SQL to access them through Trino. Also, it doesn’t just stop at accessing the data, your data science team is also able to join data across tables of different databases like a search engine like Elasticsearch with an operational database like MySQL. Further, using Trino even enables joining data sources with themselves where joins are not supported, like in Elasticsearch and MongoDB. Did it happen yet? Is your mind blown?&#xA;&#xA;Getting Started with Trino&#xA;&#xA;So what is required to give Trino a test drive? Relative to many open-source database projects, Trino is one of the more simple projects to install, but this still doesn’t mean it is easy. An important element to a successful project is how it adapts to newer users and expands capability for growth and adoption. This really pushes the importance of making sure that there are multiple avenues of entry into using a product all of which have varying levels of difficulty, cost, customizability, interoperability, and scalability. As you increase in the level of customizability, interoperability, and scalability, you will generally see an increase in difficulty or cost and vice versa. Luckily, when you are starting out, you just really need to play with Trino.&#xA;&#xA;Image added by Author&#xA;&#xA;The low-cost and low difficulty way to try out Trino is to use Docker containers. The nice thing about these containers is that you don’t have to really know anything about the installation process of Trino to play around with Trino. While many enjoy poking around documentation and working with Trino to get it set up, it may not be for all. I certainly have my days where I prefer a nice chill CLI sesh and other days where I just need to opt-out. If you want to skip to the Easy Button way to deploy Trino (hint, it’s the SaaS deployment) then skip the next few sections.&#xA;&#xA;!--emailsub--&#xA;&#xA;Using Trino With Docker&#xA;&#xA;Trino ships with a Docker image that does a lot of the setup necessary for Trino to run. Outside of simply running a docker container, there are a few things that need to happen for setup. First, in order to use a database like MySQL, we actually need to run a MySQL container as well using the official mysql image. There is a trino-getting-started repository that contains a lot of the setup needed for using Trino on your own computer or setting it up on a test server as a proof of concept. Clone this repository and follow the instructions in the README to install Docker if it is not already.&#xA;&#xA;You can actually run a query before learning the specifics of how this compose file works. Before you run the query, you will need to run the mysql and trino-coordinator instances. To do this, navigate to the mysql/trino-mysql/ directory that contains the docker-compose.yml and run:&#xA;&#xA;docker-compose up -d&#xA;&#xA;Running your first query!&#xA;&#xA;Now that you have Trino running in Docker, you need to open a session to access it. The easiest way to do this is via a console. Run the following Docker command to connect to a terminal on the coordinator:&#xA;&#xA;docker container exec -it trino-mysqltrino-coordinator1 trino&#xA;&#xA;This will bring you to the Trino terminal.&#xA;&#xA;trino  Your first query will actually be to generate data from the tpch catalog and then query the data that was loaded into mysql catalog. In the terminal, run the following two queries:&#xA;&#xA;CREATE TABLE mysql.tiny.customer&#xA;AS SELECT * FROM tpch.tiny.customer;&#xA;&#xA;SELECT custkey, name, nationkey, phone &#xA;FROM mysql.tiny.customer LIMIT 5;&#xA;&#xA;The output should look like this.&#xA;&#xA;|custkey|name              |nationkey|phone          |&#xA;|-------|------------------|---------|---------------|&#xA;|751    |Customer#000000751|0        |10-658-550-2257|&#xA;|752    |Customer#000000752|8        |18-924-993-6038|&#xA;|753    |Customer#000000753|17       |27-817-126-3646|&#xA;|754    |Customer#000000754|0        |10-646-595-5871|&#xA;|755    |Customer#000000755|16       |26-395-247-2207|&#xA;&#xA;Congrats! You just ran your first query on Trino. Did you feel the rush!? Okay well, technically we just copied data from a data generation connector and moved it into a MySQL database and queried that back out. It’s fine if this simple exercise didn’t send goosebumps flying down your spine but hopefully, you can extrapolate the possibilities when connecting to other datasets.&#xA;&#xA;A good initial exercise to study the compose file and directories before jumping into the Trino installation documentation. Let’s see how this was possible by breaking down the docker-compose file that you just ran.&#xA;&#xA;version: &#39;3.7&#39;&#xA;services:&#xA;  trino-coordinator:&#xA;    image: &#39;trinodb/trino:latest&#39;&#xA;    hostname: trino-coordinator&#xA;    ports:&#xA;      &#39;8080:8080&#39;&#xA;    volumes:&#xA;      ./etc:/etc/trino&#xA;    networks:&#xA;      trino-network&#xA;&#xA;  mysql:&#xA;    image: mysql:latest&#xA;    hostname: mysql&#xA;    environment:&#xA;      MYSQLROOTPASSWORD: admin&#xA;      MYSQLUSER: admin&#xA;      MYSQLPASSWORD: admin&#xA;      MYSQLDATABASE: tiny&#xA;    ports:&#xA;      &#39;3306:3306&#39;&#xA;    networks:&#xA;      trino-network&#xA;networks:&#xA;  trino-network:&#xA;    driver: bridge&#xA;&#xA;Notice that the hostname of mysql matches the instance name, and the mysql instance is on the trino-network that the trino-coordinator instance will also join. Also notice that the mysql image exposes port 3306 on the network.&#xA;&#xA;Finally, we will use the trinodb/trino image for the trino-coordinator instance, and use the volumes option to map our local custom configurations for Trino to the /etc/trino directory we discuss further down in the Trino Configuration section. Trino should also be added to the trino-network and expose ports 8080 which is how external clients can access Trino. Below is an example of the docker-compose.yml file. The full configurations can be found in this getting started with Trino repository.&#xA;&#xA;These instructions are a basic overview of the more complete installation instructions if you’re really going for it! If you’re not that interested in the installation, feel free to skip ahead to the Deploying Trino at Scale with Kubernetes section. If you’d rather not deal with Kubernetes I offer you another pass to the easy button section of this blog.&#xA;&#xA;Trino requirements&#xA;&#xA;The first requirement is that Trino must be run on a POSIX-compliant system such as Linux or Unix. There are some folks in the community that have gotten Trino to run on Windows for testing using runtime environments like cygwin but this is not supported officially. However, in our world of containerization, this is less of an issue and you will be able to at least test this on Docker no matter which operating system you use.&#xA;&#xA;Trino is written in Java and so it requires the Java Runtime Environment (JRE). Trino requires a 64-bit version of Java 11, with a minimum required version of 11.0.7. Newer patch versions such as 11.0.8 or 11.0.9 are recommended. The launch scripts for Trino bin/launcher, also require python version 2.6.x, 2.7.x, or 3.x.&#xA;&#xA;Trino Configuration&#xA;&#xA;To configure Trino, you need to first know the Trino configuration directory. If you were installing Trino by hand, the default would be in a etc/ directory relative to the installation directory. For our example, I’m going to use the default installation directory of the Trino Docker image, which is set in the run-trino script as /etc/trino. We need to create four files underneath this base directory. I will describe what these files do and you can see an example in the docker image I have created below.&#xA;&#xA;config.properties — This is the primary configuration for each node in the trino cluster. There are plenty of options that can be set here, but you’ll typically want to use the default settings when testing. The required configurations include indicating if the node is the coordinator, setting the http port that Trino communicates on, and the discovery node url so that Trino servers can find each other.&#xA;&#xA;jvm.config — This configuration contains the command line arguments you will pass down to the java process that runs Trino.&#xA;&#xA;log.properties — This configuration is helpful to indicate the log levels of various java classes in Trino. It can be left empty to use the default log level for all classes.&#xA;&#xA;node.properties — This configuration is used to uniquely identify nodes in the cluster and specify locations of directories in the node.&#xA;&#xA;The next directory you need to know about is the catalog/ directory, located in the root configuration directory. In the docker container, it will be in /etc/trino/catalog. This is the directory that will contain the catalog configurations that Trino will use to connect to the different data sources. For our example, we’ll configure two catalogs, the mysql catalog, and the tpch catalog. The tpch catalog is a simple data generation catalog that simply needs the conector.name property to be configured and is located in /etc/trino/catalog/tpch.properties.&#xA;&#xA;tpch.properties&#xA;&#xA;connector.name=tpch&#xA;&#xA;The mysql catalog just needs the connector.name to specify which connector plugin to use, the connection-url property to point to the mysql instance, and the connection-user and connection-password properties for the mysql user.&#xA;&#xA;mysql.properties&#xA;&#xA;connector.name=mysql&#xA;connection-url=jdbc:mysql://mysql:3306&#xA;connection-user=root&#xA;connection-password=admin&#xA;&#xA;Note: the name of the configuration file becomes the name of the catalog in Trino. If you are familiar with MySQL, you are likely to know that MySQL supports a two-tiered containment hierarchy, though you may have never known it was called that. This containment hierarchy refers to databases and tables. The first tier of the hierarchy is the tables, while the second tier consists of databases. A database contains multiple tables and therefore two tables can have the same name provided they live under a different database.&#xA;&#xA;Image by Author&#xA;&#xA;Since Trino has to connect to multiple databases, it supports a three-tiered containment hierarchy. Rather than call the second tier, databases, Trino refers to this tier as schemas. So a database in MySQL is equivalent to a schema in Trino. The third tier allows Trino to distinguish between multiple underlying data sources which are made of catalogs. Since the file provided to Trino is called mysql.properties it automatically names the catalog mysql without the .properties file type. To query the customer table in MySQL under the tiny you specify the following table name mysql.tiny.customer.&#xA;&#xA;If you’ve reached this far, congratulations, you now know how to set up catalogs and query them through Trino! The benefits at this point should be clear, and making a proof of concept is easy to do this way. It’s time to put together that proof of concept for your team and your boss! What next though? How do you actually get this deployed in a reproducible and scalable manner? The next section covers a brief overview of faster ways to get Trino deployed at scale.&#xA;&#xA;!--emailsub--&#xA;&#xA;Deploying Trino at Scale with Kubernetes&#xA;&#xA;Up to this point, this post only describes the deployment process. What about after that once you’ve deployed Trino to production and you slowly onboard engineering, BI/Analytics, and your data science teams. As many Trino users have experienced, the demand on your Trino cluster grows quickly as it becomes the single point of access to all of your data. This is where these small proof-of-concept size installations start to fall apart and you will need something more pliable to scale as your system starts to take on heavier workloads.&#xA;&#xA;You will need to monitor your cluster and will likely need to stand up other services that run these monitoring tasks. This also applies to running other systems for security and authentication management. This list of complexity grows as you consider all of these systems need to scale and adapt around the growing Trino clusters. You may, for instance, consider deploying multiple clusters to handle different workloads, or possibly running tens or hundreds of Trino clusters to provide a self-service platform to provide isolated tenancy in your platform.&#xA;&#xA;The solution to express all of these complex scenarios as the configuration is already solved by using an orchestration platform like Kubernetes, and its package manager project, Helm. Kubernetes offers a powerful way to express all the complex adaptable infrastructures based on your use cases.&#xA;&#xA;In the interest of brevity, I will not include the full set of instructions on how to run a helm chart or cover the basics of running Trino on Kubernetes. Rather, I will refer you to an episode of Trino Community Broadcast that discusses Kubernetes, the community helm chart, and the basics of running Trino on Kubernetes. In the interest of transparency, the official Trino helm charts are still in an early phase of development. There is a very popular community-contributed helm chart that is adapted by many users to suit their needs and it is currently the best open source option for self-managed deployments of Trino. If you decide to take this route, proceed with caution and know that there is development to support the helm deployments moving forward.&#xA;&#xA;While this will provide all the tools to enable a well-suited engineering department to run and maintain their own Trino cluster, this begs the question, based on your engineering team size, should you and your company be investing costly data engineer hours into maintaining, scaling, and hacking required to keep a full-size production infrastructure afloat?&#xA;&#xA;Starburst Galaxy: The Easy Button method of deploying and maintaining Trino&#xA;&#xA;Full Disclosure: This blog post was originally written while I was working at Starburst. I still stand by Starburst Galaxy as one of the better options but I will add the caveat that it depends on your use case and things change so reach out if you need my latest thoughts on the matter. That said, Galaxy is the general purpose version of Trino the creators never got to build at Facebook. If you have custom features you need that you&#39;d like to contribute, a lot of folks run an open source cluster in testing and production is run by Starburst. You can then test and develop features to contribute to open source that will eventually upstream to Galaxy, Athena, or any other Trino variant.&#xA;&#xA;Image By: lostvegas, License: CC BY-NC-ND 2.0&#xA;&#xA;As mentioned, Trino has a relatively_ simple deployment setup, with an emphasis on relatively. This blog really only hits the tip of the iceberg when it comes to the complexity involved in managing and scaling Trino. While it is certainly possible to manage running Trino and even do so at scale with helm charts in Kubernetes, it is still a difficult setup for Trinewbies and difficult to maintain and scale for those who already have experience maintaining Trino. I experienced firsthand many of these difficulties myself when I began my Trino journey years ago and started on my own quest to help others overcome some of these challenges. This is what led me to cross paths with Starburst, the company behind the SaaS Trino platform Galaxy.&#xA;&#xA;Galaxy makes Trino accessible to companies having difficulties scaling and customizing Trino to their needs. Unless you are in a company that houses a massive data platform and you have dedicated data and DevOps engineers to each system in your platform, many of these options won’t be feasible for you in the long run.&#xA;&#xA;One thing to make clear is that a Galaxy cluster is really just a Trino cluster on demand. Outside of managing the scaling policies, to avoid any surprises on your cloud bill, you really don’t have to think about scaling Trino up or down, or suspending it when it is not in use. The beautiful thing about Trino and therefore Galaxy is that it is an ephemeral compute engine much like AWS Lambda that you can quickly spin up or down. Not only are you able to run ad-hoc and federated queries over disparate data sources, but now you can also run the infrastructure for those queries on-demand with almost no cost to your engineering team’s time.&#xA;&#xA;Getting Started With Galaxy&#xA;&#xA;Here’s a quick getting started guide with the Starburst Galaxy that mirrors the setup we realized with the Docker example above with Trino and MySQL.&#xA;&#xA;Set up a trial of Galaxy by filling in your information at the bottom of the Galaxy information page.&#xA;Once you receive a link, you will see this sign-up screen. Fill out the email address, enter the pin sent to the email, and choose the domain for your cluster.&#xA;The rest of the tutorial is provided in the video below provides a basic demo of what you’ll need to do to get started.&#xA;&#xA;This introduction may feel a bit underwhelming but extrapolate being able to run federated queries across your relational databases like MySQL, a data lake storing data in S3, or soon data in many NoSQL and real-time data stores. The true power of Starburst Galaxy is that now your team will no longer need to dedicate a giant backlog of tickets aimed at scaling up and down, monitoring, and securing Trino. Rather you can return to focus on the business problems and the best model for the data in your domain.&#xA;&#xA;trino&#xA;&#xA;!--emailsub--&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="learn-how-to-quickly-join-data-across-multiple-sources" id="learn-how-to-quickly-join-data-across-multiple-sources">Learn how to quickly join data across multiple sources</h2>

<p>If you haven’t heard of Trino before, it is a query engine that speaks the language of many genres of databases. As such, Trino is commonly used to provide fast ad-hoc queries across heterogeneous data sources. Trino’s initial use case was built around replacing the Hive runtime engine to allow for faster querying of Big Data warehouses and data lakes. This may be the first time you have heard of <a href="https://trino.io/">Trino</a>, but you’ve likely heard of the project from which it was <a href="https://venturebeat.com/2021/08/27/who-owns-open-source-projects-people-or-companies/">“forklifted”</a>, Presto. If you want to learn more about <a href="https://trino.io/blog/2020/12/27/announcing-trino.html">why the creators of Presto now work on Trino (formerly PrestoSQL)</a> you can read the renaming blog that they produced earlier this year. Before you commit too much to this blog, I’d like to let you know why you should even care about Trino.</p>

<p><a href="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe91ba54f-4f2c-4516-99fc-59c5c7cd8fd0_512x241.png"><img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe91ba54f-4f2c-4516-99fc-59c5c7cd8fd0_512x241.png" alt=""/></a></p>





<h3 id="so-what-is-trino-anyways" id="so-what-is-trino-anyways">So what is Trino anyways?</h3>

<p>The first thing I like to make sure people know about when discussing Trino is that it is a SQL query engine, but not a SQL database. What does that mean? Traditional databases typically consist of a query engine and a storage engine. Trino is just a query engine and does not store data. Instead, Trino interacts with various databases that store their own data in their own formats. Trino parses and analyzes the SQL query you pass in, creates and optimizes a query execution plan that includes the data sources, and then schedules worker nodes that are able to intelligently query the underlying databases they connect to.</p>

<p>I say intelligently, specifically talking about pushdown queries. That’s right, the most intelligent thing for Trino to do is to avoid making more work for itself, and try to offload that work to the underlying database. This makes sense as the underlying databases generally have special indexes and data that are stored in a specific format to optimize the read time. It would be silly of Trino to ignore all of that optimized reading capability and do a linear scan of all the data to run the query itself. The goal in most optimizations for Trino is to push down the query to the database and only get back the smallest amount of data needed to join with another dataset from another database, do some further Trino specific processing, or simply return as the correct result set for the query.</p>

<h4 id="query-all-the-things" id="query-all-the-things">Query all the things</h4>

<p>So I still have not really answered your question of why you should care about Trino. The short answer is, Trino acts as a single access point to query all the things. Yup. Oh, and it’s super fast at ad-hoc queries over various data sources including data lakes (e.g. Iceberg/Databricks) or data warehouses (e.g. Hive/Snowflake). It has a <a href="https://trino.io/docs/current/develop/connectors.html">connector architecture</a> that allows it to speak the language of <a href="https://trino.io/docs/current/connector.html">a whole bunch of databases</a>. If you have a special use case, you can write your own connector that abstracts any database or service away to just be another table in Trino’s domain. Pretty cool right? But that’s actually rarely needed because the most common databases already have a connector written for them. If not, <a href="https://github.com/trinodb/trino/issues/4500">more connectors are getting added by Trino’s open source community every few months</a>.</p>

<p>To make the benefits of running federated queries a bit more tangible, I will present an example. Trino brings users the ability to map <a href="https://trino.io/docs/current/language.html">standardized ANSI SQL</a> query to query databases that have a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html">custom query DSL like Elasticsearch</a>. With Trino it’s incredibly simple to set up an Elasticsearch catalog and start running SQL queries on it. If that doesn’t blow your mind, let me explain why that’s so powerful.</p>

<p>Imagine you have five different data stores, each with its own independent query language. Your data science or analyst team just wants access to these data stores. It would take a ridiculous amount of time for them to have to go to each data system individually, look up the different commands to pull data out of each one, and dump the data into one location and clean it up so that they can actually run meaningful queries. With Trino all they need to use is SQL to access them through Trino. Also, it doesn’t just stop at accessing the data, your data science team is also able to join data across tables of different databases like a search engine like Elasticsearch with an operational database like MySQL. Further, using Trino even enables joining data sources with themselves where joins are not supported, like in Elasticsearch and MongoDB. Did it happen yet? Is your mind blown?</p>

<h3 id="getting-started-with-trino" id="getting-started-with-trino">Getting Started with Trino</h3>

<p>So what is required to give Trino a test drive? Relative to many open-source database projects, Trino is one of the more simple projects to install, but this still doesn’t mean it is easy. An important element to a successful project is how it adapts to newer users and expands capability for growth and adoption. This really pushes the importance of making sure that there are multiple avenues of entry into using a product all of which have varying levels of difficulty, cost, customizability, interoperability, and scalability. As you increase in the level of customizability, interoperability, and scalability, you will generally see an increase in difficulty or cost and vice versa. Luckily, when you are starting out, you just really need to play with Trino.</p>

<p><a href="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9876726b-409f-4e0d-a768-967bba0abe9e_600x390.png"><img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9876726b-409f-4e0d-a768-967bba0abe9e_600x390.png" alt=""/></a></p>

<p>Image added by Author</p>

<p>The low-cost and low difficulty way to try out Trino is to use <a href="https://www.docker.com/">Docker containers</a>. The nice thing about these containers is that you don’t have to really know anything about the installation process of Trino to play around with Trino. While many enjoy poking around documentation and working with Trino to get it set up, it may not be for all. I certainly have my days where I prefer a nice chill CLI sesh and other days where I just need to opt-out. If you want to skip to the Easy Button way to deploy Trino (hint, it’s the SaaS deployment) then skip the next few sections.</p>



<h4 id="using-trino-with-docker" id="using-trino-with-docker">Using Trino With Docker</h4>

<p>Trino ships with <a href="https://hub.docker.com/r/trinodb/trino">a Docker image</a> that does a lot of the setup necessary for Trino to run. Outside of simply running a docker container, there are a few things that need to happen for setup. First, in order to use a database like MySQL, we actually need to run a MySQL container as well using the official mysql image. There is <a href="https://github.com/bitsondatadev/trino-getting-started">a trino-getting-started repository</a> that contains a lot of the setup needed for using Trino on your own computer or setting it up on a test server as a proof of concept. Clone this repository and follow the instructions in the README to install Docker if it is not already.</p>

<p>You can actually run a query before learning the specifics of how this compose file works. Before you run the query, you will need to run the mysql and trino-coordinator instances. To do this, navigate to the mysql/trino-mysql/ directory that contains the docker-compose.yml and run:</p>

<pre><code>docker-compose up -d
</code></pre>

<h4 id="running-your-first-query" id="running-your-first-query">Running your first query!</h4>

<p>Now that you have Trino running in Docker, you need to open a session to access it. The easiest way to do this is via a console. Run the following Docker command to connect to a terminal on the coordinator:</p>

<pre><code>docker container exec -it trino-mysql_trino-coordinator_1 trino
</code></pre>

<p>This will bring you to the Trino terminal.</p>

<pre><code>trino&gt;
</code></pre>

<p>Your first query will actually be to generate data from the tpch catalog and then query the data that was loaded into mysql catalog. In the terminal, run the following two queries:</p>

<pre><code>CREATE TABLE mysql.tiny.customer
AS SELECT * FROM tpch.tiny.customer;
</code></pre>

<pre><code>SELECT custkey, name, nationkey, phone 
FROM mysql.tiny.customer LIMIT 5;
</code></pre>

<p>The output should look like this.</p>

<pre><code>|custkey|name              |nationkey|phone          |
|-------|------------------|---------|---------------|
|751    |Customer#000000751|0        |10-658-550-2257|
|752    |Customer#000000752|8        |18-924-993-6038|
|753    |Customer#000000753|17       |27-817-126-3646|
|754    |Customer#000000754|0        |10-646-595-5871|
|755    |Customer#000000755|16       |26-395-247-2207|
</code></pre>

<p>Congrats! You just ran your first query on Trino. Did you feel the rush!? Okay well, technically we just copied data from a data generation connector and moved it into a MySQL database and queried that back out. It’s fine if this simple exercise didn’t send goosebumps flying down your spine but hopefully, you can extrapolate the possibilities when connecting to other datasets.</p>

<p>A good initial exercise to study the compose file and directories before jumping into the Trino installation documentation. Let’s see how this was possible by breaking down the docker-compose file that you just ran.</p>

<pre><code>version: &#39;3.7&#39;
services:
  trino-coordinator:
    image: &#39;trinodb/trino:latest&#39;
    hostname: trino-coordinator
    ports:
      - &#39;8080:8080&#39;
    volumes:
      - ./etc:/etc/trino
    networks:
      - trino-network

  mysql:
    image: mysql:latest
    hostname: mysql
    environment:
      MYSQL_ROOT_PASSWORD: admin
      MYSQL_USER: admin
      MYSQL_PASSWORD: admin
      MYSQL_DATABASE: tiny
    ports:
      - &#39;3306:3306&#39;
    networks:
      - trino-network
networks:
  trino-network:
    driver: bridge
</code></pre>

<p>Notice that the hostname of mysql matches the instance name, and the mysql instance is on the trino-network that the trino-coordinator instance will also join. Also notice that the mysql image exposes port 3306 on the network.</p>

<p>Finally, we will use the trinodb/trino image for the trino-coordinator instance, and use the volumes option to map our local custom configurations for Trino to the /etc/trino directory we discuss further down in the <em>Trino Configuration</em> section. Trino should also be added to the trino-network and expose ports 8080 which is how external clients can access Trino. Below is an example of the docker-compose.yml file. The full configurations can be found in this <a href="https://github.com/bitsondatadev/trino-getting-started/tree/main/mysql/trino-mysql">getting started with Trino repository</a>.</p>

<p>These instructions are a basic overview of <a href="https://trino.io/docs/current/installation/deployment.html">the more complete installation instructions</a> if you’re really going for it! If you’re not that interested in the installation, feel free to skip ahead to the Deploying Trino at Scale with Kubernetes section. If you’d rather not deal with Kubernetes I offer you another pass to the easy button section of this blog.</p>

<h4 id="trino-requirements" id="trino-requirements">Trino requirements</h4>

<p>The first requirement is that Trino must be run on a POSIX-compliant system such as Linux or Unix. There are some folks in the community that have gotten Trino to run on Windows for testing using runtime environments like cygwin but this is not supported officially. However, in our world of containerization, this is less of an issue and you will be able to at least test this on <a href="https://www.docker.com/">Docker</a> no matter which operating system you use.</p>

<p>Trino is written in Java and so it requires the Java Runtime Environment (JRE). Trino requires a 64-bit version of Java 11, with a minimum required version of 11.0.7. Newer patch versions such as 11.0.8 or 11.0.9 are recommended. The launch scripts for Trino bin/launcher, also require python version 2.6.x, 2.7.x, or 3.x.</p>

<h4 id="trino-configuration" id="trino-configuration">Trino Configuration</h4>

<p>To configure Trino, you need to first know the Trino configuration directory. If you were installing Trino by hand, the default would be in a etc/ directory relative to the installation directory. For our example, I’m going to use the default installation directory of the <a href="https://hub.docker.com/r/trinodb/trino">Trino Docker image</a>, which is <a href="https://github.com/trinodb/trino/blob/356/core/docker/bin/run-trino#L15">set in the run-trino script</a> as /etc/trino. We need to create four files underneath this base directory. I will describe what these files do and you can see an example in the docker image I have created below.</p>
<ol><li><p>config.properties — This is the primary configuration for each node in the trino cluster. There are plenty of options that can be set here, but you’ll typically want to use the default settings when testing. The required configurations include indicating if the node is the coordinator, setting the http port that Trino communicates on, and the discovery node url so that Trino servers can find each other.</p></li>

<li><p>jvm.config — This configuration contains the command line arguments you will pass down to the java process that runs Trino.</p></li>

<li><p>log.properties — This configuration is helpful to indicate the log levels of various java classes in Trino. It can be left empty to use the default log level for all classes.</p></li>

<li><p>node.properties — This configuration is used to uniquely identify nodes in the cluster and specify locations of directories in the node.</p></li></ol>

<p>The next directory you need to know about is the catalog/ directory, located in the root configuration directory. In the docker container, it will be in /etc/trino/catalog. This is the directory that will contain the catalog configurations that Trino will use to connect to the different data sources. For our example, we’ll configure two catalogs, the mysql catalog, and the tpch catalog. The tpch catalog is a simple data generation catalog that simply needs the conector.name property to be configured and is located in /etc/trino/catalog/tpch.properties.</p>

<p>tpch.properties</p>

<pre><code>connector.name=tpch
</code></pre>

<p>The mysql catalog just needs the connector.name to specify which connector plugin to use, the connection-url property to point to the mysql instance, and the connection-user and connection-password properties for the mysql user.</p>

<p>mysql.properties</p>

<pre><code>connector.name=mysql
connection-url=jdbc:mysql://mysql:3306
connection-user=root
connection-password=admin
</code></pre>

<p>Note: the name of the configuration file becomes the name of the catalog in Trino. If you are familiar with MySQL, you are likely to know that MySQL supports a two-tiered containment hierarchy, though you may have never known it was called that. This containment hierarchy refers to databases and tables. The first tier of the hierarchy is the <em>tables</em>, while the second tier consists of <em>databases</em>. A database contains multiple tables and therefore two tables can have the same name provided they live under a different database.</p>

<p><a href="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f2b877-50b9-4d11-b064-f5ae0b8323db_800x450.png"><img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f2b877-50b9-4d11-b064-f5ae0b8323db_800x450.png" alt=""/></a></p>

<p>Image by Author</p>

<p>Since Trino has to connect to multiple databases, it supports a three-tiered containment hierarchy. Rather than call the second tier, databases, Trino refers to this tier as <em>schemas</em>. So a database in MySQL is equivalent to a schema in Trino. The third tier allows Trino to distinguish between multiple underlying data sources which are made of <em>catalogs</em>. Since the file provided to Trino is called mysql.properties it automatically names the catalog mysql without the .properties file type. To query the customer table in MySQL under the tiny you specify the following table name mysql.tiny.customer.</p>

<p>If you’ve reached this far, congratulations, you now know how to set up catalogs and query them through Trino! The benefits at this point should be clear, and making a proof of concept is easy to do this way. It’s time to put together that proof of concept for your team and your boss! What next though? How do you actually get this deployed in a reproducible and scalable manner? The next section covers a brief overview of faster ways to get Trino deployed at scale.</p>



<h3 id="deploying-trino-at-scale-with-kubernetes" id="deploying-trino-at-scale-with-kubernetes">Deploying Trino at Scale with Kubernetes</h3>

<p>Up to this point, this post only describes the deployment process. What about after that once you’ve deployed Trino to production and you slowly onboard engineering, BI/Analytics, and your data science teams. As many Trino users have experienced, the demand on your Trino cluster grows quickly as it becomes the single point of access to all of your data. This is where these small proof-of-concept size installations start to fall apart and you will need something more pliable to scale as your system starts to take on heavier workloads.</p>

<p>You will need to monitor your cluster and will likely need to stand up other services that run these monitoring tasks. This also applies to running other systems for security and authentication management. This list of complexity grows as you consider all of these systems need to scale and adapt around the growing Trino clusters. You may, for instance, consider deploying <a href="https://shopify.engineering/faster-trino-query-execution-infrastructure">multiple clusters to handle different workloads</a>, or possibly running tens or hundreds of Trino clusters to provide a self-service platform to provide isolated tenancy in your platform.</p>

<p>The solution to express all of these complex scenarios as the configuration is already solved by using an orchestration platform like Kubernetes, and its package manager project, Helm. Kubernetes offers a powerful way to express all the complex adaptable infrastructures based on your use cases.</p>

<p>In the interest of brevity, I will not include the full set of instructions on how to run a helm chart or cover the basics of running Trino on Kubernetes. Rather, I will refer you to <a href="https://trino.io/episodes/24.html">an episode of Trino Community Broadcast</a> that discusses Kubernetes, the community helm chart, and the basics of running Trino on Kubernetes. In the interest of transparency, <a href="https://github.com/trinodb/charts">the official Trino helm charts</a> are still in an early phase of development. There is a very popular <a href="https://github.com/valeriano-manassero/helm-charts/tree/main/valeriano-manassero/trino">community-contributed helm chart</a> that is adapted by many users to suit their needs and it is currently the best open source option for self-managed deployments of Trino. If you decide to take this route, proceed with caution and know that there is <a href="https://github.com/trinodb/charts/pull/11">development to support the helm deployments</a> moving forward.</p>

<p>While this will provide all the tools to enable a well-suited engineering department to run and maintain their own Trino cluster, this begs the question, based on your engineering team size, should you and your company be investing costly data engineer hours into maintaining, scaling, and hacking required to keep a full-size production infrastructure afloat?</p>

<h3 id="starburst-galaxy-the-easy-button-method-of-deploying-and-maintaining-trino" id="starburst-galaxy-the-easy-button-method-of-deploying-and-maintaining-trino">Starburst Galaxy: The Easy Button method of deploying and maintaining Trino</h3>

<p><em>Full Disclosure:</em> This blog post was originally written while I was working at Starburst. I still stand by Starburst Galaxy as one of the better options but I will add the caveat that it depends on your use case and things change so reach out if you need my latest thoughts on the matter. That said, Galaxy is the general purpose version of Trino the creators never got to build at Facebook. If you have custom features you need that you&#39;d like to contribute, a lot of folks run an open source cluster in testing and production is run by Starburst. You can then test and develop features to contribute to open source that will eventually upstream to Galaxy, Athena, or any other Trino variant.</p>

<p><a href="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf49db7c-b00d-4291-b010-3835451379d6_800x572.jpeg"><img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf49db7c-b00d-4291-b010-3835451379d6_800x572.jpeg" alt=""/></a></p>

<p>Image By: lostvegas, License: CC BY-NC-ND 2.0</p>

<p>As mentioned, Trino has a <em>relatively</em> simple deployment setup, with an emphasis on relatively. This blog really only hits the tip of the iceberg when it comes to the complexity involved in managing and scaling Trino. While it is certainly possible to manage running Trino and even do so at scale with helm charts in Kubernetes, it is still a difficult setup for Trinewbies and difficult to maintain and scale for those who already have experience maintaining Trino. I experienced firsthand many of these difficulties myself when I began my Trino journey years ago and started on my own quest to help others overcome some of these challenges. This is what led me to cross paths with Starburst, the company behind the SaaS Trino platform Galaxy.</p>

<p>Galaxy makes Trino accessible to companies having difficulties scaling and customizing Trino to their needs. Unless you are in a company that houses a massive data platform and you have dedicated data and DevOps engineers to each system in your platform, many of these options won’t be feasible for you in the long run.</p>

<p>One thing to make clear is that a Galaxy cluster is really just a Trino cluster on demand. Outside of managing the scaling policies, to avoid any surprises on your cloud bill, you really don’t have to think about scaling Trino up or down, or suspending it when it is not in use. The beautiful thing about Trino and therefore Galaxy is that it is an ephemeral compute engine much like AWS Lambda that you can quickly spin up or down. Not only are you able to run ad-hoc and federated queries over disparate data sources, but now you can also run the infrastructure for those queries on-demand with almost no cost to your engineering team’s time.</p>

<h4 id="getting-started-with-galaxy" id="getting-started-with-galaxy">Getting Started With Galaxy</h4>

<p>Here’s a quick getting started guide with the Starburst Galaxy that mirrors the setup we realized with the Docker example above with Trino and MySQL.</p>
<ul><li>Set up a trial of Galaxy by filling in your information at the bottom of the <a href="http://starburst.io/galaxy">Galaxy information page</a>.</li>
<li>Once you receive a link, you will see this sign-up screen. Fill out the email address, enter the pin sent to the email, and choose the domain for your cluster.</li>
<li>The rest of the tutorial is provided in the video below provides a basic demo of what you’ll need to do to get started.</li></ul>

<p>This introduction may feel a bit underwhelming but extrapolate being able to run federated queries across your relational databases like MySQL, a data lake storing data in S3, or soon data in many NoSQL and real-time data stores. The true power of Starburst Galaxy is that now your team will no longer need to dedicate a giant backlog of tickets aimed at scaling up and down, monitoring, and securing Trino. Rather you can return to focus on the business problems and the best model for the data in your domain.</p>

<p><a href="https://bitsondata.dev/tag:trino" class="hashtag"><span>#</span><span class="p-category">trino</span></a></p>


]]></content:encoded>
      <guid>https://bitsondata.dev/intro-to-trino-for-the-trinewbie</guid>
      <pubDate>Fri, 17 Dec 2021 18:00:00 +0000</pubDate>
    </item>
    <item>
      <title>Trino on ice IV: Deep dive into Iceberg internals</title>
      <link>https://bitsondata.dev/trino-iceberg-iv-deep-dive?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[&#xA;&#xA;So far, this series has covered some very interesting user level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some implementation details of Iceberg by dissecting some files that result from various operations carried out using Trino. To dissect you must use some surgical instrumentation, namely Trino, Avro tools, the MinIO client tool and Iceberg’s core library. It’s useful to dissect how these files work, not only to help understand how Iceberg works, but also to aid in troubleshooting issues, should you have any issues during ingestion or querying of your Iceberg table. I like to think of this type of debugging much like a fun game of operation, and you’re looking to see what causes the red errors to fly by on your screen.&#xA;&#xA;!--more--&#xA;&#xA;---&#xA;&#xA;Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:&#xA;&#xA;Trino on ice I: A gentle introduction to Iceberg&#xA;Trino on ice II: In-place table evolution and cloud compatibility with Iceberg&#xA;Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec&#xA;Trino on ice IV: Deep dive into Iceberg internals&#xA;&#xA;---&#xA;&#xA;Understanding Iceberg metadata&#xA;&#xA;Iceberg can use any compatible metastore, but for Trino, it only supports the  Hive metastore and AWS Glue similar to the Hive connector. This is because there is already a vast amount of testing and support for using the Hive metastore in Trino. Likewise, many Trino use cases that currently use data lakes already use the Hive connector and therefore the Hive metastore. This makes it convenient to have as the leading supported use case as existing users can easily migrate between Hive to Iceberg tables. Since there is no indication of which connector is actually executed in the diagram of the Hive connector architecture, it serves as a diagram that can be used for both Hive and Iceberg. The only difference is the connector used, but if you create a table in Hive, you can  view the same table in Iceberg.&#xA;&#xA;To recap the steps taken from the first three blogs; the first blog created an events table, while the first two blogs ran two insert statements. The first insert contained three records, while the second insert contained a single record.&#xA;&#xA;Up until this point, the state of the files in MinIO haven’t really been shown except some of the manifest list pointers from the snapshot in the third blog post. Using the MinIO client tool, you can list files that Iceberg generated through all these operations and then try to understand what purpose they are serving.&#xA;&#xA;% mc tree -f local/&#xA;local/&#xA;└─ iceberg&#xA;   └─ logging.db&#xA;      └─ events&#xA;         ├─ data&#xA;         │  ├─ eventtimeday=2021-04-01&#xA;         │  │  ├─ 51eb1ea6-266b-490f-8bca-c63391f02d10.orc&#xA;         │  │  └─ cbcf052d-240d-4881-8a68-2bbc0f7e5233.orc&#xA;         │  └─ eventtimeday=2021-04-02&#xA;         │     └─ b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc&#xA;         └─ metadata&#xA;            ├─ 00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json&#xA;            ├─ 00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json&#xA;            ├─ 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json&#xA;            ├─ 23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro&#xA;            ├─ 92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro&#xA;            ├─ snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro&#xA;            ├─ snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro&#xA;            └─ snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro&#xA;&#xA;There are a lot of files here, but here are a couple of patterns that you can observe with these files.&#xA;&#xA;First, the top two directories are named data and metadata.&#xA;&#xA;/bucket/database/table/data//bucket/database/table/metadata/&#xA;&#xA;As you might expect, data contains the actual ORC files split by partition. This is akin to what you would see in a Hive table data directory. What is really of interest here is the metadata directory. There are specifically three patterns of files you’ll find here.&#xA;&#xA;/bucket/database/table/metadata/file-id.avro&#xA;&#xA;/bucket/database/table/metadata/snap-snapshot-id-version-file-id.avro&#xA;&#xA;/bucket/database/table/metadata/version-commit-UUID.metadata.json&#xA;&#xA;Iceberg has a persistent tree structure that manages various snapshots of the data that are created for every mutation of the data. This enables not only a concurrency model that supports serializable isolation, but also cool features like time travel across a linear progression of snapshots.&#xA;&#xA;This tree structure contains two types of Avro files, manifest lists and manifest files. Manifest list files contain pointers to various manifest files and the manifest files themselves point to various data files. This post starts out by covering these manifest files, and later covers the table metadata files that are suffixed by .metadata.json.&#xA;&#xA;The last blog covered the command in Trino that shows the snapshot information that is stored in the metastore. Here is that command and its output again for your review.&#xA;&#xA;SELECT manifestlist &#xA;FROM iceberg.logging.&#34;events$snapshots&#34;;&#xA;&#xA;Result:&#xA;&#xA;table&#xA;trthsnapshots/th/tr&#xA;trtds3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro/td/tr&#xA;trtds3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro/td/tr&#xA;trtds3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro/td/tr&#xA;/table&#xA;&#xA;You’ll notice that the manifest list returns the Avro files prefixed with&#xA;snap- are returned. These files are directly correlated with the snapshot record stored in the metastore. According to the diagram above, snapshots are records in the metastore that contain the url of the manifest list in the Avro file. Avro files are binary files and not something you can just open up in a text editor to read. Using the avro-tools.jar tool distributed by the Apache Avro project, you can actually inspect the contents of this file to get a better understanding of how it is used by Iceberg.&#xA;&#xA;The first snapshot is generated on the creation of the events table. Upon inspecting this file, you notice that the file is empty. The output is an empty line that the jq JSON command line utility removes on pretty printing the JSON that is returned, which is just a newline. This snapshot represents an empty state of the table upon creation. To investigate the snapshots you need to download the files to your local filesystem. Let&#39;s move them to the home  directory:&#xA;&#xA;% java -jar  ~/Desktop/avrofiles/avro-tools-1.10.0.jar tojson ~/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro | jq .&#xA;&#xA;Result: (is empty)&#xA;&#xA;The second snapshot is a little more interesting and actually shows us the contents of a manifest list.&#xA;&#xA;% java -jar  ~/Desktop/avrofiles/avro-tools-1.10.0.jar tojson ~/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro | jq .&#xA;&#xA;Result:&#xA;&#xA;{&#xA;   &#34;manifestpath&#34;:&#34;s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro&#34;,&#xA;   &#34;manifestlength&#34;:6114,&#xA;   &#34;partitionspecid&#34;:0,&#xA;   &#34;addedsnapshotid&#34;:{&#xA;      &#34;long&#34;:2720489016575682000&#xA;   },&#xA;   &#34;addeddatafilescount&#34;:{&#xA;      &#34;int&#34;:2&#xA;   },&#xA;   &#34;existingdatafilescount&#34;:{&#xA;      &#34;int&#34;:0&#xA;   },&#xA;   &#34;deleteddatafilescount&#34;:{&#xA;      &#34;int&#34;:0&#xA;   },&#xA;   &#34;partitions&#34;:{&#xA;      &#34;array&#34;:[&#xA;         {&#xA;            &#34;containsnull&#34;:false,&#xA;            &#34;lowerbound&#34;:{&#xA;               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;&#xA;            },&#xA;            &#34;upperbound&#34;:{&#xA;               &#34;bytes&#34;:&#34;\u001fI\u0000\u0000&#34;&#xA;            }&#xA;         }&#xA;      ]&#xA;   },&#xA;   &#34;addedrowscount&#34;:{&#xA;      &#34;long&#34;:3&#xA;   },&#xA;   &#34;existingrowscount&#34;:{&#xA;      &#34;long&#34;:0&#xA;   },&#xA;   &#34;deletedrowscount&#34;:{&#xA;      &#34;long&#34;:0&#xA;   }&#xA;}&#xA;&#xA;To understand each of the values in each of these rows, you can refer to the  Iceberg &#xA;specification in the manifest list file section. Instead of covering these exhaustively, let&#39;s focus on a few key fields. Below are the fields, and their definition according to the specification.&#xA;&#xA;manifestpath - Location of the manifest file.&#xA;partitionspecid - ID of a partition spec used to write the manifest; must be listed in table metadata partition-specs.&#xA;addedsnapshotid - ID of the snapshot where the manifest file was added.&#xA;partitions - A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec.&#xA;addedrowscount - Number of rows in all files in the manifest that have status ADDED, when null this is assumed to be non-zero.&#xA;&#xA;As mentioned above, manifest lists hold references to various manifest files. These manifest paths are the pointers in the persistent tree that tells any client using Iceberg where to find all of the manifest files associated with a particular snapshot. To traverse this tree, you can look over the different manifest paths to find all the manifest files associated with the particular snapshot you want to traverse. Partition spec ids are helpful to know the current partition specification which are stored in the table metadata in the metastore. This references where to find the spec in the metastore. Added snapshot ids tells you which snapshot is associated with the manifest list. Partitions hold some high level partition bound information to make for faster querying. If a query is looking for a particular value, it only traverses the manifest files where the query values fall within the range of the file values. Finally, you get a few metrics like the number of changed rows and data files, one of which is the count of added rows. The first operation consisted of three rows inserts and the second operation was the insertion of one row. Using the row counts you can easily determine which manifest file belongs to which operation.&#xA;&#xA;The following command shows the final snapshot after both operations executed and filters out only the fields pointed out above.&#xA;&#xA;% java -jar  ~/Desktop/avrofiles/avro-tools-1.10.0.jar tojson ~/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro | jq &#39;. | {manifestpath: .manifestpath, partitionspecid: .partitionspecid, addedsnapshotid: .addedsnapshotid, partitions: .partitions, addedrowscount: .addedrowscount }&#39;&#xA;&#xA;Result: &#xA;&#xA;{&#xA;   &#34;manifestpath&#34;:&#34;s3a://iceberg/logging.db/events/metadata/23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro&#34;,&#xA;   &#34;partitionspecid&#34;:0,&#xA;   &#34;addedsnapshotid&#34;:{&#xA;      &#34;long&#34;:4564366177504223700&#xA;   },&#xA;   &#34;partitions&#34;:{&#xA;      &#34;array&#34;:[&#xA;         {&#xA;            &#34;containsnull&#34;:false,&#xA;            &#34;lowerbound&#34;:{&#xA;               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;&#xA;            },&#xA;            &#34;upperbound&#34;:{&#xA;               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;&#xA;            }&#xA;         }&#xA;      ]&#xA;   },&#xA;   &#34;addedrowscount&#34;:{&#xA;      &#34;long&#34;:1&#xA;   }&#xA;}&#xA;{&#xA;   &#34;manifestpath&#34;:&#34;s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro&#34;,&#xA;   &#34;partitionspecid&#34;:0,&#xA;   &#34;addedsnapshotid&#34;:{&#xA;      &#34;long&#34;:2720489016575682000&#xA;   },&#xA;   &#34;partitions&#34;:{&#xA;      &#34;array&#34;:[&#xA;         {&#xA;            &#34;containsnull&#34;:false,&#xA;            &#34;lowerbound&#34;:{&#xA;               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;&#xA;            },&#xA;            &#34;upperbound&#34;:{&#xA;               &#34;bytes&#34;:&#34;\u001fI\u0000\u0000&#34;&#xA;            }&#xA;         }&#xA;      ]&#xA;   },&#xA;   &#34;addedrowscount&#34;:{&#xA;      &#34;long&#34;:3&#xA;   }&#xA;}&#xA;&#xA;In the listing of the manifest file related to the last snapshot, you notice the first operation where three rows were inserted is contained in the manifest file in the second JSON object. You can determine this from the snapshot id, as well as, the number of rows that were added in the operation. The first JSON object contains the last operation that inserted a single row. So the most recent operations are listed in reverse commit order.&#xA;&#xA;The next command does the same listing of the file that you ran with the manifest list, except you run this on the manifest files themselves to expose their contents and discuss them. To begin with, you run the command to show the contents of the manifest file associated with the insertion of three rows.&#xA;&#xA;% java -jar  ~/avro-tools-1.10.0.jar tojson ~/Desktop/avrofiles/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro | jq .&#xA;&#xA;Result: &#xA;&#xA;{&#xA;   &#34;status&#34;:1,&#xA;   &#34;snapshotid&#34;:{&#xA;      &#34;long&#34;:2720489016575682000&#xA;   },&#xA;   &#34;datafile&#34;:{&#xA;      &#34;filepath&#34;:&#34;s3a://iceberg/logging.db/events/data/eventtimeday=2021-04-01/51eb1ea6-266b-490f-8bca-c63391f02d10.orc&#34;,&#xA;      &#34;fileformat&#34;:&#34;ORC&#34;,&#xA;      &#34;partition&#34;:{&#xA;         &#34;eventtimeday&#34;:{&#xA;            &#34;int&#34;:18718&#xA;         }&#xA;      },&#xA;      &#34;recordcount&#34;:1,&#xA;      &#34;filesizeinbytes&#34;:870,&#xA;      &#34;blocksizeinbytes&#34;:67108864,&#xA;      &#34;columnsizes&#34;:null,&#xA;      &#34;valuecounts&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:1&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:2,&#xA;               &#34;value&#34;:1&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:1&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:4,&#xA;               &#34;value&#34;:1&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;nullvaluecounts&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:0&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:2,&#xA;               &#34;value&#34;:0&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:0&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:4,&#xA;               &#34;value&#34;:0&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;nanvaluecounts&#34;:null,&#xA;      &#34;lowerbounds&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:&#34;ERROR&#34;&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:&#34;Oh noes&#34;&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;upperbounds&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:&#34;ERROR&#34;&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:&#34;Oh noes&#34;&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;keymetadata&#34;:null,&#xA;      &#34;splitoffsets&#34;:null&#xA;   }&#xA;}&#xA;{&#xA;   &#34;status&#34;:1,&#xA;   &#34;snapshotid&#34;:{&#xA;      &#34;long&#34;:2720489016575682000&#xA;   },&#xA;   &#34;datafile&#34;:{&#xA;      &#34;filepath&#34;:&#34;s3a://iceberg/logging.db/events/data/eventtimeday=2021-04-02/b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc&#34;,&#xA;      &#34;fileformat&#34;:&#34;ORC&#34;,&#xA;      &#34;partition&#34;:{&#xA;         &#34;eventtimeday&#34;:{&#xA;            &#34;int&#34;:18719&#xA;         }&#xA;      },&#xA;      &#34;recordcount&#34;:2,&#xA;      &#34;filesizeinbytes&#34;:1084,&#xA;      &#34;blocksizeinbytes&#34;:67108864,&#xA;      &#34;columnsizes&#34;:null,&#xA;      &#34;valuecounts&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:2&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:2,&#xA;               &#34;value&#34;:2&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:2&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:4,&#xA;               &#34;value&#34;:2&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;nullvaluecounts&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:0&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:2,&#xA;               &#34;value&#34;:0&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:0&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:4,&#xA;               &#34;value&#34;:0&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;nanvaluecounts&#34;:null,&#xA;      &#34;lowerbounds&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:&#34;ERROR&#34;&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:&#34;Double oh noes&#34;&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;upperbounds&#34;:{&#xA;         &#34;array&#34;:[&#xA;            {&#xA;               &#34;key&#34;:1,&#xA;               &#34;value&#34;:&#34;WARN&#34;&#xA;            },&#xA;            {&#xA;               &#34;key&#34;:3,&#xA;               &#34;value&#34;:&#34;Maybeh oh noes?&#34;&#xA;            }&#xA;         ]&#xA;      },&#xA;      &#34;keymetadata&#34;:null,&#xA;      &#34;splitoffsets&#34;:null&#xA;   }&#xA;}&#xA;&#xA;Now this is a very big output, but in summary, there’s really not too much to these files. As before, there is a Manifest section in the Iceberg spec that details what each of these fields means. Here are the important fields:&#xA;&#xA;snapshotid - Snapshot id where the file was added, or deleted if status is two. Inherited when null.&#xA;datafile - Field containing metadata about the data files pertaining to the manifest file, such as file path, partition tuple, metrics, etc…&#xA;datafile.filepath - Full URI for the file with FS scheme.&#xA;datafile.partition - Partition data tuple, schema based on the partition spec.&#xA;datafile.recordcount - Number of records in the data file.&#xA;datafile.count - Multiple fields that contain a map from column id to  number of values, null, nan counts in the file. These can be used to quickly  filter out unnecessary get operations.&#xA;datafile.bounds - Multiple fields that contain a map from column id to lower or upper bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.&#xA;&#xA;Each data file struct contains a partition and data file that it maps to. These files only be scanned and returned if the criteria for the query is met when  checking all of the count, bounds, and other statistics that are recorded in the file. Ideally only files that contain data relevant to the query should be scanned at all. Having information like the record count may also help in the query planning process to determine splits and other information. This particular optimization hasn’t been completed yet as planning typically happens before traversal of the files. It is still in ongoing discussion and is discussed a bit by Iceberg creator Ryan Blue in a recent meetup. If this is something you are interested in, keep posted on the Slack channel and releases as the Trino Iceberg connector progresses in this area.&#xA;&#xA;As mentioned above, the last set of files that you find in the metadata directory which are suffixed with .metadata.json. These files at baseline are a bit strange as they aren’t stored in the Avro format, but instead the JSON format. This is because they are not part of the persistent tree structure. These files are essentially a copy of the table metadata that is stored in the metastore. You can find the fields for the table metadata listed in the Iceberg specification. These tables are typically stored persistently in a metasture much like the Hive metastore but could easily be replaced by any datastore that can support an atomic swap (check-and-put) operation required for Iceberg to support the optimistic concurrency operation.&#xA;&#xA;The naming of the table metadata includes a table version and UUID: &#xA;table-version-UUID.metadata.json. To commit a new metadata version, which just adds 1 to the current version number, the writer performs these steps:&#xA;&#xA;It creates a new table metadata file using the current metadata.&#xA;It writes the new table metadata to a file following the naming with the next version number.&#xA;It requests the metastore swap the table’s metadata pointer from the old location to the new location.&#xA;&#xA;    If the swap succeeds, the commit succeeded. The new file is now the &#xA;    current metadata.&#xA;    If the swap fails, another writer has already created their own. The&#xA;    current writer goes back to step 1.&#xA;&#xA;If you want to see where this is stored in the Hive metastore, you can reference the TABLEPARAMS table. At the time of writing, this is the only method of using the metastore that is supported by the Trino Iceberg connector.&#xA;&#xA;SELECT PARAMKEY, PARAMVALUE FROM metastore.TABLEPARAMS;&#xA;&#xA;Result:&#xA;&#xA;table&#xA;trthPARAMKEY/ththPARAMVALUE/th/tr&#xA;trtdEXTERNAL/tdtdTRUE/td/tr&#xA;trtdmetadatalocation/tdtds3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json/td/tr&#xA;trtdnumFiles/tdtd2/td/tr&#xA;trtdpreviousmetadatalocation/tdtds3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json/td/tr&#xA;trtdtabletype/tdtdiceberg/td/tr&#xA;trtdtotalSize/tdtd5323/td/tr&#xA;trtdtransientlastDdlTime/tdtd1622865672/td/tr&#xA;/table&#xA;&#xA;So as you can see, the metastore is saying the current metadata location is the&#xA;00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json file. Now you can dive in to see the table metadata that is being used by the Iceberg connector.&#xA;&#xA;% cat ~/Desktop/avrofiles/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json&#xA;&#xA;Result: &#xA;&#xA;{&#xA;   &#34;format-version&#34;:1,&#xA;   &#34;table-uuid&#34;:&#34;32e3c271-84a9-4be5-9342-2148c878227a&#34;,&#xA;   &#34;location&#34;:&#34;s3a://iceberg/logging.db/events&#34;,&#xA;   &#34;last-updated-ms&#34;:1622865686323,&#xA;   &#34;last-column-id&#34;:5,&#xA;   &#34;schema&#34;:{&#xA;      &#34;type&#34;:&#34;struct&#34;,&#xA;      &#34;fields&#34;:[&#xA;         {&#xA;            &#34;id&#34;:1,&#xA;            &#34;name&#34;:&#34;level&#34;,&#xA;            &#34;required&#34;:false,&#xA;            &#34;type&#34;:&#34;string&#34;&#xA;         },&#xA;         {&#xA;            &#34;id&#34;:2,&#xA;            &#34;name&#34;:&#34;eventtime&#34;,&#xA;            &#34;required&#34;:false,&#xA;            &#34;type&#34;:&#34;timestamp&#34;&#xA;         },&#xA;         {&#xA;            &#34;id&#34;:3,&#xA;            &#34;name&#34;:&#34;message&#34;,&#xA;            &#34;required&#34;:false,&#xA;            &#34;type&#34;:&#34;string&#34;&#xA;         },&#xA;         {&#xA;            &#34;id&#34;:4,&#xA;            &#34;name&#34;:&#34;callstack&#34;,&#xA;            &#34;required&#34;:false,&#xA;            &#34;type&#34;:{&#xA;               &#34;type&#34;:&#34;list&#34;,&#xA;               &#34;element-id&#34;:5,&#xA;               &#34;element&#34;:&#34;string&#34;,&#xA;               &#34;element-required&#34;:false&#xA;            }&#xA;         }&#xA;      ]&#xA;   },&#xA;   &#34;partition-spec&#34;:[&#xA;      {&#xA;         &#34;name&#34;:&#34;eventtimeday&#34;,&#xA;         &#34;transform&#34;:&#34;day&#34;,&#xA;         &#34;source-id&#34;:2,&#xA;         &#34;field-id&#34;:1000&#xA;      }&#xA;   ],&#xA;   &#34;default-spec-id&#34;:0,&#xA;   &#34;partition-specs&#34;:[&#xA;      {&#xA;         &#34;spec-id&#34;:0,&#xA;         &#34;fields&#34;:[&#xA;            {&#xA;               &#34;name&#34;:&#34;eventtime_day&#34;,&#xA;               &#34;transform&#34;:&#34;day&#34;,&#xA;               &#34;source-id&#34;:2,&#xA;               &#34;field-id&#34;:1000&#xA;            }&#xA;         ]&#xA;      }&#xA;   ],&#xA;   &#34;default-sort-order-id&#34;:0,&#xA;   &#34;sort-orders&#34;:[&#xA;      {&#xA;         &#34;order-id&#34;:0,&#xA;         &#34;fields&#34;:[&#xA;            &#xA;         ]&#xA;      }&#xA;   ],&#xA;   &#34;properties&#34;:{&#xA;      &#34;write.format.default&#34;:&#34;ORC&#34;&#xA;   },&#xA;   &#34;current-snapshot-id&#34;:4564366177504223943,&#xA;   &#34;snapshots&#34;:[&#xA;      {&#xA;         &#34;snapshot-id&#34;:6967685587675910019,&#xA;         &#34;timestamp-ms&#34;:1622865672882,&#xA;         &#34;summary&#34;:{&#xA;            &#34;operation&#34;:&#34;append&#34;,&#xA;            &#34;changed-partition-count&#34;:&#34;0&#34;,&#xA;            &#34;total-records&#34;:&#34;0&#34;,&#xA;            &#34;total-data-files&#34;:&#34;0&#34;,&#xA;            &#34;total-delete-files&#34;:&#34;0&#34;,&#xA;            &#34;total-position-deletes&#34;:&#34;0&#34;,&#xA;            &#34;total-equality-deletes&#34;:&#34;0&#34;&#xA;         },&#xA;         &#34;manifest-list&#34;:&#34;s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro&#34;&#xA;      },&#xA;      {&#xA;         &#34;snapshot-id&#34;:2720489016575682283,&#xA;         &#34;parent-snapshot-id&#34;:6967685587675910019,&#xA;         &#34;timestamp-ms&#34;:1622865680419,&#xA;         &#34;summary&#34;:{&#xA;            &#34;operation&#34;:&#34;append&#34;,&#xA;            &#34;added-data-files&#34;:&#34;2&#34;,&#xA;            &#34;added-records&#34;:&#34;3&#34;,&#xA;            &#34;added-files-size&#34;:&#34;1954&#34;,&#xA;            &#34;changed-partition-count&#34;:&#34;2&#34;,&#xA;            &#34;total-records&#34;:&#34;3&#34;,&#xA;            &#34;total-data-files&#34;:&#34;2&#34;,&#xA;            &#34;total-delete-files&#34;:&#34;0&#34;,&#xA;            &#34;total-position-deletes&#34;:&#34;0&#34;,&#xA;            &#34;total-equality-deletes&#34;:&#34;0&#34;&#xA;         },&#xA;         &#34;manifest-list&#34;:&#34;s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro&#34;&#xA;      },&#xA;      {&#xA;         &#34;snapshot-id&#34;:4564366177504223943,&#xA;         &#34;parent-snapshot-id&#34;:2720489016575682283,&#xA;         &#34;timestamp-ms&#34;:1622865686278,&#xA;         &#34;summary&#34;:{&#xA;            &#34;operation&#34;:&#34;append&#34;,&#xA;            &#34;added-data-files&#34;:&#34;1&#34;,&#xA;            &#34;added-records&#34;:&#34;1&#34;,&#xA;            &#34;added-files-size&#34;:&#34;746&#34;,&#xA;            &#34;changed-partition-count&#34;:&#34;1&#34;,&#xA;            &#34;total-records&#34;:&#34;4&#34;,&#xA;            &#34;total-data-files&#34;:&#34;3&#34;,&#xA;            &#34;total-delete-files&#34;:&#34;0&#34;,&#xA;            &#34;total-position-deletes&#34;:&#34;0&#34;,&#xA;            &#34;total-equality-deletes&#34;:&#34;0&#34;&#xA;         },&#xA;         &#34;manifest-list&#34;:&#34;s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro&#34;&#xA;      }&#xA;   ],&#xA;   &#34;snapshot-log&#34;:[&#xA;      {&#xA;         &#34;timestamp-ms&#34;:1622865672882,&#xA;         &#34;snapshot-id&#34;:6967685587675910019&#xA;      },&#xA;      {&#xA;         &#34;timestamp-ms&#34;:1622865680419,&#xA;         &#34;snapshot-id&#34;:2720489016575682283&#xA;      },&#xA;      {&#xA;         &#34;timestamp-ms&#34;:1622865686278,&#xA;         &#34;snapshot-id&#34;:4564366177504223943&#xA;      }&#xA;   ],&#xA;   &#34;metadata-log&#34;:[&#xA;      {&#xA;         &#34;timestamp-ms&#34;:1622865672894,&#xA;         &#34;metadata-file&#34;:&#34;s3a://iceberg/logging.db/events/metadata/00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json&#34;&#xA;      },&#xA;      {&#xA;         &#34;timestamp-ms&#34;:1622865680524,&#xA;         &#34;metadata-file&#34;:&#34;s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json&#34;&#xA;      }&#xA;   ]&#xA;}&#xA;&#xA;As you can see, these JSON files can quickly grow as you perform different updates on your table. This file contains a pointer to all of the snapshots and manifest list files, much like the output you found from looking at the snapshots in the table. A really important piece to note is the schema is stored here. This is what Trino uses for validation on inserts and reads. As you may expect, there is the root location of the table itself, as well as a unique table identifier. The final part I’d like to note about this file is the partition-spec and partition-specs fields. The partition-spec field holds the current partition spec, while the partition-specs is an array that can hold a list of all partition specs that have existed for this table. As pointed out earlier, you can have many different manifest files that use different partition specs. That wraps up all of the metadata file types you can expect to see in Iceberg!&#xA;&#xA;This post wraps up the Trino on ice series. Hopefully these blog posts serve as a helpful initial dialogue about what is expected to grow as a vital portion of an open data lakehouse stack. What are you waiting for? Come join the fun and help us implement some of the missing features or instead go ahead and try Trino on Ice(berg) yourself!&#xA;&#xA;#trino #iceberg]]&gt;</description>
      <content:encoded><![CDATA[<p><img src="https://trino.io/assets/blog/trino-on-ice/trino-iceberg.png" alt=""/></p>

<p>So far, this series has covered some very interesting user level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some implementation details of Iceberg by dissecting some files that result from various operations carried out using Trino. To dissect you must use some surgical instrumentation, namely Trino, Avro tools, the MinIO client tool and Iceberg’s core library. It’s useful to dissect how these files work, not only to help understand how Iceberg works, but also to aid in troubleshooting issues, should you have any issues during ingestion or querying of your Iceberg table. I like to think of this type of debugging much like a fun game of operation, and you’re looking to see what causes the red errors to fly by on your screen.</p>



<hr/>

<p>Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:</p>
<ul><li><a href="https://bitsondata.dev/trino-iceberg-i-gentle-intro">Trino on ice I: A gentle introduction to Iceberg</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-ii-table-evolution-cloud">Trino on ice II: In-place table evolution and cloud compatibility with Iceberg</a></li>
<li><a href="https://write.as/bitsondatadev/trino-iceberg-iii-concurrency-snapshots-spec">Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-iv-deep-dive">Trino on ice IV: Deep dive into Iceberg internals</a></li></ul>

<hr/>

<p><img src="https://trino.io/assets/blog/trino-on-ice/operation.gif" alt=""/></p>

<h2 id="understanding-iceberg-metadata" id="understanding-iceberg-metadata">Understanding Iceberg metadata</h2>

<p>Iceberg can use any compatible metastore, but for Trino, it only supports the  Hive metastore and AWS Glue similar to the Hive connector. This is because there is already a vast amount of testing and support for using the Hive metastore in Trino. Likewise, many Trino use cases that currently use data lakes already use the Hive connector and therefore the Hive metastore. This makes it convenient to have as the leading supported use case as existing users can easily migrate between Hive to Iceberg tables. Since there is no indication of which connector is actually executed in the diagram of the Hive connector architecture, it serves as a diagram that can be used for both Hive and Iceberg. The only difference is the connector used, but if you create a table in Hive, you can  view the same table in Iceberg.</p>

<p><img src="https://trino.io/assets/blog/trino-on-ice/iceberg-metadata.png" alt=""/></p>

<p>To recap the steps taken from the first three blogs; the first blog created an events table, while the first two blogs ran two insert statements. The first insert contained three records, while the second insert contained a single record.</p>

<p><img src="https://trino.io/assets/blog/trino-on-ice/iceberg-snapshot-files.png" alt=""/></p>

<p>Up until this point, the state of the files in MinIO haven’t really been shown except some of the manifest list pointers from the snapshot in the third blog post. Using the <a href="https://docs.min.io/minio/baremetal/reference/minio-cli/minio-mc.html">MinIO client tool</a>, you can list files that Iceberg generated through all these operations and then try to understand what purpose they are serving.</p>

<pre><code>% mc tree -f local/
local/
└─ iceberg
   └─ logging.db
      └─ events
         ├─ data
         │  ├─ event_time_day=2021-04-01
         │  │  ├─ 51eb1ea6-266b-490f-8bca-c63391f02d10.orc
         │  │  └─ cbcf052d-240d-4881-8a68-2bbc0f7e5233.orc
         │  └─ event_time_day=2021-04-02
         │     └─ b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc
         └─ metadata
            ├─ 00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json
            ├─ 00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
            ├─ 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
            ├─ 23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro
            ├─ 92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro
            ├─ snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
            ├─ snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro
            └─ snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
</code></pre>

<p>There are a lot of files here, but here are a couple of patterns that you can observe with these files.</p>

<p>First, the top two directories are named <code>data</code> and <code>metadata</code>.</p>

<pre><code>/&lt;bucket&gt;/&lt;database&gt;/&lt;table&gt;/data//&lt;bucket&gt;/&lt;database&gt;/&lt;table&gt;/metadata/
</code></pre>

<p>As you might expect, <code>data</code> contains the actual ORC files split by partition. This is akin to what you would see in a Hive table <code>data</code> directory. What is really of interest here is the <code>metadata</code> directory. There are specifically three patterns of files you’ll find here.</p>

<pre><code>/&lt;bucket&gt;/&lt;database&gt;/&lt;table&gt;/metadata/&lt;file-id&gt;.avro

/&lt;bucket&gt;/&lt;database&gt;/&lt;table&gt;/metadata/snap-&lt;snapshot-id&gt;-&lt;version&gt;-&lt;file-id&gt;.avro

/&lt;bucket&gt;/&lt;database&gt;/&lt;table&gt;/metadata/&lt;version&gt;-&lt;commit-UUID&gt;.metadata.json
</code></pre>

<p>Iceberg has a persistent tree structure that manages various snapshots of the data that are created for every mutation of the data. This enables not only a concurrency model that supports serializable isolation, but also cool features like time travel across a linear progression of snapshots.</p>

<p><img src="https://trino.io/assets/blog/trino-on-ice/iceberg-metastore-files.png" alt=""/></p>

<p>This tree structure contains two types of Avro files, manifest lists and manifest files. Manifest list files contain pointers to various manifest files and the manifest files themselves point to various data files. This post starts out by covering these manifest files, and later covers the table metadata files that are suffixed by <code>.metadata.json</code>.</p>

<p><a href="https://bitsondata.dev/trino-on-ice-iii-iceberg-concurrency-snapshots-spec">The last blog covered</a> the command in Trino that shows the snapshot information that is stored in the metastore. Here is that command and its output again for your review.</p>

<pre><code>SELECT manifest_list 
FROM iceberg.logging.&#34;events$snapshots&#34;;
</code></pre>

<p>Result:</p>

<table>
<tr><th>snapshots</th></tr>
<tr><td>s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro</td></tr>
<tr><td>s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro</td></tr>
<tr><td>s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro</td></tr>
</table>

<p>You’ll notice that the manifest list returns the Avro files prefixed with
<code>snap-</code> are returned. These files are directly correlated with the snapshot record stored in the metastore. According to the diagram above, snapshots are records in the metastore that contain the url of the manifest list in the Avro file. Avro files are binary files and not something you can just open up in a text editor to read. Using the <a href="https://downloads.apache.org/avro/avro-1.10.2/java/avro-tools-1.10.2.jar">avro-tools.jar tool</a> distributed by the <a href="https://avro.apache.org/docs/current/index.html">Apache Avro project</a>, you can actually inspect the contents of this file to get a better understanding of how it is used by Iceberg.</p>

<p>The first snapshot is generated on the creation of the events table. Upon inspecting this file, you notice that the file is empty. The output is an empty line that the <code>jq</code> JSON command line utility removes on pretty printing the JSON that is returned, which is just a newline. This snapshot represents an empty state of the table upon creation. To investigate the snapshots you need to download the files to your local filesystem. Let&#39;s move them to the home  directory:</p>

<pre><code>% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro | jq .
</code></pre>

<p>Result: (is empty)</p>

<pre><code>
</code></pre>

<p>The second snapshot is a little more interesting and actually shows us the contents of a manifest list.</p>

<pre><code>% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro | jq .
</code></pre>

<p>Result:</p>

<pre><code>{
   &#34;manifest_path&#34;:&#34;s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro&#34;,
   &#34;manifest_length&#34;:6114,
   &#34;partition_spec_id&#34;:0,
   &#34;added_snapshot_id&#34;:{
      &#34;long&#34;:2720489016575682000
   },
   &#34;added_data_files_count&#34;:{
      &#34;int&#34;:2
   },
   &#34;existing_data_files_count&#34;:{
      &#34;int&#34;:0
   },
   &#34;deleted_data_files_count&#34;:{
      &#34;int&#34;:0
   },
   &#34;partitions&#34;:{
      &#34;array&#34;:[
         {
            &#34;contains_null&#34;:false,
            &#34;lower_bound&#34;:{
               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;
            },
            &#34;upper_bound&#34;:{
               &#34;bytes&#34;:&#34;\u001fI\u0000\u0000&#34;
            }
         }
      ]
   },
   &#34;added_rows_count&#34;:{
      &#34;long&#34;:3
   },
   &#34;existing_rows_count&#34;:{
      &#34;long&#34;:0
   },
   &#34;deleted_rows_count&#34;:{
      &#34;long&#34;:0
   }
}
</code></pre>

<p>To understand each of the values in each of these rows, you can refer to the  Iceberg
<a href="https://iceberg.apache.org/spec/#manifest-lists">specification in the manifest list file section</a>. Instead of covering these exhaustively, let&#39;s focus on a few key fields. Below are the fields, and their definition according to the specification.</p>
<ul><li><code>manifest_path</code> – Location of the manifest file.</li>
<li><code>partition_spec_id</code> – ID of a partition spec used to write the manifest; must be listed in table metadata partition-specs.</li>
<li><code>added_snapshot_id</code> – ID of the snapshot where the manifest file was added.</li>
<li><code>partitions</code> – A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec.</li>
<li><code>added_rows_count</code> – Number of rows in all files in the manifest that have status ADDED, when null this is assumed to be non-zero.</li></ul>

<p>As mentioned above, manifest lists hold references to various manifest files. These manifest paths are the pointers in the persistent tree that tells any client using Iceberg where to find all of the manifest files associated with a particular snapshot. To traverse this tree, you can look over the different manifest paths to find all the manifest files associated with the particular snapshot you want to traverse. Partition spec ids are helpful to know the current partition specification which are stored in the table metadata in the metastore. This references where to find the spec in the metastore. Added snapshot ids tells you which snapshot is associated with the manifest list. Partitions hold some high level partition bound information to make for faster querying. If a query is looking for a particular value, it only traverses the manifest files where the query values fall within the range of the file values. Finally, you get a few metrics like the number of changed rows and data files, one of which is the count of added rows. The first operation consisted of three rows inserts and the second operation was the insertion of one row. Using the row counts you can easily determine which manifest file belongs to which operation.</p>

<p>The following command shows the final snapshot after both operations executed and filters out only the fields pointed out above.</p>

<pre><code>% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro | jq &#39;. | {manifest_path: .manifest_path, partition_spec_id: .partition_spec_id, added_snapshot_id: .added_snapshot_id, partitions: .partitions, added_rows_count: .added_rows_count }&#39;
</code></pre>

<p>Result:</p>

<pre><code>{
   &#34;manifest_path&#34;:&#34;s3a://iceberg/logging.db/events/metadata/23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro&#34;,
   &#34;partition_spec_id&#34;:0,
   &#34;added_snapshot_id&#34;:{
      &#34;long&#34;:4564366177504223700
   },
   &#34;partitions&#34;:{
      &#34;array&#34;:[
         {
            &#34;contains_null&#34;:false,
            &#34;lower_bound&#34;:{
               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;
            },
            &#34;upper_bound&#34;:{
               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;
            }
         }
      ]
   },
   &#34;added_rows_count&#34;:{
      &#34;long&#34;:1
   }
}
{
   &#34;manifest_path&#34;:&#34;s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro&#34;,
   &#34;partition_spec_id&#34;:0,
   &#34;added_snapshot_id&#34;:{
      &#34;long&#34;:2720489016575682000
   },
   &#34;partitions&#34;:{
      &#34;array&#34;:[
         {
            &#34;contains_null&#34;:false,
            &#34;lower_bound&#34;:{
               &#34;bytes&#34;:&#34;\u001eI\u0000\u0000&#34;
            },
            &#34;upper_bound&#34;:{
               &#34;bytes&#34;:&#34;\u001fI\u0000\u0000&#34;
            }
         }
      ]
   },
   &#34;added_rows_count&#34;:{
      &#34;long&#34;:3
   }
}
</code></pre>

<p>In the listing of the manifest file related to the last snapshot, you notice the first operation where three rows were inserted is contained in the manifest file in the second JSON object. You can determine this from the snapshot id, as well as, the number of rows that were added in the operation. The first JSON object contains the last operation that inserted a single row. So the most recent operations are listed in reverse commit order.</p>

<p>The next command does the same listing of the file that you ran with the manifest list, except you run this on the manifest files themselves to expose their contents and discuss them. To begin with, you run the command to show the contents of the manifest file associated with the insertion of three rows.</p>

<pre><code>% java -jar  ~/avro-tools-1.10.0.jar tojson ~/Desktop/avro_files/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro | jq .
</code></pre>

<p>Result:</p>

<pre><code>{
   &#34;status&#34;:1,
   &#34;snapshot_id&#34;:{
      &#34;long&#34;:2720489016575682000
   },
   &#34;data_file&#34;:{
      &#34;file_path&#34;:&#34;s3a://iceberg/logging.db/events/data/event_time_day=2021-04-01/51eb1ea6-266b-490f-8bca-c63391f02d10.orc&#34;,
      &#34;file_format&#34;:&#34;ORC&#34;,
      &#34;partition&#34;:{
         &#34;event_time_day&#34;:{
            &#34;int&#34;:18718
         }
      },
      &#34;record_count&#34;:1,
      &#34;file_size_in_bytes&#34;:870,
      &#34;block_size_in_bytes&#34;:67108864,
      &#34;column_sizes&#34;:null,
      &#34;value_counts&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:1
            },
            {
               &#34;key&#34;:2,
               &#34;value&#34;:1
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:1
            },
            {
               &#34;key&#34;:4,
               &#34;value&#34;:1
            }
         ]
      },
      &#34;null_value_counts&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:0
            },
            {
               &#34;key&#34;:2,
               &#34;value&#34;:0
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:0
            },
            {
               &#34;key&#34;:4,
               &#34;value&#34;:0
            }
         ]
      },
      &#34;nan_value_counts&#34;:null,
      &#34;lower_bounds&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:&#34;ERROR&#34;
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:&#34;Oh noes&#34;
            }
         ]
      },
      &#34;upper_bounds&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:&#34;ERROR&#34;
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:&#34;Oh noes&#34;
            }
         ]
      },
      &#34;key_metadata&#34;:null,
      &#34;split_offsets&#34;:null
   }
}
{
   &#34;status&#34;:1,
   &#34;snapshot_id&#34;:{
      &#34;long&#34;:2720489016575682000
   },
   &#34;data_file&#34;:{
      &#34;file_path&#34;:&#34;s3a://iceberg/logging.db/events/data/event_time_day=2021-04-02/b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc&#34;,
      &#34;file_format&#34;:&#34;ORC&#34;,
      &#34;partition&#34;:{
         &#34;event_time_day&#34;:{
            &#34;int&#34;:18719
         }
      },
      &#34;record_count&#34;:2,
      &#34;file_size_in_bytes&#34;:1084,
      &#34;block_size_in_bytes&#34;:67108864,
      &#34;column_sizes&#34;:null,
      &#34;value_counts&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:2
            },
            {
               &#34;key&#34;:2,
               &#34;value&#34;:2
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:2
            },
            {
               &#34;key&#34;:4,
               &#34;value&#34;:2
            }
         ]
      },
      &#34;null_value_counts&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:0
            },
            {
               &#34;key&#34;:2,
               &#34;value&#34;:0
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:0
            },
            {
               &#34;key&#34;:4,
               &#34;value&#34;:0
            }
         ]
      },
      &#34;nan_value_counts&#34;:null,
      &#34;lower_bounds&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:&#34;ERROR&#34;
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:&#34;Double oh noes&#34;
            }
         ]
      },
      &#34;upper_bounds&#34;:{
         &#34;array&#34;:[
            {
               &#34;key&#34;:1,
               &#34;value&#34;:&#34;WARN&#34;
            },
            {
               &#34;key&#34;:3,
               &#34;value&#34;:&#34;Maybeh oh noes?&#34;
            }
         ]
      },
      &#34;key_metadata&#34;:null,
      &#34;split_offsets&#34;:null
   }
}
</code></pre>

<p>Now this is a very big output, but in summary, there’s really not too much to these files. As before, there is a <a href="https://iceberg.apache.org/spec/#manifests">Manifest section in the Iceberg spec</a> that details what each of these fields means. Here are the important fields:</p>
<ul><li><code>snapshot_id</code> – Snapshot id where the file was added, or deleted if status is two. Inherited when null.</li>
<li><code>data_file</code> – Field containing metadata about the data files pertaining to the manifest file, such as file path, partition tuple, metrics, etc…</li>
<li><code>data_file.file_path</code> – Full URI for the file with FS scheme.</li>
<li><code>data_file.partition</code> – Partition data tuple, schema based on the partition spec.</li>
<li><code>data_file.record_count</code> – Number of records in the data file.</li>
<li><code>data_file.*_count</code> – Multiple fields that contain a map from column id to  number of values, null, nan counts in the file. These can be used to quickly  filter out unnecessary get operations.</li>
<li><code>data_file.*_bounds</code> – Multiple fields that contain a map from column id to lower or upper bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.</li></ul>

<p>Each data file struct contains a partition and data file that it maps to. These files only be scanned and returned if the criteria for the query is met when  checking all of the count, bounds, and other statistics that are recorded in the file. Ideally only files that contain data relevant to the query should be scanned at all. Having information like the record count may also help in the query planning process to determine splits and other information. This particular optimization hasn’t been completed yet as planning typically happens before traversal of the files. It is still in ongoing discussion and <a href="https://youtu.be/ifXpOn0NJWk?t=2132">is discussed a bit by Iceberg creator Ryan Blue in a recent meetup</a>. If this is something you are interested in, keep posted on the Slack channel and releases as the Trino Iceberg connector progresses in this area.</p>

<p>As mentioned above, the last set of files that you find in the metadata directory which are suffixed with <code>.metadata.json</code>. These files at baseline are a bit strange as they aren’t stored in the Avro format, but instead the JSON format. This is because they are not part of the persistent tree structure. These files are essentially a copy of the table metadata that is stored in the metastore. You can find the fields for the table metadata listed <a href="https://iceberg.apache.org/spec/#table-metadata-fields">in the Iceberg specification</a>. These tables are typically stored persistently in a metasture much like the Hive metastore but could easily be replaced by any datastore that can support <a href="https://iceberg.apache.org/spec/#metastore-tables">an atomic swap (check-and-put) operation</a> required for Iceberg to support the optimistic concurrency operation.</p>

<p>The naming of the table metadata includes a table version and UUID:
<code>&lt;table-version&gt;-&lt;UUID&gt;.metadata.json</code>. To commit a new metadata version, which just adds 1 to the current version number, the writer performs these steps:</p>
<ol><li>It creates a new table metadata file using the current metadata.</li>
<li>It writes the new table metadata to a file following the naming with the next version number.</li>

<li><p>It requests the metastore swap the table’s metadata pointer from the old location to the new location.</p>
<ol><li>If the swap succeeds, the commit succeeded. The new file is now the
current metadata.</li>
<li>If the swap fails, another writer has already created their own. The
current writer goes back to step 1.</li></ol></li></ol>

<p>If you want to see where this is stored in the Hive metastore, you can reference the <code>TABLE_PARAMS</code> table. At the time of writing, this is the only method of using the metastore that is supported by the Trino Iceberg connector.</p>

<pre><code>SELECT PARAM_KEY, PARAM_VALUE FROM metastore.TABLE_PARAMS;
</code></pre>

<p>Result:</p>

<table>
<tr><th>PARAM_KEY</th><th>PARAM_VALUE</th></tr>
<tr><td>EXTERNAL</td><td>TRUE</td></tr>
<tr><td>metadata_location</td><td>s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json</td></tr>
<tr><td>numFiles</td><td>2</td></tr>
<tr><td>previous_metadata_location</td><td>s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json</td></tr>
<tr><td>table_type</td><td>iceberg</td></tr>
<tr><td>totalSize</td><td>5323</td></tr>
<tr><td>transient_lastDdlTime</td><td>1622865672</td></tr>
</table>

<p>So as you can see, the metastore is saying the current metadata location is the
<code>00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json</code> file. Now you can dive in to see the table metadata that is being used by the Iceberg connector.</p>

<pre><code>% cat ~/Desktop/avro_files/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
</code></pre>

<p>Result:</p>

<pre><code>{
   &#34;format-version&#34;:1,
   &#34;table-uuid&#34;:&#34;32e3c271-84a9-4be5-9342-2148c878227a&#34;,
   &#34;location&#34;:&#34;s3a://iceberg/logging.db/events&#34;,
   &#34;last-updated-ms&#34;:1622865686323,
   &#34;last-column-id&#34;:5,
   &#34;schema&#34;:{
      &#34;type&#34;:&#34;struct&#34;,
      &#34;fields&#34;:[
         {
            &#34;id&#34;:1,
            &#34;name&#34;:&#34;level&#34;,
            &#34;required&#34;:false,
            &#34;type&#34;:&#34;string&#34;
         },
         {
            &#34;id&#34;:2,
            &#34;name&#34;:&#34;event_time&#34;,
            &#34;required&#34;:false,
            &#34;type&#34;:&#34;timestamp&#34;
         },
         {
            &#34;id&#34;:3,
            &#34;name&#34;:&#34;message&#34;,
            &#34;required&#34;:false,
            &#34;type&#34;:&#34;string&#34;
         },
         {
            &#34;id&#34;:4,
            &#34;name&#34;:&#34;call_stack&#34;,
            &#34;required&#34;:false,
            &#34;type&#34;:{
               &#34;type&#34;:&#34;list&#34;,
               &#34;element-id&#34;:5,
               &#34;element&#34;:&#34;string&#34;,
               &#34;element-required&#34;:false
            }
         }
      ]
   },
   &#34;partition-spec&#34;:[
      {
         &#34;name&#34;:&#34;event_time_day&#34;,
         &#34;transform&#34;:&#34;day&#34;,
         &#34;source-id&#34;:2,
         &#34;field-id&#34;:1000
      }
   ],
   &#34;default-spec-id&#34;:0,
   &#34;partition-specs&#34;:[
      {
         &#34;spec-id&#34;:0,
         &#34;fields&#34;:[
            {
               &#34;name&#34;:&#34;event_time_day&#34;,
               &#34;transform&#34;:&#34;day&#34;,
               &#34;source-id&#34;:2,
               &#34;field-id&#34;:1000
            }
         ]
      }
   ],
   &#34;default-sort-order-id&#34;:0,
   &#34;sort-orders&#34;:[
      {
         &#34;order-id&#34;:0,
         &#34;fields&#34;:[
            
         ]
      }
   ],
   &#34;properties&#34;:{
      &#34;write.format.default&#34;:&#34;ORC&#34;
   },
   &#34;current-snapshot-id&#34;:4564366177504223943,
   &#34;snapshots&#34;:[
      {
         &#34;snapshot-id&#34;:6967685587675910019,
         &#34;timestamp-ms&#34;:1622865672882,
         &#34;summary&#34;:{
            &#34;operation&#34;:&#34;append&#34;,
            &#34;changed-partition-count&#34;:&#34;0&#34;,
            &#34;total-records&#34;:&#34;0&#34;,
            &#34;total-data-files&#34;:&#34;0&#34;,
            &#34;total-delete-files&#34;:&#34;0&#34;,
            &#34;total-position-deletes&#34;:&#34;0&#34;,
            &#34;total-equality-deletes&#34;:&#34;0&#34;
         },
         &#34;manifest-list&#34;:&#34;s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro&#34;
      },
      {
         &#34;snapshot-id&#34;:2720489016575682283,
         &#34;parent-snapshot-id&#34;:6967685587675910019,
         &#34;timestamp-ms&#34;:1622865680419,
         &#34;summary&#34;:{
            &#34;operation&#34;:&#34;append&#34;,
            &#34;added-data-files&#34;:&#34;2&#34;,
            &#34;added-records&#34;:&#34;3&#34;,
            &#34;added-files-size&#34;:&#34;1954&#34;,
            &#34;changed-partition-count&#34;:&#34;2&#34;,
            &#34;total-records&#34;:&#34;3&#34;,
            &#34;total-data-files&#34;:&#34;2&#34;,
            &#34;total-delete-files&#34;:&#34;0&#34;,
            &#34;total-position-deletes&#34;:&#34;0&#34;,
            &#34;total-equality-deletes&#34;:&#34;0&#34;
         },
         &#34;manifest-list&#34;:&#34;s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro&#34;
      },
      {
         &#34;snapshot-id&#34;:4564366177504223943,
         &#34;parent-snapshot-id&#34;:2720489016575682283,
         &#34;timestamp-ms&#34;:1622865686278,
         &#34;summary&#34;:{
            &#34;operation&#34;:&#34;append&#34;,
            &#34;added-data-files&#34;:&#34;1&#34;,
            &#34;added-records&#34;:&#34;1&#34;,
            &#34;added-files-size&#34;:&#34;746&#34;,
            &#34;changed-partition-count&#34;:&#34;1&#34;,
            &#34;total-records&#34;:&#34;4&#34;,
            &#34;total-data-files&#34;:&#34;3&#34;,
            &#34;total-delete-files&#34;:&#34;0&#34;,
            &#34;total-position-deletes&#34;:&#34;0&#34;,
            &#34;total-equality-deletes&#34;:&#34;0&#34;
         },
         &#34;manifest-list&#34;:&#34;s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro&#34;
      }
   ],
   &#34;snapshot-log&#34;:[
      {
         &#34;timestamp-ms&#34;:1622865672882,
         &#34;snapshot-id&#34;:6967685587675910019
      },
      {
         &#34;timestamp-ms&#34;:1622865680419,
         &#34;snapshot-id&#34;:2720489016575682283
      },
      {
         &#34;timestamp-ms&#34;:1622865686278,
         &#34;snapshot-id&#34;:4564366177504223943
      }
   ],
   &#34;metadata-log&#34;:[
      {
         &#34;timestamp-ms&#34;:1622865672894,
         &#34;metadata-file&#34;:&#34;s3a://iceberg/logging.db/events/metadata/00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json&#34;
      },
      {
         &#34;timestamp-ms&#34;:1622865680524,
         &#34;metadata-file&#34;:&#34;s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json&#34;
      }
   ]
}
</code></pre>

<p>As you can see, these JSON files can quickly grow as you perform different updates on your table. This file contains a pointer to all of the snapshots and manifest list files, much like the output you found from looking at the snapshots in the table. A really important piece to note is the schema is stored here. This is what Trino uses for validation on inserts and reads. As you may expect, there is the root location of the table itself, as well as a unique table identifier. The final part I’d like to note about this file is the partition-spec and partition-specs fields. The partition-spec field holds the current partition spec, while the partition-specs is an array that can hold a list of all partition specs that have existed for this table. As pointed out earlier, you can have many different manifest files that use different partition specs. That wraps up all of the metadata file types you can expect to see in Iceberg!</p>

<p>This post wraps up the Trino on ice series. Hopefully these blog posts serve as a helpful initial dialogue about what is expected to grow as a vital portion of an open data lakehouse stack. What are you waiting for? Come join the fun and help us implement some of the missing features or instead go ahead and try <a href="https://github.com/bitsondatadev/trino-getting-started/tree/main/iceberg/trino-iceberg-minio">Trino on Ice(berg)</a> yourself!</p>

<p><a href="https://bitsondata.dev/tag:trino" class="hashtag"><span>#</span><span class="p-category">trino</span></a> <a href="https://bitsondata.dev/tag:iceberg" class="hashtag"><span>#</span><span class="p-category">iceberg</span></a></p>
]]></content:encoded>
      <guid>https://bitsondata.dev/trino-iceberg-iv-deep-dive</guid>
      <pubDate>Thu, 12 Aug 2021 05:00:00 +0000</pubDate>
    </item>
    <item>
      <title>Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec</title>
      <link>https://bitsondata.dev/trino-iceberg-iii-concurrency-snapshots-spec?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[&#xA;&#xA;In the last two blog posts, we’ve covered a lot of cool feature improvements of Iceberg over the Hive model. I recommend you take a look at those if you haven’t yet. We introduced concepts and issues that table formats address. This blog closes up the overview of Iceberg features by discussing the concurrency model Iceberg uses to ensure data integrity, how to use snapshots via Trino, and the Iceberg Specification.&#xA;&#xA;!--more--&#xA;&#xA;---&#xA;&#xA;Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:&#xA;&#xA;Trino on ice I: A gentle introduction to Iceberg&#xA;Trino on ice II: In-place table evolution and cloud compatibility with Iceberg&#xA;Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec&#xA;Trino on ice IV: Deep dive into Iceberg internals&#xA;&#xA;---&#xA;&#xA;Concurrency Model&#xA; Some issues with the Hive model are the distinct locations where the metadata is stored and where the data files are stored. Having your data and metadata split up like this is a recipe for disaster when trying to apply updates to both services atomically.&#xA;&#xA; Iceberg metadata diagram of runtime, and file storage&#xA; A very common problem with Hive is that if a writing process failed during insertion, many times you would find the data written to file storage, but the metastore writes failed to occur. Or conversely, the metastore writes were successful, but the data failed to finish writing to file storage due to a  network or file IO failure. There’s a good  Trino Community Broadcast episode that talks about a function in Trino that exists to resolve these issues by syncing the metastore and file storage. You can watch  a simulation of this error on that episode.&#xA;&#xA; Aside from having issues due to the split state in the system, there are many  other issues that stem from the file system itself. In the case of HDFS,  depending on the specific filesystem implementation you are using, you may have different atomicity guarantees for various file systems and their operations, such as creating, deleting, and renaming files and directories. HDFS isn’t the only troublemaker here. Other than Amazon S3’s  recent announcement of strong consistency in their S3 service, most object storage systems only offer eventual consistency that may not show the latest files immediately after writes. Despite storage systems showing more progress towards offering better performance and guarantees, these systems still offer no reliable locking mechanism.&#xA;&#xA; Iceberg addresses all of these issues in a multitude of ways. One of the primary ways Iceberg introduces transactional guarantees is by storing the metadata in the same datastore as the data itself. This simplifies handling commit failures down to rolling back on one system rather than trying to coordinate a rollback across two systems like in Hive. Writers independently write their metadata and attempt to perform their operations, needing no coordination with other writers. The only time the writers coordinate is when they attempt to perform a commit of their operations. In order to do a commit, they perform a lock of the current snapshot record in a database. This concurrency model where writers eagerly do the work upfront is called optimistic concurrency control.&#xA; Currently, in Trino, this method still uses the Hive metastore to perform the lock-and-swap operation necessary to coordinate the final commits. Iceberg  creator, Ryan Blue, covers this lock-and-swap mechanism and how the metastore can be replaced with alternate locking methods. In the event that two writers attempt to commit at the same time, the writer that first acquires the lock successfully commits by swapping its snapshot as the current snapshot, while the second writer will retry to apply its changes again. The second writer should have no problem with this, assuming there are no conflicting changes between the two snapshots.&#xA;&#xA; &#xA;&#xA; This works similarly to a git workflow where the main branch is the locked resource, and two developers try to commit their changes at the same time. The first developer’s changes may conflict with the second developer’s changes. The second developer is then forced to rebase or merge the first developer’s code with their changes before commiting to the main branch again. The same logic applies to merging data files. Currently, Iceberg clients use a copy-on-write mechanism that makes a new file out of the merged data in the next snapshot. This enables accurate time traveling and preserves previous split versions of the files. At the time of writing, upserts via MERGE INTO syntax are not supported in Trino, but  this is in active development. UPDATE: Since the original writing of this post, the  MERGE syntax exists as of version 393.&#xA;&#xA; One of the great benefits of tracking each individual change that gets written to Iceberg is that you are given a view of the data at every point in time. This enables a really cool feature that I mentioned earlier called time travel.&#xA;&#xA; ## Snapshots and Time Travel&#xA;&#xA; To showcase snapshots, it’s best to go over a few examples drawing from the event table we  created in the previous blog posts. This time we’ll only be working with the Iceberg table, as this capability is not available in Hive. Snapshots allow you to have an immutable set of your data at a given time. They are automatically created on every append or removal of data. One thing to note is that for now, they do not store the state of your metadata.&#xA; Say that you have c&#xA; reated your events table and inserted the three initial rows as we did previously. Let’s look at the data we get back and see how to check the existing snapshots in Trino:&#xA;&#xA; &#xA;SELECT level, message&#xA;FROM iceberg.logging.events;&#xA;&#xA;Result:&#xA;&#xA;| level | message |&#xA;| --- | --- |&#xA;| ERROR | Double oh noes |&#xA;| WARN | Maybeh oh noes? |&#xA;| ERROR | Oh noes |&#xA;&#xA;To query the snapshots, all you need is to use the $ operator appended to the&#xA;end of the table name, and add the hidden table, snapshots:&#xA;&#xA;SELECT snapshotid, parentid, operation&#xA;FROM iceberg.logging.“events$snapshots”;&#xA;&#xA;Result:&#xA;&#xA;| snapshotid | parentid | operation |&#xA;| --- | --- | --- |&#xA;| 7620328658793169607 | | append |&#xA;| 2115743741823353537 | 7620328658793169607 | append |&#xA;&#xA;Let’s take a look at the manifest list files that are associated with each &#xA;snapshot ID. You can tell which file belongs to which snapshot based on the &#xA;snapshot ID embedded in the filename:&#xA;&#xA;SELECT manifestlist&#xA;FROM iceberg.logging.“events$snapshots”;&#xA;&#xA;Result:&#xA;&#xA;| shapshots |&#xA;| --- |&#xA;| s3a://iceberg/logging.db/events/metadata/snap-7620328658793169607-1-cc857d89-1c07-4087-bdbc-2144a814dae2.avro | &#xA;| s3a://iceberg/logging.db/events/metadata/snap-2115743741823353537-1-4cb458be-7152-4e99-8db7-b2dda52c556c.avro | &#xA;&#xA;Now, let’s insert another row to the table:&#xA;&#xA;INSERT INTO iceberg.logging.events&#xA;VALUES&#xA;(&#xA;‘INFO’,&#xA;timestamp ‘2021-04-02 00:00:11.1122222’,&#xA;‘It is all good’,&#xA;ARRAY [‘Just updating you!’]&#xA;);&#xA;&#xA;Let’s check the snapshot table again:&#xA;&#xA;SELECT snapshotid, parentid, operation&#xA;FROM iceberg.logging.“events$snapshots”;&#xA;&#xA;Result:&#xA;&#xA;| snapshotid | parentid | operation |&#xA;| --- | --- | --- |&#xA;| 7620328658793169607 | | append |&#xA;| 2115743741823353537 | 7620328658793169607 | append |&#xA;| 7030511368881343137 | 2115743741823353537 | append |&#xA;&#xA;Let’s also verify that our row was added:&#xA;&#xA;SELECT level, message&#xA;FROM iceberg.logging.events;&#xA;&#xA;Result:&#xA;&#xA;| level | message |&#xA;| --- | --- |&#xA;|ERROR|Oh noes |&#xA;|INFO |It is all good |&#xA;|ERROR|Double oh noes |&#xA;|WARN |Maybeh oh noes?|&#xA;&#xA; Since Iceberg is already tracking the list of files added and removed at each snapshot, it would make sense that you can travel back and forth between these different views into the system, right? This concept is called time traveling. You need to specify which snapshot you would like to read from and you will see the view of the data at that timestamp. In Trino, you need to use the @ operator, followed by the snapshot you wish to read from:&#xA; &#xA;&#xA;SELECT level, message&#xA;FROM iceberg.logging.“events@2115743741823353537”;&#xA;&#xA;Result:&#xA;&#xA;| level | message |&#xA;| --- | --- |&#xA;|ERROR|Double oh noes |&#xA;|WARN |Maybeh oh noes?|&#xA;|ERROR|Oh noes |&#xA;&#xA; If you determine there is some issue with your data, you can always roll back to the previous state permanently as well. In Trino we have a function called rollbacktosnapshot to move the table state to another snapshot:&#xA; &#xA;CALL system.rollbacktosnapshot(‘logging’, ‘events’, 2115743741823353537);&#xA;&#xA;Now that we have rolled back, observe what happens when we query the events&#xA;table with:&#xA;&#xA;SELECT level, message&#xA;FROM iceberg.logging.events;&#xA;&#xA;Result:&#xA;&#xA;| level | message |&#xA;| --- | --- |&#xA;|ERROR|Double oh noes |&#xA;|WARN |Maybeh oh noes?|&#xA;|ERROR|Oh noes |&#xA; &#xA; Notice the INFO row is still missing even though we query the table without specifying a snapshot id. Now just because we rolled back, doesn’t mean we’ve lost the snapshot we just rolled back from. In fact, we can roll forward, or as I like to call it,  back to the future! In Trino, you use the same function call but with a predecessor of the existing snapshot:&#xA; &#xA;CALL system.rollbacktosnapshot(‘logging’, ‘events’, 7030511368881343137)&#xA;&#xA;And now we should be able to query the table again and see the INFO row &#xA;return:&#xA;&#xA;SELECT level, message&#xA;FROM iceberg.logging.events;&#xA;&#xA;Result:&#xA;&#xA;| level | message |&#xA;| --- | --- |&#xA;|ERROR|Oh noes |&#xA;|INFO |It is all good |&#xA;|ERROR|Double oh noes |&#xA;|WARN |Maybeh oh noes?|&#xA; &#xA; As expected, the INFO row returns when you roll back to the future.&#xA; &#xA; Having snapshots not only provides you with a level of immutability that is key to the eventual consistency model, but gives you a rich set of features to version and move between different versions of your data like a git repository.&#xA; &#xA; ## Iceberg Specification&#xA; &#xA; Perhaps saving the best for last, the benefit of using Iceberg is the community that surrounds it, and the support you receive. It can be daunting to have to choose a project that replaces something so core to your architecture. While Hive has so many drawbacks, one of the things keeping many companies locked in is the fear of the unknown. How do you know which table format to choose? Are there unknown data corruption issues that I’m about to take on? What if this doesn’t scale like it promises on the label? It is worth noting that  alternative table formats are also emerging in this space  and we encourage you to investigate these for your own use cases. When sitting down with Iceberg creator, Ryan Blue,  comparing Iceberg to other table formats,  he claims the community’s greatest strength is their ability to look forward. They intentionally broke compatibility with Hive to enable them to provide a richer level of features. Unlike Hive, the Iceberg project explained their thinking in a spec.&#xA;&#xA; The strongest argument I can see for Iceberg is that it has a specification. This is something that has largely been missing from Hive and shows a real maturity in how the Iceberg community has approached the issue. On the Trino project, we think standards are important. We adhere to many of them ourselves, such as the ANSI SQL syntax, and exposing the client through a JDBC connection. By creating a standard around this, you’re no longer tied to any particular technology, not even Iceberg itself. You are adhering to a standard that will hopefully become the de facto standard over a decade or two, much like Hive did. Having the standard in clear writing invites multiple communities to the table and brings even more use  cases. Doing so improves the standards and therefore the technologies that implement them.&#xA; &#xA; The previous three blog posts of this series covered the features and massive benefits from using this novel table format. The following post will dive deeper and discuss more about how Iceberg achieves some of this functionality, with an overview into some of the internals and metadata layouts. In the meantime, feel free to try  Trino on Ice(berg).&#xA;&#xA;#trino #iceberg&#xA;&#xA;bits_&#xA;&#xA;!--emailsub--]]&gt;</description>
      <content:encoded><![CDATA[<p><img src="https://trino.io/assets/blog/trino-on-ice/trino-iceberg.png" alt=""/></p>

<p>In the last two blog posts, we’ve covered a lot of cool feature improvements of Iceberg over the Hive model. I recommend you take a look at those if you haven’t yet. We introduced concepts and issues that table formats address. This blog closes up the overview of Iceberg features by discussing the concurrency model Iceberg uses to ensure data integrity, how to use snapshots via Trino, and the <a href="https://iceberg.apache.org/spec/">Iceberg Specification</a>.</p>



<hr/>

<p>Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:</p>
<ul><li><a href="https://bitsondata.dev/trino-iceberg-i-gentle-intro">Trino on ice I: A gentle introduction to Iceberg</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-ii-table-evolution-cloud">Trino on ice II: In-place table evolution and cloud compatibility with Iceberg</a></li>
<li><a href="https://write.as/bitsondatadev/trino-iceberg-iii-concurrency-snapshots-spec">Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-iv-deep-dive">Trino on ice IV: Deep dive into Iceberg internals</a></li></ul>

<hr/>

<h2 id="concurrency-model" id="concurrency-model">Concurrency Model</h2>

<p> Some issues with the Hive model are the distinct locations where the metadata is stored and where the data files are stored. Having your data and metadata split up like this is a recipe for disaster when trying to apply updates to both services atomically.</p>

<p> <img src="https://trino.io/assets/blog/trino-on-ice/iceberg-metadata.png" alt="Iceberg metadata diagram of runtime, and file storage"/>
 A very common problem with Hive is that if a writing process failed during insertion, many times you would find the data written to file storage, but the metastore writes failed to occur. Or conversely, the metastore writes were successful, but the data failed to finish writing to file storage due to a  network or file IO failure. There’s a good  <a href="https://trino.io/episodes/5.html">Trino Community Broadcast episode</a> that talks about a function in Trino that exists to resolve these issues by syncing the metastore and file storage. You can watch  <a href="https://www.youtube.com/watch?v=OXyJFZSsX5w&amp;t=2097s">a simulation of this error</a> on that episode.</p>

<p> Aside from having issues due to the split state in the system, there are many  other issues that stem from the file system itself. In the case of HDFS,  depending on the specific filesystem implementation you are using, you may have <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/introduction.html#Core_Expectations_of_a_Hadoop_Compatible_FileSystem">different atomicity guarantees for various file systems and their operations</a>, such as creating, deleting, and renaming files and directories. HDFS isn’t the only troublemaker here. Other than Amazon S3’s  <a href="https://aws.amazon.com/about-aws/whats-new/2020/12/amazon-s3-now-delivers-strong-read-after-write-consistency-automatically-for-all-applications/">recent announcement of strong consistency in their S3 service,</a> most object storage systems only offer <em>eventual</em> consistency that may not show the latest files immediately after writes. Despite storage systems showing more progress towards offering better performance and guarantees, these systems still offer no reliable locking mechanism.</p>

<p> Iceberg addresses all of these issues in a multitude of ways. One of the primary ways Iceberg introduces transactional guarantees is by storing the metadata in the same datastore as the data itself. This simplifies handling commit failures down to rolling back on one system rather than trying to coordinate a rollback across two systems like in Hive. Writers independently write their metadata and attempt to perform their operations, needing no coordination with other writers. The only time the writers coordinate is when they attempt to perform a commit of their operations. In order to do a commit, they perform a lock of the current snapshot record in a database. This concurrency model where writers eagerly do the work upfront is called <strong><em>optimistic concurrency control</em></strong>.
 Currently, in Trino, this method still uses the Hive metastore to perform the lock-and-swap operation necessary to coordinate the final commits. Iceberg  creator, <a href="https://www.linkedin.com/in/rdblue/">Ryan Blue</a>, <a href="https://youtu.be/-iIY2sOFBRc?t=1351">covers this lock-and-swap mechanism</a> and how the metastore can be replaced with alternate locking methods. In the event that <a href="https://iceberg.apache.org/reliability/#concurrent-write-operations">two writers attempt to commit at the same time</a>, the writer that first acquires the lock successfully commits by swapping its snapshot as the current snapshot, while the second writer will retry to apply its changes again. The second writer should have no problem with this, assuming there are no conflicting changes between the two snapshots.</p>

<p> <img src="https://trino.io/assets/blog/trino-on-ice/iceberg-files.png" alt=""/></p>

<p> This works similarly to a git workflow where the main branch is the locked resource, and two developers try to commit their changes at the same time. The first developer’s changes may conflict with the second developer’s changes. The second developer is then forced to rebase or merge the first developer’s code with their changes before commiting to the main branch again. The same logic applies to merging data files. Currently, Iceberg clients use a <a href="https://iceberg.apache.org/reliability/#concurrent-write-operations">copy-on-write mechanism</a> that makes a new file out of the merged data in the next snapshot. This enables accurate time traveling and preserves previous split versions of the files. At the time of writing, upserts via <code>MERGE INTO</code> syntax are not supported in Trino, but  <a href="https://github.com/trinodb/trino/issues/7708">this is in active development</a>. <strong><em>UPDATE:</em></strong> Since the original writing of this post, the  <a href="https://github.com/trinodb/trino/pull/7933"><code>MERGE</code> syntax exists as of version 393</a>.</p>

<p> One of the great benefits of tracking each individual change that gets written to Iceberg is that you are given a view of the data at every point in time. This enables a really cool feature that I mentioned earlier called <strong><em>time travel</em></strong>.</p>

<p> ## Snapshots and Time Travel</p>

<p> To showcase snapshots, it’s best to go over a few examples drawing from the event table we  created in the previous blog posts. This time we’ll only be working with the Iceberg table, as this capability is not available in Hive. Snapshots allow you to have an immutable set of your data at a given time. They are automatically created on every append or removal of data. One thing to note is that for now, they do not store the state of your metadata.
 Say that you have c
 reated your events table and inserted the three initial rows as we did previously. Let’s look at the data we get back and see how to check the existing snapshots in Trino:</p>

<pre><code>SELECT level, message
FROM iceberg.logging.events;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>level</th>
<th>message</th>
</tr>
</thead>

<tbody>
<tr>
<td>ERROR</td>
<td>Double oh noes</td>
</tr>

<tr>
<td>WARN</td>
<td>Maybeh oh noes?</td>
</tr>

<tr>
<td>ERROR</td>
<td>Oh noes</td>
</tr>
</tbody>
</table>

<p>To query the snapshots, all you need is to use the $ operator appended to the
end of the table name, and add the hidden table, <code>snapshots</code>:</p>

<pre><code>SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>snapshot_id</th>
<th>parent_id</th>
<th>operation</th>
</tr>
</thead>

<tbody>
<tr>
<td>7620328658793169607</td>
<td></td>
<td>append</td>
</tr>

<tr>
<td>2115743741823353537</td>
<td>7620328658793169607</td>
<td>append</td>
</tr>
</tbody>
</table>

<p>Let’s take a look at the manifest list files that are associated with each
snapshot ID. You can tell which file belongs to which snapshot based on the
snapshot ID embedded in the filename:</p>

<pre><code>SELECT manifest_list
FROM iceberg.logging.“events$snapshots”;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>shapshots</th>
</tr>
</thead>

<tbody>
<tr>
<td>s3a://iceberg/logging.db/events/metadata/snap-7620328658793169607-1-cc857d89-1c07-4087-bdbc-2144a814dae2.avro</td>
</tr>

<tr>
<td>s3a://iceberg/logging.db/events/metadata/snap-2115743741823353537-1-4cb458be-7152-4e99-8db7-b2dda52c556c.avro</td>
</tr>
</tbody>
</table>

<p>Now, let’s insert another row to the table:</p>

<pre><code>INSERT INTO iceberg.logging.events
VALUES
(
‘INFO’,
timestamp ‘2021-04-02 00:00:11.1122222’,
‘It is all good’,
ARRAY [‘Just updating you!’]
);
</code></pre>

<p>Let’s check the snapshot table again:</p>

<pre><code>SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>snapshot_id</th>
<th>parent_id</th>
<th>operation</th>
</tr>
</thead>

<tbody>
<tr>
<td>7620328658793169607</td>
<td></td>
<td>append</td>
</tr>

<tr>
<td>2115743741823353537</td>
<td>7620328658793169607</td>
<td>append</td>
</tr>

<tr>
<td>7030511368881343137</td>
<td>2115743741823353537</td>
<td>append</td>
</tr>
</tbody>
</table>

<p>Let’s also verify that our row was added:</p>

<pre><code>SELECT level, message
FROM iceberg.logging.events;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>level</th>
<th>message</th>
</tr>
</thead>

<tbody>
<tr>
<td>ERROR</td>
<td>Oh noes</td>
</tr>

<tr>
<td>INFO</td>
<td>It is all good</td>
</tr>

<tr>
<td>ERROR</td>
<td>Double oh noes</td>
</tr>

<tr>
<td>WARN</td>
<td>Maybeh oh noes?</td>
</tr>
</tbody>
</table>

<p> Since Iceberg is already tracking the list of files added and removed at each snapshot, it would make sense that you can travel back and forth between these different views into the system, right? This concept is called time traveling. You need to specify which snapshot you would like to read from and you will see the view of the data at that timestamp. In Trino, you need to use the <code>@</code> operator, followed by the snapshot you wish to read from:</p>

<pre><code>SELECT level, message
FROM iceberg.logging.“events@2115743741823353537”;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>level</th>
<th>message</th>
</tr>
</thead>

<tbody>
<tr>
<td>ERROR</td>
<td>Double oh noes</td>
</tr>

<tr>
<td>WARN</td>
<td>Maybeh oh noes?</td>
</tr>

<tr>
<td>ERROR</td>
<td>Oh noes</td>
</tr>
</tbody>
</table>

<p> If you determine there is some issue with your data, you can always roll back to the previous state permanently as well. In Trino we have a function called <code>rollback_to_snapshot</code> to move the table state to another snapshot:</p>

<pre><code>CALL system.rollback_to_snapshot(‘logging’, ‘events’, 2115743741823353537);
</code></pre>

<p>Now that we have rolled back, observe what happens when we query the events
table with:</p>

<pre><code>SELECT level, message
FROM iceberg.logging.events;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>level</th>
<th>message</th>
</tr>
</thead>

<tbody>
<tr>
<td>ERROR</td>
<td>Double oh noes</td>
</tr>

<tr>
<td>WARN</td>
<td>Maybeh oh noes?</td>
</tr>

<tr>
<td>ERROR</td>
<td>Oh noes</td>
</tr>
</tbody>
</table>

<p> Notice the <code>INFO</code> row is still missing even though we query the table without specifying a snapshot id. Now just because we rolled back, doesn’t mean we’ve lost the snapshot we just rolled back from. In fact, we can roll forward, or as I like to call it,  <a href="https://en.wikipedia.org/wiki/Back_to_the_Future">back to the future</a>! In Trino, you use the same function call but with a predecessor of the existing snapshot:</p>

<pre><code>CALL system.rollback_to_snapshot(‘logging’, ‘events’, 7030511368881343137)
</code></pre>

<p>And now we should be able to query the table again and see the <code>INFO</code> row
return:</p>

<pre><code>SELECT level, message
FROM iceberg.logging.events;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>level</th>
<th>message</th>
</tr>
</thead>

<tbody>
<tr>
<td>ERROR</td>
<td>Oh noes</td>
</tr>

<tr>
<td>INFO</td>
<td>It is all good</td>
</tr>

<tr>
<td>ERROR</td>
<td>Double oh noes</td>
</tr>

<tr>
<td>WARN</td>
<td>Maybeh oh noes?</td>
</tr>
</tbody>
</table>

<p> As expected, the INFO row returns when you roll back to the future.</p>

<p> Having snapshots not only provides you with a level of immutability that is key to the eventual consistency model, but gives you a rich set of features to version and move between different versions of your data like a git repository.</p>

<p> ## Iceberg Specification</p>

<p> Perhaps saving the best for last, the benefit of using Iceberg is the community that surrounds it, and the support you receive. It can be daunting to have to choose a project that replaces something so core to your architecture. While Hive has so many drawbacks, one of the things keeping many companies locked in is the fear of the unknown. How do you know which table format to choose? Are there unknown data corruption issues that I’m about to take on? What if this doesn’t scale like it promises on the label? It is worth noting that  <a href="https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/">alternative table formats are also emerging in this space</a>  and we encourage you to investigate these for your own use cases. When sitting down with Iceberg creator, Ryan Blue,  <a href="https://www.twitch.tv/videos/989098630">comparing Iceberg to other table formats</a>,  he claims the community’s greatest strength is their ability to look forward. They intentionally broke compatibility with Hive to enable them to provide a richer level of features. Unlike Hive, the Iceberg project explained their thinking in a spec.</p>

<p> The strongest argument I can see for Iceberg is that it has a <a href="https://iceberg.apache.org/spec/">specification</a>. This is something that has largely been missing from Hive and shows a real maturity in how the Iceberg community has approached the issue. On the Trino project, we think standards are important. We adhere to many of them ourselves, such as the ANSI SQL syntax, and exposing the client through a JDBC connection. By creating a standard around this, you’re no longer tied to any particular technology, not even Iceberg itself. You are adhering to a standard that will hopefully become the de facto standard over a decade or two, much like Hive did. Having the standard in clear writing invites multiple communities to the table and brings even more use  cases. Doing so improves the standards and therefore the technologies that implement them.</p>

<p> The previous three blog posts of this series covered the features and massive benefits from using this novel table format. The following post will dive deeper and discuss more about how Iceberg achieves some of this functionality, with an overview into some of the internals and metadata layouts. In the meantime, feel free to try  <a href="https://github.com/bitsondatadev/trino-getting-started/tree/main/iceberg/trino-iceberg-minio">Trino on Ice(berg)</a>.</p>

<p><a href="https://bitsondata.dev/tag:trino" class="hashtag"><span>#</span><span class="p-category">trino</span></a> <a href="https://bitsondata.dev/tag:iceberg" class="hashtag"><span>#</span><span class="p-category">iceberg</span></a></p>

<p><em>bits</em></p>


]]></content:encoded>
      <guid>https://bitsondata.dev/trino-iceberg-iii-concurrency-snapshots-spec</guid>
      <pubDate>Fri, 30 Jul 2021 05:00:00 +0000</pubDate>
    </item>
    <item>
      <title>Trino on ice II: In-place table evolution and cloud compatibility with Iceberg</title>
      <link>https://bitsondata.dev/trino-iceberg-ii-table-evolution-cloud?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[&#xA;&#xA;The first post covered how Iceberg is a table format and not a file format It demonstrated the benefits of hidden partitioning in Iceberg in contrast to exposed partitioning in Hive. There really is no such thing as “exposed partitioning.” I just thought that sounded better than not-hidden partitioning. If any of that wasn’t clear, I recommend either that you stop reading now, or go back to the first post before starting this one. This post discusses evolution. No, the post isn’t covering Darwinian nor Pokémon evolution, but in-place table evolution! &#xA;&#xA;!--more--&#xA;&#xA;---&#xA;&#xA;Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:&#xA;&#xA;Trino on ice I: A gentle introduction to Iceberg&#xA;Trino on ice II: In-place table evolution and cloud compatibility with Iceberg&#xA;Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec&#xA;Trino on ice IV: Deep dive into Iceberg internals&#xA;&#xA;---&#xA;&#xA;You may find it a little odd that I am getting excited over tables evolving &#xA;in-place, but as mentioned in the last post, if you have experience performing table evolution in Hive, you’d be as happy as Ash Ketchum when Charmander evolved into Charmeleon discovering that Iceberg supports Partition evolution and schema evolution. That is, until Charmeleon started treating Ash like a jerk after the evolution from Charmander. Hopefully, you won’t face the same issue when your tables evolve. &#xA;&#xA;Another important aspect that is covered, is how Iceberg is developed with cloud storage in mind. Hive and other data lake technologies were developed with file systems as their primary storage layer. This is still a very common layer today, but as more companies move to include object storage, table formats did not adapt to the needs of object stores. Let’s dive in!&#xA;&#xA;Partition Specification evolution&#xA;&#xA;In Iceberg, you are able to update the partition specification, shortened to partition spec in Iceberg, on a live table. You do not need to perform a table migration as you do in Hive. In Hive, partition specs don’t explicitly exist because they are tightly coupled with the creation of the Hive table. Meaning, if you ever need to change the granularity of your data partitions at any point, you need to create an entirely new table, and move all the data to the new partition granularity you desire. No pressure on choosing the right granularity or anything!&#xA;&#xA;In Iceberg, you’re not required to choose the perfect partition specification upfront, and you can have multiple partition specs in the same table, and query across the different sized partition specs. How great is that! This means, if you’re initially partitioning your data by month, and later you decide to move to a daily partitioning spec due to a growing ingest from all your new customers, you can do so with no migration, and query over the table with no issue. &#xA;&#xA;This is conveyed pretty succinctly in this graphic from the Iceberg &#xA;documentation. At the end of the year 2008, partitioning occurs at a monthly granularity and after 2009, it moves to a daily granularity. When the query to pull data from December 14th, 2008 and January 13th, 2009, the entire month of December gets scanned due to the monthly partition, but for the dates in January, only the first 13 days are scanned to answer the query.&#xA;&#xA;At the time of writing, Trino is able to perform reads from tables that have multiple partition spec changes but partition evolution write support does not yet exist. There are efforts to add this support in the near future. Edit: this has since been merged!&#xA;&#xA;Schema evolution&#xA;&#xA;Iceberg also handles schema evolution much more elegantly than Hive. In Hive, adding columns worked well enough, as data inserted before the schema change just reports null for that column. For formats that use column names, like ORC and Parquet, deletes are also straightforward for Hive, as it simply ignores fields that are no longer part of the table. For unstructured files like CSV that use the position of the column, deletes would still cause issues, as deleting one column shifts the rest of the columns. Renames for schemas pose an issue for all formats in Hive as data written prior to the rename is not modified to the new field. This effectively works the same as if you deleted the old field and added a new column with the new name. This lack of support for schema. evolution across various file types in Hive requires a lot of memorizing&#xA;the formats underneath various tables. This is very susceptible to causing user errors if someone executes one of the unsupported operations on the wrong table.&#xA;&#xA;table&#xA;thead&#xA;  tr&#xA;    th colspan=&#34;4&#34;Hive 2.2.0 schema evolution based on file type and operation./th&#xA;  /tr&#xA;/thead&#xA;tbody&#xA;  tr&#xA;    td/td&#xA;    tdAdd/td&#xA;    tdDelete/td&#xA;    tdRename/td&#xA;  /tr&#xA;  tr&#xA;    tdCSV/TSV/td&#xA;    td✅/td&#xA;    td❌/td&#xA;    td❌/td&#xA;  /tr&#xA;  tr&#xA;    tdJSON/td&#xA;    td✅/td&#xA;    td✅/td&#xA;    td❌/td&#xA;  /tr&#xA;  tr&#xA;    tdORC/Parquet/Avro/td&#xA;    td✅/td&#xA;    td✅/td&#xA;    td❌/td&#xA;  /tr&#xA;/tbody&#xA;/table&#xA;&#xA;Currently in Iceberg, schemaless position-based data formats such as CSV and TSVare not supported, though there are some discussions on adding limited support for them. This would be good from a reading standpoint, to load data from the CSV, into an Iceberg format with all the guarantees that Iceberg offers. &#xA;&#xA;While JSON doesn’t rely on positional data, it does have an explicit dependency on names. This means, that if I remove a text column from a JSON table named severity, then later I want to add a new int column called severity, I encounter an error when I try to read in the data with the string type from before when I try to deserialize the JSON files. Even worse would be if the new severity column you add has the same type as the original but a semantically different meaning. This results in old rows containing values that are unknowingly from a different domain, which can lead to wrong analytics. After all, someone who adds the new severity column might not even be aware of the old severity column, if it was quite some time ago when it was dropped.&#xA;&#xA;ORC, Parquet, and Avro do not suffer from these issues as they are columnar formats that keep a schema internal to the file itself, and each format tracks changes to the columns through IDs rather than name values or position. Iceberg uses these unique column IDs to also keep track of the columns as changes are applied.&#xA;&#xA;In general, Iceberg can only allow this small set of file formats due to the correctness guarantees it provides. In Trino, you can add, delete, or rename columns using the ALTER TABLE command. Here’s an example that continues from the table created  in the last post  that inserted three rows. The DDL statement looked like this.&#xA;&#xA;CREATE TABLE iceberg.logging.events (&#xA;  level VARCHAR,&#xA;  eventtime TIMESTAMP(6), &#xA;  message VARCHAR,&#xA;  callstack ARRAY(VARCHAR)&#xA;) WITH (&#xA;  format = &#39;ORC&#39;,&#xA;  partitioning = ARRAY[&#39;day(eventtime)&#39;]&#xA;);&#xA;&#xA;Here is an ALTER TABLE sequence that adds a new column named severity, inserts data including into the new column, renames the column, and prints the data.&#xA;&#xA;ALTER TABLE iceberg.logging.events ADD COLUMN severity INTEGER; &#xA;&#xA;INSERT INTO iceberg.logging.events VALUES &#xA;(&#xA;  &#39;INFO&#39;, &#xA;  timestamp &#xA;  &#39;2021-04-01 19:59:59.999999&#39; AT TIME ZONE &#39;America/LosAngeles&#39;, &#xA;  &#39;es muy bueno&#39;, &#xA;  ARRAY [&#39;It is all normal&#39;], &#xA;  1&#xA;);&#xA;&#xA;ALTER TABLE iceberg.logging.events RENAME COLUMN severity TO priority;&#xA;&#xA;SELECT level, message, priority&#xA;FROM iceberg.logging.events;&#xA;&#xA;Result:&#xA;&#xA;| level |  message | priority |&#xA;| --- | --- | --- |&#xA;| ERROR | Double oh noes | NULL |&#xA;| WARN | Maybeh oh noes? | NULL |&#xA;| ERROR | Oh noes | NULL |&#xA;| INFO | es muy bueno | 1 |&#xA;&#xA;ALTER TABLE iceberg.logging.events &#xA;DROP COLUMN priority;&#xA;&#xA;SHOW CREATE TABLE iceberg.logging.events;&#xA;&#xA;Result&#xA;&#xA;CREATE TABLE iceberg.logging.events (&#xA;   level varchar,&#xA;   eventtime timestamp(6),&#xA;   message varchar,&#xA;   callstack array(varchar)&#xA;)&#xA;WITH (&#xA;   format = &#39;ORC&#39;,&#xA;   partitioning = ARRAY[&#39;day(eventtime)&#39;]&#xA;)&#xA;&#xA;Notice how the priority and severity columns are both not present in the schema. As noted in the table above, Hive renames cause issues for all file formats. Yet in Iceberg, performing all these operations causes no issues with the table and underlying data.&#xA;&#xA;Cloud storage compatibility&#xA;&#xA;Not all developers consider or are aware of the performance implications of using Hive over a cloud object storage solution like S3 or Azure Blob storage. One thing to remember is that Hive was developed with the Hadoop Distributed File System (HDFS) in mind. HDFS is a filesystem and is particularly well suited to handle listing files on the filesystem, because they were stored in a contiguous manner. When Hive stores data associated with a table, it assumes there is a contiguous layout underneath it and performs list operations that are expensive on cloud storage systems.&#xA;&#xA;The common cloud storage systems are typically object stores that do not lay out the files in a contiguous manner based on paths. Therefore, it becomes very expensive to list out all the files in a particular path. Yet, these list operations are executed for every partition that could be included in a query, regardless of only a single row, in a single file out of thousands of files needing to be retrieved to answer the query. Even ignoring the performance costs for a minute, object stores may also pose issues for Hive due to eventual  consistency. Inserting and deleting can cause inconsistent results for readers, if the files you end up reading are out of date. &#xA;&#xA;Iceberg avoids all of these issues by tracking the data at the file level, &#xA;rather than the partition level. By tracking the files, Iceberg only accesses the files containing data relevant to the query, as opposed to accessing files in the same partition looking for the few files that are relevant to the query. Further, this allows Iceberg to control for the inconsistency issue in cloud-based file systems by using a locking mechanism at the file level. See the file layout below that Hive layout versus the Iceberg layout. As you can see in the next image, Iceberg makes no assumptions about the data being contiguous or not. It simply builds a persistent tree using the snapshot (S) location stored in the metadata, that points to the manifest list (ML), which points to &#xA;manifests containing partitions (P). Finally, these manifest files contain the file (F) locations and stats that can quickly be used to prune data versus &#xA;needing to do a list operation and scanning all the files.&#xA;&#xA;Referencing the picture above, if you were to run a query where the result set only contains rows from file F1, Hive would require a list operation and scanning the files, F2 and F3. In Iceberg, file metadata exists in the manifest file, P1, that would have a range on the predicate field that prunes out files F2 and F3, and only scans file F1. This example only shows a couple of files, but imagine storage that scales up to thousands of files! Listing becomes expensive on files that are not contiguously stored in memory. Having this flexibility in the logical layout is essential to increase query performance. This is especially true on cloud object stores.&#xA;&#xA;If you want to play around with Iceberg using Trino, check out the &#xA;Trino Iceberg docs. To avoid issues like the eventual consistency issue, as well as other problems of trying to sync operations across systems, Iceberg provides optimistic concurrency support, which is covered in more detail in&#xA;the next post. &#xA;&#xA;#trino #iceberg&#xA;&#xA;bits_&#xA;&#xA;!--emailsub--]]&gt;</description>
      <content:encoded><![CDATA[<p><img src="https://trino.io/assets/blog/trino-on-ice/trino-iceberg.png" alt=""/></p>

<p><a href="https://bitsondata.dev/trino-on-ice-i-a-gentle-introduction-to-iceberg">The first post</a> covered how Iceberg is a table format and not a file format It demonstrated the benefits of hidden partitioning in Iceberg in contrast to exposed partitioning in Hive. There really is no such thing as “exposed partitioning.” I just thought that sounded better than not-hidden partitioning. If any of that wasn’t clear, I recommend either that you stop reading now, or go back to the first post before starting this one. This post discusses evolution. No, the post isn’t covering Darwinian nor Pokémon evolution, but in-place table evolution!</p>



<hr/>

<p>Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:</p>
<ul><li><a href="https://bitsondata.dev/trino-iceberg-i-gentle-intro">Trino on ice I: A gentle introduction to Iceberg</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-ii-table-evolution-cloud">Trino on ice II: In-place table evolution and cloud compatibility with Iceberg</a></li>
<li><a href="https://write.as/bitsondatadev/trino-iceberg-iii-concurrency-snapshots-spec">Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-iv-deep-dive">Trino on ice IV: Deep dive into Iceberg internals</a></li></ul>

<hr/>

<p><img src="https://trino.io/assets/blog/trino-on-ice/evolution.gif" alt=""/></p>

<p>You may find it a little odd that I am getting excited over tables evolving
in-place, but as mentioned in the last post, if you have experience performing table evolution in Hive, you’d be as happy as Ash Ketchum when Charmander evolved into Charmeleon discovering that Iceberg supports Partition evolution and schema evolution. That is, until Charmeleon started treating Ash like a jerk after the evolution from Charmander. Hopefully, you won’t face the same issue when your tables evolve.</p>

<p>Another important aspect that is covered, is how Iceberg is developed with cloud storage in mind. Hive and other data lake technologies were developed with file systems as their primary storage layer. This is still a very common layer today, but as more companies move to include object storage, table formats did not adapt to the needs of object stores. Let’s dive in!</p>

<h2 id="partition-specification-evolution" id="partition-specification-evolution">Partition Specification evolution</h2>

<p>In Iceberg, you are able to update the partition specification, shortened to partition spec in Iceberg, on a live table. You do not need to perform a table migration as you do in Hive. In Hive, partition specs don’t explicitly exist because they are tightly coupled with the creation of the Hive table. Meaning, if you ever need to change the granularity of your data partitions at any point, you need to create an entirely new table, and move all the data to the new partition granularity you desire. No pressure on choosing the right granularity or anything!</p>

<p>In Iceberg, you’re not required to choose the perfect partition specification upfront, and you can have multiple partition specs in the same table, and query across the different sized partition specs. How great is that! This means, if you’re initially partitioning your data by month, and later you decide to move to a daily partitioning spec due to a growing ingest from all your new customers, you can do so with no migration, and query over the table with no issue.</p>

<p>This is conveyed pretty succinctly in this graphic from the Iceberg
documentation. At the end of the year 2008, partitioning occurs at a monthly granularity and after 2009, it moves to a daily granularity. When the query to pull data from December 14th, 2008 and January 13th, 2009, the entire month of December gets scanned due to the monthly partition, but for the dates in January, only the first 13 days are scanned to answer the query.</p>

<p><img src="https://trino.io/assets/blog/trino-on-ice/partition-spec-evolution.png" alt=""/></p>

<p>At the time of writing, Trino is able to perform reads from tables that have multiple partition spec changes but partition evolution write support does not yet exist. <a href="https://github.com/trinodb/trino/issues/7580">There are efforts to add this support in the near future</a>. Edit: this has since been merged!</p>

<h2 id="schema-evolution" id="schema-evolution">Schema evolution</h2>

<p>Iceberg also handles schema evolution much more elegantly than Hive. In Hive, adding columns worked well enough, as data inserted before the schema change just reports null for that column. For formats that use column names, like ORC and Parquet, deletes are also straightforward for Hive, as it simply ignores fields that are no longer part of the table. For unstructured files like CSV that use the position of the column, deletes would still cause issues, as deleting one column shifts the rest of the columns. Renames for schemas pose an issue for all formats in Hive as data written prior to the rename is not modified to the new field. This effectively works the same as if you deleted the old field and added a new column with the new name. This lack of support for schema. evolution across various file types in Hive requires a lot of memorizing
the formats underneath various tables. This is very susceptible to causing user errors if someone executes one of the unsupported operations on the wrong table.</p>

<table>
<thead>
  <tr>
    <th colspan="4">Hive 2.2.0 schema evolution based on file type and operation.</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td></td>
    <td>Add</td>
    <td>Delete</td>
    <td>Rename</td>
  </tr>
  <tr>
    <td>CSV/TSV</td>
    <td>✅</td>
    <td>❌</td>
    <td>❌</td>
  </tr>
  <tr>
    <td>JSON</td>
    <td>✅</td>
    <td>✅</td>
    <td>❌</td>
  </tr>
  <tr>
    <td>ORC/Parquet/Avro</td>
    <td>✅</td>
    <td>✅</td>
    <td>❌</td>
  </tr>
</tbody>
</table>

<p>Currently in Iceberg, schemaless position-based data formats such as CSV and TSVare not supported, though there are <a href="https://github.com/apache/iceberg/issues/118">some discussions on adding limited support for them</a>. This would be good from a reading standpoint, to load data from the CSV, into an Iceberg format with all the guarantees that Iceberg offers.</p>

<p>While JSON doesn’t rely on positional data, it does have an explicit dependency on names. This means, that if I remove a text column from a JSON table named <code>severity</code>, then later I want to add a new int column called <code>severity</code>, I encounter an error when I try to read in the data with the string type from before when I try to deserialize the JSON files. Even worse would be if the new <code>severity</code> column you add has the same type as the original but a semantically different meaning. This results in old rows containing values that are unknowingly from a different domain, which can lead to wrong analytics. After all, someone who adds the new <code>severity</code> column might not even be aware of the old <code>severity</code> column, if it was quite some time ago when it was dropped.</p>

<p>ORC, Parquet, and Avro do not suffer from these issues as they are columnar formats that keep a schema internal to the file itself, and each format tracks changes to the columns through IDs rather than name values or position. Iceberg uses these unique column IDs to also keep track of the columns as changes are applied.</p>

<p>In general, Iceberg can only allow this small set of file formats due to the <a href="https://iceberg.apache.org/evolution/#correctness">correctness guarantees</a> it provides. In Trino, you can add, delete, or rename columns using the <code>ALTER TABLE</code> command. Here’s an example that continues from the table created  in the last post  that inserted three rows. The DDL statement looked like this.</p>

<pre><code>CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6), 
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = &#39;ORC&#39;,
  partitioning = ARRAY[&#39;day(event_time)&#39;]
);
</code></pre>

<p>Here is an <code>ALTER TABLE</code> sequence that adds a new column named <code>severity</code>, inserts data including into the new column, renames the column, and prints the data.</p>

<pre><code>ALTER TABLE iceberg.logging.events ADD COLUMN severity INTEGER; 

INSERT INTO iceberg.logging.events VALUES 
(
  &#39;INFO&#39;, 
  timestamp 
  &#39;2021-04-01 19:59:59.999999&#39; AT TIME ZONE &#39;America/Los_Angeles&#39;, 
  &#39;es muy bueno&#39;, 
  ARRAY [&#39;It is all normal&#39;], 
  1
);

ALTER TABLE iceberg.logging.events RENAME COLUMN severity TO priority;

SELECT level, message, priority
FROM iceberg.logging.events;
</code></pre>

<p>Result:</p>

<table>
<thead>
<tr>
<th>level</th>
<th>message</th>
<th>priority</th>
</tr>
</thead>

<tbody>
<tr>
<td>ERROR</td>
<td>Double oh noes</td>
<td>NULL</td>
</tr>

<tr>
<td>WARN</td>
<td>Maybeh oh noes?</td>
<td>NULL</td>
</tr>

<tr>
<td>ERROR</td>
<td>Oh noes</td>
<td>NULL</td>
</tr>

<tr>
<td>INFO</td>
<td>es muy bueno</td>
<td>1</td>
</tr>
</tbody>
</table>

<pre><code>ALTER TABLE iceberg.logging.events 
DROP COLUMN priority;

SHOW CREATE TABLE iceberg.logging.events;
</code></pre>

<p>Result</p>

<pre><code>CREATE TABLE iceberg.logging.events (
   level varchar,
   event_time timestamp(6),
   message varchar,
   call_stack array(varchar)
)
WITH (
   format = &#39;ORC&#39;,
   partitioning = ARRAY[&#39;day(event_time)&#39;]
)
</code></pre>

<p>Notice how the priority and severity columns are both not present in the schema. As noted in the table above, Hive renames cause issues for all file formats. Yet in Iceberg, performing all these operations causes no issues with the table and underlying data.</p>

<h2 id="cloud-storage-compatibility" id="cloud-storage-compatibility">Cloud storage compatibility</h2>

<p>Not all developers consider or are aware of the performance implications of using Hive over a cloud object storage solution like S3 or Azure Blob storage. One thing to remember is that Hive was developed with the Hadoop Distributed File System (HDFS) in mind. HDFS is a filesystem and is particularly well suited to handle listing files on the filesystem, because they were stored in a contiguous manner. When Hive stores data associated with a table, it assumes there is a contiguous layout underneath it and performs list operations that are expensive on cloud storage systems.</p>

<p>The common cloud storage systems are typically object stores that do not lay out the files in a contiguous manner based on paths. Therefore, it becomes very expensive to list out all the files in a particular path. Yet, these list operations are executed for every partition that could be included in a query, regardless of only a single row, in a single file out of thousands of files needing to be retrieved to answer the query. Even ignoring the performance costs for a minute, object stores may also pose issues for Hive due to eventual  consistency. Inserting and deleting can cause inconsistent results for readers, if the files you end up reading are out of date.</p>

<p>Iceberg avoids all of these issues by tracking the data at the file level,
rather than the partition level. By tracking the files, Iceberg only accesses the files containing data relevant to the query, as opposed to accessing files in the same partition looking for the few files that are relevant to the query. Further, this allows Iceberg to control for the inconsistency issue in cloud-based file systems by using a locking mechanism at the file level. See the file layout below that Hive layout versus the Iceberg layout. As you can see in the next image, Iceberg makes no assumptions about the data being contiguous or not. It simply builds a persistent tree using the snapshot (S) location stored in the metadata, that points to the manifest list (ML), which points to
manifests containing partitions (P). Finally, these manifest files contain the file (F) locations and stats that can quickly be used to prune data versus
needing to do a list operation and scanning all the files.</p>

<p><img src="https://trino.io/assets/blog/trino-on-ice/cloud-file-layout.png" alt=""/></p>

<p>Referencing the picture above, if you were to run a query where the result set only contains rows from file F1, Hive would require a list operation and scanning the files, F2 and F3. In Iceberg, file metadata exists in the manifest file, P1, that would have a range on the predicate field that prunes out files F2 and F3, and only scans file F1. This example only shows a couple of files, but imagine storage that scales up to thousands of files! Listing becomes expensive on files that are not contiguously stored in memory. Having this flexibility in the logical layout is essential to increase query performance. This is especially true on cloud object stores.</p>

<p>If you want to play around with Iceberg using Trino, check out the
<a href="https://trino.io/docs/current/connector/iceberg.html">Trino Iceberg docs</a>. To avoid issues like the eventual consistency issue, as well as other problems of trying to sync operations across systems, Iceberg provides optimistic concurrency support, which is covered in more detail in
<a href="https://bitsondata.dev/iceberg-concurrency-snapshots-spec">the next post</a>.</p>

<p><a href="https://bitsondata.dev/tag:trino" class="hashtag"><span>#</span><span class="p-category">trino</span></a> <a href="https://bitsondata.dev/tag:iceberg" class="hashtag"><span>#</span><span class="p-category">iceberg</span></a></p>

<p><em>bits</em></p>


]]></content:encoded>
      <guid>https://bitsondata.dev/trino-iceberg-ii-table-evolution-cloud</guid>
      <pubDate>Mon, 12 Jul 2021 05:00:00 +0000</pubDate>
    </item>
    <item>
      <title>Trino on ice I: A gentle introduction To Iceberg</title>
      <link>https://bitsondata.dev/trino-iceberg-i-gentle-intro?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[&#xA;&#xA;Back in the Gentle introduction to the Hive connector blog post, I discussed a commonly misunderstood architecture and uses of the Trino Hive connector. In short, while some may think the name indicates Trino makes a call to a running Hive instance, the Hive connector does not use the Hive runtime to answer queries. Instead, the connector is named Hive connector because it relies on Hive conventions and implementation details from the Hadoop ecosystem - the invisible Hive specification.&#xA;&#xA;!--more--&#xA;&#xA;---&#xA;&#xA;Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:&#xA;&#xA;Trino on ice I: A gentle introduction to Iceberg&#xA;Trino on ice II: In-place table evolution and cloud compatibility with Iceberg&#xA;Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec&#xA;Trino on ice IV: Deep dive into Iceberg internals&#xA;&#xA;---&#xA;&#xA;I call this specification invisible because it doesn’t exist. It lives in the Hive code and the minds of those who developed it. This is makes it very difficult for anybody else who has to integrate with any distributed object storage that uses Hive, since they had to rely on reverse engineering and keeping up with the changes. The way you interact with Hive changes based on which version of Hive or Hadoop you are running. It also varies if you are in the cloud or over an object store. Spark has even modified the Hive spec in some ways to fit the Hive model to their use cases. It’s a big mess that data engineers have put up with for years. Yet despite the confusion and lack of organization due to Hive’s number of unwritten assumptions, the Hive connector is the most popular connector in use for Trino. Virtually every big data query engine uses the Hive model today in some form. As a result it is used by numerous companies to store and access data in their data lakes.&#xA;&#xA;So how did something with no specification become so ubiquitous in data lakes? Hive was first in the large object storage and big data world as part of Hadoop. Hadoop became popular from good marketing for Hadoop to solve the problems of dealing with the increase in data with the Web 2.0 boom . Of course, Hive didn’t get everything wrong. In fact, without Hive, and the fact that it is open source, there may not have been a unified specification at all. Despite the many hours data engineers have spent bashing their heads against the wall with all the unintended consequences of Hive, it still served a very useful purpose.&#xA;&#xA;So why did I just rant about Hive for so long if I’m here to tell you about Apache Iceberg? It’s impossible for a teenager growing up today to truly appreciate music streaming services without knowing what it was like to have an iPod with limited storage, or listening to a scratched burnt CD that skips, or flipping your tape or record to side-B. The same way anyone born before the turn of the millennium really appreciates streaming services, so you too will appreciate Iceberg once you’ve learned the intricacies of managing a data lake built on Hive and Hadoop.&#xA;&#xA;If you haven’t used Hive before, this blog post outlines just a few pain points that come from this data warehousing software to give you proper context. If you have already lived through these headaches, this post acts as a guide to Iceberg from Hive. This post is the first in a series of blog posts discussing Apache Iceberg in great detail, through the lens of the Trino query engine user. If you’re not aware of Trino (formerly PrestoSQL) yet, it is the project that houses the founding Presto community after the founders of Presto left Facebook. This and the next couple of posts discuss the Iceberg specification and all the features Iceberg has to offer, many times in comparison with Hive.&#xA;&#xA;Before jumping into the comparisons, what is Iceberg exactly? The first thing to understand is that Iceberg is not a file format, but a table format. It may not be clear what this means by just stating that, but the function of a table format becomes clearer as the improvements Iceberg brings from the Hive table standard materialize. Iceberg doesn’t replace file formats like ORC and Parquet, but is the layer between the query engine and the data. Iceberg maps and indexes the files in order to provide a higher level abstraction that handles the relational table format for data lakes. You will understand more about table formats through examples in this series.&#xA;&#xA;Hidden Partitions&#xA;&#xA;Hive Partitions&#xA;&#xA;Since most developers and users interact with the table format via the query language, a noticeable difference is the flexibility you have while creating a partitioned table. Assume you are trying to create a table for tracking events occurring in our system. You run both sets of SQL commands from Trino, just using the Hive and Iceberg connectors which are designated by the catalog name (i.e. the catalog name starting with hive. uses the Hive connector, while the iceberg. table uses the Iceberg connector). To begin with, the first DDL statement attempts to create an events table in the logging schema in the hive catalog, which is configured to use the Hive connector. Trino also creates a partition on the events table using the eventtime field which is a TIMESTAMP field.&#xA;&#xA;CREATE TABLE hive.logging.events (&#xA;  level VARCHAR,&#xA;  eventtime TIMESTAMP,&#xA;  message VARCHAR,&#xA;  callstack ARRAY(VARCHAR)&#xA;) WITH (&#xA;  format = &#39;ORC&#39;,&#xA;  partitionedby = ARRAY[&#39;eventtime&#39;]&#xA;);&#xA;&#xA;Running this in Trino using the Hive connector produces the following error message.&#xA;&#xA;Partition keys must be the last columns in the table and in the same order as the table properties: [eventtime]&#xA;&#xA;The Hive DDL is very dependent on ordering for columns and specifically partition columns. Partition fields must be located in the final column positions and in the order of partitioning in the DDL statement. The next statement attempts to create the same table, but now with the eventtime field moved to the last column position.&#xA;&#xA;CREATE TABLE hive.logging.events (&#xA;  level VARCHAR,&#xA;  message VARCHAR,&#xA;  callstack ARRAY(VARCHAR),&#xA;  eventtime TIMESTAMP&#xA;) WITH (&#xA;  format = &#39;ORC&#39;,&#xA;  partitionedby = ARRAY[&#39;eventtime&#39;]&#xA;);&#xA;&#xA;This time, the DDL command works successfully, but you likely don’t want to partition your data on the plain timestamp. This results in a separate file for each distinct timestamp value in your table (likely almost a file for each event). In Hive, there’s no way to indicate the time granularity at which you want to partition natively. The method to support this scenario with Hive is to create a new VARCHAR column, eventtimeday that is dependent on the eventtime column to create the date partition value.&#xA;&#xA;CREATE TABLE hive.logging.events (&#xA;  level VARCHAR,&#xA;  eventtime TIMESTAMP,&#xA;  message VARCHAR,&#xA;  callstack ARRAY(VARCHAR),&#xA;  eventtimeday VARCHAR&#xA;) WITH (&#xA;  format = &#39;ORC&#39;,&#xA;  partitionedby = ARRAY[&#39;eventtimeday&#39;]&#xA;);&#xA;&#xA;This method wastes space by adding a new column to your table. Even worse, it puts the burden of knowledge on the user to include this new column for writing data. It is then necessary to use that separate column for any read access to take advantage of the performance gains from the partitioning.&#xA;&#xA;INSERT INTO hive.logging.events&#xA;VALUES&#xA;(&#xA;  &#39;ERROR&#39;,&#xA;  timestamp &#39;2021-04-01 12:00:00.000001&#39;,&#xA;  &#39;Oh noes&#39;, &#xA;  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;], &#xA;  &#39;2021-04-01&#39;&#xA;),&#xA;(&#xA;  &#39;ERROR&#39;,&#xA;  timestamp &#39;2021-04-02 15:55:55.555555&#39;,&#xA;  &#39;Double oh noes&#39;,&#xA;  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;],&#xA;  &#39;2021-04-02&#39;&#xA;),&#xA;(&#xA;  &#39;WARN&#39;, &#xA;  timestamp &#39;2021-04-02 00:00:11.1122222&#39;,&#xA;  &#39;Maybeh oh noes?&#39;,&#xA;  ARRAY [&#39;Bad things could be happening??&#39;], &#xA;  &#39;2021-04-02&#39;&#xA;);&#xA;&#xA;Notice that the last partition value &#39;2021-04-01&#39; has to match the TIMESTAMP date during insertion. There is no validation in Hive to make sure this is happening because it only requires a VARCHAR and knows to partition based on different values.&#xA;&#xA;On the other hand, If a user runs the following query:&#xA;&#xA;SELECT &#xA;FROM hive.logging.events&#xA;WHERE eventtime &lt; timestamp &#39;2021-04-02&#39;;&#xA;&#xA;they get the correct results back, but have to scan all the data in the table:&#xA;&#xA;table&#xA;trthlevel/ththeventtime/ththmessage/ththcallstack/th/tr&#xA;trtdERROR/tdtd2021-04-01 12:00:00/tdtdOh noes/tdtdException in thread &#34;main&#34; java.lang.NullPointerException/td/tr&#xA;/table&#xA;&#xA;This happens because the user forgot to include the eventtimeday &lt; &#39;2021-04-02&#39; predicate in the WHERE clause. This eliminates all the benefits that led us to create the partition in the first place and yet frequently this is missed by the users of these tables.&#xA;&#xA;SELECT &#xA;FROM hive.logging.events&#xA;WHERE eventtime &lt; timestamp &#39;2021-04-02&#39; &#xA;AND eventtimeday &lt; &#39;2021-04-02&#39;;&#xA;&#xA;Result:&#xA;&#xA;table&#xA;trthlevel/ththeventtime/ththmessage/ththcallstack/th/tr&#xA;trtdERROR/tdtd2021-04-01 12:00:00/tdtdOh noes/tdtdException in thread &#34;main&#34; java.lang.NullPointerException/td/tr&#xA;/table&#xA;&#xA;Iceberg Partitions&#xA;&#xA;The following DDL statement illustrates how these issues are handled in Iceberg via the Trino Iceberg connector.&#xA;&#xA;CREATE TABLE iceberg.logging.events (&#xA;  level VARCHAR,&#xA;  eventtime TIMESTAMP(6),&#xA;  message VARCHAR,&#xA;  callstack ARRAY(VARCHAR)&#xA;) WITH (&#xA;  partitioning = ARRAY[&#39;day(eventtime)&#39;]&#xA;);&#xA;&#xA;Taking note of a few things. First, notice the partition on the eventtime column that is defined without having to move it to the last position. There is also no need to create a separate field to handle the daily partition on the eventtime field. The partition specification is maintained internally by Iceberg, and neither the user nor the reader of this table needs to know anything about the partition specification to take advantage of it. This concept is called hidden partitioning , where only the table creator/maintainer has to know the partitioning specification. Here is what the insert statements look like now:&#xA;&#xA;INSERT INTO iceberg.logging.events&#xA;VALUES&#xA;(&#xA;  &#39;ERROR&#39;,&#xA;  timestamp &#39;2021-04-01 12:00:00.000001&#39;,&#xA;  &#39;Oh noes&#39;, &#xA;  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;]&#xA;),&#xA;(&#xA;  &#39;ERROR&#39;,&#xA;  timestamp &#39;2021-04-02 15:55:55.555555&#39;,&#xA;  &#39;Double oh noes&#39;,&#xA;  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;]),&#xA;(&#xA;  &#39;WARN&#39;, &#xA;  timestamp &#39;2021-04-02 00:00:11.1122222&#39;,&#xA;  &#39;Maybeh oh noes?&#39;,&#xA;  ARRAY [&#39;Bad things could be happening??&#39;]&#xA;);&#xA;&#xA;The VARCHAR dates are no longer needed. The eventtime field is internally converted to the proper partition value to partition each row. Also, notice that the same query that ran in Hive returns the same results. The big difference is that it doesn’t require any extra clause to indicate to filter partition as well as filter the results.&#xA;&#xA;SELECT *&#xA;FROM iceberg.logging.events&#xA;WHERE eventtime &lt; timestamp &#39;2021-04-02&#39;;&#xA;&#xA;Result:&#xA;&#xA;table&#xA;trthlevel/ththeventtime/ththmessage/ththcallstack/th/tr&#xA;trtdERROR/tdtd2021-04-01 12:00:00/tdtdOh noes/tdtdException in thread &#34;main&#34; java.lang.NullPointerException/td/tr&#xA;/table&#xA;&#xA;So hopefully that gives you a glimpse into what a table format and specification are, and why Iceberg is such a wonderful improvement over the existing and outdated method of storing your data in your data lake. While this post covers a lot of aspects of Iceberg’s capabilities, this is just the tip of the Iceberg…&#xA;&#xA;If you want to play around with Iceberg using Trino, check out the Trino Iceberg docs. The next post covers how table evolution works in Iceberg, as well as, how Iceberg is an improved storage format for cloud storage.&#xA;&#xA;#trino #iceberg&#xA;&#xA;bits&#xA;&#xA;!--emailsub--]]&gt;</description>
      <content:encoded><![CDATA[<p><img src="https://trino.io/assets/blog/trino-on-ice/trino-iceberg.png" alt=""/></p>

<p>Back in the <a href="https://trino.io/blog/2020/10/20/intro-to-hive-connector">Gentle introduction to the Hive connector</a> blog post, I discussed a commonly misunderstood architecture and uses of the Trino Hive connector. In short, while some may think the name indicates Trino makes a call to a running Hive instance, the Hive connector does not use the Hive runtime to answer queries. Instead, the connector is named Hive connector because it relies on Hive conventions and implementation details from the Hadoop ecosystem – the invisible Hive specification.</p>



<hr/>

<p>Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:</p>
<ul><li><a href="https://bitsondata.dev/trino-iceberg-i-gentle-intro">Trino on ice I: A gentle introduction to Iceberg</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-ii-table-evolution-cloud">Trino on ice II: In-place table evolution and cloud compatibility with Iceberg</a></li>
<li><a href="https://write.as/bitsondatadev/trino-iceberg-iii-concurrency-snapshots-spec">Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec</a></li>
<li><a href="https://bitsondata.dev/trino-iceberg-iv-deep-dive">Trino on ice IV: Deep dive into Iceberg internals</a></li></ul>

<hr/>

<p>I call this specification invisible because it doesn’t exist. It lives in the Hive code and the minds of those who developed it. This is makes it very difficult for anybody else who has to integrate with any distributed object storage that uses Hive, since they had to rely on reverse engineering and keeping up with the changes. The way you interact with Hive changes based on <a href="https://medium.com/hashmapinc/four-steps-for-migrating-from-hive-2-x-to-3-x-e85a8363a18">which version of Hive or Hadoop</a> you are running. It also varies if you are in the cloud or over an object store. Spark has even <a href="https://spark.apache.org/docs/2.4.4/sql-migration-guide-hive-compatibility.html">modified the Hive spec</a> in some ways to fit the Hive model to their use cases. It’s a big mess that data engineers have put up with for years. Yet despite the confusion and lack of organization due to Hive’s number of unwritten assumptions, the Hive connector is the most popular connector in use for Trino. Virtually every big data query engine uses the Hive model today in some form. As a result it is used by numerous companies to store and access data in their data lakes.</p>

<p>So how did something with no specification become so ubiquitous in data lakes? Hive was first in the large object storage and big data world as part of Hadoop. Hadoop became popular from good marketing for Hadoop to solve the problems of dealing with the increase in data with the Web 2.0 boom . Of course, Hive didn’t get everything wrong. In fact, without Hive, and the fact that it is open source, there may not have been a unified specification at all. Despite the many hours data engineers have spent bashing their heads against the wall with all the unintended consequences of Hive, it still served a very useful purpose.</p>

<p>So why did I just rant about Hive for so long if I’m here to tell you about <a href="https://iceberg.apache.org/">Apache Iceberg</a>? It’s impossible for a teenager growing up today to truly appreciate music streaming services without knowing what it was like to have an iPod with limited storage, or listening to a scratched burnt CD that skips, or flipping your tape or record to side-B. The same way anyone born before the turn of the millennium really appreciates streaming services, so you too will appreciate Iceberg once you’ve learned the intricacies of managing a data lake built on Hive and Hadoop.</p>

<p>If you haven’t used Hive before, this blog post outlines just a few pain points that come from this data warehousing software to give you proper context. If you have already lived through these headaches, this post acts as a guide to Iceberg from Hive. This post is the first in a series of blog posts discussing Apache Iceberg in great detail, through the lens of the Trino query engine user. If you’re not aware of Trino (formerly PrestoSQL) yet, it is the project that houses the founding Presto community after the <a href="https://trino.io/blog/2020/12/27/announcing-trino">founders of Presto left Facebook</a>. This and the next couple of posts discuss the Iceberg specification and all the features Iceberg has to offer, many times in comparison with Hive.</p>

<p>Before jumping into the comparisons, what is Iceberg exactly? The first thing to understand is that Iceberg is not a file format, but a table format. It may not be clear what this means by just stating that, but the function of a table format becomes clearer as the improvements Iceberg brings from the Hive table standard materialize. Iceberg doesn’t replace file formats like ORC and Parquet, but is the layer between the query engine and the data. Iceberg maps and indexes the files in order to provide a higher level abstraction that handles the relational table format for data lakes. You will understand more about table formats through examples in this series.</p>

<h2 id="hidden-partitions" id="hidden-partitions">Hidden Partitions</h2>

<h3 id="hive-partitions" id="hive-partitions">Hive Partitions</h3>

<p>Since most developers and users interact with the table format via the query language, a noticeable difference is the flexibility you have while creating a partitioned table. Assume you are trying to create a table for tracking events occurring in our system. You run both sets of SQL commands from Trino, just using the Hive and Iceberg connectors which are designated by the catalog name (i.e. the catalog name starting with <code>hive.</code> uses the Hive connector, while the <code>iceberg.</code> table uses the Iceberg connector). To begin with, the first DDL statement attempts to create an <code>events</code> table in the <code>logging</code> schema in the <code>hive</code> catalog, which is configured to use the Hive connector. Trino also creates a partition on the <code>events</code> table using the <code>event_time</code> field which is a <code>TIMESTAMP</code> field.</p>

<pre><code>CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = &#39;ORC&#39;,
  partitioned_by = ARRAY[&#39;event_time&#39;]
);
</code></pre>

<p>Running this in Trino using the Hive connector produces the following error message.</p>

<pre><code>Partition keys must be the last columns in the table and in the same order as the table properties: [event_time]
</code></pre>

<p>The Hive DDL is very dependent on ordering for columns and specifically partition columns. Partition fields must be located in the final column positions and in the order of partitioning in the DDL statement. The next statement attempts to create the same table, but now with the <code>event_time</code> field moved to the last column position.</p>

<pre><code>CREATE TABLE hive.logging.events (
  level VARCHAR,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time TIMESTAMP
) WITH (
  format = &#39;ORC&#39;,
  partitioned_by = ARRAY[&#39;event_time&#39;]
);
</code></pre>

<p>This time, the DDL command works successfully, but you likely don’t want to partition your data on the plain timestamp. This results in a separate file for each distinct timestamp value in your table (likely almost a file for each event). In Hive, there’s no way to indicate the time granularity at which you want to partition natively. The method to support this scenario with Hive is to create a new <code>VARCHAR</code> column, <code>event_time_day</code> that is dependent on the <code>event_time</code> column to create the date partition value.</p>

<pre><code>CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time_day VARCHAR
) WITH (
  format = &#39;ORC&#39;,
  partitioned_by = ARRAY[&#39;event_time_day&#39;]
);
</code></pre>

<p>This method wastes space by adding a new column to your table. Even worse, it puts the burden of knowledge on the user to include this new column for writing data. It is then necessary to use that separate column for any read access to take advantage of the performance gains from the partitioning.</p>

<pre><code>INSERT INTO hive.logging.events
VALUES
(
  &#39;ERROR&#39;,
  timestamp &#39;2021-04-01 12:00:00.000001&#39;,
  &#39;Oh noes&#39;, 
  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;], 
  &#39;2021-04-01&#39;
),
(
  &#39;ERROR&#39;,
  timestamp &#39;2021-04-02 15:55:55.555555&#39;,
  &#39;Double oh noes&#39;,
  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;],
  &#39;2021-04-02&#39;
),
(
  &#39;WARN&#39;, 
  timestamp &#39;2021-04-02 00:00:11.1122222&#39;,
  &#39;Maybeh oh noes?&#39;,
  ARRAY [&#39;Bad things could be happening??&#39;], 
  &#39;2021-04-02&#39;
);
</code></pre>

<p>Notice that the last partition value <code>&#39;2021-04-01&#39;</code> has to match the <code>TIMESTAMP</code> date during insertion. There is no validation in Hive to make sure this is happening because it only requires a <code>VARCHAR</code> and knows to partition based on different values.</p>

<p>On the other hand, If a user runs the following query:</p>

<pre><code>SELECT *
FROM hive.logging.events
WHERE event_time &lt; timestamp &#39;2021-04-02&#39;;
</code></pre>

<p>they get the correct results back, but have to scan all the data in the table:</p>

<table>
<tr><th>level</th><th>event_time</th><th>message</th><th>call_stack</th></tr>
<tr><td>ERROR</td><td>2021-04-01 12:00:00</td><td>Oh noes</td><td>Exception in thread &#34;main&#34; java.lang.NullPointerException</td></tr>
</table>

<p>This happens because the user forgot to include the <code>event_time_day &lt; &#39;2021-04-02&#39;</code> predicate in the <code>WHERE</code> clause. This eliminates all the benefits that led us to create the partition in the first place and yet frequently this is missed by the users of these tables.</p>

<pre><code>SELECT *
FROM hive.logging.events
WHERE event_time &lt; timestamp &#39;2021-04-02&#39; 
AND event_time_day &lt; &#39;2021-04-02&#39;;
</code></pre>

<p>Result:</p>

<table>
<tr><th>level</th><th>event_time</th><th>message</th><th>call_stack</th></tr>
<tr><td>ERROR</td><td>2021-04-01 12:00:00</td><td>Oh noes</td><td>Exception in thread &#34;main&#34; java.lang.NullPointerException</td></tr>
</table>

<h3 id="iceberg-partitions" id="iceberg-partitions">Iceberg Partitions</h3>

<p>The following DDL statement illustrates how these issues are handled in Iceberg via the Trino Iceberg connector.</p>

<pre><code>CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6),
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  partitioning = ARRAY[&#39;day(event_time)&#39;]
);
</code></pre>

<p>Taking note of a few things. First, notice the partition on the <code>event_time</code> column that is defined without having to move it to the last position. There is also no need to create a separate field to handle the daily partition on the <code>event_time</code> field. The <em><strong>partition specification</strong></em> is maintained internally by Iceberg, and neither the user nor the reader of this table needs to know anything about the partition specification to take advantage of it. This concept is called <em><strong>hidden partitioning</strong></em> , where only the table creator/maintainer has to know the <em><strong>partitioning specification</strong></em>. Here is what the insert statements look like now:</p>

<pre><code>INSERT INTO iceberg.logging.events
VALUES
(
  &#39;ERROR&#39;,
  timestamp &#39;2021-04-01 12:00:00.000001&#39;,
  &#39;Oh noes&#39;, 
  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;]
),
(
  &#39;ERROR&#39;,
  timestamp &#39;2021-04-02 15:55:55.555555&#39;,
  &#39;Double oh noes&#39;,
  ARRAY [&#39;Exception in thread &#34;main&#34; java.lang.NullPointerException&#39;]),
(
  &#39;WARN&#39;, 
  timestamp &#39;2021-04-02 00:00:11.1122222&#39;,
  &#39;Maybeh oh noes?&#39;,
  ARRAY [&#39;Bad things could be happening??&#39;]
);
</code></pre>

<p>The <code>VARCHAR</code> dates are no longer needed. The <code>event_time</code> field is internally converted to the proper partition value to partition each row. Also, notice that the same query that ran in Hive returns the same results. The big difference is that it doesn’t require any extra clause to indicate to filter partition as well as filter the results.</p>

<pre><code>SELECT *
FROM iceberg.logging.events
WHERE event_time &lt; timestamp &#39;2021-04-02&#39;;
</code></pre>

<p>Result:</p>

<table>
<tr><th>level</th><th>event_time</th><th>message</th><th>call_stack</th></tr>
<tr><td>ERROR</td><td>2021-04-01 12:00:00</td><td>Oh noes</td><td>Exception in thread &#34;main&#34; java.lang.NullPointerException</td></tr>
</table>

<p>So hopefully that gives you a glimpse into what a table format and specification are, and why Iceberg is such a wonderful improvement over the existing and outdated method of storing your data in your data lake. While this post covers a lot of aspects of Iceberg’s capabilities, this is just the tip of the Iceberg…</p>

<p><img src="https://trino.io/assets/blog/trino-on-ice/see_myself_out.gif" alt=""/></p>

<p>If you want to play around with Iceberg using Trino, check out the <a href="https://trino.io/docs/current/connector/iceberg.html">Trino Iceberg docs</a>. The <a href="https://bitsondata.dev/in-place-table-evolution-and-cloud-compatibility-with-iceberg">next post</a> covers how table evolution works in Iceberg, as well as, how Iceberg is an improved storage format for cloud storage.</p>

<p><a href="https://bitsondata.dev/tag:trino" class="hashtag"><span>#</span><span class="p-category">trino</span></a> <a href="https://bitsondata.dev/tag:iceberg" class="hashtag"><span>#</span><span class="p-category">iceberg</span></a></p>

<p><em>bits</em></p>


]]></content:encoded>
      <guid>https://bitsondata.dev/trino-iceberg-i-gentle-intro</guid>
      <pubDate>Mon, 03 May 2021 05:00:00 +0000</pubDate>
    </item>
    <item>
      <title>A gentle introduction to the Hive connector</title>
      <link>https://bitsondata.dev/a-gentle-introduction-to-the-hive-connector?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[TL;DR: The Hive connector is what you use in Trino for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.&#xA;&#xA;!--more--&#xA;&#xA;Originally Posted on https://trino.io/blog/2020/10/20/intro-to-hive-connector.html&#xA;&#xA;One of the most confusing aspects when starting Trino is the Hive connector. Typically, you seek out the use of Trino when you experience an intensely slow query turnaround from your existing Hadoop, Spark, or Hive infrastructure. In fact, the genesis of Trino came about due to these slow Hive query conditions at Facebook back in 2012.&#xA;&#xA;So when you learn that Trino has a Hive connector, it can be rather confusing since you moved to Trino to circumvent the slowness of your current Hive cluster. Another common source of confusion is when you want to query your data from your cloud object storage, such as AWS S3, MinIO, and Google Cloud Storage. This too uses the Hive connector. If that confuses you, don’t worry, you are not alone. This blog aims to explain this commonly confusing nomenclature.&#xA;&#xA;Hive architecture&#xA;&#xA;To understand the origins and inner workings of Trino’s Hive connector, you first need to know a few high-level components of the Hive architecture.&#xA;&#xA;You can simplify the Hive architecture to four components:&#xA;&#xA;The runtime contains the logic of the query engine that translates the SQL -esque Hive Query Language(HQL) into MapReduce jobs that run over files stored in the filesystem.&#xA;&#xA;The storage component is simply that, it stores files in various formats and index structures to recall these files. The file formats can be anything as simple as JSON and CSV, to more complex files such as columnar formats like ORC and Parquet. Traditionally, Hive runs on top of the Hadoop Distributed Filesystem (HDFS). As cloud-based options became more prevalent, object storage like Amazon S3, Azure Blob Storage, Google Cloud Storage, and others needed to be leveraged as well and replaced HDFS as the storage component.&#xA;&#xA;In order for Hive to process these files, it must have a mapping from SQL tables in the runtime to files and directories in the storage component. To accomplish this, Hive uses the Hive Metastore Service (HMS), often shortened to the metastore to manage the metadata about the files such as table columns, file locations, file formats, etc…&#xA;&#xA;The last component not included in the image is Hive’s data organization specification. The documentation of this element only exists in the code in Hive and has been reverse engineered to be used by other systems like Trino to remain compatible with other systems.&#xA;&#xA;Trino reuses all of these components except for the runtime. This is the same approach most compute engine takes when dealing with data in object stores, specifically, Trino, Spark, Drill, and Impala. When you think of the Hive connector, you should think about a connector that is capable of reading data organized by the unwritten Hive specification.&#xA;&#xA;Trino runtime replaces Hive runtime&#xA;&#xA;In the early days of big data systems, many expected query turnaround to take a long time due to the high volume of unstructured data in ETL workloads. The primary goal in early iterations of these systems was simply throughput over large volumes of data while maintaining fault-tolerance. Now, more businesses want to run fast interactive queries over their big data instead of running jobs that take hours and produce possibly undesirable results. Many companies have petabytes of data and metadata in their data warehouse. Data in storage is cumbersome to move and the data in the metastore takes a long time to repopulate in other formats. Since only the runtime that executed Hive queries needs replacement, the Trino engine utilizes the existing metastore metadata and files residing in storage, and the Trino runtime effectively replaces the Hive runtime responsible for analyzing the data.&#xA;&#xA;Trino Architecture&#xA;&#xA;The Hive connector nomenclature&#xA;&#xA;Notice, that the only change in the Trino architecture is the runtime. The HMS still exists along with the storage. This is not by accident. This design exists to address a common problem faced by many companies. It simplifies the migration from using Hive to using Trino. Regardless of the storage component used the runtime makes use of the HMS and that is the reason this connector is the Hive connector.&#xA;&#xA;Where the confusion tends to come from, is when you search for a connector from the context of the storage systems you want to query. You may not even be aware the metastore is a necessity or even exists. Typically, you look for an S3 connector, a GCS connector or a MinIO connector. All you need is the Hive connector and the HMS to manage the metadata of the objects in your storage.&#xA;&#xA;The Hive Metastore Service&#xA;&#xA;The HMS is the only Hive process used in the entire Trino ecosystem when using the Hive connector. The HMS is actually a simple service with a binary API using the Thrift protocol. This service makes updates to the metadata, stored in an RDBMS such as PostgreSQL, MySQL, or MariaDB. There are other compatible replacements of the HMS such as AWS Glue, a drop-in substitution for the HMS.&#xA;&#xA;https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio&#xA;&#xA;Getting started with the Hive Connector on Trino&#xA;&#xA;To drive this point home, I created a tutorial that showcases using Trino and looking at the metadata it produces. In the following scenario, the docker environment contains four docker containers:&#xA;&#xA;trino - the runtime in this scenario that replaces Hive.&#xA;minio - the storage is an open-source cloud object storage.&#xA;hive-metastore - the metastore service instance.&#xA;mariadb - the database that the metastore uses to store the metadata.&#xA;&#xA;You can play around with the system and optionally view the configurations. The scenario asks you to run a query to populate data in MinIO and then see the resulting metadata populated in MariaDB by the HMS. The next step asks you to run queries over the mariadb database which holds the generated metadata from the metastore.&#xA;&#xA;If you have any questions or run into any issues with the example, you can find us on slack on the #dev or #general channels.&#xA;&#xA;Have fun!&#xA;&#xA;https://trino.io/assets/blog/intro-to-hive-connector/intro-to-hive.jpeg&#xA;&#xA;https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio&#xA;&#xA;#trino #hive&#xA;&#xA;bits&#xA;&#xA;!--emailsub--]]&gt;</description>
      <content:encoded><![CDATA[<p>TL;DR: The Hive connector is what you use in Trino for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.</p>

<p><img src="https://trino.io/assets/blog/intro-to-hive-connector/hive.png" alt=""/></p>



<p>Originally Posted on <a href="https://trino.io/blog/2020/10/20/intro-to-hive-connector.html">https://trino.io/blog/2020/10/20/intro-to-hive-connector.html</a></p>

<p>One of the most confusing aspects when starting Trino is the Hive connector. Typically, you seek out the use of Trino when you experience an intensely slow query turnaround from your existing Hadoop, Spark, or Hive infrastructure. In fact, the genesis of Trino came about due to these slow Hive query conditions at Facebook back in 2012.</p>

<p>So when you learn that Trino has a Hive connector, it can be rather confusing since you moved to Trino to circumvent the slowness of your current Hive cluster. Another common source of confusion is when you want to query your data from your cloud object storage, such as AWS S3, MinIO, and Google Cloud Storage. This too uses the Hive connector. If that confuses you, don’t worry, you are not alone. This blog aims to explain this commonly confusing nomenclature.</p>

<h3 id="hive-architecture" id="hive-architecture">Hive architecture</h3>

<p>To understand the origins and inner workings of Trino’s Hive connector, you first need to know a few high-level components of the Hive architecture.</p>

<p><img src="https://trino.io/assets/blog/intro-to-hive-connector/hive.png" alt=""/></p>

<p>You can simplify the Hive architecture to four components:</p>

<p><em>The runtime</em> contains the logic of the query engine that translates the SQL -esque Hive Query Language(HQL) into MapReduce jobs that run over files stored in the filesystem.</p>

<p><em>The storage</em> component is simply that, it stores files in various formats and index structures to recall these files. The file formats can be anything as simple as JSON and CSV, to more complex files such as columnar formats like ORC and Parquet. Traditionally, Hive runs on top of the Hadoop Distributed Filesystem (HDFS). As cloud-based options became more prevalent, object storage like Amazon S3, Azure Blob Storage, Google Cloud Storage, and others needed to be leveraged as well and replaced HDFS as the storage component.</p>

<p>In order for Hive to process these files, it must have a mapping from SQL tables in <em>the runtime</em> to files and directories in <em>the storage</em> component. To accomplish this, Hive uses the Hive Metastore Service (HMS), often shortened to <em>the metastore</em> to manage the metadata about the files such as table columns, file locations, file formats, etc…</p>

<p>The last component not included in the image is Hive’s <em>data organization specification</em>. The documentation of this element only exists in the code in Hive and has been reverse engineered to be used by other systems like Trino to remain compatible with other systems.</p>

<p>Trino reuses all of these components except for <em>the runtime</em>. This is the same approach most compute engine takes when dealing with data in object stores, specifically, Trino, Spark, Drill, and Impala. When you think of the Hive connector, you should think about a connector that is capable of reading data organized by the unwritten Hive specification.</p>

<h3 id="trino-runtime-replaces-hive-runtime" id="trino-runtime-replaces-hive-runtime">Trino runtime replaces Hive runtime</h3>

<p>In the early days of big data systems, many expected query turnaround to take a long time due to the high volume of unstructured data in ETL workloads. The primary goal in early iterations of these systems was simply throughput over large volumes of data while maintaining fault-tolerance. Now, more businesses want to run fast interactive queries over their big data instead of running jobs that take hours and produce possibly undesirable results. Many companies have petabytes of data and metadata in their data warehouse. Data in storage is cumbersome to move and the data in the metastore takes a long time to repopulate in other formats. Since only the runtime that executed Hive queries needs replacement, the Trino engine utilizes the existing metastore metadata and files residing in storage, and the Trino runtime effectively replaces the Hive runtime responsible for analyzing the data.</p>

<h3 id="trino-architecture" id="trino-architecture">Trino Architecture</h3>

<p><img src="https://trino.io/assets/blog/intro-to-hive-connector/trino.png" alt=""/></p>

<h3 id="the-hive-connector-nomenclature" id="the-hive-connector-nomenclature">The Hive connector nomenclature</h3>

<p>Notice, that the only change in the Trino architecture is <em>the runtime</em>. The HMS still exists along with <em>the storage</em>. This is not by accident. This design exists to address a common problem faced by many companies. It simplifies the migration from using Hive to using Trino. Regardless of <em>the storage</em> component used <em>the runtime</em> makes use of the HMS and that is the reason this connector is the Hive connector.</p>

<p>Where the confusion tends to come from, is when you search for a connector from the context of the storage systems you want to query. You may not even be aware <em>the metastore</em> is a necessity or even exists. Typically, you look for an S3 connector, a GCS connector or a MinIO connector. All you need is the Hive connector and the HMS to manage the metadata of the objects in your storage.</p>

<h3 id="the-hive-metastore-service" id="the-hive-metastore-service">The Hive Metastore Service</h3>

<p>The HMS is the only Hive process used in the entire Trino ecosystem when using the Hive connector. The HMS is actually a simple service with a binary API using <strong><a href="https://thrift.apache.org/">the Thrift protocol</a></strong>. This service makes updates to the metadata, stored in an RDBMS such as PostgreSQL, MySQL, or MariaDB. There are other compatible replacements of the HMS such as AWS Glue, a drop-in substitution for the HMS.</p>

<p><a href="https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio">https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio</a></p>

<h3 id="getting-started-with-the-hive-connector-on-trino" id="getting-started-with-the-hive-connector-on-trino">Getting started with the Hive Connector on Trino</h3>

<p>To drive this point home, I <a href="https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio">created a tutorial that showcases using Trino and looking at the metadata it produces</a>. In the following scenario, the docker environment contains four docker containers:</p>
<ul><li><code>trino</code> - <em>the runtime</em> in this scenario that replaces Hive.</li>
<li><code>minio</code> - <em>the storage</em> is an open-source cloud object storage.</li>
<li><code>hive-metastore</code> - <em>the metastore</em> service instance.</li>
<li><code>mariadb</code> - the database that <em>the metastore</em> uses to store the metadata.</li></ul>

<p>You can play around with the system and optionally view the configurations. The scenario asks you to run a query to populate data in MinIO and then see the resulting metadata populated in MariaDB by the HMS. The next step asks you to run queries over the <code>mariadb</code> database which holds the generated metadata from <em>the metastore</em>.</p>

<p>If you have any questions or run into any issues with the example, you can find us on <a href="https://trino.io/slack.html">slack</a> on the <a href="https://bitsondata.dev/tag:dev" class="hashtag"><span>#</span><span class="p-category">dev</span></a> or <a href="https://bitsondata.dev/tag:general" class="hashtag"><span>#</span><span class="p-category">general</span></a> channels.</p>

<p>Have fun!</p>

<p>![<a href="https://trino.io/assets/blog/intro-to-hive-connector/intro-to-hive.jpeg](">https://trino.io/assets/blog/intro-to-hive-connector/intro-to-hive.jpeg](</a>)</p>

<p><a href="https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio">https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio</a></p>

<p><a href="https://bitsondata.dev/tag:trino" class="hashtag"><span>#</span><span class="p-category">trino</span></a> <a href="https://bitsondata.dev/tag:hive" class="hashtag"><span>#</span><span class="p-category">hive</span></a></p>

<p><em>bits</em></p>


]]></content:encoded>
      <guid>https://bitsondata.dev/a-gentle-introduction-to-the-hive-connector</guid>
      <pubDate>Wed, 21 Oct 2020 17:00:00 +0000</pubDate>
    </item>
    <item>
      <title>What is benchmarketing and why is it bad?</title>
      <link>https://bitsondata.dev/what-is-benchmarketing-and-why-is-it-bad?pk_campaign=rss-feed</link>
      <description>&lt;![CDATA[There’s something I have to get off my chest. If you really need to, just read the TLDR and listen to the Justin Bieber parody posted below. If you’re confused by the lingo, the rest of the post will fill in any gaps.&#xA;&#xA;TL;DR: Benchmarketing, the practice of using benchmarks for marketing, is bad. Consumers should run their own benchmarks and ideally open-source them instead of relying on an internal and biased report.&#xA;&#xA;iframe width=&#34;560&#34; height=&#34;315&#34; src=&#34;https://www.youtube.com/embed/FSy8V-R0Zw&#34; frameborder=&#34;0&#34; allow=&#34;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&#34; allowfullscreen/iframe&#xA;&#xA;!--more--&#xA;&#xA;For the longest time, I have wondered what is the point of corporations, specifically in the database sectors, running their own benchmarks. Would a company ever have any incentive to post results from a benchmark that didn’t show its own system winning in at least the majority of cases? I understand that these benchmarks have become part of the furniture we come to expect to see when visiting any hot new database’s website. I doubt anybody in the public domain gains much insight out of these results, to begin with, at least nothing they weren’t expecting to see.&#xA;&#xA;Now to be clear, I am in no way indicating that companies running their own internal benchmarks to analyze their own performance in comparison to their competitors is a bad thing. It’s when they take those results and intentionally skew the methods or data from these benchmarks for sales or marketing purposes that is the problem we’re discussing here. Vendors that take part in the practice, not only use these benchmarks to show their systems succeeding a little but rather perversely taint their methodology with settings, caching, and other performance enhancements while leaving their competition’s settings untouched.&#xA;&#xA;This should be obvious that this is NOT what benchmarking is about! If you read about the history of the Transaction Processing Performance Council (TPC) you come to understand that this is the very wrongdoing that the council was created to address. But like with any proxy involving measurements, the measurements are inherently pliable.&#xA;&#xA;  By the spring of 1991, the TPC was clearly a success. Dozens of companies were running multiple TPC-A and TPC-B results. Not surprisingly, these companies wanted to capitalize on the TPC’s cachet and leverage the investment they had made in TPC benchmarking. Several companies launched aggressive advertising and public relations campaigns based around their TPC results. In many ways, this was exactly why the TPC was created: to provide objective measures of performance. What was wrong, therefore, with companies wanting to brag about their good results? What was wrong is that there was often a large gap between the objective benchmark results and their benchmark marketing claims — this gap, over the years, has been dubbed “benchmarketing.” So the TPC was faced with an ironic situation. It had poured an enormous amount of time and energy into creating a good benchmark and even a good benchmark review process. However, the TPC had no means to control how those results were used once they were approved. The resulting problems generated intense debates within the TPC.&#xA;&#xA;This benchmarketing ultimately fails the clients that these companies are marketing to. It demonstrates not only a lack of care for addressing the users’ actual pain but a lack of respect by intentionally pulling the wool over their eyes simply in an attempt to mask that their performance isn’t up to par with their competitors. This leads to consumers not being able to make informed decisions as most of our decisions are made from gut instincts and human emotion which these benchmarks aim to manipulate.&#xA;&#xA;If you’re not sure exactly how a company would pull this off, an example of might be that database A enables using a cost-based optimizer that requires precomputing statistics about different tables involved in the computation, while database B is running a query against this table without any type of stats based optimization made available to it. Database A will clearly dominate as now it can reorder joins and apply better execution plans while database B is going to go with the simplest plan and run much slower in most scenarios. The company whose product depends on database A will then hone in on the numerical outcomes of this report. Even if they’re decent enough to report the methods they skewed to get these results, they bury it within their report and focus on advertising the outcome of what would otherwise be considered an absurd comparison. Companies will even go as far as to say that their competition’s database wasn’t straightforward to configure when they were setting up optimizations. If you’re not capable of understanding how to make equivalent changes to both systems, well then I guess you don’t get to run that comparison until you figure it out.&#xA;&#xA;Many think that consumers are not susceptible to such attacks and would be able to see right through this scheme, but these reports appeal to any of us when we don’t have the necessity or resources to thoroughly examine all the data. Many times we have to take cues from our gut when a decision needs to be made and the time to make it is constrained by our time and other business needs. We see this type of phenomenon described in the book, Thinking Fast and Slow by Daniel Kahneman. To briefly summarize the model they use, there are two modes that humans use when they reason about their decisions, System 1 and System 2.&#xA;&#xA;  Systems 1 and 2 are both active whenever we are awake. System 1 runs automatically and System 2 is normally in comfortable low-effort mode, in which only a fraction of its capacity is engaged. System 1 continuously generates suggestions for System 2: impressions, intuitions, intentions, and feelings. If endorsed by System 2, impressions and intuitions turn into beliefs, and impulses turn into voluntary actions. When all goes smoothly, which is most of the time, System 2 adopts the suggestions of System 1 with little or no modification. You generally believe your impressions and act on your desires, and that is fine — usually.&#xA;&#xA;No surprise, that’s usually the part where we get into trouble. While we like to think that we are generally thinking in the logical System 2 mode, we don’t have time or energy to live in this space for long periods throughout the day and we find ourselves very reliant on System 1 for much of our decision making.&#xA;&#xA;  The measure of success for System 1 is the coherence of the story it manages to create. The amount and quality of the data on which the story is based are largely irrelevant. When information is scarce, which is a common occurrence, System 1 operates as a machine for jumping to conclusions.&#xA;&#xA;This is why benchmarketing can be so dangerous because it is so effective at manipulating our belief in claims that simply aren’t true. These decisions affect how your architecture will unfold, your time-to-value, and lost hours for your team and customers. It makes having these systems that fairly compare the performance and merits of two systems all the more paramount.&#xA;&#xA;https://xkcd.com/882/&#xA;&#xA;So why am I talking about this now?&#xA;&#xA;I have become a pretty big fanboy of Trino, a distributed query engine that runs interactive queries from many sources. I have witnessed firsthand how fast a cluster of Trino nodes is able to process a huge amount of data at fast speeds. When you dive into how these speeds are achieved you find that this project is an incredible modern feat of solid engineering that makes interactive analysis over petabytes of data a reality. Going into all the reasons I like this project would be too tangential but it fuels the fire for why I believe this message needs to be heard.&#xA;&#xA;Recently there was a “benchmark” that came out comparing the performance Dremio and Trino (then Presto) open-source and enterprise versions, touting performance improvements over Trino by an amount that would have been called out as too high in a CSI episode insert canonical csi clip here. Trino isn’t the only system in the data space to come under similar types of attacks. It makes sense too, as this type of technical peacocking is common as it successfully gains attention.&#xA;&#xA;Luckily, as more companies strive to become transparent and associate themselves with open-source efforts, we are starting to see a relatively new pattern of open-source efforts emerge. Typically, you’re used to hearing about open-source within the context of software projects maintained by open-source communities. We are now arriving at the age of any noun being able to be used in an open-source framework. There is open-source music, open-source education, and even open-source data. So why not reach a point where open-source benchmarking through consumer collaboration is a thing? This is not just for the sake of the consumers of these technologies who simply want to have more data to inform their design choices to better serve their clients, it’s also unfortunate that this affects developer communities that are putting in a lot of hard work on these projects, only to have that hard work get berated unintelligibly by the likes of some corporate status competition.&#xA;&#xA;Now I’m clearly a little biased when I tell you that I think Trino is currently the best analytics engine on the market today. When I say this, you really should be skeptical too. Really, I encourage it. You should verify in some way beyond a shadow of a doubt that:&#xA;&#xA;Any TPC or other benchmarks are validated and no “magic” was used to improve their performance.&#xA;&#xA;using your own use cases to make sure the system you choose is going to meet the needs of your particular use case.&#xA;&#xA;While this may seem like a lot of work, with cloud infrastructure and the simplicity of deploying different systems into the cloud, it’s now more possible to do this today than it ever was even 10 years ago to run a benchmark of competing systems internally and at scale. Not only can this benchmark be run by your own unbiased data engineers who have more stake to find out which system best fits the companies’ needs, but you don’t have to rely on generic benchmarking data to analyze this if you don’t want. You can spin up these systems and let them query your system, using your use cases, and do it any way you want it.&#xA;&#xA;In summary, if consumers can work together, we can work to get rid of this specific type of misinformation while providing a richer more insightful analysis that will aid both companies and consumers. As I mention in the song above, go run the test yourselves.&#xA;&#xA;#trino #presto #opensource&#xA;&#xA;bits_&#xA;&#xA;!--emailsub--]]&gt;</description>
      <content:encoded><![CDATA[<p>There’s something I have to get off my chest. If you really need to, just read the TLDR and listen to the Justin Bieber parody posted below. If you’re confused by the lingo, the rest of the post will fill in any gaps.</p>

<p>TL;DR: Benchmarketing, the practice of using benchmarks for marketing, is bad. Consumers should run their own benchmarks and ideally open-source them instead of relying on an internal and biased report.</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/FSy8V-R0_Zw" frameborder="0" allowfullscreen=""></iframe>



<p>For the longest time, I have wondered what is the point of corporations, specifically in the database sectors, running their own benchmarks. Would a company ever have any incentive to post results from a benchmark that didn’t show its own system winning in at least the majority of cases? I understand that these benchmarks have become part of the furniture we come to expect to see when visiting any hot new database’s website. I doubt anybody in the public domain gains much insight out of these results, to begin with, at least nothing they weren’t expecting to see.</p>

<p>Now to be clear, I am in no way indicating that companies running their own internal benchmarks to analyze their own performance in comparison to their competitors is a bad thing. It’s when they take those results and intentionally skew the methods or data from these benchmarks for sales or marketing purposes that is the problem we’re discussing here. Vendors that take part in the practice, not only use these benchmarks to show their systems succeeding a little but rather perversely taint their methodology with settings, caching, and other performance enhancements while leaving their competition’s settings untouched.</p>

<p>This should be obvious that this is NOT what benchmarking is about! If you read about the history of the <a href="http://www.tpc.org/information/about/history5.asp">Transaction Processing Performance Council (TPC)</a> you come to understand that this is the very wrongdoing that the council was created to address. But like with any proxy involving measurements, the measurements are inherently pliable.</p>

<blockquote><p><em>By the spring of 1991, the TPC was clearly a success. Dozens of companies were running multiple TPC-A and TPC-B results. Not surprisingly, these companies wanted to capitalize on the TPC’s cachet and leverage the investment they had made in TPC benchmarking. Several companies launched aggressive advertising and public relations campaigns based around their TPC results. In many ways, this was exactly why the TPC was created: to provide objective measures of performance. What was wrong, therefore, with companies wanting to brag about their good results? What was wrong is that there was often a large gap between the objective benchmark results and their benchmark marketing claims — this gap, over the years, has been dubbed “benchmarketing.” So the TPC was faced with an ironic situation. It had poured an enormous amount of time and energy into creating a good benchmark and even a good benchmark review process. However, the TPC had no means to control how those results were used once they were approved. The resulting problems generated intense debates within the TPC.</em></p></blockquote>

<p>This benchmarketing ultimately fails the clients that these companies are marketing to. It demonstrates not only a lack of care for addressing the users’ actual pain but a lack of respect by intentionally pulling the wool over their eyes simply in an attempt to mask that their performance isn’t up to par with their competitors. <strong>This leads to consumers not being able to make informed decisions as most of our decisions are made from gut instincts and human emotion which these benchmarks aim to manipulate.</strong></p>

<p>If you’re not sure exactly how a company would pull this off, an example of might be that database A enables using a cost-based optimizer that requires precomputing statistics about different tables involved in the computation, while database B is running a query against this table without any type of stats based optimization made available to it. Database A will clearly dominate as now it can reorder joins and apply better execution plans while database B is going to go with the simplest plan and run much slower in most scenarios. The company whose product depends on database A will then hone in on the numerical outcomes of this report. Even if they’re decent enough to report the methods they skewed to get these results, they bury it within their report and focus on advertising the outcome of what would otherwise be considered an absurd comparison. Companies will even go as far as to say that their competition’s database wasn’t straightforward to configure when they were setting up optimizations. If you’re not capable of understanding how to make equivalent changes to both systems, well then I guess you don’t get to run that comparison until you figure it out.</p>

<p>Many think that consumers are not susceptible to such attacks and would be able to see right through this scheme, but these reports appeal to any of us when we don’t have the necessity or resources to thoroughly examine all the data. Many times we have to take cues from our gut when a decision needs to be made and the time to make it is constrained by our time and other business needs. We see this type of phenomenon described in the book, <em>Thinking Fast and Slow</em> by Daniel Kahneman. To briefly summarize the model they use, there are two modes that humans use when they reason about their decisions, System 1 and System 2.</p>

<blockquote><p><em>Systems 1 and 2 are both active whenever we are awake. System 1 runs automatically and System 2 is normally in comfortable low-effort mode, in which only a fraction of its capacity is engaged. System 1 continuously generates suggestions for System 2: impressions, intuitions, intentions, and feelings. If endorsed by System 2, impressions and intuitions turn into beliefs, and impulses turn into voluntary actions. When all goes smoothly, which is most of the time, System 2 adopts the suggestions of System 1 with little or no modification. You generally believe your impressions and act on your desires, and that is fine — usually.</em></p></blockquote>

<p>No surprise, that’s usually the part where we get into trouble. While we like to think that we are generally thinking in the logical System 2 mode, we don’t have time or energy to live in this space for long periods throughout the day and we find ourselves very reliant on System 1 for much of our decision making.</p>

<blockquote><p><em>The measure of success for System 1 is the coherence of the story it manages to create. The amount and quality of the data on which the story is based are largely irrelevant. When information is scarce, which is a common occurrence, System 1 operates as a machine for jumping to conclusions.</em></p></blockquote>

<p>This is why benchmarketing can be so dangerous because it is so effective at manipulating our belief in claims that simply aren’t true. These decisions affect how your architecture will unfold, your time-to-value, and lost hours for your team and customers. It makes having these systems that fairly compare the performance and merits of two systems all the more paramount.</p>

<p><img src="https://imgs.xkcd.com/comics/significant.png" alt="https://xkcd.com/882/"/></p>

<p>So why am I talking about this now?</p>

<p>I have become a pretty big fanboy of Trino, a distributed query engine that runs interactive queries from many sources. I have witnessed firsthand how fast a cluster of Trino nodes is able to process a huge amount of data at fast speeds. When you dive into how these speeds are achieved you find that this project is an incredible modern feat of solid engineering that makes interactive analysis over petabytes of data a reality. Going into all the reasons I like this project would be too tangential but it fuels the fire for why I believe this message needs to be heard.</p>

<p>Recently there was a <a href="https://web.archive.org/web/20211022224127/https://www.dremio.com/dremio-vs-presto/">“benchmark”</a> that came out comparing the performance Dremio and Trino (<a href="https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-for-trino.html">then Presto</a>) open-source and enterprise versions, touting performance improvements over Trino by an amount that would have been called out as too high in a CSI episode <a href="https://www.youtube.com/watch?v=hkDD03yeLnU"></a>. <a href="https://blog.yugabyte.com/yugabytedb-vs-cockroachdb-bringing-truth-to-performance-benchmark-claims-part-1/">Trino isn’t the only system in the data space to come under similar types of attacks.</a> It makes sense too, as this type of technical peacocking is common as it successfully gains attention.</p>

<p>Luckily, as more companies strive to become transparent and associate themselves with open-source efforts, we are starting to see a relatively new pattern of open-source efforts emerge. Typically, you’re used to hearing about open-source within the context of software projects maintained by open-source communities. We are now arriving at the age of any noun being able to be used in an open-source framework. There is open-source music, open-source education, and even open-source data. So why not reach a point where open-source benchmarking through consumer collaboration is a thing? This is not just for the sake of the consumers of these technologies who simply want to have more data to inform their design choices to better serve their clients, it’s also unfortunate that this affects developer communities that are putting in a lot of hard work on these projects, only to have that hard work get berated unintelligibly by the likes of some corporate status competition.</p>

<p>Now I’m clearly a little biased when I tell you that I think Trino is currently the best analytics engine on the market today. When I say this, you really should be skeptical too. Really, I encourage it. You should verify in some way beyond a shadow of a doubt that:</p>
<ol><li><p>Any TPC or other benchmarks are validated and no “magic” was used to improve their performance.</p></li>

<li><p>using your own use cases to make sure the system you choose is going to meet the needs of your particular use case.</p></li></ol>

<p>While this may seem like a lot of work, with cloud infrastructure and the simplicity of deploying different systems into the cloud, it’s now more possible to do this today than it ever was even 10 years ago to run a benchmark of competing systems internally and at scale. Not only can this benchmark be run by your own unbiased data engineers who have more stake to find out which system best fits the companies’ needs, but you don’t have to rely on generic benchmarking data to analyze this if you don’t want. You can spin up these systems and let them query your system, using your use cases, and do it any way you want it.</p>

<p>In summary, if consumers can work together, we can work to get rid of this specific type of misinformation while providing a richer more insightful analysis that will aid both companies and consumers. As I mention in the song above, go run the test yourselves.</p>

<p><a href="https://bitsondata.dev/tag:trino" class="hashtag"><span>#</span><span class="p-category">trino</span></a> <a href="https://bitsondata.dev/tag:presto" class="hashtag"><span>#</span><span class="p-category">presto</span></a> <a href="https://bitsondata.dev/tag:opensource" class="hashtag"><span>#</span><span class="p-category">opensource</span></a></p>

<p><em>bits</em></p>


]]></content:encoded>
      <guid>https://bitsondata.dev/what-is-benchmarketing-and-why-is-it-bad</guid>
      <pubDate>Sat, 12 Sep 2020 17:00:00 +0000</pubDate>
    </item>
  </channel>
</rss>