bits on data

Humans aren't going anywhere

Thu, 21 Aug 2025 21:30:58 +0000

The Human in the loop series

This is the first post in a larger human in the loop series that takes a social libertarian stance on AI and argues why humans are relevant if not required for the wave of AI technology we experience today. Although this view contradicts many proponents of the current AI wave, I also want to lend credibility to the utility of generative AI, more specifically the transformer architecture.

In my view, the value of transformers lies more in their extension of information retrieval (i.e. search) technologies as opposed the marketed goal of building a centralized superintelligence that requires large amounts of our data and unsustainable energy demands. I believe if done correctly, this technology could set the stage to bring humans back into the loop for understanding, sharing with consent, and indexing our own precious information unlike the post-Google decades that brought us to surveillance capitalism.

The GenAI-to-AGI dichotomy

Generative artificial intelligence (GenAI) is a new wave of artificial intelligence that gained the public's attention in late 2022 when OpenAI released their ChatGPT application. GenAI uses the transformer architecture to train on large amounts of human language scraped from the internet to develop systems that mimic expected language. Despite their convincing prose, these systems lack any human-like intuition or model of the world, yet are convincing enough to fool most people that there is real understanding and critical thinking based on what is asked of them. This has split public opinion to either see humans as essential for any AI application to work properly, or sees human intelligence as soon-to-be replaced entirely by GenAI.

Artificial General Intelligence (AGI) is a theory suggesting once a computer system is given a sufficient amount of training data, it will be able to outperform human's in nearly every task and update its own understanding, or world model, without the need for human intervention. Unfortunately the GenAI-to-AGI debate has followed suit to our tribal political environment, causing many to support or oppose the use and development of GenAI altogether.

Gary Marcus, a psychologist at NYU and early skeptic of the utility of today's LLM-based AI systems, and OpenAI CEO Sam Altman in 2023. Photo by: Andrew Caballero-Reynolds/AFP—Getty Images

AGI skeptics against GenAI investment correctly see current implementations like ChatGPT as exploitative of the humans who created both public and private work without consent, energy hungry and unsustainable, failing to create real value, creating real job loss, and causing mental health problems for its users.

Those who believe GenAI will achieve AGI, tend to focus on theoretical benefits, often see these downsides as inevitable and temporary precursors to AIs pending value towards global human progress. This group is made up of technooptimists who generally lack of accessible education on how GenAI works, are CEOs investing in AI due to groupthink consensus, or are otherwise often profiting off of the AGI narrative in some way or another.

There is of course, a group of technophobic folks that also believes GenAI will become AGI but fall into the anti-GenAI category from fear the technology will evolve itself to wipe out humanity. I won't lie, there was a week where I felt this way before I educated myself on how the parlor trick worked. We can add this diminishing group to the anti-GenAI group. Let's then focus on the group who allegedly believes GenAI still could become AGI and view it as a positive for humanity.

In an IBM study that questioned 2000 CEOs, two-thirds of the participants acknowledge they are taking a risk on AI and don't have a clear understanding of how it would value their company. A little over third of the CEOs believe, “it’s better to be 'fast and wrong' than 'right and slow' when it comes to technology adoption.” An even smaller group among the GenAI-to-AGI crowd are the techno-elitists who have a vested stake in selling the notion of AGI as this incentivizes large and centralized AI systems become ubiquitous to build monopoly in a domain with few regulations or public understanding. Further, the AGI narrative has spurred on nationalist technology race between the United States, China, and other countries adding a sense of urgency and impetus to ignore the impact of the scaling war on the climate and energy needs of local populations.

Like many in the AI community, I believe generative language models provide value and should be researched — but building AGI superintelligence as the north star with unrestricted resource consumption is where I believe we must draw the line. This is actually where a lot of nuance between the two extreme viewpoints lives, and it's exactly where the conversation needs to be today. What is important to understand that GenAI is not synonymous with developing superintelligence, displacing humans from jobs, hoarding energy resources from communities, and enabling capitalist to profiteer off our personal information. Despite what either side believes, businesses and investors are running low on patience and there would need to be a drastic boost in performance of the next large model for Silicon Valley to continue this experiment.

The business use case for artificial intelligence

If you look at the economy and the language used by company leadership when describing their investment in AI, the sentiment becomes clear. “65% (of CEOs) say their organization will use automation to address future skill gaps”, which signals their intent to use GenAI to cut their staff budget. Yet, the MIT NANDA project just released a report that found that, “despite $30–40 billion in enterprise investment into GenAI...95% of organizations are getting zero return.”

The report continues to point out;

the primary factor keeping organizations on the wrong side of the GenAI Divide is the learning gap, tools that don't learn, integrate poorly, or match workflows. Users prefer ChatGPT for simple tasks, but abandon it for mission-critical work due to its lack of memory. What's missing is systems that adapt, remember, and evolve, capabilities that define the difference between the two sides of the divide.

What I found most striking about this report is its focus on value to mission-critical workflows that rely on carbon-based employees with domain experience to keep it running. The receipts that AI will actually replace us are not showing up due to the lack of leadership to guide current staff to retrofit AI into production workflows to replace their human intuition. Ironically, this report mentions that most employees showed higher value from simpler interface like ChatGPT over inflexible domain-specific tooling that wasn't able to adapt to a company's specific needs. That said, simpler interfaces were only useful for the most basic tasks as opposed to complex or domain-specific work.

The 5% of pilots that were successful in domain-specific workloads, “focus on narrow but high-value use cases, integrate deeply into workflows, and scale through continuous learning rather than broad feature sets. Domain fluency and workflow integration matter more than flashy UX.” This both indicates that there is clearly value to this technology, but the ROI is evident only in narrow AI use cases as opposed to general use cases. Without true ROI to businesses, the centralized super-intelligent system will no longer justify the cost required to fund the costly and energy-inefficient scale of larger and larger models.

Where the buck stops

I would like to challenge three sentiments characterized by the current AI companies that are essential to the hype and high valuation:

Language models must be “large” in their training data in order to achieve the value companies see with GenAI today.
Language models must be proprietary and centralized to perform the best in comparison to open models.
Language models do not need humans to evolve their understanding of the world.

Does infinite scale mean infinitely better performance?

Cal Newport recounted in his recent New Yorker article, that the original premise set by an OpenAI paper from 2020 argued that language models would likely perform much better at virtually any task if you trained language models on larger data sets. The release of GPT-3 provided compelling evidence as it used ten times more data and saw drastically improved performance over the prior GPT-2 model. This evidence was sufficient enough for venture capital investors to further testing and had a rather convenient narrative to centralize and profit off the technology.

If OpenAI's theory was correct, they could build a system with the potential to outperform humans on any task, and was only attainable with more money, more computing power, and more data. The continued success of OpenAI's larger GPT-4 model boosted the momentum of this belief, and had investors throwing money at multiple companies in hopes they would invest in the one that made it to AGI first.

This was only to be followed by years of incremental releases from OpenAI's competitors on bigger models that barely performed better and OpenAI's continuous delays of the GPT-5 release. The AI industry began expressing doubt if scaling would consistently provide exponentially growing results. After two and a half years since the GPT-4 release, the unimpressive release of GPT-5 left the question on everyone's mind, “Was GPT-4 the peak size and performance for transformer-based models?”

This question causes a great deal of tension today as this larger-means-better approach starts to unfold. The entire viability of Nvidia's surge in valuation lies at the heart of the truth of this scaling power law. If models don't require increasingly larger datasets to perform better, then Nvidia's sales will stagnate once companies have a sufficient amount of GPUs to train and fine-tune existing open models. It's not that Nvidia won't sell a lot of GPUs in the near future, it's that the number has a clear finite bound. This can be further constrained as open models provide a clear path to build custom models without the entry fee of starting from scratch.

Open models prove size isn't always what matters

Open source has become part of the strategic landscape and how companies like Meta diminish OpenAI's advantage by growing a community around an open alternative as they did with the LLaMa model. Open language models come in all domains and sizes and are widely available on platforms like Kaggle and HuggingFace. Some models obfuscate model training and release model weights for use and limited reuse, while other models open their entire training sets and publish their methods to communities like OpenML or MLCommons. A study from late 2024 compared the performance of open models like LLama 2, Phi 2, and Mistral OpenOrca against OpenAI's GPT-3.5 and GPT-4, which concluded:

Open-source models, while showing slightly lower scores in terms of precision and speed, offer an interesting alternative due to their deployment flexibility. For example, Mistral-7b-OpenOrca achieved an 83% exact match and a ROUGE-2 score of 80%, while LLaMA-2 showed a 76% exact match, proving their competitiveness in controlled and secure environments. These open-source models, with their optimized attention mechanisms and adjusted quantization configurations, show that they can compete with proprietary models while allowing companies to customize the models according to their specific needs. These models represent viable and cost-effective solutions for sectors where data privacy and the ability to deploy on private infrastructures are essential.

Open models make it possible for more companies to outperform proprietary models with the use of retrieval augmentation or fine-tuning methods. This begged the question for companies who value cost-efficiency, How well do transformer models work with fine-tuned smaller models and other attention tuning like retrieval augmentation?

Fine-tuned small language models (SLMs) have also begun to outperform large language models from traditional tasks like text classification, to domain-specific edge models and coding tasks with lower operating costs and higher ROI. This fine-tuning and RAG optimization doesn't come for free, but it has more compounding of benefits for users and avoids lock-in to proprietary models. Of the 5% of domain-specific workloads that succeeded in the MIT study, the collaborations between domain-specialists and AI engineers provided substantial benefit through making the mechanisms of generative AI more transparent to knowledge workers, increasing their trust in these tools and providing more confidence to use them to automate other workflows. Domain specialists can then tweak updates of shifting institutional knowledge to a company's custom model while bringing more efficient workloads into the scaffold of mission-critical tasks.

It's hard to say that there will never be a case for large models, perhaps everyone will still want ChatGPT but cheaper and less AGI focused. If more research is placed into improving the performance of small language models towards the optimal performance of GPT-4. Then perhaps this is still a valid, albeit way smaller, use case for these general models that could be offered as a service to more and more companies. This does put a nail in the coffin of mindless scaling as a means to generate higher-value for consumers.

Will SLMs and open models kill centralized or proprietary models?

The growing research using open models substantially reduced the overall necessity and therefore value of a centralized platform. If training does not require substantial investment in a large data center, there would be more long-term value for companies with enough use cases to invest in developing internal or use existing open tools to create and run domain-specific models. As I pointed out in Your Own Private AI, any consumer these days can run open models locally and use RAG to train on their own information. This makes it nearly impossible for a single AI company like OpenAI to become the next Google monopoly in information networks.

In contrast, this opens up room for a new market of smaller proprietary and domain-specific AI in tools. This may look similar to the current proliferation of LLM wrapper companies, with the difference of them taking the time to think through and develop their own foundational models to address specific needs. The models won't necessarily need to be large, but rather, work incredibly well within their domain. The only centralization of knowledge would exist for that domain, but it is no longer trying to get data from every corner of the planet to solve all problems. This ends up looking like a slightly more generalized version of narrow or traditional AI tooling. This tooling may be so insignificant to any domain, that the application may not even be branded as an AI tool, but rather serving some features of an application much like Markov Chains give us autocorrect on our phones.

What about the job market?

As the narrative that all human value can be produced better and faster by a centralized superintelligence fades into the next AI winter, we are then left with an insecure feeling that AGI didn't happen this time, but what about the next? Because GenAI was such a convincing parlor trick, it spurred on a lot of conversations and new research across experts in anthropology, neuroscience, computer science, and artificial intelligence. It became very important for us to understand how we humans think, learn, and most importantly, what we take for granted about our organic and specifically human cognition.

I personally took solace in learning that large language models emulate language, but they don't model a world view, nor are they guided by biological stimulus like emotions that factor into their learning. I'm not saying that computers couldn't outpace us in some ways, but drove me to question the vast complexity of our own experience that we rarely practice metacognition as we are always thinking much like we always breathe and have a heartbeat. What also becomes clear when looking at this is how early research is on understanding the connection between our thoughts and the biological matter we've evolved over millennia. I'd like to share one aspect of human cognition we do understand that already falls outside of what is modeled by a language model, which is, the human nervous system's role in conscious thought.

Language models can't emulate human cognition

I recently learned about the Enteric Nervous System (ENS), our “second brain”, which is a mesh-like system of neurons that controls our gastrointestinal track. It can control digestive function entirely on its own without signals from our brain or the central nervous system. It is responsible for 90% of the serotonin and 50% of the dopamine generated in our body which has a large effect on your emotional state. It is why we use the language of “gut” feelings that describe our intuition that guides our decision making. It also sends signals that make us cranky and contributes to worse decision-making when we are hungry. Because humans aren't just brains with fingers and we have an entire body that dictate how we think and learn, we must consider the many complex systems of human anatomy factor into our experience when making a choice. This is just one of many things that make human thinking distinct from AI, and why we need to avoid comparing language emulation models to human thought.

If that weren't enough, there is a much more complex way our behaviors are affected by our environment, culture, social interactions, and the information we consume. When we see AI generate false information with no model to verify it against (i.e. “hallucinates”), it is clear that AI is only working in limited language or sensory dimensions such as image and video to generate something that is plausible or possible, but not likely correct. There's a lot of great reading in AI papers from the 2000s that aimed to create models like Distributed Cognition not even to build an AI model around it, but simply to create a vocabulary framework around complex cognition seen in animals to clarify different taxonomies of the type of cognition being emulated and which weren't. It's a starch reminder about how anthropomorphism and confirmation bias can cause us to be reductive of the complex mechanisms of our cognition. Although it's still possible that one day we will build AGI that may or may not think like a human, it most definitely won't be agents or language models. I believe most, if not, all jobs are safe from being replaced from language models.

In the post LLM hype, my anticipation will be a rise in the job market. Transformer-based model research will live on in AI academia and in business, and we'll see an explosion in developments around open training and open data sets. There will be a larger focus on developing domain-specific smaller models with fine-tuning. There will also be a large interest in embedded AI for IOT devices and lower energy consumption. This type of AI economy makes knowledge of every human valuable. It will take time and will require consent if we do this right.

Companies should still invest in AI, but gradually and following the techniques by the few successful companies that have had success. This will involve bringing back the domain experts and human resources AI suggested we could replace, and instead train them on how to effectively use AI in their workflow. As employee knowledge becomes important, companies must prioritize proper documentation and knowledge work. If leaders focus on proper incentives to capture workflows across teams through internal wikis, software logic mapping and validation, ops reports, meeting summaries, etc... GenAI has a lot of potential to lower the burden of time-consuming communication gaps and can provide a lot of information for experts to share knowledge with newcomers and remain productive within their role. This all comes down to telling the right stories around how we enable individuals to do their best work and their unique experiences and talents to shape the larger company organism into a more efficient being.

GenAI and Search

The single greatest pervasive and most influential information technology we know today is search, specifically Google's search initially powered by PageRank. Though general open source search engines like Solr, Lucene, Elasticsearch, meta search engines like SearXNG, and more modern vector and search hybrids like meilisearch, these only provide the mechanisms of how search works, but are missing the gargantuan amount of data that Google possesses through its search monopoly, data-collecting products, adware metrics, and its highly adopted web browser that feed into both providing context for search. The slow coercive shift of society granting Google its incredible influence over how humans around the globe mentally model and obtain knowledge that form our world views also shaped the way in which netizen's structure their information to be found.

Much of the GenAI data was procured without consent from the large troves of publicly available data. This ranged from troves of information scraped off of forum sites like, Reddit and StackOverlow, to small sites from individual blogs. All of these were used to feed the large data needs of the LLMs. As we sit here in the aftermath of this technology, I think it's incredibly important that we reflect on the importance of how we structure our information as a internet society. There is clearly power in the conversations we have and information we produce. It's important that the exploitative practices of companies like OpenAI and Google drive us towards safeguarding personal information, while ensuring our intellectual assets remain available to the public. We should continue building the consensual information sharing of information to benefit everyone, while safeguarding personal information from private benefactors that can leave the public vulnerable in the current political climate.

Many in open source have started developing open alternatives that enable us to create and use our information on our own terms. Social media alternatives have grown through Fediverse technologies, such as Mastadon and Pixelfed. There are also user-driven search systems such as USENET that enable users to curate our own search indexes and share it with others. This could both build information netowrks that would create an information economy that expands our existing peer-to-peer blog funding economies. Rather, you would be funded by curating valuable information in your expertise. However, today's average netizen would consider a system like USENET complex and unusable as it was created before social media and personal phones shapes how users find information.

Much of the current challenge poised to open designers and developers is to match current search mental models with those of a new system, that would also link across sites through semantic web standards. For those who understand how to build these economies can make some early examples themselves and train future generations of individuals and corporations on how to manage and profit off their own open digital gardens. I believe democratizing the ownership of information flow occurs on the internet will break up information bubbles and enable us to share the information we want and maintain our privacy where we want.

Nobody can do that alone, but it is possible if we start to pull our resources together and build something that has the distribution of USENET, the UX search model of Google, the interoperability of the semantic web, and the ability for folks to work on a repository of documents like Wikipedia, but many small wikis that can reference eachother. With proper design and consent mechanisms, humans will want to collectively curate the valuable information in our own heads into shared value for globally interconnected local economies.

We'll dive more into some of these fundamentals in the rest of this series. The next post will dive into the internals of search technology and its relevance to GenAI.

Your own Private AI 🕵️

Thu, 14 Aug 2025 01:23:00 +0000

I've had a bunch of conversations with family, friends, and folks who avoid tech until it's necessary about the latest wave of artificial intelligence (AI) craze. A lot of these folks are curious to use AI, but are concerned about privacy or receiving biased views and misinformation. Many technologists have limited this exposure using locally running AI models which avoids sharing personal information to platforms and enables you to choose a model that has more public validation of the data it trains on and therefore why it may have a stance on a particular topic.

I realized how most blog posts on how to run local AI are written by technologists for technologists and everyone else is beholden to ChatGPT and the like. I myself have been quite excited about the potential for AI, but as a socialist who has worked in the Big Data industry and seen how information is being used against the average citizen, I want to help everyone in learning how to take ownership of your information without avoiding participating in this fun and valuable tech.

Before we even jump into that, I just want to provide a little context about the main things you need to know about AI since a tool called ChatGPT made its debut in November 2022. Rather than share the massive amounts of activity happening here, I'll cover the most interesting larger developments mixed with what you'll need to know for the tutorial below.

Generative AI in a nutshell

First, some clarity on the term “AI”... What is commonly referred to in the media in recent years as “AI” actually refers to a subset (i.e. Generative AI) of a subset (i.e. Deep Learning) of the larger field of Artificial Intelligence. Artificial intelligence prior to the year 2022 was interested in statistical models that were easier to create and computationally “small” enough to run on mobile phone chips. This enables features like autocomplete, voice-to-text, facial recognition to log into you phone or run snapchat filters to add a mustache, etc... Before those applications, artificial intelligence was more often seen in video games when a computer would generate a series of actions for another character or adversary during game play.

ChatGPT is the flagship application of the modern wave of artificial intelligence products built by a closed company ironically named OpenAI. They set a record for most downloaded application and set off the interest and investments into a lot of products that internally rely on their services. For most people ChatGPT is a list of text conversations you have in what looks like an instant messenger window, but the words replying back to you are generated by a computer.

Computers generate this text by running programs that copy as much text off the internet and as many books, magazines, YouTube transcripts, or anything a company can freely get public access to, and build a statistical prediction model called Large Language Models. When we say model, think less Heidi Klum and more like a physics model that predicts the weather. LLMs encode the patterns of speech given a context. For example, if I say “It was the best of times, it was the ____ of times” you likely can guess based on your own model and understanding of language (i.e. your intuition) that the missing word is “worst” even if you've never read Charles Dickens' A Tale of Two Cities. LLMs trained on different data sets will have different strengths, weaknesses, and biases when faced with various tasks. In a similar way our brain uses mental models to remember or recall information, you can think of LLMs as just stochastic computer-based intuition models. It should also be noted that human mental models are a shallow analogy and despite the propaganda that AI can think the same way humans do, this is utter rubbish and anthropomorphizing AI has the potential to cause a lot of harm.

One of the early open-source families of LLMs that challenged the superior performance of early ChatGPT models is called Llama. Llama brought the first equally powerful open LLM to center stage, making it possible for people to run LLMs that could even outperform ChatGPT on their own computers privately. The release of Llama was a strategic power play from Meta (aka Facebook) and by opening large scale models to the public, Meta could reap the benefits of the adoption and rapid development in open ecosystems and tooling, such as Ollama.

Despite Ollama's similar name to the Llama LLMs it is not an LLM itself, but a platform to manage and build off of any open LLM. These platforms are also called LLM providers. Ollama's name came from its initial use of the llama.cpp to customize (aka fine-tuning) Llama's LLM to new tasks. As more open LLMs were created, Ollama built its own runtime and became a platform for any open source model. Although this tool is very helpful for those in the industry, it still requires anyone with less technical experience to type all of their work in a command line and wasn't the experience you would get from ChatGPT. This is where tools like Anything LLM become handy. AnythingLLM is a versatile application that can run on your own laptop and provide the similar user interface and features that you might find on ChatGPT.

Now that I've given you all of that information, let's cover an installation of Ollama to download (aka pull) either Llama or Deepseek LLMs down that are small enough to run on the average computer. The Deepseek model should run on most modern laptops and will give a generally better output than the smaller LLama model I'm providing as backup. Once that is complete we will install Anything LLM to give you a nice sleek chat interface similar to ChatGPT which can recall all your previous chats. I've added videos to help with installation visuals as I always think just having a human explain stuff helps me with this stuff as well.

Requirements

You should have a computer that has at least a dualcore or ideally quadcore CPU and at least 4GB (ideally 8GB) of RAM. You will also need about 5GB of storage space depending on the LLM you use (more below).

Tutorial

Download and install Ollama and open the terminal.
If you have an older laptop or have less than 4GB ram, type: ollama run llama3.2 to run the LLaMa LLM (2.0GB of disk space), or ollama run deepseek-r1 to run Deepseek LLM (4.7GB of disk space).
Download and install Anything LLM Desktop.
Open Anything LLM and connect to Ollama

Once you have that set up, you can then you can use AnythingLLM application similar to ChatGPT. You can even import your local documents to the local instance of AnythingLLM. Watch this video for a nice and recent AnythingLLM overview.

If a lot of this terminology still felt confusing it has more to do with the intentional complexity and lack of user design to avoid too much democratization of running your own AI to keep you dependent on a paid service that extracts and reveals your conversations. I'll do my best to keep this article up-to-date as new tools and models come along.

Integrating Trino and Snowflake

Wed, 08 May 2024 05:00:00 +0000

An open source Success Story

TL;DR: Contributing to open source can be frustrating as the consensus needed for code to align to the project vision is often out of scope for many companies. This post dives deep into the obstacles and wins of two contributors from different companies working together to add the same proprietary connector. It's both inspiring and carries many lessons to bring along as you venture into open source to gain the pearls and avoid the perils.

We’re seeing open source usher in a challenge to the economic model where the success metric is increasing the commonwealth of economic capital. This acceleration comes from playing positive-sum games with friends online and avoiding limiting a community to a vision that only benefits a small number of corporations or individuals. It’s hard to imagine how to embed such frameworks within our current zero-sum winner-takes-all economic system. There’s certainly no shortage of heated debates around how to construct a harmonious relationship between the open source community and companies participating in them. Something we don’t talk about enough are the positive examples of when a coordinated effort in open source sticks the landing, and so many benefit from it.

This post highlights the extraordinary contributions of Erik Anderson, Teng Yu, Yuya Ebihara, and the broader Trino community to finally contribute the long-coveted Trino Snowflake connector. It is a success story paired with a blueprint for individuals and corporations wanting to contribute to open source projects they use. These stories are valuable in that they demonstrate how to be most effective in collaborating with strangers-soon-to-be-friends and common pitfalls to avoid.

A common challenge in open source

Despite the importance of delivering marketing and education in a community (aka edutainment), it’s only the first part of the equation of what makes open source projects successful. Once developers see some exciting video or tutorial, they ultimately land on the docs site, GitHub, StackOverflow, or some communication platform in the community. It's at this point that developers can easily lose the motivation if the docs lacks proper getting started materials or the community is completely silent. This is how I categorize the developer experience (aka devex), which aims to improve both the user and contributor experiences in the developer community by empowering decisions through hands-on learning, removing inefficiencies, and as we'll cover here, exposing untapped opportunities.

Much like any open source project, maintainers on the Trino project struggle with communicating the lack of proper resources to build and test new features built for various proprietary software. For those less familiar, Trino is a federated query engine with multiple data sources. Trino tests integrations with open data sources by running small local instances of the connecting system. Snowflake is a proprietary, cloud-native data warehouse, also known as cloud data platform. This provided no viable and free way to support testing this integration that was eagerly sought by many. After an initial attempt by my friend Phillipe Gagnon, a similar pattern emerged with the second pull request where the development velocity started strong and after some months stagnated.

Cognitive surplus and communication deficit

A common and unfortunate class of issues are that various well-known larger objectives known among the core group often move faster than less-established individual contributions. These additions are often much needed and welcomed, but often fail to fit a larger project roadmap narrative. As its easier to coordinate between the smaller core group as trust and norms have been communicated and established. This makes changes outside of this group have a higher likelihood to get lost in the shuffle. As an open source project grows, you end up with a cognitive surplus in the form of an abundance of bright people willing to share their time, intellect, and experience with a larger community.

Often both contributors and maintainers are so busy with their day jobs, families, and self care, that they dedicate most of their remaining energy to ensuring they write quality code and tests to the best of their ability. Lack of upfront communication to validate ideas from newer contributors, and lack of communication by maintainers who see a large number of issues to address are two communication issues that stagnate a project. Maintainers are often doers that see more value in addressing quick-win work that flows from the well-established contributors of the project. Followthrough on either side can be difficult as newcomers don't want to be rude and maintainers accidentally forget or hope someone else will take the time to address the issues on that pull request.

Waiting for your work to be reviewed by someone in the community kind of works like a wishing well, you toss in a coin (i.e. your time and effort represented as code and a pull request) and hope your wish of getting your code reviewed and merged comes true. The satisfaction of hypothetical developers that benefit from your small and significant change floods your mind and you feel like you’ve improved humanity just that one little bit more.

Maintainers are in a constant state of pulling triage on all the surplus of innovation being thrown at them and simultaneously trying to look for more help reviewing and being the expert at some areas of the code. As you can imagine, good communication can be hard to come by as many newcomers are strangers and concerned they are wasting precious time by asking too many questions rather than just showing a proof of concept. This backfires when developers will spend a large portion of their time developing a solution that is not compatible with the project, and maintainers will lose the opportunity to quickly spin up on the value of the new feature. This is why regular contributor meetings help solve both of these issues synchronously to cut out the delayed feedback loops.

History repeats itself, until it doesn't

It became apparent that each time there was a discussion for how to do integration testing there was no good way to test a Snowflake instance with the lack of funding for the project. Trino has a high bar for quality and none of the maintainers felt it was a risk worth taking due to the likely popularity of the integration and likelihood of future maintenance issues. Once each pull request hit this same fate, it stalled with no clear path to resolve the real issue of funding the Snowflake infrastructure needed by the Trino Software Foundation (TSF). It’s never fun to mention that you can’t move forward on work with constraints like these, and without a monetary solution, silence is what is experienced on the side of the contributor.

Noticing that Teng had already done a significant amount of work to contribute his Snowflake connector, I reached out to him to see if we could brainstorm a solution. Not long after, Erik also reached out to get my thoughts on how to go about contributing Bloomberg's Snowflake connector. Great, now we have two connector implementations and no solution to getting the infrastructure to get them tested. During the first Trino Contributor Congregation, Erik and I brought up Bloomberg's desire to contribute a Snowflake connector and I articulated the testing issue. Ironically, this was the first time I had thoroughly articulated the issue to Erik as well.

As soon as I was done, Erik requested the mic said something to the effect of, “Oh I wish I would have known that's the problem, the solution is simple, Bloomberg will provide the TSF a Snowflake account.”

Done!

Just as in business, you can never underestimate the power of communication in an open source project as well. Shortly after Erik, Teng, and I discussed the best ways to merge their work, they set up the Snowflake accounts for Trino maintainers and start the arduous process of building a thorough test suite with the help of Yuya, Piotr Findeisen, Manfred Moser, and Martin Traverso.

The long road to Snowflake

As Teng and Erik merged their efforts, the process was anything but straightforward. There were setbacks, vacations, meticulous reviews, and infrastructure issues. But the perseverance of everyone involved was unwavering.

Bloomberg started by creating an official Bloomberg Trino repository originally as a means for Teng and Erik to mesh their solutions together and build the testing infrastructure that relied on Bloomberg resources. Without needing to rely on the main Trino project to merge incremental solutions, they were able to quickly iterate the early solutions. This repository also facilitated Bloomberg’s now numerous contributions to Trino.

It took a few months just to get the ForePaaS¹ and Bloomberg solutions merged. There were valuable takes from each system and better integration tests were written with the new testing infrastructure. The two Snowflake connector implementations were merged together by April of 2023. Finally, the reviews could start. Once the initial two passes happened we anticipated that we would see the Snowflake connector release in the summer of 2023 around Trino Fest. So much so, that we planned a talk with Erik and Teng initially as a reveal, assuming the pull request would be merged by then. Lo and behold, this didn’t happen, as there were still a lot of concerns around use cases not being properly tested.

The halting review problem

A necessary evil that comes with pull request reviews and more broadly, distributed consensus is that reviews can drag on over time. This can lead to countless number of updates you have to make to your changes to accommodate the ever changing project shifting beneath your feet as you simultaneously try to make progress on suggestions from those reviewing your code.

Many critics of open source like to point this out as a drawback, when in fact, this same problem exists in closed source systems. Closed source projects can generally delay difficult decisions to make fast upfront progress to meet certain deadlines. This may be seen as an advantage at first, but as many developers can attest, this simply leads to technical debt and fragile products in most environments that struggle to prioritize a healthy codebase.

Regardless, having to face these larger discussions upfront can induce fatigue, especially when managing external circumstances; personal affairs, a project at work – you know, the entity that pays these engineers – or countless other factors will rear their ugly heads and progress will stagger with ebbs and flows of attention. This can be really dangerous territory and commonly resolves in contributors and reviewers abandoning the PR when it stalls.

This is why I believe open source, while not beholden to any timelines, needs a project and product management role which is currently covered often by project leaders and devex engineers. This can also relieve tension between the needs of open source and big businesses in the community with real deadlines, at least keeping the communication consistent while ensuring bugs and design flaws aren’t introduced to the code base.

What’s in it for Bloomberg and ForePaaS?

If you’ve never worked in open source or for a company that contributes to open source, you may be thinking how the heck do these engineers convince their leadership to let them dump so much time into these contributions? The simple answer is, it’s good for business.

If we peep into why Bloomberg uses Trino, they aggregate data from an unusually large number of data sources across their customers who use their services. Part of this requires them to merge the customer’s dataset with existing aggregate data in Bloomberg’s product. Since Trino can connect to most customer databases out-of-the-box, this requires Bloomberg to manage a small array of custom connectors that provide their services to customers as multiple catalogs in a single convention SQL endpoint. Having engineers maintain a few small connectors rather than an entire distributed query engine themselves saves a lot of time and maintenance.

Despite how many problems Trino already solves for them, Bloomberg and ForePaaS needed this Snowflake connector and through the open source model created it for themselves. The drawback is that the solution must be maintained by the engineers at each company any time they want to upgrade to a new Trino feature. This takes consistently depletes engineering resources and so they want to maintain as few features as possible to relieve their engineer’s time. Open source projects are generally more than happy to accept features that the community benefits from. This doesn’t mean we shouldn’t appreciate when companies contribute. This dual-sum generosity and forward-thinking approach enabled Erik and Teng to combine their battle-tested connectors, crafting a high value creation for the community.

If you are a developer who sees the value in contributing to open source, and you aren't sure how to convince leadership to get on board, you need to speak their language. Show how companies like Bloomberg get involved in open source, and how it lowers maintenance costs when done correctly. If you see an open project like Trino that could replace 97% of a new project, demonstrate that the upfront cost will pay off when you remove the amount of code to be managed by your team which lowers the future need to expand headcounts. I don’t imagine a world where your boss and colleagues are altruists, but present an economic incentive that lowers the amortized cost of engineers needed to maintain a project, then your strategy becomes helpful to the company's bottom line.

While the immediate investment shows small gains for a single team on a single company, once that change exists in open source, other companies can immediately benefit and offer better testing and improvements than you could have asked for when managing the original project with your own team. Humanity at large gets to benefit upon every contribution done this way, and the more companies that embrace this, the less we waste our efforts of pointlessly duplicating work.

Esprit de Corps

The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”, which I mistakenly took the “Corps” part for the Marine Corps rather than the more general meaning of a body or group of people. In fact, it expresses the common spirit existing in the members of a group and inspiring enthusiasm, devotion, and strong regard for the honor of the group. Any time I see this type of shared and selfless cooperation in open source, I’m reminded of the bond, friendships, and care of me and my fellow marines. Despite the unfortunate political circumstances of our mission, I do treasure the shared companionship with both my fellow marines and the local Iraqi people. There is ultimately a power in the gathering of many when aimed for building an altruistic means of improving each others lives.

In the same way, demonstration of human cooperation is about more than just developing a connector; it's about the shared experiences, the friendships forged, and the skills honed in the pursuit of a common goal. The successful addition of the Trino Snowflake connector is a testament to the positive sum outcomes of open source collaboration. This journey has been about collaboration, learning, and growth that will benefit many. I remember the night I got the email that Yuya had merged the pull request, I was ecstatic to say the least. The connector shipped with Trino version 440, and made connection to the most widely adopted cloud warehouse possible.

Once the hard work was done, many valuable iterations like adding Top-N support(Shoppee), adding Snowflake Iceberg REST catalog support (Starburst), and adding better type mapping(Apple) were added to the Snowflake integration. I love showcasing this trailblazing and yes, altruistic work from Erik, Teng, Yuya, Martin, Manfred, and Piotr – and everyone who helped in the Trino community. A special thanks to the managers and leadership at Bloomberg and ForePaaS for their generous commitment of time and resources.

As we celebrate this milestone, we're already looking forward to the next adventure. Here's to federating them all, together!

Notes: ¹ForePaaS has been integrated into OVHCloud, which is now called Data Platform.

bits

Iceberg won the table format war

Wed, 05 Jul 2023 17:00:00 +0000

Photo by Michail Dementiev on Unsplash

TL;DR: I believe Apache Iceberg won the table format wars, not because of a feature race, but primarily because of the open Iceberg spec. There are some features only available in Iceberg due to the breaking of compatibility with Hive, which was also a contributing factor to the adoption of the implementation.

Disclaimer: I am the Head of Developer Relations at Tabular and a Developer Advocate in both Apache Iceberg and Trino communities. All of my 🌶️ takes are my biased opinion and not necessarily the opinion of Tabular, the Apache Software Foundation, the Trino Software Foundation, or the communities I work with. This also goes into a bit of my personal story for leaving my previous company but relates to my reasoning so I offer you a TL;DR if you don’t care about the details.

My revelation with Iceberg

Two months ago, I made the difficult decision to leave Starburst which was hands down the best job I’ve ever had up to this point. Since I left I’ve had a lot of questions about my motivations for leaving and wanted to put some concerns to rest. This role allowed me to get deeply involved in open-source during working hours and showed me how I could aid the community to get traction on their work and drive the roadmap for many in the project. This was a new calling that overlapped with many altruist parts of how I define myself and was deeply rewarding.

I made some incredible friends, some of which have become invaluable mentors during this process of learning the nuances and interplay between venture capital and an open-source community. So why did I leave this job that I love so much?

Apache Iceberg Baby

Let’s time-travel (pun intended) to the first Iceberg episode of the Trino Community Broadcast. In true ADHD form, I crammed learning about Apache Iceberg well into the night before the broadcast with the creator of Iceberg, Ryan Blue. While setting up that demo, I really started to understand what a game-changer Iceberg was. I had heard the Trino users and maintainers talk about Iceberg replacing Hive but it just didn’t sink in for the first couple of months. I mean really, what could be better than Hive? 🥲

While researching I learned about hidden partitioning, schema evolution, and most importantly, the open specification. The whole package was just such an elegant solution to problems that had caused me and many in the Trino community failed deployments and late-night calls. Just as I had the epiphany with Trino (Presto at the time) of how big of a productivity booster SQL queries over multiple systems were, I had a similar experience with Iceberg that night. Preaching the combination of these two became somewhat of a mission of mine after that.

ANSI SQL + Iceberg + Parquet + S3

Immediately after that show, I wrote a four-blog series on Trino on Iceberg, did a talk, and built out the getting started repository for Iceberg. I was rather hooked on the thought of these two technologies in tandem. You start out with a system that can connect to any data source you throw at it and sees it as yet another SQL table. Take that system and add a table format that interacts with all the popular analytics engines people use today from Spark, Snowflake, and DuckDB, to Trino variants like EMR, Athena, and Starburst.

Standards all the way down

This approach to data virtualization is so interesting as each system offers full leverage over vendors trying to lock you in their particular query language or storage format. It pushes the incentives for vendors to support these open standards which puts them in a seemingly vulnerable position compared to locking users in. However, that’s a fallacy I hope vendors will slowly begin to understand is not true. With the open S3 storage standard, open file standards like Parquet, open table standards like Iceberg, and the ANSI SQL spec closely followed by Trino, the entire analytics warehouse has become a modular stack of truly open components. This is not just open in the sense of the freedom to use and contribute to a project, but the open standard that enables you the freedom to simply move between different projects.

This new freedom gives the user the features of a data warehouse, with the scalability of the cloud, and a free market of a la carte services to handle your needs at whatever price point you need at any given time. All users need to do in this new ecosystem is shop around and choose any open project or vendor that implements the open standard and your migration cost will be practically non-existent. This is the ultimate definition of future-proofing your architecture.

Back to why I left Starburst

Trino Community

I’ll quickly tie up why I left Starburst before I reveal why Iceberg won the table format wars. For the last three years, I have worked on building awareness around Trino. My partner in crime, Manfred Moser, had been in the Trino community and literally wrote the book on Trino. Together we spent long days and nights growing the Trino community. I loved every minute of it and honestly didn’t see myself leaving Starburst or shifting focus from Trino until it became an analytics organization standard.

Something became apparent though. Trino community health was thriving, and there were many organic product movements taking place in the Trino community. Cole Bowden was boosting the Trino release process getting us to cutting Trino releases every 1-2 weeks which is unprecedented in open-source. Cole, Manfred, and I did a manual scan over the pull requests and gracefully closed or revived outdated or abandoned pull requests. The Trino community is in great shape.

Iceberg Community

As I looked at Iceberg, the adoption and awareness were growing at an unprecedented rate with Snowflake, BigQuery, Starburst, and Athena all announcing support between 2021 and 2022. However, nothing was moving the needle forward from a developer experience perspective. There was some initial amazing work done by Sam Redai, but there was still so much to be done. I noticed the Iceberg documentation needed improvement. While many vendors were advocating for Iceberg, there was nobody putting in consistent work to the vanilla Iceberg site. PMCs like Ryan Blue, Jack Ye, Russel Spitzer, Dan Weeks, and many others are doing a great shared job of driving roadmap features for Iceberg, but no individual currently has the time to dedicate to the cat herding, improving communication in the project, or bettering the developer and contributor experience for users. Since Trino was on stable ground it felt imperative to move to Iceberg and fill in these gaps. When Ryan approached me with a Head of DevRel position at Tabular, I couldn’t pass up the opportunity. To be clear I left Starburst but not the Trino community. Being at Iceberg also helped me in my mission to continue forging these two technologies that I believe in so much.

Tell us why Iceberg won already!

Moving back to the meat of the subject. My first blog in the Trino community covered what I once called, the invisible Hive spec to alleviate confusion around why Trino would need a “Hive” connector if it’s a query engine itself. The reason we called the Hive connector as such is that it translated Parquet files sitting in an S3 object store into a schema that could be read and modified via a query engine that knew the Hive spec. This had nothing to do with Hive the query engine, but Hive the spec. Why was the Hive spec invisible 👻? Because nobody wrote it down. It was in the minds of the engineers who put Hive together, spread across Cloudera forums by engineers who had bashed their heads against the wall and reverse-engineered the binaries to understand this “spec”.

Why do you even need a spec?

Having an “invisible” spec was rather problematic, as every implementation of that spec ran on different assumptions. I spent days searching Hive solutions on Hortonworks/Cloudera forums trying to solve an error with solutions that for whatever reason, didn’t work on my instance of Hive. There were also implicit dependencies required with using the Hive spec. It’s actually incorrect to call the Hive spec as such because a spec should have independence of platform, programming language, and hardware, while including minimal implementation details not required for interoperability between implementations. SQL, for instance, is a declarative language that runs on systems written in, C++, Rust, Java, and doesn’t get into the business of telling query engines how to answer that query, just what behavior is expected.

The open spec is why Iceberg won

By the time there were various iterations of Hive from the Hadoop and Big Data Boom, there was not a very central source to lay a stake in the ground to standardize this spec. While you may imagine that we would have learned our lesson, until Iceberg, none of the projects officially formalized their assumptions and specifications for their project. Delta Lake and Hudi extended the Hive models to make migration from Hive simpler but kept a few of the issues that Hive introduced, like exposing the partitioning format to users running analysis.

Open specifications and vendor politics

You may seem skeptical that an open specification for a table format holds such weight, but it crept its way into a well-known feud between two large analytics vendors of the day, Databricks and Snowflake. Databricks has been the leading vendor in the data lakehouse market while Snowflake dominated the data warehouse market. Snowflake’s original strategy initially involved encouraging movement from the data lakehouse market back to the data warehouse market, while Databrick’s original strategy was to do the exact opposite. This conveyed the outdated trap that many vendors still fall prey to. This idea is that locking users in will help your business and stakeholders reduce churn and keep customers in the long run. Vendor lock-in was once a decades-long play, but as B2C consumerism expectations creep into B2B consumerism, we are seeing a gradual shift of practitioners demanding interoperability from vendors to give them the level of autonomy they have experienced with open-source projects.

This surfaced with Snowflake when they announced that they would be offering support for an Iceberg external table functionality in 2022. At first, I raised my eyebrow at this as I figured this was a feeble attempt for Snowflake to market itself as an open-friendly warehouse when the external table would be nothing but an easy way to migrate data from the data lake to Snowflake. Whatever their motives, this was good visibility for Iceberg and I was thrilled that Snowflake was showcasing the need for an open specification, gimmick or not. This even pressured other competing data warehouses like BigQuery to add Iceberg support.

The final signal that the open Iceberg spec won

I didn’t, however, expect to see what took place at both Snowflake Summit and Data + AI Summit this year in 2023. If you don’t know these are Snowflake and Databricks’ big events that happened on the same week as an extension of their feud. There were already some hints that Snowflake had dropped to signal that they were ramping up their Iceberg support but were doing a great job at keeping it under wraps. The final reveal came at Snowflake Summit; Snowflake now offers managed Iceberg catalog support.

I was thrilled to see that finally, a data warehouse vendor understands there’s no way to beat open standards and they should adopt open storage as part of their core business model. What’s even more about this picture is they show both Trino and Spark engines as open compute alternatives to their own engine. This was beyond my expectations for Snowflake, and definitely showed me they were heading in the right direction for their customers.

Meanwhile, in a California town not far away, Databricks would also have a response to Snowflake’s announcement stored up. Delta Lake 3.0 was announced and it now supports compatibility across Delta Lake, Iceberg, and Hudi (be it limited compatibility for Iceberg for now). Whatever happens, there is now opportunity for Databricks customers to trial Iceberg, and this should excite everyone. We’re one step closer to having this spec become the common denominator. With both of these moves from a company that initially only wanted their proprietary format to win and a company that built on a competing format, I have the opinion that Iceberg has won the format wars.

Now I don’t want to act like these vendors simply have users’ best interests in mind. All they want is for you to make the decision to choose them. I work for a vendor, we also want your money. What I am seeing though is that now the industry is trending towards incentivizing openness as more customers demand it. In order for companies to stay ahead, they must embrace this fact rather than fight it. What is rather historical about this moment in time is that since the dawn of analytics and the data warehouse, there has been vendor lock-in on many fronts of the analytics market. This to me, signals the nail in the coffin of any capability for vendors to do this on the level playing field of open standards. It’s now up to the vendors to keep you happy and continuously stop you from churning. This in the long run is good for users and vendors and will ultimately drive better products.

One quick aside, some may mention that Hudi recently added a “spec” so why does Iceberg having a spec give it the winning vote? I recommend you go back to reading the purpose of an open spec section earlier in this blog, then look at both the Iceberg spec and the Hudi “spec” and determine which one satisfies the criteria. Hudi’s “spec” exposes Java dependencies making it unusable for any system not running on the JDK, doesn’t clarify schema, and has a lot of implementation details rather than leaving that to the systems that implement it. This “spec” is something closer to an architecture document than an open spec.

Will the Iceberg project drive Delta Lake and Hudi to nonexistence?

Maybe, or maybe not. That really depends on the feature set that these three table formats offer and what the actual value they bring to the users is. In other words, you decide what’s important, we decide what we’re going to support, and vendors and the larger data community decide who stays and who goes. This simply boils down to what users want, and if there is enough deviation for it to make sense for multiple formats to exist. Having competition in open-source is every bit as healthy as having competition in the vendor space – with some exceptions:

Exception 1: The projects don’t collectively build on an open specification. This enables vendor lock-in under the guise of being “open” simply because the source is available.

Exception 2: The projects are not just doubled efforts of each other. This is a sad waste of time and adds analysis paralysis to users. If two projects have clear value added in different domains with some overlap, then the choice becomes clearer based on the use case.

Exception 3: Any of the competing OSS projects primarily serve the needs of one or a few companies over the larger community. This exception is an anti-pattern of innovation brought up in the Innovator’s Dilemma. Focusing on the needs of the few will eventually force the majority to move to the next thing. Truly open projects will continue to evolve at the pace of the industry.

Just because I’m touting that Apache Iceberg has “won the table format wars” with an open spec, does not mean I am discrediting the hard work done by the competing table formats or advocating not to use them. Delta Lake has made sizeable efforts to stay competitive on a feature-by-feature basis. I’m also still good friends with Mr. DeltaLake himself, Denny Lee. I hold no grudge against these amazing people trying to do better for their users and customers. I am also excited at the fact that all formats now have some level of interoperability for users outside of the Iceberg ecosystem!

And…Hudi?

I toiled with my take on Hudi. I really enjoyed working with the folks from Hudi and while I usually follow the “don’t say anything unless it’s nice” rule, it would be disingenuous not to give my real opinion on this matter. I even quoted Reddit user u/JudgingYouThisSecond to initially avoid saying my own words, but even his take still doesn’t quite capture my thoughts concisely:

Ultimately, there is room enough for there to be several winners in the table format space:

Hudi, for folks who care about or need near-real-time ingestion and streaming. This project will likely be #3 in the overall pecking order as it's become less relevant and seems to be losing steam realative to the other two in this list (at least IMO)

Delta Lake, for people who play a lot in the Databricks ecosystem and don't mind that Delta Lake (while OSS) may end up being a bit of a “gateway drug” into a pay-to-play world

Iceberg, for folks looking to avoid vendor lock-in and are looking for a table format with a growing community behind it.

While my opinion may change as these projects evolve, I consider myself an “Iceberg Guy” at the moment.

Comment by u/JudgingYouThisSecond in dataengineering

I feel like the open-source Hudi project is just not in a state where I would recommend it to a friend. I once thought perhaps Hudi’s upserts made their complexity worth it, but both Delta and Iceberg have improved on this front. My opinion from supporting all three formats in the Trino community is that Hudi is an overly complex implementation with many leaky abstractions such as exposing Hadoop and other legacy dependencies (e.g. the KyroSerializer) and doesn’t actually provide proper transaction semantics.

What I personally hope for is that we get to a point where we converge down to the minimum set of table format projects needed to scratch the different itches that shouldn’t be overlapping, and all adhering to the Iceberg specification. In essence, really only the Iceberg spec has won and Iceberg is currently the only implementation to support it. I would be thrilled to see Delta Lake and Hudi projects support the Iceberg spec and make this a fair race and make the new question, which Iceberg implementation won the most adoption? That game will be fun to play. Imagine Tabular, Databricks, Cloudera, and Onehouse all building on the same spec!

Note: Please don’t respond with silly performance benchmarks with TPC (or any standardized dataset in a vendor-controlled benchmark) to suggest the performance of these benchmarks is relevant to this conversation. It’s not.

Thanks for reading, and if you disagree and want to debate or love this and want to discuss it, reach out to me on LinkedIn, Twitter, or the Apache Iceberg Slack.

Stay classy Data Folks!

#iceberg #opensource #openstandard

bits

Intro to Trino for the Trinewbie

Fri, 17 Dec 2021 18:00:00 +0000

Learn how to quickly join data across multiple sources

If you haven’t heard of Trino before, it is a query engine that speaks the language of many genres of databases. As such, Trino is commonly used to provide fast ad-hoc queries across heterogeneous data sources. Trino’s initial use case was built around replacing the Hive runtime engine to allow for faster querying of Big Data warehouses and data lakes. This may be the first time you have heard of Trino, but you’ve likely heard of the project from which it was “forklifted”, Presto. If you want to learn more about why the creators of Presto now work on Trino (formerly PrestoSQL) you can read the renaming blog that they produced earlier this year. Before you commit too much to this blog, I’d like to let you know why you should even care about Trino.

So what is Trino anyways?

The first thing I like to make sure people know about when discussing Trino is that it is a SQL query engine, but not a SQL database. What does that mean? Traditional databases typically consist of a query engine and a storage engine. Trino is just a query engine and does not store data. Instead, Trino interacts with various databases that store their own data in their own formats. Trino parses and analyzes the SQL query you pass in, creates and optimizes a query execution plan that includes the data sources, and then schedules worker nodes that are able to intelligently query the underlying databases they connect to.

I say intelligently, specifically talking about pushdown queries. That’s right, the most intelligent thing for Trino to do is to avoid making more work for itself, and try to offload that work to the underlying database. This makes sense as the underlying databases generally have special indexes and data that are stored in a specific format to optimize the read time. It would be silly of Trino to ignore all of that optimized reading capability and do a linear scan of all the data to run the query itself. The goal in most optimizations for Trino is to push down the query to the database and only get back the smallest amount of data needed to join with another dataset from another database, do some further Trino specific processing, or simply return as the correct result set for the query.

Query all the things

So I still have not really answered your question of why you should care about Trino. The short answer is, Trino acts as a single access point to query all the things. Yup. Oh, and it’s super fast at ad-hoc queries over various data sources including data lakes (e.g. Iceberg/Databricks) or data warehouses (e.g. Hive/Snowflake). It has a connector architecture that allows it to speak the language of a whole bunch of databases. If you have a special use case, you can write your own connector that abstracts any database or service away to just be another table in Trino’s domain. Pretty cool right? But that’s actually rarely needed because the most common databases already have a connector written for them. If not, more connectors are getting added by Trino’s open source community every few months.

To make the benefits of running federated queries a bit more tangible, I will present an example. Trino brings users the ability to map standardized ANSI SQL query to query databases that have a custom query DSL like Elasticsearch. With Trino it’s incredibly simple to set up an Elasticsearch catalog and start running SQL queries on it. If that doesn’t blow your mind, let me explain why that’s so powerful.

Imagine you have five different data stores, each with its own independent query language. Your data science or analyst team just wants access to these data stores. It would take a ridiculous amount of time for them to have to go to each data system individually, look up the different commands to pull data out of each one, and dump the data into one location and clean it up so that they can actually run meaningful queries. With Trino all they need to use is SQL to access them through Trino. Also, it doesn’t just stop at accessing the data, your data science team is also able to join data across tables of different databases like a search engine like Elasticsearch with an operational database like MySQL. Further, using Trino even enables joining data sources with themselves where joins are not supported, like in Elasticsearch and MongoDB. Did it happen yet? Is your mind blown?

Getting Started with Trino

So what is required to give Trino a test drive? Relative to many open-source database projects, Trino is one of the more simple projects to install, but this still doesn’t mean it is easy. An important element to a successful project is how it adapts to newer users and expands capability for growth and adoption. This really pushes the importance of making sure that there are multiple avenues of entry into using a product all of which have varying levels of difficulty, cost, customizability, interoperability, and scalability. As you increase in the level of customizability, interoperability, and scalability, you will generally see an increase in difficulty or cost and vice versa. Luckily, when you are starting out, you just really need to play with Trino.

Image added by Author

The low-cost and low difficulty way to try out Trino is to use Docker containers. The nice thing about these containers is that you don’t have to really know anything about the installation process of Trino to play around with Trino. While many enjoy poking around documentation and working with Trino to get it set up, it may not be for all. I certainly have my days where I prefer a nice chill CLI sesh and other days where I just need to opt-out. If you want to skip to the Easy Button way to deploy Trino (hint, it’s the SaaS deployment) then skip the next few sections.

Using Trino With Docker

Trino ships with a Docker image that does a lot of the setup necessary for Trino to run. Outside of simply running a docker container, there are a few things that need to happen for setup. First, in order to use a database like MySQL, we actually need to run a MySQL container as well using the official mysql image. There is a trino-getting-started repository that contains a lot of the setup needed for using Trino on your own computer or setting it up on a test server as a proof of concept. Clone this repository and follow the instructions in the README to install Docker if it is not already.

You can actually run a query before learning the specifics of how this compose file works. Before you run the query, you will need to run the mysql and trino-coordinator instances. To do this, navigate to the mysql/trino-mysql/ directory that contains the docker-compose.yml and run:

docker-compose up -d

Running your first query!

Now that you have Trino running in Docker, you need to open a session to access it. The easiest way to do this is via a console. Run the following Docker command to connect to a terminal on the coordinator:

docker container exec -it trino-mysql_trino-coordinator_1 trino

This will bring you to the Trino terminal.

trino>

Your first query will actually be to generate data from the tpch catalog and then query the data that was loaded into mysql catalog. In the terminal, run the following two queries:

CREATE TABLE mysql.tiny.customer
AS SELECT * FROM tpch.tiny.customer;

SELECT custkey, name, nationkey, phone 
FROM mysql.tiny.customer LIMIT 5;

The output should look like this.

|custkey|name              |nationkey|phone          |
|-------|------------------|---------|---------------|
|751    |Customer#000000751|0        |10-658-550-2257|
|752    |Customer#000000752|8        |18-924-993-6038|
|753    |Customer#000000753|17       |27-817-126-3646|
|754    |Customer#000000754|0        |10-646-595-5871|
|755    |Customer#000000755|16       |26-395-247-2207|

Congrats! You just ran your first query on Trino. Did you feel the rush!? Okay well, technically we just copied data from a data generation connector and moved it into a MySQL database and queried that back out. It’s fine if this simple exercise didn’t send goosebumps flying down your spine but hopefully, you can extrapolate the possibilities when connecting to other datasets.

A good initial exercise to study the compose file and directories before jumping into the Trino installation documentation. Let’s see how this was possible by breaking down the docker-compose file that you just ran.

version: '3.7'
services:
  trino-coordinator:
    image: 'trinodb/trino:latest'
    hostname: trino-coordinator
    ports:
      - '8080:8080'
    volumes:
      - ./etc:/etc/trino
    networks:
      - trino-network

  mysql:
    image: mysql:latest
    hostname: mysql
    environment:
      MYSQL_ROOT_PASSWORD: admin
      MYSQL_USER: admin
      MYSQL_PASSWORD: admin
      MYSQL_DATABASE: tiny
    ports:
      - '3306:3306'
    networks:
      - trino-network
networks:
  trino-network:
    driver: bridge

Notice that the hostname of mysql matches the instance name, and the mysql instance is on the trino-network that the trino-coordinator instance will also join. Also notice that the mysql image exposes port 3306 on the network.

Finally, we will use the trinodb/trino image for the trino-coordinator instance, and use the volumes option to map our local custom configurations for Trino to the /etc/trino directory we discuss further down in the Trino Configuration section. Trino should also be added to the trino-network and expose ports 8080 which is how external clients can access Trino. Below is an example of the docker-compose.yml file. The full configurations can be found in this getting started with Trino repository.

These instructions are a basic overview of the more complete installation instructions if you’re really going for it! If you’re not that interested in the installation, feel free to skip ahead to the Deploying Trino at Scale with Kubernetes section. If you’d rather not deal with Kubernetes I offer you another pass to the easy button section of this blog.

Trino requirements

The first requirement is that Trino must be run on a POSIX-compliant system such as Linux or Unix. There are some folks in the community that have gotten Trino to run on Windows for testing using runtime environments like cygwin but this is not supported officially. However, in our world of containerization, this is less of an issue and you will be able to at least test this on Docker no matter which operating system you use.

Trino is written in Java and so it requires the Java Runtime Environment (JRE). Trino requires a 64-bit version of Java 11, with a minimum required version of 11.0.7. Newer patch versions such as 11.0.8 or 11.0.9 are recommended. The launch scripts for Trino bin/launcher, also require python version 2.6.x, 2.7.x, or 3.x.

Trino Configuration

To configure Trino, you need to first know the Trino configuration directory. If you were installing Trino by hand, the default would be in a etc/ directory relative to the installation directory. For our example, I’m going to use the default installation directory of the Trino Docker image, which is set in the run-trino script as /etc/trino. We need to create four files underneath this base directory. I will describe what these files do and you can see an example in the docker image I have created below.

config.properties — This is the primary configuration for each node in the trino cluster. There are plenty of options that can be set here, but you’ll typically want to use the default settings when testing. The required configurations include indicating if the node is the coordinator, setting the http port that Trino communicates on, and the discovery node url so that Trino servers can find each other.
jvm.config — This configuration contains the command line arguments you will pass down to the java process that runs Trino.
log.properties — This configuration is helpful to indicate the log levels of various java classes in Trino. It can be left empty to use the default log level for all classes.
node.properties — This configuration is used to uniquely identify nodes in the cluster and specify locations of directories in the node.

The next directory you need to know about is the catalog/ directory, located in the root configuration directory. In the docker container, it will be in /etc/trino/catalog. This is the directory that will contain the catalog configurations that Trino will use to connect to the different data sources. For our example, we’ll configure two catalogs, the mysql catalog, and the tpch catalog. The tpch catalog is a simple data generation catalog that simply needs the conector.name property to be configured and is located in /etc/trino/catalog/tpch.properties.

tpch.properties

connector.name=tpch

The mysql catalog just needs the connector.name to specify which connector plugin to use, the connection-url property to point to the mysql instance, and the connection-user and connection-password properties for the mysql user.

mysql.properties

connector.name=mysql
connection-url=jdbc:mysql://mysql:3306
connection-user=root
connection-password=admin

Note: the name of the configuration file becomes the name of the catalog in Trino. If you are familiar with MySQL, you are likely to know that MySQL supports a two-tiered containment hierarchy, though you may have never known it was called that. This containment hierarchy refers to databases and tables. The first tier of the hierarchy is the tables, while the second tier consists of databases. A database contains multiple tables and therefore two tables can have the same name provided they live under a different database.

Image by Author

Since Trino has to connect to multiple databases, it supports a three-tiered containment hierarchy. Rather than call the second tier, databases, Trino refers to this tier as schemas. So a database in MySQL is equivalent to a schema in Trino. The third tier allows Trino to distinguish between multiple underlying data sources which are made of catalogs. Since the file provided to Trino is called mysql.properties it automatically names the catalog mysql without the .properties file type. To query the customer table in MySQL under the tiny you specify the following table name mysql.tiny.customer.

If you’ve reached this far, congratulations, you now know how to set up catalogs and query them through Trino! The benefits at this point should be clear, and making a proof of concept is easy to do this way. It’s time to put together that proof of concept for your team and your boss! What next though? How do you actually get this deployed in a reproducible and scalable manner? The next section covers a brief overview of faster ways to get Trino deployed at scale.

Deploying Trino at Scale with Kubernetes

Up to this point, this post only describes the deployment process. What about after that once you’ve deployed Trino to production and you slowly onboard engineering, BI/Analytics, and your data science teams. As many Trino users have experienced, the demand on your Trino cluster grows quickly as it becomes the single point of access to all of your data. This is where these small proof-of-concept size installations start to fall apart and you will need something more pliable to scale as your system starts to take on heavier workloads.

You will need to monitor your cluster and will likely need to stand up other services that run these monitoring tasks. This also applies to running other systems for security and authentication management. This list of complexity grows as you consider all of these systems need to scale and adapt around the growing Trino clusters. You may, for instance, consider deploying multiple clusters to handle different workloads, or possibly running tens or hundreds of Trino clusters to provide a self-service platform to provide isolated tenancy in your platform.

The solution to express all of these complex scenarios as the configuration is already solved by using an orchestration platform like Kubernetes, and its package manager project, Helm. Kubernetes offers a powerful way to express all the complex adaptable infrastructures based on your use cases.

In the interest of brevity, I will not include the full set of instructions on how to run a helm chart or cover the basics of running Trino on Kubernetes. Rather, I will refer you to an episode of Trino Community Broadcast that discusses Kubernetes, the community helm chart, and the basics of running Trino on Kubernetes. In the interest of transparency, the official Trino helm charts are still in an early phase of development. There is a very popular community-contributed helm chart that is adapted by many users to suit their needs and it is currently the best open source option for self-managed deployments of Trino. If you decide to take this route, proceed with caution and know that there is development to support the helm deployments moving forward.

While this will provide all the tools to enable a well-suited engineering department to run and maintain their own Trino cluster, this begs the question, based on your engineering team size, should you and your company be investing costly data engineer hours into maintaining, scaling, and hacking required to keep a full-size production infrastructure afloat?

Starburst Galaxy: The Easy Button method of deploying and maintaining Trino

Full Disclosure: This blog post was originally written while I was working at Starburst. I still stand by Starburst Galaxy as one of the better options but I will add the caveat that it depends on your use case and things change so reach out if you need my latest thoughts on the matter. That said, Galaxy is the general purpose version of Trino the creators never got to build at Facebook. If you have custom features you need that you'd like to contribute, a lot of folks run an open source cluster in testing and production is run by Starburst. You can then test and develop features to contribute to open source that will eventually upstream to Galaxy, Athena, or any other Trino variant.

Image By: lostvegas, License: CC BY-NC-ND 2.0

As mentioned, Trino has a relatively simple deployment setup, with an emphasis on relatively. This blog really only hits the tip of the iceberg when it comes to the complexity involved in managing and scaling Trino. While it is certainly possible to manage running Trino and even do so at scale with helm charts in Kubernetes, it is still a difficult setup for Trinewbies and difficult to maintain and scale for those who already have experience maintaining Trino. I experienced firsthand many of these difficulties myself when I began my Trino journey years ago and started on my own quest to help others overcome some of these challenges. This is what led me to cross paths with Starburst, the company behind the SaaS Trino platform Galaxy.

Galaxy makes Trino accessible to companies having difficulties scaling and customizing Trino to their needs. Unless you are in a company that houses a massive data platform and you have dedicated data and DevOps engineers to each system in your platform, many of these options won’t be feasible for you in the long run.

One thing to make clear is that a Galaxy cluster is really just a Trino cluster on demand. Outside of managing the scaling policies, to avoid any surprises on your cloud bill, you really don’t have to think about scaling Trino up or down, or suspending it when it is not in use. The beautiful thing about Trino and therefore Galaxy is that it is an ephemeral compute engine much like AWS Lambda that you can quickly spin up or down. Not only are you able to run ad-hoc and federated queries over disparate data sources, but now you can also run the infrastructure for those queries on-demand with almost no cost to your engineering team’s time.

Getting Started With Galaxy

Here’s a quick getting started guide with the Starburst Galaxy that mirrors the setup we realized with the Docker example above with Trino and MySQL.

Set up a trial of Galaxy by filling in your information at the bottom of the Galaxy information page.
Once you receive a link, you will see this sign-up screen. Fill out the email address, enter the pin sent to the email, and choose the domain for your cluster.
The rest of the tutorial is provided in the video below provides a basic demo of what you’ll need to do to get started.

This introduction may feel a bit underwhelming but extrapolate being able to run federated queries across your relational databases like MySQL, a data lake storing data in S3, or soon data in many NoSQL and real-time data stores. The true power of Starburst Galaxy is that now your team will no longer need to dedicate a giant backlog of tickets aimed at scaling up and down, monitoring, and securing Trino. Rather you can return to focus on the business problems and the best model for the data in your domain.

#trino

Trino on ice IV: Deep dive into Iceberg internals

Thu, 12 Aug 2021 05:00:00 +0000

So far, this series has covered some very interesting user level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some implementation details of Iceberg by dissecting some files that result from various operations carried out using Trino. To dissect you must use some surgical instrumentation, namely Trino, Avro tools, the MinIO client tool and Iceberg’s core library. It’s useful to dissect how these files work, not only to help understand how Iceberg works, but also to aid in troubleshooting issues, should you have any issues during ingestion or querying of your Iceberg table. I like to think of this type of debugging much like a fun game of operation, and you’re looking to see what causes the red errors to fly by on your screen.

Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:

Understanding Iceberg metadata

Iceberg can use any compatible metastore, but for Trino, it only supports the Hive metastore and AWS Glue similar to the Hive connector. This is because there is already a vast amount of testing and support for using the Hive metastore in Trino. Likewise, many Trino use cases that currently use data lakes already use the Hive connector and therefore the Hive metastore. This makes it convenient to have as the leading supported use case as existing users can easily migrate between Hive to Iceberg tables. Since there is no indication of which connector is actually executed in the diagram of the Hive connector architecture, it serves as a diagram that can be used for both Hive and Iceberg. The only difference is the connector used, but if you create a table in Hive, you can view the same table in Iceberg.

To recap the steps taken from the first three blogs; the first blog created an events table, while the first two blogs ran two insert statements. The first insert contained three records, while the second insert contained a single record.

Up until this point, the state of the files in MinIO haven’t really been shown except some of the manifest list pointers from the snapshot in the third blog post. Using the MinIO client tool, you can list files that Iceberg generated through all these operations and then try to understand what purpose they are serving.

% mc tree -f local/
local/
└─ iceberg
   └─ logging.db
      └─ events
         ├─ data
         │  ├─ event_time_day=2021-04-01
         │  │  ├─ 51eb1ea6-266b-490f-8bca-c63391f02d10.orc
         │  │  └─ cbcf052d-240d-4881-8a68-2bbc0f7e5233.orc
         │  └─ event_time_day=2021-04-02
         │     └─ b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc
         └─ metadata
            ├─ 00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json
            ├─ 00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
            ├─ 00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
            ├─ 23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro
            ├─ 92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro
            ├─ snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
            ├─ snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro
            └─ snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro

There are a lot of files here, but here are a couple of patterns that you can observe with these files.

First, the top two directories are named data and metadata.

////data/////metadata/
As you might expect, data contains the actual ORC files split by partition. This is akin to what you would see in a Hive table data directory. What is really of interest here is the metadata directory. There are specifically three patterns of files you’ll find here.
///




/metadata/.avro

///
/metadata/snap---.avro

///
/metadata/-.metadata.json
Iceberg has a persistent tree structure that manages various snapshots of the data that are created for every mutation of the data. This enables not only a concurrency model that supports serializable isolation, but also cool features like time travel across a linear progression of snapshots.
This tree structure contains two types of Avro files, manifest lists and manifest files. Manifest list files contain pointers to various manifest files and the manifest files themselves point to various data files. This post starts out by covering these manifest files, and later covers the table metadata files that are suffixed by .metadata.json.
The last blog covered the command in Trino that shows the snapshot information that is stored in the metastore. Here is that command and its output again for your review.
SELECT manifest_list 
FROM iceberg.logging."events$snapshots";
Result:















snapshots
s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro


You’ll notice that the manifest list returns the Avro files prefixed with
snap- are returned. These files are directly correlated with the snapshot record stored in the metastore. According to the diagram above, snapshots are records in the metastore that contain the url of the manifest list in the Avro file. Avro files are binary files and not something you can just open up in a text editor to read. Using the avro-tools.jar tool distributed by the Apache Avro project, you can actually inspect the contents of this file to get a better understanding of how it is used by Iceberg.

The first snapshot is generated on the creation of the events table. Upon inspecting this file, you notice that the file is empty. The output is an empty line that the jq JSON command line utility removes on pretty printing the JSON that is returned, which is just a newline. This snapshot represents an empty state of the table upon creation. To investigate the snapshots you need to download the files to your local filesystem. Let's move them to the home  directory:

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro | jq .


Result: (is empty)




The second snapshot is a little more interesting and actually shows us the contents of a manifest list.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro | jq .


Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "manifest_length":6114,
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "added_data_files_count":{
      "int":2
   },
   "existing_data_files_count":{
      "int":0
   },
   "deleted_data_files_count":{
      "int":0
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   },
   "existing_rows_count":{
      "long":0
   },
   "deleted_rows_count":{
      "long":0
   }
}


To understand each of the values in each of these rows, you can refer to the  Iceberg
specification in the manifest list file section. Instead of covering these exhaustively, let's focus on a few key fields. Below are the fields, and their definition according to the specification.
manifest_path – Location of the manifest file.
partition_spec_id – ID of a partition spec used to write the manifest; must be listed in table metadata partition-specs.
added_snapshot_id – ID of the snapshot where the manifest file was added.
partitions – A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec.
added_rows_count – Number of rows in all files in the manifest that have status ADDED, when null this is assumed to be non-zero.

As mentioned above, manifest lists hold references to various manifest files. These manifest paths are the pointers in the persistent tree that tells any client using Iceberg where to find all of the manifest files associated with a particular snapshot. To traverse this tree, you can look over the different manifest paths to find all the manifest files associated with the particular snapshot you want to traverse. Partition spec ids are helpful to know the current partition specification which are stored in the table metadata in the metastore. This references where to find the spec in the metastore. Added snapshot ids tells you which snapshot is associated with the manifest list. Partitions hold some high level partition bound information to make for faster querying. If a query is looking for a particular value, it only traverses the manifest files where the query values fall within the range of the file values. Finally, you get a few metrics like the number of changed rows and data files, one of which is the count of added rows. The first operation consisted of three rows inserts and the second operation was the insertion of one row. Using the row counts you can easily determine which manifest file belongs to which operation.

The following command shows the final snapshot after both operations executed and filters out only the fields pointed out above.

% java -jar  ~/Desktop/avro_files/avro-tools-1.10.0.jar tojson ~/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro | jq '. | {manifest_path: .manifest_path, partition_spec_id: .partition_spec_id, added_snapshot_id: .added_snapshot_id, partitions: .partitions, added_rows_count: .added_rows_count }'


Result:

{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/23cc980c-9570-42ed-85cf-8658fda2727d-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":4564366177504223700
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001eI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":1
   }
}
{
   "manifest_path":"s3a://iceberg/logging.db/events/metadata/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro",
   "partition_spec_id":0,
   "added_snapshot_id":{
      "long":2720489016575682000
   },
   "partitions":{
      "array":[
         {
            "contains_null":false,
            "lower_bound":{
               "bytes":"\u001eI\u0000\u0000"
            },
            "upper_bound":{
               "bytes":"\u001fI\u0000\u0000"
            }
         }
      ]
   },
   "added_rows_count":{
      "long":3
   }
}


In the listing of the manifest file related to the last snapshot, you notice the first operation where three rows were inserted is contained in the manifest file in the second JSON object. You can determine this from the snapshot id, as well as, the number of rows that were added in the operation. The first JSON object contains the last operation that inserted a single row. So the most recent operations are listed in reverse commit order.

The next command does the same listing of the file that you ran with the manifest list, except you run this on the manifest files themselves to expose their contents and discuss them. To begin with, you run the command to show the contents of the manifest file associated with the insertion of three rows.

% java -jar  ~/avro-tools-1.10.0.jar tojson ~/Desktop/avro_files/92382234-a4a6-4a1b-bc9b-24839472c2f6-m0.avro | jq .


Result:

{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-01/51eb1ea6-266b-490f-8bca-c63391f02d10.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18718
         }
      },
      "record_count":1,
      "file_size_in_bytes":870,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":1
            },
            {
               "key":2,
               "value":1
            },
            {
               "key":3,
               "value":1
            },
            {
               "key":4,
               "value":1
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Oh noes"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}
{
   "status":1,
   "snapshot_id":{
      "long":2720489016575682000
   },
   "data_file":{
      "file_path":"s3a://iceberg/logging.db/events/data/event_time_day=2021-04-02/b012ec20-bbdd-47f5-89d3-57b9e32ea9eb.orc",
      "file_format":"ORC",
      "partition":{
         "event_time_day":{
            "int":18719
         }
      },
      "record_count":2,
      "file_size_in_bytes":1084,
      "block_size_in_bytes":67108864,
      "column_sizes":null,
      "value_counts":{
         "array":[
            {
               "key":1,
               "value":2
            },
            {
               "key":2,
               "value":2
            },
            {
               "key":3,
               "value":2
            },
            {
               "key":4,
               "value":2
            }
         ]
      },
      "null_value_counts":{
         "array":[
            {
               "key":1,
               "value":0
            },
            {
               "key":2,
               "value":0
            },
            {
               "key":3,
               "value":0
            },
            {
               "key":4,
               "value":0
            }
         ]
      },
      "nan_value_counts":null,
      "lower_bounds":{
         "array":[
            {
               "key":1,
               "value":"ERROR"
            },
            {
               "key":3,
               "value":"Double oh noes"
            }
         ]
      },
      "upper_bounds":{
         "array":[
            {
               "key":1,
               "value":"WARN"
            },
            {
               "key":3,
               "value":"Maybeh oh noes?"
            }
         ]
      },
      "key_metadata":null,
      "split_offsets":null
   }
}


Now this is a very big output, but in summary, there’s really not too much to these files. As before, there is a Manifest section in the Iceberg spec that details what each of these fields means. Here are the important fields:
snapshot_id – Snapshot id where the file was added, or deleted if status is two. Inherited when null.
data_file – Field containing metadata about the data files pertaining to the manifest file, such as file path, partition tuple, metrics, etc…
data_file.file_path – Full URI for the file with FS scheme.
data_file.partition – Partition data tuple, schema based on the partition spec.
data_file.record_count – Number of records in the data file.
data_file.*_count – Multiple fields that contain a map from column id to  number of values, null, nan counts in the file. These can be used to quickly  filter out unnecessary get operations.
data_file.*_bounds – Multiple fields that contain a map from column id to lower or upper bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.

Each data file struct contains a partition and data file that it maps to. These files only be scanned and returned if the criteria for the query is met when  checking all of the count, bounds, and other statistics that are recorded in the file. Ideally only files that contain data relevant to the query should be scanned at all. Having information like the record count may also help in the query planning process to determine splits and other information. This particular optimization hasn’t been completed yet as planning typically happens before traversal of the files. It is still in ongoing discussion and is discussed a bit by Iceberg creator Ryan Blue in a recent meetup. If this is something you are interested in, keep posted on the Slack channel and releases as the Trino Iceberg connector progresses in this area.

As mentioned above, the last set of files that you find in the metadata directory which are suffixed with .metadata.json. These files at baseline are a bit strange as they aren’t stored in the Avro format, but instead the JSON format. This is because they are not part of the persistent tree structure. These files are essentially a copy of the table metadata that is stored in the metastore. You can find the fields for the table metadata listed in the Iceberg specification. These tables are typically stored persistently in a metasture much like the Hive metastore but could easily be replaced by any datastore that can support an atomic swap (check-and-put) operation required for Iceberg to support the optimistic concurrency operation.

The naming of the table metadata includes a table version and UUID:
-.metadata.json. To commit a new metadata version, which just adds 1 to the current version number, the writer performs these steps:
It creates a new table metadata file using the current metadata.
It writes the new table metadata to a file following the naming with the next version number.

It requests the metastore swap the table’s metadata pointer from the old location to the new location.
If the swap succeeds, the commit succeeded. The new file is now the
current metadata.
If the swap fails, another writer has already created their own. The
current writer goes back to step 1.

If you want to see where this is stored in the Hive metastore, you can reference the TABLE_PARAMS table. At the time of writing, this is the only method of using the metastore that is supported by the Trino Iceberg connector.

SELECT PARAM_KEY, PARAM_VALUE FROM metastore.TABLE_PARAMS;


Result:


PARAM_KEY PARAM_VALUE
EXTERNAL TRUE
metadata_location s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
numFiles 2
previous_metadata_location s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
table_type iceberg
totalSize 5323
transient_lastDdlTime 1622865672


So as you can see, the metastore is saying the current metadata location is the
00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json file. Now you can dive in to see the table metadata that is being used by the Iceberg connector.

% cat ~/Desktop/avro_files/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json


Result:

{
   "format-version":1,
   "table-uuid":"32e3c271-84a9-4be5-9342-2148c878227a",
   "location":"s3a://iceberg/logging.db/events",
   "last-updated-ms":1622865686323,
   "last-column-id":5,
   "schema":{
      "type":"struct",
      "fields":[
         {
            "id":1,
            "name":"level",
            "required":false,
            "type":"string"
         },
         {
            "id":2,
            "name":"event_time",
            "required":false,
            "type":"timestamp"
         },
         {
            "id":3,
            "name":"message",
            "required":false,
            "type":"string"
         },
         {
            "id":4,
            "name":"call_stack",
            "required":false,
            "type":{
               "type":"list",
               "element-id":5,
               "element":"string",
               "element-required":false
            }
         }
      ]
   },
   "partition-spec":[
      {
         "name":"event_time_day",
         "transform":"day",
         "source-id":2,
         "field-id":1000
      }
   ],
   "default-spec-id":0,
   "partition-specs":[
      {
         "spec-id":0,
         "fields":[
            {
               "name":"event_time_day",
               "transform":"day",
               "source-id":2,
               "field-id":1000
            }
         ]
      }
   ],
   "default-sort-order-id":0,
   "sort-orders":[
      {
         "order-id":0,
         "fields":[
            
         ]
      }
   ],
   "properties":{
      "write.format.default":"ORC"
   },
   "current-snapshot-id":4564366177504223943,
   "snapshots":[
      {
         "snapshot-id":6967685587675910019,
         "timestamp-ms":1622865672882,
         "summary":{
            "operation":"append",
            "changed-partition-count":"0",
            "total-records":"0",
            "total-data-files":"0",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro"
      },
      {
         "snapshot-id":2720489016575682283,
         "parent-snapshot-id":6967685587675910019,
         "timestamp-ms":1622865680419,
         "summary":{
            "operation":"append",
            "added-data-files":"2",
            "added-records":"3",
            "added-files-size":"1954",
            "changed-partition-count":"2",
            "total-records":"3",
            "total-data-files":"2",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro"
      },
      {
         "snapshot-id":4564366177504223943,
         "parent-snapshot-id":2720489016575682283,
         "timestamp-ms":1622865686278,
         "summary":{
            "operation":"append",
            "added-data-files":"1",
            "added-records":"1",
            "added-files-size":"746",
            "changed-partition-count":"1",
            "total-records":"4",
            "total-data-files":"3",
            "total-delete-files":"0",
            "total-position-deletes":"0",
            "total-equality-deletes":"0"
         },
         "manifest-list":"s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro"
      }
   ],
   "snapshot-log":[
      {
         "timestamp-ms":1622865672882,
         "snapshot-id":6967685587675910019
      },
      {
         "timestamp-ms":1622865680419,
         "snapshot-id":2720489016575682283
      },
      {
         "timestamp-ms":1622865686278,
         "snapshot-id":4564366177504223943
      }
   ],
   "metadata-log":[
      {
         "timestamp-ms":1622865672894,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00000-c5cfaab4-f82f-4351-b2a5-bd0e241f84bc.metadata.json"
      },
      {
         "timestamp-ms":1622865680524,
         "metadata-file":"s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json"
      }
   ]
}


As you can see, these JSON files can quickly grow as you perform different updates on your table. This file contains a pointer to all of the snapshots and manifest list files, much like the output you found from looking at the snapshots in the table. A really important piece to note is the schema is stored here. This is what Trino uses for validation on inserts and reads. As you may expect, there is the root location of the table itself, as well as a unique table identifier. The final part I’d like to note about this file is the partition-spec and partition-specs fields. The partition-spec field holds the current partition spec, while the partition-specs is an array that can hold a list of all partition specs that have existed for this table. As pointed out earlier, you can have many different manifest files that use different partition specs. That wraps up all of the metadata file types you can expect to see in Iceberg!

This post wraps up the Trino on ice series. Hopefully these blog posts serve as a helpful initial dialogue about what is expected to grow as a vital portion of an open data lakehouse stack. What are you waiting for? Come join the fun and help us implement some of the missing features or instead go ahead and try Trino on Ice(berg) yourself!

#trino #iceberg

snapshots
s3a://iceberg/logging.db/events/metadata/snap-6967685587675910019-1-bcbe9133-c51c-42a9-9c73-f5b745702cb0.avro
s3a://iceberg/logging.db/events/metadata/snap-2720489016575682283-1-92382234-a4a6-4a1b-bc9b-24839472c2f6.avro
s3a://iceberg/logging.db/events/metadata/snap-4564366177504223943-1-23cc980c-9570-42ed-85cf-8658fda2727d.avro

PARAM_KEY	PARAM_VALUE
EXTERNAL	TRUE
metadata_location	s3a://iceberg/logging.db/events/metadata/00002-33d69acc-94cb-44bc-b2a1-71120e749d9a.metadata.json
numFiles	2
previous_metadata_location	s3a://iceberg/logging.db/events/metadata/00001-27c8c2d1-fdbb-429d-9263-3654d818250e.metadata.json
table_type	iceberg
totalSize	5323
transient_lastDdlTime	1622865672



Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Fri, 30 Jul 2021 05:00:00 +0000


In the last two blog posts, we’ve covered a lot of cool feature improvements of Iceberg over the Hive model. I recommend you take a look at those if you haven’t yet. We introduced concepts and issues that table formats address. This blog closes up the overview of Iceberg features by discussing the concurrency model Iceberg uses to ensure data integrity, how to use snapshots via Trino, and the Iceberg Specification.





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals



Concurrency Model

 Some issues with the Hive model are the distinct locations where the metadata is stored and where the data files are stored. Having your data and metadata split up like this is a recipe for disaster when trying to apply updates to both services atomically.

 
 A very common problem with Hive is that if a writing process failed during insertion, many times you would find the data written to file storage, but the metastore writes failed to occur. Or conversely, the metastore writes were successful, but the data failed to finish writing to file storage due to a  network or file IO failure. There’s a good  Trino Community Broadcast episode that talks about a function in Trino that exists to resolve these issues by syncing the metastore and file storage. You can watch  a simulation of this error on that episode.

 Aside from having issues due to the split state in the system, there are many  other issues that stem from the file system itself. In the case of HDFS,  depending on the specific filesystem implementation you are using, you may have different atomicity guarantees for various file systems and their operations, such as creating, deleting, and renaming files and directories. HDFS isn’t the only troublemaker here. Other than Amazon S3’s  recent announcement of strong consistency in their S3 service, most object storage systems only offer eventual consistency that may not show the latest files immediately after writes. Despite storage systems showing more progress towards offering better performance and guarantees, these systems still offer no reliable locking mechanism.

 Iceberg addresses all of these issues in a multitude of ways. One of the primary ways Iceberg introduces transactional guarantees is by storing the metadata in the same datastore as the data itself. This simplifies handling commit failures down to rolling back on one system rather than trying to coordinate a rollback across two systems like in Hive. Writers independently write their metadata and attempt to perform their operations, needing no coordination with other writers. The only time the writers coordinate is when they attempt to perform a commit of their operations. In order to do a commit, they perform a lock of the current snapshot record in a database. This concurrency model where writers eagerly do the work upfront is called optimistic concurrency control.
 Currently, in Trino, this method still uses the Hive metastore to perform the lock-and-swap operation necessary to coordinate the final commits. Iceberg  creator, Ryan Blue, covers this lock-and-swap mechanism and how the metastore can be replaced with alternate locking methods. In the event that two writers attempt to commit at the same time, the writer that first acquires the lock successfully commits by swapping its snapshot as the current snapshot, while the second writer will retry to apply its changes again. The second writer should have no problem with this, assuming there are no conflicting changes between the two snapshots.

 

 This works similarly to a git workflow where the main branch is the locked resource, and two developers try to commit their changes at the same time. The first developer’s changes may conflict with the second developer’s changes. The second developer is then forced to rebase or merge the first developer’s code with their changes before commiting to the main branch again. The same logic applies to merging data files. Currently, Iceberg clients use a copy-on-write mechanism that makes a new file out of the merged data in the next snapshot. This enables accurate time traveling and preserves previous split versions of the files. At the time of writing, upserts via MERGE INTO syntax are not supported in Trino, but  this is in active development. UPDATE: Since the original writing of this post, the  MERGE syntax exists as of version 393.

 One of the great benefits of tracking each individual change that gets written to Iceberg is that you are given a view of the data at every point in time. This enables a really cool feature that I mentioned earlier called time travel.

 ## Snapshots and Time Travel

 To showcase snapshots, it’s best to go over a few examples drawing from the event table we  created in the previous blog posts. This time we’ll only be working with the Iceberg table, as this capability is not available in Hive. Snapshots allow you to have an immutable set of your data at a given time. They are automatically created on every append or removal of data. One thing to note is that for now, they do not store the state of your metadata.
 Say that you have c
 reated your events table and inserted the three initial rows as we did previously. Let’s look at the data we get back and see how to check the existing snapshots in Trino:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




To query the snapshots, all you need is to use the $ operator appended to the
end of the table name, and add the hidden table, snapshots:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;


Result:




snapshot_id
parent_id
operation





7620328658793169607

append



2115743741823353537
7620328658793169607
append




Let’s take a look at the manifest list files that are associated with each
snapshot ID. You can tell which file belongs to which snapshot based on the
snapshot ID embedded in the filename:

SELECT manifest_list
FROM iceberg.logging.“events$snapshots”;


Result:




shapshots





s3a://iceberg/logging.db/events/metadata/snap-7620328658793169607-1-cc857d89-1c07-4087-bdbc-2144a814dae2.avro



s3a://iceberg/logging.db/events/metadata/snap-2115743741823353537-1-4cb458be-7152-4e99-8db7-b2dda52c556c.avro




Now, let’s insert another row to the table:

INSERT INTO iceberg.logging.events
VALUES
(
‘INFO’,
timestamp ‘2021-04-02 00:00:11.1122222’,
‘It is all good’,
ARRAY [‘Just updating you!’]
);


Let’s check the snapshot table again:

SELECT snapshot_id, parent_id, operation
FROM iceberg.logging.“events$snapshots”;


Result:




snapshot_id
parent_id
operation





7620328658793169607

append



2115743741823353537
7620328658793169607
append



7030511368881343137
2115743741823353537
append




Let’s also verify that our row was added:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Oh noes



INFO
It is all good



ERROR
Double oh noes



WARN
Maybeh oh noes?




 Since Iceberg is already tracking the list of files added and removed at each snapshot, it would make sense that you can travel back and forth between these different views into the system, right? This concept is called time traveling. You need to specify which snapshot you would like to read from and you will see the view of the data at that timestamp. In Trino, you need to use the @ operator, followed by the snapshot you wish to read from:

SELECT level, message
FROM iceberg.logging.“events@2115743741823353537”;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




 If you determine there is some issue with your data, you can always roll back to the previous state permanently as well. In Trino we have a function called rollback_to_snapshot to move the table state to another snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 2115743741823353537);


Now that we have rolled back, observe what happens when we query the events
table with:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Double oh noes



WARN
Maybeh oh noes?



ERROR
Oh noes




 Notice the INFO row is still missing even though we query the table without specifying a snapshot id. Now just because we rolled back, doesn’t mean we’ve lost the snapshot we just rolled back from. In fact, we can roll forward, or as I like to call it,  back to the future! In Trino, you use the same function call but with a predecessor of the existing snapshot:

CALL system.rollback_to_snapshot(‘logging’, ‘events’, 7030511368881343137)


And now we should be able to query the table again and see the INFO row
return:

SELECT level, message
FROM iceberg.logging.events;


Result:




level
message





ERROR
Oh noes



INFO
It is all good



ERROR
Double oh noes



WARN
Maybeh oh noes?




 As expected, the INFO row returns when you roll back to the future.

 Having snapshots not only provides you with a level of immutability that is key to the eventual consistency model, but gives you a rich set of features to version and move between different versions of your data like a git repository.

 ## Iceberg Specification

 Perhaps saving the best for last, the benefit of using Iceberg is the community that surrounds it, and the support you receive. It can be daunting to have to choose a project that replaces something so core to your architecture. While Hive has so many drawbacks, one of the things keeping many companies locked in is the fear of the unknown. How do you know which table format to choose? Are there unknown data corruption issues that I’m about to take on? What if this doesn’t scale like it promises on the label? It is worth noting that  alternative table formats are also emerging in this space  and we encourage you to investigate these for your own use cases. When sitting down with Iceberg creator, Ryan Blue,  comparing Iceberg to other table formats,  he claims the community’s greatest strength is their ability to look forward. They intentionally broke compatibility with Hive to enable them to provide a richer level of features. Unlike Hive, the Iceberg project explained their thinking in a spec.

 The strongest argument I can see for Iceberg is that it has a specification. This is something that has largely been missing from Hive and shows a real maturity in how the Iceberg community has approached the issue. On the Trino project, we think standards are important. We adhere to many of them ourselves, such as the ANSI SQL syntax, and exposing the client through a JDBC connection. By creating a standard around this, you’re no longer tied to any particular technology, not even Iceberg itself. You are adhering to a standard that will hopefully become the de facto standard over a decade or two, much like Hive did. Having the standard in clear writing invites multiple communities to the table and brings even more use  cases. Doing so improves the standards and therefore the technologies that implement them.

 The previous three blog posts of this series covered the features and massive benefits from using this novel table format. The following post will dive deeper and discuss more about how Iceberg achieves some of this functionality, with an overview into some of the internals and metadata layouts. In the meantime, feel free to try  Trino on Ice(berg).

#trino #iceberg

bits





Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Mon, 12 Jul 2021 05:00:00 +0000


The first post covered how Iceberg is a table format and not a file format It demonstrated the benefits of hidden partitioning in Iceberg in contrast to exposed partitioning in Hive. There really is no such thing as “exposed partitioning.” I just thought that sounded better than not-hidden partitioning. If any of that wasn’t clear, I recommend either that you stop reading now, or go back to the first post before starting this one. This post discusses evolution. No, the post isn’t covering Darwinian nor Pokémon evolution, but in-place table evolution!





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals





You may find it a little odd that I am getting excited over tables evolving
in-place, but as mentioned in the last post, if you have experience performing table evolution in Hive, you’d be as happy as Ash Ketchum when Charmander evolved into Charmeleon discovering that Iceberg supports Partition evolution and schema evolution. That is, until Charmeleon started treating Ash like a jerk after the evolution from Charmander. Hopefully, you won’t face the same issue when your tables evolve.

Another important aspect that is covered, is how Iceberg is developed with cloud storage in mind. Hive and other data lake technologies were developed with file systems as their primary storage layer. This is still a very common layer today, but as more companies move to include object storage, table formats did not adapt to the needs of object stores. Let’s dive in!

Partition Specification evolution

In Iceberg, you are able to update the partition specification, shortened to partition spec in Iceberg, on a live table. You do not need to perform a table migration as you do in Hive. In Hive, partition specs don’t explicitly exist because they are tightly coupled with the creation of the Hive table. Meaning, if you ever need to change the granularity of your data partitions at any point, you need to create an entirely new table, and move all the data to the new partition granularity you desire. No pressure on choosing the right granularity or anything!

In Iceberg, you’re not required to choose the perfect partition specification upfront, and you can have multiple partition specs in the same table, and query across the different sized partition specs. How great is that! This means, if you’re initially partitioning your data by month, and later you decide to move to a daily partitioning spec due to a growing ingest from all your new customers, you can do so with no migration, and query over the table with no issue.

This is conveyed pretty succinctly in this graphic from the Iceberg
documentation. At the end of the year 2008, partitioning occurs at a monthly granularity and after 2009, it moves to a daily granularity. When the query to pull data from December 14th, 2008 and January 13th, 2009, the entire month of December gets scanned due to the monthly partition, but for the dates in January, only the first 13 days are scanned to answer the query.



At the time of writing, Trino is able to perform reads from tables that have multiple partition spec changes but partition evolution write support does not yet exist. There are efforts to add this support in the near future. Edit: this has since been merged!

Schema evolution

Iceberg also handles schema evolution much more elegantly than Hive. In Hive, adding columns worked well enough, as data inserted before the schema change just reports null for that column. For formats that use column names, like ORC and Parquet, deletes are also straightforward for Hive, as it simply ignores fields that are no longer part of the table. For unstructured files like CSV that use the position of the column, deletes would still cause issues, as deleting one column shifts the rest of the columns. Renames for schemas pose an issue for all formats in Hive as data written prior to the rename is not modified to the new field. This effectively works the same as if you deleted the old field and added a new column with the new name. This lack of support for schema. evolution across various file types in Hive requires a lot of memorizing
the formats underneath various tables. This is very susceptible to causing user errors if someone executes one of the unsupported operations on the wrong table.



  
    Hive 2.2.0 schema evolution based on file type and operation.
  


  
    
    Add
    Delete
    Rename
  
  
    CSV/TSV
    ✅
    ❌
    ❌
  
  
    JSON
    ✅
    ✅
    ❌
  
  
    ORC/Parquet/Avro
    ✅
    ✅
    ❌
  



Currently in Iceberg, schemaless position-based data formats such as CSV and TSVare not supported, though there are some discussions on adding limited support for them. This would be good from a reading standpoint, to load data from the CSV, into an Iceberg format with all the guarantees that Iceberg offers.

While JSON doesn’t rely on positional data, it does have an explicit dependency on names. This means, that if I remove a text column from a JSON table named severity, then later I want to add a new int column called severity, I encounter an error when I try to read in the data with the string type from before when I try to deserialize the JSON files. Even worse would be if the new severity column you add has the same type as the original but a semantically different meaning. This results in old rows containing values that are unknowingly from a different domain, which can lead to wrong analytics. After all, someone who adds the new severity column might not even be aware of the old severity column, if it was quite some time ago when it was dropped.

ORC, Parquet, and Avro do not suffer from these issues as they are columnar formats that keep a schema internal to the file itself, and each format tracks changes to the columns through IDs rather than name values or position. Iceberg uses these unique column IDs to also keep track of the columns as changes are applied.

In general, Iceberg can only allow this small set of file formats due to the correctness guarantees it provides. In Trino, you can add, delete, or rename columns using the ALTER TABLE command. Here’s an example that continues from the table created  in the last post  that inserted three rows. The DDL statement looked like this.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6), 
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioning = ARRAY['day(event_time)']
);


Here is an ALTER TABLE sequence that adds a new column named severity, inserts data including into the new column, renames the column, and prints the data.

ALTER TABLE iceberg.logging.events ADD COLUMN severity INTEGER; 

INSERT INTO iceberg.logging.events VALUES 
(
  'INFO', 
  timestamp 
  '2021-04-01 19:59:59.999999' AT TIME ZONE 'America/Los_Angeles', 
  'es muy bueno', 
  ARRAY ['It is all normal'], 
  1
);

ALTER TABLE iceberg.logging.events RENAME COLUMN severity TO priority;

SELECT level, message, priority
FROM iceberg.logging.events;


Result:




level
message
priority





ERROR
Double oh noes
NULL



WARN
Maybeh oh noes?
NULL



ERROR
Oh noes
NULL



INFO
es muy bueno
1




ALTER TABLE iceberg.logging.events 
DROP COLUMN priority;

SHOW CREATE TABLE iceberg.logging.events;


Result

CREATE TABLE iceberg.logging.events (
   level varchar,
   event_time timestamp(6),
   message varchar,
   call_stack array(varchar)
)
WITH (
   format = 'ORC',
   partitioning = ARRAY['day(event_time)']
)


Notice how the priority and severity columns are both not present in the schema. As noted in the table above, Hive renames cause issues for all file formats. Yet in Iceberg, performing all these operations causes no issues with the table and underlying data.

Cloud storage compatibility

Not all developers consider or are aware of the performance implications of using Hive over a cloud object storage solution like S3 or Azure Blob storage. One thing to remember is that Hive was developed with the Hadoop Distributed File System (HDFS) in mind. HDFS is a filesystem and is particularly well suited to handle listing files on the filesystem, because they were stored in a contiguous manner. When Hive stores data associated with a table, it assumes there is a contiguous layout underneath it and performs list operations that are expensive on cloud storage systems.

The common cloud storage systems are typically object stores that do not lay out the files in a contiguous manner based on paths. Therefore, it becomes very expensive to list out all the files in a particular path. Yet, these list operations are executed for every partition that could be included in a query, regardless of only a single row, in a single file out of thousands of files needing to be retrieved to answer the query. Even ignoring the performance costs for a minute, object stores may also pose issues for Hive due to eventual  consistency. Inserting and deleting can cause inconsistent results for readers, if the files you end up reading are out of date.

Iceberg avoids all of these issues by tracking the data at the file level,
rather than the partition level. By tracking the files, Iceberg only accesses the files containing data relevant to the query, as opposed to accessing files in the same partition looking for the few files that are relevant to the query. Further, this allows Iceberg to control for the inconsistency issue in cloud-based file systems by using a locking mechanism at the file level. See the file layout below that Hive layout versus the Iceberg layout. As you can see in the next image, Iceberg makes no assumptions about the data being contiguous or not. It simply builds a persistent tree using the snapshot (S) location stored in the metadata, that points to the manifest list (ML), which points to
manifests containing partitions (P). Finally, these manifest files contain the file (F) locations and stats that can quickly be used to prune data versus
needing to do a list operation and scanning all the files.



Referencing the picture above, if you were to run a query where the result set only contains rows from file F1, Hive would require a list operation and scanning the files, F2 and F3. In Iceberg, file metadata exists in the manifest file, P1, that would have a range on the predicate field that prunes out files F2 and F3, and only scans file F1. This example only shows a couple of files, but imagine storage that scales up to thousands of files! Listing becomes expensive on files that are not contiguously stored in memory. Having this flexibility in the logical layout is essential to increase query performance. This is especially true on cloud object stores.

If you want to play around with Iceberg using Trino, check out the
Trino Iceberg docs. To avoid issues like the eventual consistency issue, as well as other problems of trying to sync operations across systems, Iceberg provides optimistic concurrency support, which is covered in more detail in
the next post.

#trino #iceberg

bits





Trino on ice I: A gentle introduction To Iceberg
Mon, 03 May 2021 05:00:00 +0000


Back in the Gentle introduction to the Hive connector blog post, I discussed a commonly misunderstood architecture and uses of the Trino Hive connector. In short, while some may think the name indicates Trino makes a call to a running Hive instance, the Hive connector does not use the Hive runtime to answer queries. Instead, the connector is named Hive connector because it relies on Hive conventions and implementation details from the Hadoop ecosystem – the invisible Hive specification.





Trino on ice is a series, covering the details around how the Iceberg table format works with the Trino query engine. It’s recommended to read the posts sequentially as the examples build on previous posts in this series:
Trino on ice I: A gentle introduction to Iceberg
Trino on ice II: In-place table evolution and cloud compatibility with Iceberg
Trino on ice III: Iceberg concurrency model, snapshots, and the Iceberg spec
Trino on ice IV: Deep dive into Iceberg internals



I call this specification invisible because it doesn’t exist. It lives in the Hive code and the minds of those who developed it. This is makes it very difficult for anybody else who has to integrate with any distributed object storage that uses Hive, since they had to rely on reverse engineering and keeping up with the changes. The way you interact with Hive changes based on which version of Hive or Hadoop you are running. It also varies if you are in the cloud or over an object store. Spark has even modified the Hive spec in some ways to fit the Hive model to their use cases. It’s a big mess that data engineers have put up with for years. Yet despite the confusion and lack of organization due to Hive’s number of unwritten assumptions, the Hive connector is the most popular connector in use for Trino. Virtually every big data query engine uses the Hive model today in some form. As a result it is used by numerous companies to store and access data in their data lakes.

So how did something with no specification become so ubiquitous in data lakes? Hive was first in the large object storage and big data world as part of Hadoop. Hadoop became popular from good marketing for Hadoop to solve the problems of dealing with the increase in data with the Web 2.0 boom . Of course, Hive didn’t get everything wrong. In fact, without Hive, and the fact that it is open source, there may not have been a unified specification at all. Despite the many hours data engineers have spent bashing their heads against the wall with all the unintended consequences of Hive, it still served a very useful purpose.

So why did I just rant about Hive for so long if I’m here to tell you about Apache Iceberg? It’s impossible for a teenager growing up today to truly appreciate music streaming services without knowing what it was like to have an iPod with limited storage, or listening to a scratched burnt CD that skips, or flipping your tape or record to side-B. The same way anyone born before the turn of the millennium really appreciates streaming services, so you too will appreciate Iceberg once you’ve learned the intricacies of managing a data lake built on Hive and Hadoop.

If you haven’t used Hive before, this blog post outlines just a few pain points that come from this data warehousing software to give you proper context. If you have already lived through these headaches, this post acts as a guide to Iceberg from Hive. This post is the first in a series of blog posts discussing Apache Iceberg in great detail, through the lens of the Trino query engine user. If you’re not aware of Trino (formerly PrestoSQL) yet, it is the project that houses the founding Presto community after the founders of Presto left Facebook. This and the next couple of posts discuss the Iceberg specification and all the features Iceberg has to offer, many times in comparison with Hive.

Before jumping into the comparisons, what is Iceberg exactly? The first thing to understand is that Iceberg is not a file format, but a table format. It may not be clear what this means by just stating that, but the function of a table format becomes clearer as the improvements Iceberg brings from the Hive table standard materialize. Iceberg doesn’t replace file formats like ORC and Parquet, but is the layer between the query engine and the data. Iceberg maps and indexes the files in order to provide a higher level abstraction that handles the relational table format for data lakes. You will understand more about table formats through examples in this series.

Hidden Partitions

Hive Partitions

Since most developers and users interact with the table format via the query language, a noticeable difference is the flexibility you have while creating a partitioned table. Assume you are trying to create a table for tracking events occurring in our system. You run both sets of SQL commands from Trino, just using the Hive and Iceberg connectors which are designated by the catalog name (i.e. the catalog name starting with hive. uses the Hive connector, while the iceberg. table uses the Iceberg connector). To begin with, the first DDL statement attempts to create an events table in the logging schema in the hive catalog, which is configured to use the Hive connector. Trino also creates a partition on the events table using the event_time field which is a TIMESTAMP field.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);


Running this in Trino using the Hive connector produces the following error message.

Partition keys must be the last columns in the table and in the same order as the table properties: [event_time]


The Hive DDL is very dependent on ordering for columns and specifically partition columns. Partition fields must be located in the final column positions and in the order of partitioning in the DDL statement. The next statement attempts to create the same table, but now with the event_time field moved to the last column position.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time TIMESTAMP
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time']
);


This time, the DDL command works successfully, but you likely don’t want to partition your data on the plain timestamp. This results in a separate file for each distinct timestamp value in your table (likely almost a file for each event). In Hive, there’s no way to indicate the time granularity at which you want to partition natively. The method to support this scenario with Hive is to create a new VARCHAR column, event_time_day that is dependent on the event_time column to create the date partition value.

CREATE TABLE hive.logging.events (
  level VARCHAR,
  event_time TIMESTAMP,
  message VARCHAR,
  call_stack ARRAY(VARCHAR),
  event_time_day VARCHAR
) WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_time_day']
);


This method wastes space by adding a new column to your table. Even worse, it puts the burden of knowledge on the user to include this new column for writing data. It is then necessary to use that separate column for any read access to take advantage of the performance gains from the partitioning.

INSERT INTO hive.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'], 
  '2021-04-01'
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException'],
  '2021-04-02'
),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??'], 
  '2021-04-02'
);


Notice that the last partition value '2021-04-01' has to match the TIMESTAMP date during insertion. There is no validation in Hive to make sure this is happening because it only requires a VARCHAR and knows to partition based on different values.

On the other hand, If a user runs the following query:

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02';


they get the correct results back, but have to scan all the data in the table:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


This happens because the user forgot to include the event_time_day < '2021-04-02' predicate in the WHERE clause. This eliminates all the benefits that led us to create the partition in the first place and yet frequently this is missed by the users of these tables.

SELECT *
FROM hive.logging.events
WHERE event_time < timestamp '2021-04-02' 
AND event_time_day < '2021-04-02';


Result:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


Iceberg Partitions

The following DDL statement illustrates how these issues are handled in Iceberg via the Trino Iceberg connector.

CREATE TABLE iceberg.logging.events (
  level VARCHAR,
  event_time TIMESTAMP(6),
  message VARCHAR,
  call_stack ARRAY(VARCHAR)
) WITH (
  partitioning = ARRAY['day(event_time)']
);


Taking note of a few things. First, notice the partition on the event_time column that is defined without having to move it to the last position. There is also no need to create a separate field to handle the daily partition on the event_time field. The partition specification is maintained internally by Iceberg, and neither the user nor the reader of this table needs to know anything about the partition specification to take advantage of it. This concept is called hidden partitioning , where only the table creator/maintainer has to know the partitioning specification. Here is what the insert statements look like now:

INSERT INTO iceberg.logging.events
VALUES
(
  'ERROR',
  timestamp '2021-04-01 12:00:00.000001',
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']
),
(
  'ERROR',
  timestamp '2021-04-02 15:55:55.555555',
  'Double oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']),
(
  'WARN', 
  timestamp '2021-04-02 00:00:11.1122222',
  'Maybeh oh noes?',
  ARRAY ['Bad things could be happening??']
);


The VARCHAR dates are no longer needed. The event_time field is internally converted to the proper partition value to partition each row. Also, notice that the same query that ran in Hive returns the same results. The big difference is that it doesn’t require any extra clause to indicate to filter partition as well as filter the results.

SELECT *
FROM iceberg.logging.events
WHERE event_time < timestamp '2021-04-02';


Result:


level event_time message call_stack
ERROR 2021-04-01 12:00:00 Oh noes Exception in thread "main" java.lang.NullPointerException


So hopefully that gives you a glimpse into what a table format and specification are, and why Iceberg is such a wonderful improvement over the existing and outdated method of storing your data in your data lake. While this post covers a lot of aspects of Iceberg’s capabilities, this is just the tip of the Iceberg…



If you want to play around with Iceberg using Trino, check out the Trino Iceberg docs. The next post covers how table evolution works in Iceberg, as well as, how Iceberg is an improved storage format for cloud storage.

#trino #iceberg

bits





A gentle introduction to the Hive connector
Wed, 21 Oct 2020 17:00:00 +0000
TL;DR: The Hive connector is what you use in Trino for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.





Originally Posted on https://trino.io/blog/2020/10/20/intro-to-hive-connector.html

One of the most confusing aspects when starting Trino is the Hive connector. Typically, you seek out the use of Trino when you experience an intensely slow query turnaround from your existing Hadoop, Spark, or Hive infrastructure. In fact, the genesis of Trino came about due to these slow Hive query conditions at Facebook back in 2012.

So when you learn that Trino has a Hive connector, it can be rather confusing since you moved to Trino to circumvent the slowness of your current Hive cluster. Another common source of confusion is when you want to query your data from your cloud object storage, such as AWS S3, MinIO, and Google Cloud Storage. This too uses the Hive connector. If that confuses you, don’t worry, you are not alone. This blog aims to explain this commonly confusing nomenclature.

Hive architecture

To understand the origins and inner workings of Trino’s Hive connector, you first need to know a few high-level components of the Hive architecture.



You can simplify the Hive architecture to four components:

The runtime contains the logic of the query engine that translates the SQL -esque Hive Query Language(HQL) into MapReduce jobs that run over files stored in the filesystem.

The storage component is simply that, it stores files in various formats and index structures to recall these files. The file formats can be anything as simple as JSON and CSV, to more complex files such as columnar formats like ORC and Parquet. Traditionally, Hive runs on top of the Hadoop Distributed Filesystem (HDFS). As cloud-based options became more prevalent, object storage like Amazon S3, Azure Blob Storage, Google Cloud Storage, and others needed to be leveraged as well and replaced HDFS as the storage component.

In order for Hive to process these files, it must have a mapping from SQL tables in the runtime to files and directories in the storage component. To accomplish this, Hive uses the Hive Metastore Service (HMS), often shortened to the metastore to manage the metadata about the files such as table columns, file locations, file formats, etc…

The last component not included in the image is Hive’s data organization specification. The documentation of this element only exists in the code in Hive and has been reverse engineered to be used by other systems like Trino to remain compatible with other systems.

Trino reuses all of these components except for the runtime. This is the same approach most compute engine takes when dealing with data in object stores, specifically, Trino, Spark, Drill, and Impala. When you think of the Hive connector, you should think about a connector that is capable of reading data organized by the unwritten Hive specification.

Trino runtime replaces Hive runtime

In the early days of big data systems, many expected query turnaround to take a long time due to the high volume of unstructured data in ETL workloads. The primary goal in early iterations of these systems was simply throughput over large volumes of data while maintaining fault-tolerance. Now, more businesses want to run fast interactive queries over their big data instead of running jobs that take hours and produce possibly undesirable results. Many companies have petabytes of data and metadata in their data warehouse. Data in storage is cumbersome to move and the data in the metastore takes a long time to repopulate in other formats. Since only the runtime that executed Hive queries needs replacement, the Trino engine utilizes the existing metastore metadata and files residing in storage, and the Trino runtime effectively replaces the Hive runtime responsible for analyzing the data.

Trino Architecture



The Hive connector nomenclature

Notice, that the only change in the Trino architecture is the runtime. The HMS still exists along with the storage. This is not by accident. This design exists to address a common problem faced by many companies. It simplifies the migration from using Hive to using Trino. Regardless of the storage component used the runtime makes use of the HMS and that is the reason this connector is the Hive connector.

Where the confusion tends to come from, is when you search for a connector from the context of the storage systems you want to query. You may not even be aware the metastore is a necessity or even exists. Typically, you look for an S3 connector, a GCS connector or a MinIO connector. All you need is the Hive connector and the HMS to manage the metadata of the objects in your storage.

The Hive Metastore Service

The HMS is the only Hive process used in the entire Trino ecosystem when using the Hive connector. The HMS is actually a simple service with a binary API using the Thrift protocol. This service makes updates to the metadata, stored in an RDBMS such as PostgreSQL, MySQL, or MariaDB. There are other compatible replacements of the HMS such as AWS Glue, a drop-in substitution for the HMS.

https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio

Getting started with the Hive Connector on Trino

To drive this point home, I created a tutorial that showcases using Trino and looking at the metadata it produces. In the following scenario, the docker environment contains four docker containers:
trino - the runtime in this scenario that replaces Hive.
minio - the storage is an open-source cloud object storage.
hive-metastore - the metastore service instance.
mariadb - the database that the metastore uses to store the metadata.

You can play around with the system and optionally view the configurations. The scenario asks you to run a query to populate data in MinIO and then see the resulting metadata populated in MariaDB by the HMS. The next step asks you to run queries over the mariadb database which holds the generated metadata from the metastore.

If you have any questions or run into any issues with the example, you can find us on slack on the #dev or #general channels.

Have fun!

![https://trino.io/assets/blog/intro-to-hive-connector/intro-to-hive.jpeg]()

https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio

#trino #hive

bits

snapshot_id	parent_id	operation
7620328658793169607		append
2115743741823353537	7620328658793169607	append

level	message
ERROR	Oh noes
INFO	It is all good
ERROR	Double oh noes
WARN	Maybeh oh noes?

Hive 2.2.0 schema evolution based on file type and operation.
	Add	Delete	Rename
CSV/TSV	✅	❌	❌
JSON	✅	✅	❌
ORC/Parquet/Avro	✅	✅	❌

level	message	priority
ERROR	Double oh noes	NULL
WARN	Maybeh oh noes?	NULL
ERROR	Oh noes	NULL
INFO	es muy bueno	1