What is benchmarketing and why is it bad?

There’s something I have to get off my chest. If you really need to, just read the TLDR and listen to the Justin Bieber parody posted below. If you’re confused by the lingo, the rest of the post will fill in any gaps.

TL;DR: Benchmarketing, the practice of using benchmarks for marketing, is bad. Consumers should run their own benchmarks and ideally open-source them instead of relying on an internal and biased report.

For the longest time, I have wondered what is the point of corporations, specifically in the database sectors, running their own benchmarks. Would a company ever have any incentive to post results from a benchmark that didn’t show its own system winning in at least the majority of cases? I understand that these benchmarks have become part of the furniture we come to expect to see when visiting any hot new database’s website. I doubt anybody in the public domain gains much insight out of these results, to begin with, at least nothing they weren’t expecting to see.

Now to be clear, I am in no way indicating that companies running their own internal benchmarks to analyze their own performance in comparison to their competitors is a bad thing. It’s when they take those results and intentionally skew the methods or data from these benchmarks for sales or marketing purposes that is the problem we’re discussing here. Vendors that take part in the practice, not only use these benchmarks to show their systems succeeding a little but rather perversely taint their methodology with settings, caching, and other performance enhancements while leaving their competition’s settings untouched.

This should be obvious that this is NOT what benchmarking is about! If you read about the history of the Transaction Processing Performance Council (TPC) you come to understand that this is the very wrongdoing that the council was created to address. But like with any proxy involving measurements, the measurements are inherently pliable.

By the spring of 1991, the TPC was clearly a success. Dozens of companies were running multiple TPC-A and TPC-B results. Not surprisingly, these companies wanted to capitalize on the TPC’s cachet and leverage the investment they had made in TPC benchmarking. Several companies launched aggressive advertising and public relations campaigns based around their TPC results. In many ways, this was exactly why the TPC was created: to provide objective measures of performance. What was wrong, therefore, with companies wanting to brag about their good results? What was wrong is that there was often a large gap between the objective benchmark results and their benchmark marketing claims — this gap, over the years, has been dubbed “benchmarketing.” So the TPC was faced with an ironic situation. It had poured an enormous amount of time and energy into creating a good benchmark and even a good benchmark review process. However, the TPC had no means to control how those results were used once they were approved. The resulting problems generated intense debates within the TPC.

This benchmarketing ultimately fails the clients that these companies are marketing to. It demonstrates not only a lack of care for addressing the users’ actual pain but a lack of respect by intentionally pulling the wool over their eyes simply in an attempt to mask that their performance isn’t up to par with their competitors. This leads to consumers not being able to make informed decisions as most of our decisions are made from gut instincts and human emotion which these benchmarks aim to manipulate.

If you’re not sure exactly how a company would pull this off, an example of might be that database A enables using a cost-based optimizer that requires precomputing statistics about different tables involved in the computation, while database B is running a query against this table without any type of stats based optimization made available to it. Database A will clearly dominate as now it can reorder joins and apply better execution plans while database B is going to go with the simplest plan and run much slower in most scenarios. The company whose product depends on database A will then hone in on the numerical outcomes of this report. Even if they’re decent enough to report the methods they skewed to get these results, they bury it within their report and focus on advertising the outcome of what would otherwise be considered an absurd comparison. Companies will even go as far as to say that their competition’s database wasn’t straightforward to configure when they were setting up optimizations. If you’re not capable of understanding how to make equivalent changes to both systems, well then I guess you don’t get to run that comparison until you figure it out.

Many think that consumers are not susceptible to such attacks and would be able to see right through this scheme, but these reports appeal to any of us when we don’t have the necessity or resources to thoroughly examine all the data. Many times we have to take cues from our gut when a decision needs to be made and the time to make it is constrained by our time and other business needs. We see this type of phenomenon described in the book, Thinking Fast and Slow by Daniel Kahneman. To briefly summarize the model they use, there are two modes that humans use when they reason about their decisions, System 1 and System 2.

Systems 1 and 2 are both active whenever we are awake. System 1 runs automatically and System 2 is normally in comfortable low-effort mode, in which only a fraction of its capacity is engaged. System 1 continuously generates suggestions for System 2: impressions, intuitions, intentions, and feelings. If endorsed by System 2, impressions and intuitions turn into beliefs, and impulses turn into voluntary actions. When all goes smoothly, which is most of the time, System 2 adopts the suggestions of System 1 with little or no modification. You generally believe your impressions and act on your desires, and that is fine — usually.

No surprise, that’s usually the part where we get into trouble. While we like to think that we are generally thinking in the logical System 2 mode, we don’t have time or energy to live in this space for long periods throughout the day and we find ourselves very reliant on System 1 for much of our decision making.

The measure of success for System 1 is the coherence of the story it manages to create. The amount and quality of the data on which the story is based are largely irrelevant. When information is scarce, which is a common occurrence, System 1 operates as a machine for jumping to conclusions.

This is why benchmarketing can be so dangerous because it is so effective at manipulating our belief in claims that simply aren’t true. These decisions affect how your architecture will unfold, your time-to-value, and lost hours for your team and customers. It makes having these systems that fairly compare the performance and merits of two systems all the more paramount.

https://xkcd.com/882/

So why am I talking about this now?

I have become a pretty big fanboy of Trino, a distributed query engine that runs interactive queries from many sources. I have witnessed firsthand how fast a cluster of Trino nodes is able to process a huge amount of data at fast speeds. When you dive into how these speeds are achieved you find that this project is an incredible modern feat of solid engineering that makes interactive analysis over petabytes of data a reality. Going into all the reasons I like this project would be too tangential but it fuels the fire for why I believe this message needs to be heard.

Recently there was a “benchmark” that came out comparing the performance Dremio and Trino (then Presto) open-source and enterprise versions, touting performance improvements over Trino by an amount that would have been called out as too high in a CSI episode Trino isn’t the only system in the data space to come under similar types of attacks. It makes sense too, as this type of technical peacocking is common as it successfully gains attention.

Luckily, as more companies strive to become transparent and associate themselves with open-source efforts, we are starting to see a relatively new pattern of open-source efforts emerge. Typically, you’re used to hearing about open-source within the context of software projects maintained by open-source communities. We are now arriving at the age of any noun being able to be used in an open-source framework. There is open-source music, open-source education, and even open-source data. So why not reach a point where open-source benchmarking through consumer collaboration is a thing? This is not just for the sake of the consumers of these technologies who simply want to have more data to inform their design choices to better serve their clients, it’s also unfortunate that this affects developer communities that are putting in a lot of hard work on these projects, only to have that hard work get berated unintelligibly by the likes of some corporate status competition.

Now I’m clearly a little biased when I tell you that I think Trino is currently the best analytics engine on the market today. When I say this, you really should be skeptical too. Really, I encourage it. You should verify in some way beyond a shadow of a doubt that:

  1. Any TPC or other benchmarks are validated and no “magic” was used to improve their performance.

  2. using your own use cases to make sure the system you choose is going to meet the needs of your particular use case.

While this may seem like a lot of work, with cloud infrastructure and the simplicity of deploying different systems into the cloud, it’s now more possible to do this today than it ever was even 10 years ago to run a benchmark of competing systems internally and at scale. Not only can this benchmark be run by your own unbiased data engineers who have more stake to find out which system best fits the companies’ needs, but you don’t have to rely on generic benchmarking data to analyze this if you don’t want. You can spin up these systems and let them query your system, using your use cases, and do it any way you want it.

In summary, if consumers can work together, we can work to get rid of this specific type of misinformation while providing a richer more insightful analysis that will aid both companies and consumers. As I mention in the song above, go run the test yourselves.

#trino #presto #opensource

bits