Snowflake and Databricks Summit Recap: Selling Shovels for the AI Gold Rush
People often say “you can’t judge a book by its cover.” But what I’m asking is… can you judge a data conference by its title?
Apparently not if those events happen to be Databricks’ Data + AI Summit or Snowflake’s own Data Cloud Summit subtitled the Era of Enterprise AI.
While both of these titanic events unquestionably sought to affirm their fidelity to the AI overlords, it was clear from “go” that the biggest news coming out of these events was far from the latest AI code generators.
Of course, there were plenty of flashy generative demos (like Shutterstock AI)—not to mention a couple of live snafus—but these were merely a palette cleanser between keynotes as these events hurdled toward the real star of their shows—data enablement.
So, what was the big news coming out of Summit season this year? In this blog, we’ll compare and contrast both of these monster events to understand the biggest news of the day, what it says about the state of the data industry, and why data is forever and always the star of AI.
Table of Contents
Selling shovels for the AI gold rush
While LLMs will almost certainly remain the “it” technology of the decade—at least until we figure out how to upload our consciousness to the metaverse—the star of both Summit events this year was undoubtedly all the new ways Snowflake and Databricks customers can enable data for a given use-case.
The 2024 announcements that made the biggest splash can be summed up in three categories:
- Data pipelines
- Data infrastructure
- Data governance
And each of these developments focused on one of three intended purposes: scalability, interoperability, or reliability.
Look, we’ve all been told there’s data in the LLM hills. But we can’t just go dig it up with our hands. We’ve gotta get the right shovel, the right pickaxe, and the right wheelbarrow to haul it out. And that’s what this year’s events were all about. Getting the right tools to deliver structured, governed, and reliable data into a modern data platform that can do something with it—whatever that something may be.
So, with that, let’s take a closer peek at some of those announcements and see what they can tell us about how the industry’s brightest luminaries are looking at the future of data.
Data pipelines
Databricks announces Lakeflow
Ostensibly positioned as the “everything app” for data pipelines, Lakeflow is Databricks’ new alternative to the fragmentation of some pipeline tooling.
Running the gamut from ingestion to transformation, LakeFlow will theoretically enable time- and resource-strapped Databricks users to expand their data more easily, with point and click tools to build and ingest data in one fell swoop, including operability with MySQL, Postgres, SQL Server, and Oracle, as well as critical enterprise apps like Salesforce, Dynamics, Sharepoint, Workday, NetSuite and Google Analytics. The goal being to help teams create a single source of truth.
But Databricks already has a robust partner ecosystem to solve these problems. So, why duplicate work to create an alternative? At the Databricks CIO Forum two years ago, CEO Ali Ghodsi was expecting to field requests for more ML. But that wasn’t the case.
“Everybody in the audience said: we just want to be able to get data in from all these SaaS applications and databases into Databricks.”
AI doesn’t work when data is spread across six on-prem systems. Step one is getting the data into the cloud, so it can be put to work. This is Databricks’ attempt to help streamline that process en masse.
As it turns out, much like data quality, data teams benefit from an end-to-end approach to pipeline tooling—and more scalability and usability at the data level is worth more than all the LLMs put together.
Snowflake announces Dynamic Tables
Where Databricks had Lakeflow to daisy chain your data pipeline construction, Snowflake created a new table to handle the pipeline management for you.
Dynamic Tables (which Monte Carlo supports) helps to transform data on load without the need for an additional target table, simplifying the movement of data at scale.
In Snowflake’s own words: “Dynamic tables simplify data engineering in Snowflake by providing a reliable, cost-effective, and automated way to transform data. Instead of managing transformation steps with tasks and scheduling, you define the end state using dynamic tables and let Snowflake handle the pipeline management.”
But whether we’re talking about end-to-end tooling or imagining the pipeline process in a new way, the intention is the same—help data teams do more faster with their data.
Data Infrastructure
Open source catalogs for all
If you were at both events like I was, you might have thought you were experiencing deja vu when both Databricks and Snowflake announced the open-sourcing of their data catalog features. But I assure you, you’re not in the Matrix. This really happened.
Heralded as the key to everything, Databricks made the shocking announcement that they were open sourcing their popular Unity Catalog, which initially launched in 2021 as a response to the growing demand for discoverability in the lakehouse.
Unity Catalog OSS will purportedly offer a universal interface to support any data format and compute engine, including the ability to read tables with Delta Lake, Apache Iceberg™, and Apache Hudi™ clients via Delta Lake UniForm—enabling both interoperability and choice across a variety of compute engines, tools, and platforms.
Of course, you wouldn’t be nearly as shocked by this if you’d been in that exact same room one week prior when Snowflake announced the same thing for the Polaris Catalog (albeit without the same on-stage theatrics of pushing it live in front of a drooling audience).
By open-sourcing their popular catalog solutions, both companies hope to enable greater interoperability and scale at the operational and governance levels, including a unified UI for both data and AI use cases.
Of course, Step 2 is to make sure all that data has some real semantic meaning, and that it’s properly governed for a given purpose; and open source catalogs could certainly be part of that—but a data catalog has never been an ultimate solution.
Also serverless compute for all
It looks like Snowflake and Databricks were swapping homework, because both teams also came to class with big announcements for generally available serverless compute as well.
If OSS catalogs are all about interoperability, then serverless is all about scale.
In a bid to improve flexibility and scalability (though not without some overspending risks if you’re not paying attention), Snowflake and Databricks have been hard at work on serverless solutions that give teams more control of costs by managing when and how their compute jobs will run.
The one major difference here is that serverless for Snowflake sounds a little more like an alternative while serverless for Databricks sounds a little more like a “do this now or you won’t get all our latest features.”
Data governance
Finally, if there’s one thing the “age of enterprise AI” has shed a light on, it’s the need for stronger data governance practices. From identifying and enforcing standards to leveraging data observability to detect, triage, and resolve issues quickly, how teams manage their data—and the tools that support it—were front and center this year.
Here were the big data governance reveals:
Databricks’ Mosaic AI Agent Evaluation
An AI-assisted evaluation tool to measure AI outputs by defining quality, collecting feedback, and identifying and resolving issues in the model.
Data classification and custom classification
The ability to specify properties at the schema level or using specific rules to better identify, tag, and govern your data.
New upstream and downstream data lineage specifically for tables within the Snowflake ecosystem. This announcement also includes the addition of ML lineage which is an exciting development for anyone developing and training machine learning models.
Snowflake data quality monitoring
A point solution to monitor basic or custom metrics (like null count) for data within Snowflake tables (available on Snowflake’s Enterprise or higher plans)
The sheer number of governance and quality announcements this year underscores how important this is to get right. Each of these solutions attempts to come at this problem from one angle for another. But without end-to-end coverage that extends outside a given environment—and a commitment to data quality innovation over time—point solutions will always be a drop in this monumental problem bucket.
That’s why more enterprise teams are turning to third-party data observability to extend quality across and into the entire data estate (at the data, system, and code level) to deliver data quality at scale beyond a given moment in time.
Check out the latest Gartner data observability report to hear what they have to say about the need for data observability in the age of AI.
The data estate reigns supreme
You could certainly be forgiven if you went into this year’s summit season expecting some shiny new AI model to be the Belle of the Ball.
However, while AI use-cases were still a meta-theme at both events, 2024 was a sober reminder that unless data teams can get their data estate in order, all the LLMs in the world won’t help them deliver more value for their stakeholders.
Large language models. Small language models. Good old fashioned machine learning and dashboards. Data products come and go. But what’s at the heart of all of it? Your own reliable first-party data.
Data is the real gold rush in 2024. And your data team doesn’t just provide the shovels—they build the refineries, the vaults, and the jewelry stores too. That’s why empowering data teams to activate accurate and reliable data will forever and always be the star of the show.
Need to get your data’s quality in order? Contact the team below and find out how data observability can help you find the fool’s gold in your data estate.
Our promise: we will show you the product.