Generative AI, Case Studies

How WHOOP Built and Launched a Reliable GenAI Chatbot

WHOOP

Sara Gates

Sara is a content strategist and writer at Monte Carlo.

Boston-based WHOOP is on a mission to help its members understand their bodies and unlock their potential through a performance wearable, accompanying app, and 24/7 health data. 

Data isn’t just critical to optimizing physical potential – it’s also the lifeblood of WHOOP. Director of Business Analytics, Matt Luizzi and his team leverages this data firsthand. Their primary focus is on personalizing the in-app experience, optimizing marketing investment, and helping product managers understand feature adoption and engagement – all via data-driven decision making.

“One of the things that makes WHOOP exciting is that we’re a small company and very agile — so when we catch on to the right insights, it leads to changes in the product roadmap and new features that we’re shipping really quickly to our members,” Matt says. 

That includes early adoption of GenAI to power an in-app feature called WHOOP Coach, then repurposing the underlying infrastructure to help internal team members access even more data. 

We’re suckers for a good self-serve data product, especially when it’s powered by GenAI. So we were thrilled when Matt sat down with our Field CTO Shane Murray at the 18th annual CDOIQ Symposium to share how his team is leveraging LLMs to deliver reliable insights to stakeholders throughout the organization. Here are a few key insights from their conversation.  

The challenge: Making data trustworthy and accessible to chatbot users

Matt wanted to solve an age-old problem: “too many questions and not enough analysts.” Like many organizations, WHOOP’s marketers and product managers relied on data to make decisions, but they lacked the SQL or Python skills to find their own answers. Often that meant Slacking a data analyst for help, who might take days to get an answer back. 

Matt saw an opportunity to build an internal chatbot that these users could leverage to get answers — and make decisions — faster. 

The initial proof of concept came together via an existing relationship with OpenAI after building the in-app experience for WHOOP Coach. “We already had access to all the tools between OpenAI, Snowflake, and Streamlit, and built a chat interface.” 

Matt took three datasets and connected them to the LLM and chatbot. Initially, he focused on answering questions like “Which countries are growing the fastest by new membership sales?” — ones that could be answered with a single table and some aggregations and window functions. 

He brought the chatbot to the CTO. “You see the lightbulb go on when they realize how easy it is to get data into people’s hands,” says Matt. “It was really powerful to show ‘Hey, this is something we can do in a week — imagine if we actually put resources behind this project and scaled it up.”

That’s exactly what happened. Matt now had a team of dedicated resources solely tasked with making sure the internal chatbot was enterprise-grade and optimized for performance with the latest advancements in LLMs. 

The first order of business: getting their data quality in order. 

“We had several hundred dashboards and all the typical sprawl you see in BI,” says Matt. “Everyone’s creating things, nobody knows what’s being used or what’s correct, and there might be two Looker explorers that are the exact same thing with different filters. Depending on where you go, you may or may not get the right answer.”

If they were building an internal tool that would drive important decision-making — how could they govern and monitor their data to ensure reliable answers were delivered when working with such massive datasets? 

The solution: Implementing data quality standards

First, Matt and his team re-architected their dbt project to improve documentation and accessibility. They wanted to understand what their key datasets were and have a single source of truth for all their metrics.

Given their data sprawl, the process was “quite a beast,” says Matt. But they were able to leverage Monte Carlo’s data observability platform to plan the new architecture. 

“With all of the lineage that Monte Carlo gives us, beyond our dbt project, into the BI layer, and even upstream into our data lake, we could immediately go and start deleting dashboards that weren’t being queried,” says Matt. “We wanted to see if people weren’t using it, they don’t need it, or we’re not supporting it. We eliminated 80% of the bloat.”

From there, Matt and his team were able to diagram their ideal data architecture. They identified all the metrics they cared about as a company, worked backwards to understand how they wanted to report on those metrics, and what tables would be needed. They were able to reduce their number of dbt models and implement better version control. 

“We were really meticulous about having the right documentation in place,” says Matt. “We created YAML files with table-level description and column-level descriptions. When we cut over all our analytics, ultimately, we improved the accuracy of our reporting.” 

Matt’s goal was to enable the LLM to interpret the data correctly when answering conversational questions from chatbot users. But first, he tapped their analytics team — people who could read and write SQL — to be the initial chatbot users. 

“We’d have the chat interface actually show the SQL that was being executed and sent to Snowflake,” says Matt, “so it was pretty easy for an analyst who knew the data sets to tell whether the answer was sensible and likely correct, or whether it was garbage.”

They also had every analyst submit ten “golden questions” with their own corresponding SQL statements, to compare against the chatbot’s results. Today, this automated testing framework includes around 150 questions. “It can run both SQL queries simultaneously and compare the resulting data frames, so we get a sense of how accurate these are,” says Matt. “We can evaluate new models by asking: Is it returning SQL? Is the SQL executable? Is it correct?” 

Over time, the chatbot was able to support more complex questions, like “What’s the monthly churn rate of people who’ve engaged with a certain feature compared to those who haven’t?” As the underlying models continue to advance, Matt predicts, so will the chatbot’s capabilities. 

Outcome: Achieving organizational value on data quality

Building an enterprise-grade, self-serve data platform takes an upfront investment of time. Matt knew it would be slow going at first, because he couldn’t let users run wild on a chatbot until it was fully trusted to deliver accurate responses. 

“It’s a big investment upfront, but we’re already starting to see returns in the interest in data quality and people driving new conversations with the data team,” says Matt.

For one thing, his team has proven they deserve a seat at the decision-making table. 

“Getting in the room and having conversations with the right stakeholders is half the battle,” says Matt. “For us, being able to showcase the fact that we’re able to not only create dashboards and run A/B tests but actually build tooling that’s serving the business — that’s gotten us a lot of value in the organization.” 

Outcome: Improving visibility across the data stack

The chatbot has also given Matt and his team visibility into areas where they have poor documentation. If the LLM can’t accurately interpret and respond to a question, sometimes it’s because the prompt needs to be adjusted — but often, there’s a problem with the documentation. 

“If the chatbot can’t answer the question, it’s either something wrong with our semantic modeling or our business relationships are unclear — in which case we should probably fix the documentation,” says Matt. “It’s been really helpful in terms of driving clarity around where we’ve done a good job, and where we’ve done less well — and we’ve been able to iterate from there.” 

Outcome: Implementing best practices around data quality

As Matt and his team undertook the dbt re-architecture, they began to understand why their data quality had suffered in the first place. 

“As the analytics team grew, it was getting unruly,” Matt says. “More people could just commit code to main with one approver, and we typically got a bunch of people saying ‘Hey, it runs. Looks good to me.’ That was very much not the direction that we wanted to go in as an organization, between quality is super important.”

To make sure data quality remained intact going forward, Matt took a few specific steps to follow software engineering best practices. He implemented a weekly release cadence within their dbt project to help eliminate all avoidable data errors, from upstream changes to syntax errors. 

“Everyone can commit pull requests to our QA branch by Wednesday at 3pm, and then our analytics engineering team has 24 hours to run all the data quality checks we need to and ensure there’s no regression on metrics,” says Matt. 

With Monte Carlo monitoring helping ensure everything looked good in the QA branch, the engineers can confidently merge to main on a weekly basis. “Monte Carlo is that one-tool-fits-all for us when it comes to observability,” he says. “We have freshness monitors out-of-the-box, we don’t have to do anything there. And then from a SQL rules point of view, we have aggregations and metrics and cardinality out of the box.”

Matt describes the shift to a rigorous QA process and weekly release cadence as the biggest change in data quality he’s seen over the last three years. “We’ve been doing this for six months or so now, and we’ve had zero instances where unavoidable mistakes were actually merged into production,” says Matt. 

The bright future of data at WHOOP

Looking ahead, Matt and his team plan to apply Monte Carlo across more tables and more dimensions that are impactful to the business. They’re also focused on performance optimization for the chatbot — constantly evaluating the latest technology to power their product, from adopting Snowflake Cortex to building an in-house RAG system. 

And the data team’s leadership within the organization continues to grow. Matt and his team are tackling experimentation — automating A/B testing and understanding how to personalize experiences within the product. Quarter-over-quarter, they’ve been able to quadruple the number of experiments run.

“We’ve seen the outcomes of some of our experiments be such needle-movers for us — like small copy changes, or a different signup flow, or funnel optimizations,” says Matt. “It’s stuff that you would be staggered to look at the results and say, ‘How did we not do this sooner?’ It’s been a real wake-up call for the management team on how we can move quickly and use data to drive more decisions.”

Ready to see how reliable, accurate data can move the needle at your company? Contact our team to learn how data observability can lay the foundation for data trust across the organization.

Our promise: we will show you the product.