Data Observability

Tales from the Pipeline: 4 Data Horror Stories To Keep You Up at Night

Tim Osborn

Tim is a content creator at Monte Carlo who writes about data quality, technology, and snacks—occasionally in that order.

“As he lay awake in his Bay Area apartment, the data leader couldn’t shake the feeling that something wasn’t right. He tried to shut his eyes—to force them closed—but the more the data engineer tried, the more convinced he became. 

Suddenly, a light appeared from the darkness. 

It was a Slack from the CEO. 

She was working late. And the data…it couldn’t be…it looked wrong. 

His blood froze. 

Somewhere in the distance, thunder cracked.”

Does this horror story sound familiar?

Nothing keeps data leaders up at night like bad data. From loss of revenue to loss of reputation, data quality issues can wreak havoc on an organization. And without the right data quality strategy in place, you won’t know until it’s already too late.

In honor of Spooky Season, we gathered around the campfire to share four tales of terror from real Monte Carlo customers—and discover how data observability helped them put their data quality fears to rest. 

Names have been changed out of respect for the victims—all other details have been told exactly as they occurred.

Read on… if you dare.

Contact Records … from Beyond the Grave

Sometimes what we think is gone and buried can be the very thing that comes back to haunt us. For one Monte Carlo customer, that specter turned out to be a huge batch of old contact records. 

One dark and stormy evening, the marketing team at one of the world’s most popular wellness apps was putting the finishing touches on a new email campaign launch they’d been preparing for months.

In order to make sure the emails made it to the right inboxes, the data team employed several data models to identify the best recipients and their subsequent contact information. 

Unfortunately, amidst a rush of last minute changes, an old data source was also activated by mistake. To make matters worse, a simple oversight allowed the mistake to go unnoticed through manual testing. And the only thing worse than a marketing email is a marketing email intended for someone else. 

If the old data source wasn’t discovered quickly, that little mistake would have email campaigns slipping uninvited into thousands of wrong inboxes—firing spam filters and destroying deliverability in the process. 

Fortunately, Monte Carlo’s out of the box volume monitor—which had been activated for their Snowflake assets—triggered an alert to the team’s Slack integration. 

“The error was spotted when the old data source started showing a spike in data and the support team went to look into why an old data source was suddenly sending huge amounts of contact records.”

Within a day, the team had traced the issue back to its source, and resolved the production flow to exclude the outdated data—and those old records were once again lost to memory.

The Night the Schema Changed

Like ancient tombs or books bound with human skin, some things are better left alone—and critical JSON schemas are on that list. 

As the data engineer of a leading enterprise digital security solution stared into the flickering flames of our campfire, he shared how his organization’s customer usage data took a frightening turn one fateful weekend—and how his data consumers very nearly paid the price.

“We rely a lot on customer usage data to help our technical account management (TAM) team. We built dashboards that show important account activities, help with upselling, and track customer health. Everything was going well—until one weekend, things took a turn.”

On that fateful weekend, the team’s Monte Carlo monitors began picking up ominous alerts about a data quality issue in their pipeline. Customer usage data wasn’t coming through correctly, and there was a huge spike in missing data.

Was it the ghost of a vengeful ancestor? A curse from the proprietor of a decaying bed and breakfast?

No. 

It was a change in the system’s schema. 

Fortunately, the data team had configured Slack alerts for Monte Carlo as part of their operational process, and the issue was flagged immediately; but here’s the really scary part—no one checked Slack over the weekend.

If the issue wasn’t caught quickly, dashboards for executive account manager dashboards would return inaccurate information, leading to poor decisions for customers and harming the business and its customer relationships in one fell swoop. 

“By Monday morning, it became clear that customer data hadn’t been updated for over two days. Panic set in, and we had to dig deep to find the problem.”

The team needed all the alerts and context to be waiting for them when they logged on on Monday morning—and waiting it was. Thanks to Monte Carlo’s fully-automated schema monitors that were programmatically deployed across their tables and tailored root cause insights that were delivered with their alert, the team was able to dig deep and fix the issue quickly before it made its way to the customer. 

“Monte Carlo helped us catch the issue just in time, but the thought of what could’ve happened still gives us chills.”

The Curse of the Missing Data Source

Imagine waking up to find that one of your critical data sources has disappeared without a trace. Would you catch the issue in time? What would happen if you didn’t?…

When it comes to third-party data brokers, selling missing data is a lot like selling a car without brakes. You probably could do it, but the buyer won’t be in a good place when they find out. 

For one of the nation’s leading providers of third-party rental data, this nightmare scenario almost became more than a hypothetical when they discovered a spike in nulls returned from one of their most important endpoints. 

Fortunately for this data team, they came prepared.

“We had created a [Null metric monitor in Monte Carlo] for this exact scenario, because it was something we knew would happen eventually. An alert came into one of our Slack channels, and a Data Engineer [immediately] alerted the rest of the team to what happened. We had to take action immediately…” 

But where was the data source? Trapped in a wormhole? Absorbed in the spectral plane? Upon further investigation, the team discovered that the endpoint had been deprecated without their knowledge.  

“If we hadn’t caught this quickly, we would have overwritten data that gets sent to customers and displayed in-app with null values. We would have needed to pause all customer fulfillments and app updates until we could fix it. [And] given some legacy infrastructure we were working with at the time, fixing the null data would have been far more challenging and time-consuming than preventing the nulls from making it downstream in the first place.” 

After leveraging Monte Carlo to triage and root cause the incident, the team confirmed that the endpoint had indeed been deprecated and began mitigating the issue within three hours of the initial impact.

With Monte Carlo manning the EMF meter, you’ll always know what’s lurking in the shadows of your pipelines—or in this case, what isn’t. 

Wailing in the Webapp

There are forces at work in this world. Forces you can’t control. You don’t always see them… but you can feel their impact. The hairs standing up on the back of your neck. A chill running down into your toes.  

And sometimes… On a particularly cold night—when the moon is full and the night beasts lie still in the dark—you get a data quality alert about it. 

When Luke, a data leader responsible for external application data, received a volume alert that showed no change in row count for a table that logged the particular app feature, he knew something was wrong. 

“Part of our webapp was migrated to a different service, and a Data Engineer caught a Monte Carlo alert that showed that a particular database hadn’t increased in row count.” 

But a deeper investigation showed that the data was just the tip of this chilling iceberg. 

“Some detective work showed that the ETL’s were running fine, and that an app feature was actually down. It was a lesser-used feature, but thanks to Monte Carlo we discovered that the feature didn’t have sufficient internal alerting to indicate an outage.”

The data engineering alerted the applications platform team before the helpdesk ticket even came through. Within 24 hours, the issue was resolved and new alerting was in place to detect future gremlins in the app. 

“Because of Monte Carlo, this was able to be caught and fixed early Monday morning. If it had been any later, more users would have been impacted by [the outage].”

With Monte Carlo’s programmatic volume alerts, Luke’s team was able to not only uncover the darker root cause at work in their pipeline—while also protecting external trust and improving future reliability in the process.

Put your data quality fears to rest with Monte Carlo

The only thing scarier than the data quality issues you know about are the data quality issues you don’t. 

With Data Observability from Monte Carlo, you’ll have the visibility you need to detect your data quality monsters faster, resolve them sooner, and scale bigger than ever. 

Don’t let fear of the unknown overwhelm your data team. Take control of your enterprise data quality with Monte Carlo. Contact our team to learn more. 

Our promise: we will show you the product.