The 5 Data Quality Rules You Should Never Write Again
You know what they say about rules: they’re meant to be broken. Or, when it comes to data quality, it’s more like they’re bound to be broken.
Data breaks, that much is certain. The challenge is knowing when, where, and why it happens. For most data analysts, combating that means writing data rules – lots of data rules – to ensure your data products are accurate and reliable. But, while handwriting hundreds (or even thousands) of manual rules may be the de facto approach to data quality, that doesn’t mean it has to be.
What’s more—it isn’t enough.
In data, change is the only constant. Volume expands, systems evolve, use-cases grow; and data analysts wind up on the hook for an endless stream of data quality rules – or worse yet, waiting on the hook for someone else to prioritize them.
Which begs the question—is there a better way?
At Monte Carlo, we pride ourselves on making data quality rules easier for enterprise analysts. Our data quality and observability platform’s AI-powered monitors learn the statistical profile of data within the fields of a table, and automatically send you and your engineers an alert if those patterns are violated. Instead of writing manual rules (and more rules), our monitors do the heavy lifting—so you can spend less time writing tests and more time driving insights.
Curious how rule automation can transform your data quality workflows? Here are five rules across common dimensions of data quality—including validity, uniqueness, accuracy, timeliness, and completeness—that data analysts never need to write again.
Table of Contents
Uniqueness rules
Uniqueness – the notion that an event or entity should only be recorded once – can be tricky to define. After all, there’s no universal standard of what a “good unique %” is.
Let’s say you have a Customer Fact Table with a number of fields, from customer_id, to customer_state, and customer_firstname.
Data analysts typically have to write a rule for each of those fields because they all have different thresholds. For example, customer_id should always be unique, there’s an expected amount of duplicates in customer_firstname, and even more expected duplicates in customer_state. Now, imagine there are 100 columns in this table. Maybe even 1000!
Rather than profile each of these to give it your best attempt at setting a threshold for each column, you can use automated monitors. A unique% machine learning monitor, like UNIQUE _RATE, across the entire table will do the trick – automatically alerting you if the percentage of unique values across all rows or columns becomes anomalous.
No rules required.
Validity rules & dimension drift
Writing rules for accepted values for low cardinality fields and data validity can be tedious. Let’s revisit our Customer Fact Table example from above – specifically the customer_state field.
If you’re in the USA, there should only ever be 50 accepted values for the customer_state field. To address this, you might write a rule that says to alert you if you get a value that’s different from all 50 possibilities (‘AL’, ‘AK’, ‘AZ’, ‘CA’…)—and that’s a time-consuming rule to manually spell out.
What’s more, it’s almost impossible to create rules that alert you when a low cardinality field has drifted. In that case, you might write rules to alert you if, for example, ‘AL’ is more than 10% of the total values in the column or ‘CA’ makes up more than 50% of the values. Not only is spelling out every possibility a hassle, but it’s also prone to errors.
ML-powered dimension tracking can automatically apply distribution analysis to the values of a given field, providing analysts with the distribution of certain values as a percentage based on historical values over time and their relative frequency.
Instead of writing and changing rules ad nauseam, automatic dimension tracking monitors can proactively alert you to dimensional issues—like if ‘AL’ appears more frequently than normal or a new customer_state value, like ‘XY’, appears that doesn’t fit historic patterns.
Timeliness rules
One of the most common rules that analysts write is timeliness rules to be alerted if the data doesn’t arrive on time.
For example, you might write a rule to alert you if table_one doesn’t receive rows by date_time. The problem is that pipelines often experience minor hiccups that self-correct fairly quickly. If manual rules are too narrowly defined, the pipeline could easily become overwhelmed and alert you to an anomaly that might actually have been fixed on its own.
In addition, timeliness rules typically only apply to tables that are refreshed on very regular intervals – like every 6 hours, every day, or every week. If a table has multiple update patterns, you’ll need to understand what they are and manually set a rule for the longest possible refresh point… a rule that’s particularly prone to human error since it can be difficult to account for fluctuations.
Moreover, you might write different timeliness rules for different table types – temporary tables, dynamic tables, etc. – and, there are rules for tables that can be updated without rows added. The special cases could go on forever, and without automation, you’ll be on the hook for manual rules to cover them all.
Instead of hitting your head against a wall trying to cover the bases and more, you can deploy an AI-powered monitor that immediately understands data freshness at any scale. If data typically arrives in 48 hours, you can simply set a monitor, like ‘TIME_SINCE_LAST_ROW_COUNT_CHANGE,’ to alert you if it breaches your thresholds. Easy.
Accuracy rules
Knowing the data is accurate is essential to delivering reliable business insights – but writing rules to validate it is time-consuming and tedious.
Manual accuracy rules require profiling each numeric column individually to manually define all of the accepted ranges for various tables—and if you have hundreds of tables to define…well, that’s a lot of rules to write.
What’s worse, that data is also highly prone to changes. A value slightly outside your previous max value might be no big deal, but a major outlier is a big deal. And unless you’re a data analyst who’s also a data scientist on the side writing predictive algorithms to understand what a normal distribution vs. an anomalous distribution is—you’re going to need some level of automation to be effective, not just efficient.
Automated distribution monitors can programmatically check for shifts in the numeric profile of your data across tables. If an anomalous number shows up in any table, the monitor will trigger an alert, generally routed to your Slack, Teams, or other communication channel of choice.
Instead of writing an endless list of accepted_values rules, you can simply turn on a distribution monitor to look for fluctuations in:
- Mean / median
- Min / max
- 20th percentile (or 40th, 60th, 80th, etc)
- Zero (%) / zero (count)
- Negative % / negative count
- Standard deviation
- True / false %
And more!
That rule you didn’t think of
For all the rules data analysts write, there’s a rule that goes forgotten. It’s the rule you don’t think to write because it applies to a situation you didn’t think to look for.
You might write a rule saying that “currency can never be negative,” but would you write a rule to prevent a specific currency calculation error that might create those negative numbers in your revenue column? Probably not, right?
Coverage for ‘unknown unknown’ issues are some of the most important monitors for data teams to create—and they’re also the checks that most often get missed.
Automated monitors for ‘unknown unknowns’ aim to address issues across your entire data pipeline, not just the areas covered by specific tests—issues that even the most comprehensive testing can’t account for. Issues like:
- A distribution anomaly in a critical field that causes your Tableau dashboard to malfunction
- A JSON schema change made by another team that turns 6 columns into 600
- A code change that causes an API to stop collecting data feeding an important new product
You can’t write rules for issues you don’t anticipate. That’s why machine learning monitors are so important. By automatically monitoring for ‘unknown unknown’ anomalies at scale, you can identify issues before they break your pipelines – and preserve precious stakeholder trust in the process.
The new (automated) rules of data observability
Automated data quality solutions like data observability platforms ensure you’re the first to know about bad data, and then arm you with the incident management resources to do something about it.
When data breaks, automated data observability platforms like Monte Carlo triage the issue and alert the right stakeholder, like your data engineering team, with root-cause insights to understand and resolve the issue fast.
Plus, a best-of-breed data quality and observability solution like Monte Carlo requires minimal configuration and practically no threshold-setting, so you can get out-of-the-box AI monitors up and running in no time.
As data analysts, there’s so much to do. Don’t let the work of writing manual data rules weigh you down. Let AI monitors do the heavy lifting so you can focus on the work that really matters—like generating real, needle-moving insights for your stakeholders.
We can’t guarantee data observability will protect you from every ad-hoc request, but we’ll certainly try.
To learn more about how you can replace your manual rules with scalable AI monitors, reach out to our team.
Our promise: we will show you the product.