6 Common dbt Unit Testing Mistakes
Ah, dbt unit testing. It’s like that friend who always points out when you’ve got spinach in your teeth. Slightly annoying? Maybe. Necessary for avoiding public embarrassment? Absolutely.
Yet even in the most rigorous data engineering environments, mistakes happen. They’re the inevitable byproduct of innovation and ambition. And from these mistakes? Lessons. Valuable, timeless lessons.
But, why stumble through the same pitfalls others have faced? Learn from these six common missteps in dbt unit testing so you can sidestep them in your own data quality journey.
Table of Contents
Wait, can’t I avoid mistakes by just skipping dbt unit testing?
While avoiding dbt unit testing might seem like a shortcut, it’s a shortcut most data engineers wouldn’t take. The potential pitfalls and long-term costs of not testing far outweigh the initial effort of setting up and conducting those unit tests.
Just because dbt makes it easier to transform data doesn’t mean you should skip validating the transformations. dbt unit testing ensures every piece of data is accurate, consistent, and adheres to predefined quality standards.
Think of this as lesson zero: don’t skip unit testing.
Assuming dbt unit testing is still on your menu, let’s get into the real pitfalls.
Mistake #1: Neglecting Data Quality Checks
Always prioritize data quality. In dbt unit testing, this means rigorously checking every transformation, every join, every aggregation.
Start with these built-in test functionalities:
- unique: This test checks that a column or a combination of columns in a model has unique values.
- not_null: This makes sure that a specific column doesn’t have any null values.
- relationships: Use this to validate foreign key relationships between tables. It checks that all values in a child table’s foreign key column match values in the parent table’s primary key column.
- accepted_values: This test checks that values in a column match a predefined set of accepted values. It’s useful for fields that should only have specific values, like status fields.
- bounds: This test ensures that numeric values in a column fall within a specified minimum and maximum range.
For example, here’s how you can set up and run the not_null test:
1. Define the test in your dbt model’s YAML file
Let’s say you have a model named orders and you want to ensure that the order_id column doesn’t have any null values. You would define the test in your schema.yml (or another appropriate YAML file) like this:
version: 2
models:
- name: orders
columns:
- name: order_id
tests:
- not_null
2. Run the test
After defining the test in the YAML file, you can run it using the dbt CLI:
dbt test
This command will execute all tests defined in your dbt project. If you only want to run tests for the orders model, you can specify the model:
dbt test --models orders
3. Review the results
After running the test, dbt will provide feedback in the terminal. If the not_null test passes, it means there are no null values in the order_id column of the orders model. If the test fails, dbt will provide details about the failure, including the number of records that failed the test.
While incorporating the not_null test and other built-in dbt tests into your data transformation workflow can ensure that your transformed data meets specific quality standards, you can also write custom SQL tests to validate any specific data quality or business rule requirements that aren’t covered by built-in tests.
Mistake #2: Relying Only on Built-in Tests
Relying on built-in tests exclusively means you’re potentially missing out on addressing specific nuances and intricacies of your data. Every organization has its own unique quirks, business rules, and edge cases, and there’s no guarantee that built-in tests will capture these subtleties.
Imagine an e-commerce platform where users can leave reviews for products they’ve purchased, and each review has a timestamp indicating when it was submitted.
Built-in tests can ensure that each review has a valid user ID, product ID, and that the timestamp is not null. However, due to a software glitch, some reviews begin receiving timestamps set in the future. While the timestamp is technically valid from a data type perspective, it doesn’t make sense from a business logic standpoint. Relying solely on the built-in tests wouldn’t catch this anomaly.
A custom test that checks that review timestamps are not set in the future would be crucial to maintaining data integrity and providing accurate insights into product feedback trends.
Mistake #3: Overlooking Business Rule Violations
Let’s consider another hypothetical. Imagine you were tasked with transforming sales data and the business had a rule that any sale below $10 was considered a sample that shouldn’t be included in revenue calculations. Unfortunately, due to a transform error, the transformed data counted every sale, leading to inflated revenue figures.
Understanding the business and its internal policies is just as crucial as understanding the data. It can be easy to become enamored with the technical intricacies of our pipelines and overlook the guiding business rules that underpin our data. These rules, often defined by domain experts and stakeholders, ensure that our data not only looks good on paper but also accurately reflects the realities and constraints of the business.
Ignoring or misinterpreting these rules during transformation can lead to data that, while technically correct, is misleading or even erroneous in a business context.
Always collaborate closely with domain experts and stakeholders to understand the business rules that should guide your transformations. Incorporate these rules as explicit checks in your dbt unit tests. For instance, if there’s a rule about minimum sale amounts, write a test to flag any records that violate this rule.
-- tests/exclude_sample_sales.sql
with sample_sales as (
select
order_id,
sale_amount
from {{ ref('your_model_name') }}
where sale_amount < 10
)
select
order_id,
sale_amount
from sample_sales
This SQL test will return any records from your model where the sale_amount is below $10.
Mistake #4: Inconsistent Testing Across Datasets
I recently heard about a project where an engineer was integrating multiple datasets. His team had a robust set of tests for the primary dataset, and assumed they could apply the same tests to a newly integrated dataset. It wasn’t long before discrepancies arose.
The new dataset had different conventions, null value representations, and even some columns that the primary dataset didn’t. By blindly applying the same tests, his team missed these nuances, leading to a few embarrassing meetings with stakeholders.
You should always tailor your tests to the specific characteristics of each dataset. While it’s tempting to create a “one-size-fits-all” testing approach, it’s essential to recognize that every dataset is different. When integrating new data, take the time to understand the structure, conventions, and potential pitfalls of that dataset, then, adapt or create new tests that cater to those issues specifically. It’s a bit more work upfront, but it ensures that your data quality remains both high and consistent.
Mistake #5: Failing to Update Tests with Schema Changes
Imagine you’re working on a project where you decide to split a full_name column into first_name and last_name for better granularity. The transformation goes smoothly, but in the hustle, you forget to update your dbt unit tests. It’s not until a stakeholder points out discrepancies in a report that you realized your tests are still referencing the old full_name column.
Datasets evolve, business needs shift, and schemas change. A common problem is neglecting to update dbt tests when making these schema changes. It’s like getting into a new car but forgetting to adjust the mirrors. The result? Tests that either break or, worse, give a false sense of security.
dbt unit tests are only as good as their alignment with the current schema. Whenever you make schema changes, consider it a two-step process: modify the transformation and update the dbt unit tests.
Mistake #6: Ignoring Continuous Integration in Testing
While setting up dbt unit testing is a commendable first step, it’s just the beginning. The real magic happens when you integrate these tests into a continuous integration (CI) pipeline.
CI refers to the automated process of validating and testing changes to data pipelines and workflows.
Ignoring CI means you’re running tests manually or sporadically. This approach can lead to long intervals between test runs, during which data issues can creep in unnoticed. Without CI, you’re essentially driving with a blind spot, unaware of potential data pitfalls until it’s too late.
Select a CI platform like dbt Cloud, CircleCI, or GitHub Actions to automate the execution of your dbt tests.
Supplement dbt unit tests with data observability platform Monte Carlo
In any data system, there are two types of data quality issues: those you can predict (known unknowns) and those you can’t (unknown unknowns). While your team can have hundreds of unit tests in place to catch a discrete set of known unknowns, there’s no way to account for unknown unknowns or programmatically create new tests to scale monitoring for known unknowns as your company’s data ecosystem evolves.
To truly elevate and maintain data quality, you need automated end-to-end coverage with data observability. Data observability helps you catch your unknown unknowns through automated monitoring and rule generation, scaling QA coverage of your data stack to eliminate the need for manual threshold setting. And there’s no better data observability platform than Monte Carlo, the #1 data observability platform on G2.
Plus, Monte Carlo integrated with both dbt and GitHub, so you can take data observability to your transforms, CI/CD and beyond.
Learn more about how data observability can supplement your testing. Set up a time to talk to us using the form below.
Our promise: we will show you the product.