Best Data Observability Tools (with RFP Template and Analyst Reports)
Data observability has been one of the hottest emerging data engineering technologies the last several years.
This momentum shows no signs of stopping with data quality and reliability becoming a central topic in the data product and AI conversations taking place across organizations of all types and sizes.
Benefits of a data observability solution include:
- Increasing data trust and adoption
- Mitigating operational, reputational, and compliance risks associated with bad data
- Boosting revenue
- Reducing time and resources associated with data quality (more efficient DataOps)
Following Monte Carlo’s creation of the data observability category in 2019, alternative data observability tools have entered the market at various levels of maturity.
In this post, we will share analyst reports and the core evaluation criteria we see organizations use when ranking data observability solutions. We’ll also provide a sample RFP template.
Finally, we’ll share our perspective on alternative data observability vendors, from relative new-comers to open-source stop-gaps–and when to choose Monte Carlo instead.
Table of Contents
What are data observability tools? And why do they matter?
The data estate has changed, providing data leaders with greater value – and greater complexity.
Now, enterprise teams are responsible for more:
- Diverse data sources
- Layers of data transformations
- Interdependent technologies
- Consumers
- Analytical, AI, or ML data products
Delivering trusted data in this context is difficult, especially with legacy data quality approaches. Similar to how software engineers turned to observability to deliver reliable applications, organizations must embrace data observability to deliver reliable data products.
Traditional data quality solutions evaluate data across the six dimensions – completeness, consistency, uniqueness, accuracy, validity, and timeliness – at a moment in time. But:
- The data upstream might be incorrect or late
- Issues may be rare, conditional, or persistent
- The root cause may be resolved quickly…or not
Your team needs a modern approach that is designed to reduce the period of time data is not fit for use.
This approach is called data observability, and it’s the only way organizations can resolve incidents at scale to achieve data trust and reduce risk.
Data observability combines data quality detection features with powerful resolution capabilities for full visibility into your entire data estate, including your data, systems, and code, and anomalous data is correlated to the root cause for lightning fast resolution. For example:
- DATA – what source sent us bad data?
- SYSTEM – what specific job failed?
- CODE – what code change led to the anomaly?
By providing AI-powered monitoring and incident management across your entire data estate, Monte Carlo enables you to take the next step on your data quality journey.
Key features of data observability tools: the analyst perspective
Let’s take a look at what some key industry analysts have pointed to as key evaluation criteria for data observability tools.
Gartner
While Gartner hasn’t produced a data observability magic quadrant or report ranking data observability vendors, they have named it one of the hottest emerging technologies and placed it on the 2023 Data Management Hype Cycle.
They say data and analytics leaders should, “Explore the data observability tools available in the market by investigating their features, upfront setup, deployment models and possible constraints. Also consider how it fits to overall data ecosystems and how it interoperates with the existing tools.”
We anticipate Gartner will continue to evolve and add to their guidance on data observability tools this year.
GigaOm
GigaOm’s Data Observability Radar Report covers the problem data observability tools look to solve saying, “Data observability is critical for countering, if not eliminating, data downtime, in which the results of analytics or the performance of applications are compromised because of unhealthy, inaccurate data.”
The authors include a list of key criteria and a list of evaluation metrics.
Key criteria include:
- Schema change monitoring
- Data pipeline support
- AIOps
- Advanced data quality
- Edge capabilities
Evaluation metrics:
- Contextualization
- Ease of connectability or configurability
- Security and compliance
- BI-like experience
- Reusability
The analyst’s take at the conclusion of the report also highlights the importance of end-to-end coverage and root cause analysis–two features we believe are essential factors for evaluating data observability tools as well.
Ventana
The Ventana Research Buyers Guide does a good job capturing the essence of these tools saying, “data observability tools monitor not just the data in an individual environment for a specific purpose at a given point in time, but also the associated upstream and downstream data pipelines.”
They also used standard dimensions of SaaS platforms in how they ranked vendors:
- Adaptability
- Capability
- Manageability
- Reliability
- Usability
- Customer Experience
- TCO/ROI
- Validation
But, product capability is the highest weighted at 25% of the evaluation. Here Ventana really hit the nail on the head saying that the best data observability solutions go beyond detection to focus on resolution, prevention and other workflows:
“The research largely focuses on how vendors apply data observability and the specific processes where some specialize, such as the detection of data reliability issues, compared to resolution and prevention. Vendors that have more breadth and depth and support the entire set of needs fared better than others. Vendors who specialize in the detection of data reliability issues did not perform as well as the others.”
G2 Crowd
G2 was one of the earliest non-vendor resources to put together a credible list of data observability vendors and a definition for the category. They say:
To qualify for inclusion in the G2 Crowd data observability category, a product must:
- Proactively monitor, alert, track, log, compare, and analyze data for any errors or issues across the entire data stack
- Monitor data at rest and data in motion, and does not require data extraction from current storage location
- Connect to an existing stack without any need to write code or modify data pipelines
Vendors are evaluated by verified users of the product across a list of organizational and product specific capabilities including:
- Quality of support
- Ease of admin
- Ease of use
- Integrations
- Alerting
- Monitoring
- Product direction
- Automation
- Single pane view
Key features of data observability tools: our perspective
At it’s core, a data observability solution takes data quality management to the next level by operationalizing detection, triage, and resolution of data issues through a mix of AI-powered and customizable data quality tools.
Comprehensive data observability tools will integrate across your entire data stack and provide coverage for issues like data freshness, volume anomalies, and schema changes from CI/CD in addition to monitoring your data directly on your most critical tables for things like NULL rates, duplicates, and values outside a normal distribution.
At Monte Carlo, our customers’ needs are never far from our mind when we think about developing data observability as a category and our own feature roadmap. Considering the needs of our customers and the data industry at large, here are the key evaluation criteria we feel are the most critical to a comprehensive and future-proof data quality and observability solution:
- Enterprise readiness
- End-to-end coverage
- Seamless incident management
- Integrated data lineage
- Comprehensive root cause analysis
- Quick time-to-value
- AI observability
So, now that we’ve got those points in mind, let’s dive into each of these criteria in a bit more detail.
Enterprise readiness
Data is like fashion–it’s ever evolving. You don’t need another vendor, you need a data observability provider that can serve as a strategic advisor. Someone that is going to be innovating alongside you for the long-haul and ensure your operationalization is informed by best practices.
Vendors will promise the world, but can they deliver if they are 12 people in a garage? Will they be around next year?
These are important questions to answer through customer reference calls and an understanding of their overall maturity. As we saw above, these dimensions are also well covered during analyst reviews.
Some key areas to evaluate for enterprise readiness include:
- Security– Do they have SOC II certification? Robust role based access controls?
- Architecture– Do they have multiple deployment options for the level of control over the connection? How does it impact data warehouse/lakehouse performance?
- Usability– This can be subjective and superficial during a committee POC so it’s important to balance this with the perspective from actual users. Otherwise you might over-prioritize how pretty an alert appears versus aspects that will save you time such as ability to bulk update incidents or being able to deploy monitors-as-code.
- Scalability– This is important for small organizations and essential for larger ones. We all know the nature of data and data-driven organizations lends itself to fast, and at times unexpected growth. What are the largest deployments? Has this organization proven its ability to grow alongside its customer base? Other key features here include things like ability to support domains, reporting, change logging, and more. These typically aren’t flashy features so many vendors don’t prioritize them.
- Support– Data observability isn’t just a technology, it’s an operational process. The maturity of the vendor’s customer success organization can impact your level of success as can support SLAs (the vendor doesn’t even have support SLAs? Red flag!).
- Innovation history and roadmap– The data world changes rapidly and as we enter the AI era, you need a partner that has a history of being on the forefront of these trends. Fast followers are often anything but, with comparative features shipped 6 months to a year later. That’s 25 in chief data officer years! Cloud-native solutions often have a leg up here.
End-to-end coverage
The true power of data observability tools lies in its ability to integrate across modern data platform layers to create end-to-end visibility into your critical pipelines.
Don’t fish with a line, shoot fish in a barrel. (Yes, we have reasoning behind this convoluted analogy).
For years, data testing–whether it was hardcoded, dbt tests, or some other type of unit test–was the primary mechanism to catch bad data.
While still relevant in the right context, the problem with data testing as a complete practice is that you couldn’t possibly write a test for every single way your data could break. No matter how well you know your pipelines, unknown unknowns will still be a fact of life. And even if you could identify every potential break (which you can’t), you certainly wouldn’t be able to scale your testing to account for each one as your environment grew. That leaves a lot of cracks in your pipelines to fill.
Data observability tools should offer both broad automated metadata monitoring across all the tables once they have been added to your selected schemas, as well as deep monitoring for issues inherent in the data itself.
A strong data observability tool will also integrate widely and robust across your modern data platform, from ingestion to BI and consumption, and enable quick time-to-value through simple plug and play integrations.
Be sure to verify that your chosen solution offers tooling integrations for each of the layers you’ll need to monitor in order to validate the quality of your data products, as well as integrations into existing workflows with tools like Slack, Microsoft Teams, Jira, and GitHub. Speaking of..
Seamless incident management
Most data teams we talk to initially have a detection focused mind-set as it relates to data quality, likely formed from their experience with data testing.
The beauty of data observability is that not only can you catch more meaningful incidents, but the best solutions will also include features that improve and accelerate your ability to manage incidents. Bad data is inevitable and having tools to mitigate its impact provides tremendous value.
There are a few areas to evaluate when it comes to incident management:
- Impact analysis: How do you know if an incident is critical and requires prioritizing? Easy—you look at the impact. Data observability tools that provide automated column-level lineage out-of-the-box will also sometimes provide an impact radius dashboard to illustrate how far a quality issue has extended from its root. This can help data engineers understand at a glance how many teams or products have been impacted by a particular issue and who needs to be kept informed as it moves through triage and resolution.
- Internal team collaboration: Once an alert has triggered there needs to be a process for assigning and potentially transferring ownership surrounding the incident. This may involve integrating with external ticket management solutions like JIRA or ServiceNow, or some teams may choose to manage the incident lifecycle within the data observability tool itself. Either way, it’s helpful to have the flexibility to do both.
- Proactive communication with data consumers: When consumers use bad data to make decisions, you lose data trust. Data observability solutions should have means for proactively communicating with data consumers the current health of particular datasets or data products.
Comprehensive root cause analysis
What is your standard root cause analysis process? Does it feel disjointed hopping across multiple tools? How long does it take to resolve an issue?
Data can go bad in a lot of ways. A comprehensive data observability tool should help you identify if the root cause is an issue with the data, system, or code.
For example, the data can be bad from the source. If an application went buggy and you started seeing an abnormally low sales price from orders in New York, that would be considered a data issue.
Alternatively, a data environment is made up of a panoply of irreducibly complex systems that all need to work in tandem to deliver valuable data products for your downstream consumers. Sometimes the issue is hidden within this web of dependencies. If you had an Airflow job that caused your data to fail, the real culprit wouldn’t be the data but a system issue.
Or if a bad dbt model or data warehouse query change ultimately broke the data product downstream, that would be considered a code issue.
A thorough data observability tool would be able to accurately identify these issues and provide the proper context to help your team remediate each at its source.
Integrated column-level data lineage
Lineage is a dependency map that allows you to visualize the flow of data through your pipelines and simplify root cause analysis and remediation.
While a variety of tools like dbt will provide lineage mapping at the table level, very few extend that lineage into the columns of a table or show how that data flows across all of your systems. Sometimes called “field-level lineage,” column-level lineage maps the dependencies between data sets and tables across data products to understand visually how data moves through your pipelines.
It’s also important that your data lineage and data incident detection features work as an integrated solution within the same platform. A key reason for this is that lineage grouped alerting not only reduces alert fatigue, but helps tell a more cohesive story when an event impacts multiple tables.
Rather than getting 12 jumbled chapters that may be part of one or two stories, you are getting an alert with the full story and table of contents.
Quick Time To Value
Data observability is intended to reduce work—not to add more.
If a data observability tool is providing the right integrations and automated monitors for your environment out-of-the-box, it will be quick to implement and deliver near immediate time-to-value for data teams.
A data observability solution that requires more than an hour to get set up and more than a couple of days to start delivering value, is unlikely to deliver the data quality efficiencies that a growing data organization would require to scale data quality long-term.
AI ready
Building differentiated, useful generative AI applications requires first party data. That means data engineers and high quality data are integral to the solution.
Most data observability solutions today will monitor the data pipelines powering RAG or fine tuning use cases–they are essentially the same as data pipelines powering other data products such as dashboards, ML applications, or customer facing data.
However, the generative AI ecosystem is evolving rapidly and your data observability vendor needs to be not just monitoring this evolution but helping to lead the charge. That means features like observability for vector databases, streaming data sources, and ensuring pipelines are as performant as possible.
Data Observability RFP Template
Not every data team needs to issue a RFP or undergo a proof of concept. But some organizations find it helpful to organize their evaluations this way. This template is inspired by RFPs we have seen from customers and is an amalgamation of the very best.
Data Observability Request for Proposal | |
Section | Criteria |
Company Background | What is your experience in the industry? |
What references and case studies can you provide from similar projects? | |
What is your vision for the future of data observability? | |
What are your planned enhancements and new features in the next quarter? | |
Security | Is your product delivered as a Software-as-a-service (SaaS) offering? |
Can you deploy the agent in our AWS/GCP/Azure instance if desired? | |
Can you provide proof of SOC2 Type II certification? | |
What options do you provide for authentication? | |
Do you have role based access controls? | |
Do you provide an API for retrieving security events to import into a SIEM? | |
Are third-party penetration tests available? | |
Is PII filtering available? | |
What data is exported from our environment? Is it encrypted? | |
Configuration and Management | What functionality is available via API? |
Can monitors be created via API? | |
Can audiences be created and configured via API? | |
Is a command line interface available to simplify API interactions? | |
Are any SDKs available, e.g. for use in Python scripts or Data Science Notebooks? | |
Can monitoring and alerting be configured from within your Airflow jobs, without breaking a workflow? | |
Can monitoring be defined in YAML? | |
Can alert routing and audiences be defined in YAML? | |
How will your product impact our data warehouse/lake/lakehouse performance and compute costs? | |
Integrations | What cloud native data warehouse technologies does your platform integrate with? |
What cloud native data lakehouse technologies does your platform integrate with? | |
What other database technologies does your platform integrate with? | |
What query engines and metastores does your platform integrate with? | |
What ETL technologies does your platform integrate with? | |
What catalog technologies does your platform integrate with? | |
What BI technologies does your platform integrate with? | |
What collaboration channels does your platform integrate with? | |
Support & CS | Do you provide web-based self-support resources? |
What is your support SLA? | |
Are your releases backwards compatible? | |
Do you charge additional fees for providing product support? | |
What training, onboarding, and ongoing support is available? | |
Pricing Structure | What is the basis of licenses for the product? |
Is on-demand/usage based pricing available? | |
ML Monitoring & Anomaly Detection | What kinds of intelligent features are implemented that allow for finding anomalies without the need for manual input? |
Can anomaly detection be automated to cover tables upon creation based on schema, database, domain, tag, and table name? | |
Can anomaly detection be automated to cover specific pipelines, or data products, based on the selection of a downstream asset? | |
Are specific anomaly detection models applied based on the specific table pattern type classification (I.E streaming_table / weekend_pattern / multimodal_update_pattern / etc.)? | |
Approximately how many production tables are your ML monitors deployed across currently? | |
Are there mechanisms for adjusting the sensitivity of the anomaly detection models? | |
Can exclusion windows be set for anomaly detection models to ignore expected weird or seasonal data patterns? | |
Can exclusion windows be automatically set for major holidays and other seasonal events? | |
Can machine learning anomaly detection monitors be deployed on specific segments of a table for one or multiple metrics? | |
Can machine learning anomaly detection monitors be adjusted to detect anomalies based on hourly, daily, or all record aggregations? | |
Can machine learning anomaly detection monitors be deployed without code for multiple variables (condition X AND condition Y = true)? | |
Can machine learning anomaly detection monitors be scheduled? | |
Can machine learning anomaly detection monitors be set to execute whenever a table is updated? | |
Are failure notifications sent when a monitor fails to execute successfully? | |
Can monitoring cost be allocated to specified query engines within the same warehouse? | |
Are machine learning monitor configurations automatically recommended by the platform? | |
Does the monitor creation process mandate a name and description? | |
Does the product provide machine learning anomaly detection models to detect data freshness anomalies? | |
Does the product provide machine learning anomaly detection models to determine when the number of rows added to a table are too high or low based on historical patterns? | |
Does the product provide machine learning anomaly detection models to determine table unchanged size anomalies? | |
Can volume anomaly detection be run on external views or tables? | |
Can custom anomaly detection monitors be built to monitor custom metrics with automatically set and updated thresholds? | |
Does the product automatically provide schema change detection (column name, data type) for selected datasets? | |
Does the product allow users monitor nested JSON within a given field in a table? | |
Does the product provide machine learning anomaly detection for NULLs or missing values? | |
Does the product provide machine learning anomaly detection for percentage of unique values in a field? | |
Does the product provide machine learning anomaly detection for distribution using metrics such as min, max, average, stddev, variance, skew, percentiles? | |
Does the product provide machine learning anomaly detection to alert when a column receives new values or an anomalous distribution of values in a column? | |
Does the product provide machine learning anomaly detection for validation such as detecting when is an anomaly in the number of values within a column that are not in a standard format such as email, social security number, US state code, and more? | |
Does the product provide machine learning anomaly detection for timestamp fields including detecting anomalies when the percentage of values in a field are in the future or past? | |
Data Validation & Rules | Does the platform offer pre-built data validations or rule templates where thresholds be be defined and deployed without code? |
Can custom data validations and rules with manually defined thresholds be created and deployed with SQL? | |
Can data validations and rules be tested prior to deployment? | |
Can complex data validations, with alert conditions featuring multiple variables, be created and deployed without code? | |
Can data validations and rules be executed based on a schedule or manual trigger option? | |
Can value-based thresholds be set to return a single numerical value when a specific column drops below a given value? | |
Can data quality rules be deployed that compare values across tables? | |
Does the platform offer the ability to identify records or key fields that are present in one dataset but missing in another? (referential integrity) | |
Can data quality rules be deployed that compare values across database, warehouse, or lakehouse? (source to target checks) | |
Can custom SQL rules be generated with AI from within the platform? | |
Does the tool have feature of circuit breaker which can stop the pipelines when data does not meet a set of quality or integrity thresholds? | |
Does the platform offer the ability to profile a table interactively and without code? | |
Does the platform offer the ability to automatically suggest data quality rules or ML monitors based on insights from data profiling and analysis? | |
Assets & Metadata Consolidation | What types of data assets are discoverable within the platform? |
Are different table types–such as views, external tables, wildcard tables, dynamic tables–discoverable within the platform? | |
Are non-table assets such as BI reports, orchestration jobs, and streams discoverable within the platform? | |
What general information and metadata is listed within each asset? | |
Are tags from other data systems, such as the data warehouse or ETL tools, surfaced and inherited within the platform? | |
Are usage statistics of each table including read/writes a day, number of users, latest updates, and number of dependencies automatically discovered and surfaced within the platform? | |
Is the schema, table type, and database name surfaced for each asset? | |
Is field level lineage surfaced for each asset? | |
Is table level lineage surfaced for each asset? | |
Does the platform’s data lineage show ingestion, orchestration, and transformation workflows between lineage nodes? | |
Does the platform surface the type and execution history for monitors on a specific asset? | |
Are the number of downstream reports, and overall upstream and downstream dependencies, surfaced for each asset? | |
Is the historical update cadence, including time since last row count change, for each asset surfaced? | |
Is the row count history surfaced for each asset? | |
Are the associated query logs for each asset surfaced? | |
Is the performance of ETL systems acting on that specific asset surfaced? | |
Alerting & Notifications | Can alerts be muted? |
Can notifications for rule breaches that stay violated be adjusted to alert every time, after a certain number of runs, or only if the count of breached rows change? | |
What alert channels are supported? | |
How are alerts routed to specific channels and audiences? | |
Can alerts be routed to specific channels based on alert type, domain, asset importance, tag, table name, or dataset? | |
Can alerts be sent in real-time or in a daily digest? | |
Can alert notifications have pre-set priority levels? | |
Can alert notifications have pre-set descriptions, including the ability to tag specific owners or teams? | |
What kind of guidance or initial information do system alerts provide? | |
For breached data validations or rules, is the description of the validation, number of breached rows, and last breach date information provided within the alert notification? | |
Do alert notifications indicate the importance of an asset for quick prioritization and triage? | |
Are triggered alerts with the same dependencies (cascading incidents) automatically grouped together within the same thread to improve management and avoid fatigue? | |
How is ownership of incidents and data assets tracked within the platform? | |
Can the product automatically infer the priority of a detected issue by factors such as table popularity and consumption at the repository and BI layers? | |
Are system alerts from ETL tools ingested and consolidated within the platform? | |
Incident Management | Do alerts feature information on the specific downstream reports impacted (automatic impact analysis)? |
Can alerts be escalated to incidents with different severity levels within the platform? | |
Can alert/incident status be assigned and changed within the platform? | |
What workflows and feeds exist for managing alert/incidents within the platform? | |
Can alerts be filtered by status, type, severity, tag, owner, table, schema, database, or audience? | |
What service/ticket management/workflow integrations exist for managing incidents outside of the platform? | |
Is the history of data incidents or failed checks on an asset accessible? | |
Can those incidents be annotated with custom notes? | |
What data catalog integrations are available? | |
How can data consumers be proactively alerted to incidents or the current health status of a dataset? | |
Resolution/Root Cause Analysis | Does the tool detect insights to facilitate in the discovery of the root cause of a particular data incident? |
Does the platform provide segmentation analysis to automatically identify patterns within anomalous records. For example, if a spike in NULLS correlates to values of invoice_status=pending. | |
Does the platform provide users automated suggestions for investigation queries based on how their team researched breaches of that rule in the past. | |
Does the product integrate with multi-system data lineage with anomaly detection to help users pinpoint the origin of an incident? | |
For machine learning monitors, does the product automatically provide samples of anomalous records? | |
For data validations, does the product automatically provide samples of breached rows? | |
Does the product automatically surface system integration failures due to permissioning or credentialing issues? | |
Does the product leverage metadata to automatically surface job failure alerts from ETL and orchestration systems? | |
Does the product leverage metadata to surface the performance of orchestration jobs acting on specific assets to provide additional insight for troubleshooting? | |
Does the product automatically correlate data anomalies to relevant system failures to accelerate root cause analysis? | |
Does the platform surface the query logs of specific assets to provide additional insight for troubleshooting? | |
Does the platform automatically correlate anomalies to relevant changes in the query code of the underlying or upstream asset? | |
Does the platform leverage metadata to correlate anomalies to failed queries? | |
Does the platform leverage metadata to correlate anomalies to empty queries–queries that executed successfully but did not update or modify any data? | |
Does the platform provide a unified incident timeline and systems performance dashboard to enable users to accelerate their root cause analysis? | |
Does the platform correlate data anomalies to relevant pull requests (PRs) on relevant assets? | |
Can the platform help prevent incidents by automatically surfacing the impact of a schema change on downstream assets during the pull request process? | |
Reporting | Does the product automatically compile alert and incident metrics–including status, severity– at the domain, dataset, audience, and data product levels? |
Does the product contain prebuilt dashboards that compile operational response metrics–including time to response and time to fixed– at the domain, dataset, audience, and data product levels? | |
Does the product contain prebuilt dashboards that automatically surface the data health– number of breached data validations and custom rules–of specific tables? | |
Does the product contain prebuilt dashboards that contain all incident metrics, data health, and operational response metrics across all the tables within a selected dataset (data product pipeline)? | |
Can metrics be easily exported via API? |
What’s the future of data observability tools?
There’s one critical feature that we didn’t mention earlier, that plays a huge role in the long-term viability of a data observability solution. And that’s category leadership.
Like any piece of enterprise software, you aren’t just making a decision for the here and now—you’re making a bet on the future as well.
When you choose a data observability solution, you’re making a statement about the vision of that company and how closely it aligns to your own long-term goals. “Will this partner make the right decisions to continue to provide adequate data quality coverage as the data landscape changes and my own needs expand?”
Particularly as AI proliferates, having a solution that will innovate when and how you need it is equally as important as what that platform offers today.
Not only has Monte Carlo been named a proven category leader by the likes of G2, Gartner, Ventana, and the industry at large; but with a commitment to support vector databases for RAG and help organizations across industries power the future of market-ready enterprise AI, Monte Carlo has become the de facto leader for AI reliability as well.
There’s no question that AI is a data product. And with a mission to power data quality for your most critical data products, Monte Carlo is committed to helping you deliver the most reliable and valuable AI products for your stakeholders.
Ready to find out how the category leader in data quality and data observability tools can help you get the most out of your data and AI products? Let’s talk!
Our promise: we will show you the product.