Data Observability

Announcing Monte Carlo’s Incident IQ, a Root Cause Analysis Workflow for Data Teams

Open source data observability tools come with their fair share of pros and cons.

Francisco Alberini

Francisco is a product manager at Monte Carlo.

Incident IQ gives data engineers and analysts a centralized, all-in-one solution for conducting incident management and root cause analysis on your data pipelines. Video courtesy of Monte Carlo.

Today, we are excited to announce the release of Monte Carlo’s data incident management feature, Incident IQ, a new solution that allows data teams to collaboratively identify, alert on, and remediate the root cause of critical data issues before they impact downstream systems and end users. 

By applying similar workflows and end-to-end incident management capabilities of best-of-breed application performance monitoring solutions to your data pipelines, Monte Carlo can now help data teams achieve full visibility into data health. 

Incident IQ is the first fully automated, end-to-end solution that conducts root cause analysis for data issues and changes at each stage of the pipeline, from ingestion in the data warehouse or lake to analytics in your business intelligence dashboards. To help companies eliminate “data downtime” which causes missing, erroneous or inaccurate data, Incident IQ automatically generates historical insights about your data to identify patterns in query logs, trigger investigative follow-on query results, and monitor upstream dependency changes to pin-point exactly what caused the issue to occur. 

Here’s how it works: 

Data Incident Alerting & Routing

Monte carlo data incident alerting in Slack

When a data issue occurs, alerts are routed through Slack, PagerDuty, Opsgenie, email, or webhooks to those who need to know so they can update the incident status for observers and take action. 

Central UI for Incident Management

Incident IQ gives data teams a centralized UI for troubleshooting and resolving data incidents in real time. Video courtesy of Monte Carlo.

Alerted parties can go into the Monte Carlo application and access the Incident Report via a central UI that provides:

  • An incident timeline that makes it easy to view impacted tables, and every action that was taken to manage and resolve the incident
  • Comprehensive query logs that reveal periodic ETL queries, ad hoc/backfill queries, changes in query patterns, and more hints that help teams identify the root cause of data incidents. 
  • Access to sample data, to help users immediately understand what data involved in the incidents looks like, and what typical data looks like. 
  • ML-generated insights to help pinpoint specific groups and subsets in the data that are contributing to the incident.. 
  • Automatic, end-to-end lineage that maps impacted downstream BI dashboards to the furthest upstream tables, helping teams narrow the focus of root cause investigations. 
  • Quick links to Monte Carlo’s Lineage, historical incidents, Pipelines, and Catalog features, making it easy to identify, root cause, and fix data issues all from the same interface. 

Communication & Collaboration

Monte carlo root cause incident summary
Incident IQ gives data engineers quick updates on the status of incidents as they evolve, including number of events, key assets impacted, owners, and incident severity level. Image courtesy of Monte Carlo.

Once a root cause (or several!) have been identified, incident managers can use Incident IQ to provide updates on the state of the issue, as well as triage and collaborate to simultaneously resolve incidents. Features include: 

  • An incident status bar that allows data engineers and analysts to mark the status of the incident as investigating, fixed, expected, no action needed, and resolved depending on the severity of the issue, as well as delegate incident owners. When a user changes status, owner, or severity, an additional entry will automatically be captured on the  incident’s timeline for post mortems and future learnings.
  • Automatic runbooks and workflows to make the incident resolution and triaging process easy, fast, and collaborative between data engineers and analysts. 
  • Real-time notification of incident status across relevant team channels, including Slack, PagerDuty, Opsgenie, email, and webhooks. 

Incident Resolution & Prevention

Monte Carlo schema changes
Incident IQ gives teams a historical log of past incidents, filtered by severity, owner, pipeline, team, and more. Image courtesy of Monte Carlo.

After the incident has been resolved, Incident IQ will alert relevant stakeholders and record vital information about the issue to help data engineering teams prevent future incidents. 

  • Incident trends: Metrics related to each incident are easily available within the UI to help teams track total incidents by severity, owner, pipeline, team, and more.

Customers have already benefited from the rich insights, incident alerting, and root cause analysis capabilities of Incident IQ. Here’s what they have to say: 

  • Incident IQ is really nice!” – data engineer at leading insurtech startup
  • I’m seeing the new incident page and love it!” – Head of Data Engineering at Fortune 50 food & beverage company
  • “To troubleshoot a problem, I’d love to see all the impacted tables, their query logs, and any of their past issues we’ve looked into. Now, we have that all in one place!” data engineer at 2,000-employee e-commerce company

Availability

Monte Carlo’s Incident IQ is currently available for qualified organizations. Be sure to check out our Live Product Demo on July 15, 2021 at 12:00 p.m. EST / 9:00 a.m. PST to learn more.

Interested in learning more about Incident IQ and Monte Carlo’s end-to-end Data Observability Platform? Book a time to speak with us using the form below.

Our promise: we will show you the product.