Skip to content
Data Engineering
December 1, 20256 min read

Data Quality: The Silent Killer of ML Projects

85% of ML projects fail, and bad data is the #1 cause. Here's how to build data quality into your pipeline from day one.

Data Quality
Machine Learning
MLOps
Data Engineering
Best Practices
Dr. Jody-Ann Jones

Dr. Jody-Ann Jones

Founder & CEO, The Data Sensei

Data Quality: The Silent Killer of ML Projects

85% of machine learning projects fail. Not "shipped late" or "underwhelming"—actually fail. Never make it to production.

And the cause isn't what you'd expect. It's not algorithm selection, model architecture, or lack of compute. It's data quality.

Your Model Isn't Wrong. Your Data Is.

I've seen it dozens of times. A team spends months building a sophisticated model, only to discover:

  • The training data had duplicate records (inflating accuracy)
  • Null values were silently converted to zeros (biasing predictions)
  • Labels were inconsistently applied (model learned noise)
  • Production data looked nothing like training data (model collapsed)

The model wasn't wrong. The data was.

What "Data Quality" Actually Means

Data quality isn't one thing—it's a constellation of properties:

DimensionDefinitionExample Failure
CompletenessNo missing valuesCustomer age is null for 40% of records
ConsistencySame thing represented same way"USA", "US", "United States"
AccuracyValues reflect realityRevenue in cents stored as dollars
TimelinessData is currentUsing 6-month-old customer segments
ValidityValues within expected rangesAge = -5 or 200
UniquenessNo unintended duplicatesSame transaction recorded twice

Every ML pipeline needs checks for each dimension.

What Bad Data Actually Costs

A project I consulted on last year went like this:

Initial approach: Build churn model → Deploy → Wonder why it doesn't work

  • 3 months of engineering time: $75,000
  • Model in production for 2 months before issues discovered: $40,000 (lost predictions)
  • Rework to fix data issues: $50,000
  • Total: $165,000

Better approach: Data quality audit first → Fix issues → Build model → Deploy

  • 2 weeks data quality work: $15,000
  • 2 months engineering with clean data: $50,000
  • Model works first time: $0 rework
  • Total: $65,000

That's $100,000 saved by doing data quality upfront.

Building Data Quality Into Your Pipeline

Here's the framework I use with clients:

1. Profile Before You Model

Before writing a single line of model code, run data profiling:

import pandas as pd
from ydata_profiling import ProfileReport

# Generate comprehensive data profile
profile = ProfileReport(df, title="Data Quality Report")
profile.to_file("data_quality_report.html")

Look for:

  • Missing value patterns (random? or systematic?)
  • Distribution anomalies (outliers? bimodal?)
  • Correlation issues (multicollinearity?)
  • Cardinality problems (too many categories?)

2. Define Expectations

Use tools like Great Expectations to codify data contracts:

import great_expectations as gx

# Define expectations
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("age", 0, 120)
validator.expect_column_values_to_be_in_set("status", ["active", "churned"])
validator.expect_column_pair_values_a_to_be_greater_than_b("end_date", "start_date")

These expectations become automated tests that run on every data load.

3. Monitor Continuously

Data quality isn't a one-time check. Production data drifts. Sources change. Build monitoring:

# Example: dbt data tests
version: 2
models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - unique
          - not_null
      - name: created_at
        tests:
          - not_null
          - recency:
              days: 1  # Fail if no new customers in 24h

4. Create Data Contracts

When data crosses team boundaries, establish contracts:

{
  "name": "customer_events",
  "version": "2.1.0",
  "owner": "product-analytics",
  "schema": {
    "customer_id": {"type": "string", "required": true},
    "event_type": {"type": "string", "enum": ["purchase", "view", "click"]},
    "timestamp": {"type": "datetime", "required": true}
  },
  "sla": {
    "freshness": "1 hour",
    "completeness": "99.5%"
  }
}

Breaking changes require version bumps and migration paths.

The Data Quality Checklist

Before any ML project, run through this checklist:

Data Understanding

  • What is the source of this data?
  • How often is it updated?
  • Who owns it?
  • What transformations have been applied?
  • Are there known issues or caveats?

Completeness

  • What % of each column is null?
  • Are nulls random or systematic?
  • What's the null handling strategy?

Consistency

  • Are categorical values standardized?
  • Are date formats consistent?
  • Are units documented and consistent?

Accuracy

  • Can we validate against a known source?
  • Are there sanity checks for numeric ranges?
  • Do aggregates match expected values?

Freshness

  • How old is the data?
  • Is there a staleness alert?
  • What's the update frequency?

Uniqueness

  • Are there duplicate records?
  • Is the primary key truly unique?
  • Are there near-duplicates (fuzzy matches)?

Tools I Recommend

ToolPurposeWhen to Use
Great ExpectationsData validationEvery project
dbt testsSQL-based checksData warehouse projects
PanderaDataFrame validationPython pipelines
Monte CarloAutomated monitoringLarge-scale production
SodaData observabilityMulti-source environments

The Bottom Line

Data quality work doesn't get you Twitter followers or conference invitations. Nobody's impressed by a well-structured dbt test suite.

But it's the single best predictor of whether your ML project ships or dies.

Do the boring work first. Everything else gets easier.


Seeing weird model behavior that might be a data issue? I'm happy to take a look.

Data Quality
Machine Learning
MLOps
Data Engineering
Best Practices

Related Articles

The Modern Data Stack for SMEs: Building Enterprise-Grade BI Without Enterprise Budgets
Data Engineering
December 10, 202514 min read

The Modern Data Stack for SMEs: Building Enterprise-Grade BI Without Enterprise Budgets

Tableau and Looker aren't your only options. Here's how to build a production-ready analytics platform with Supabase, dbt, and Metabase—for $0 in licensing costs.

Business Intelligence
dbt
Supabase
Dr. Jody-Ann JonesDr. Jody-Ann Jones
Beyond Document Q&A: Building Production RAG Systems That Actually Scale
AI/LLM
December 10, 202512 min read

Beyond Document Q&A: Building Production RAG Systems That Actually Scale

Most RAG tutorials end at 'it works in a notebook.' The gap to production—1000+ QPS, sub-50ms latency—is where things get interesting.

RAG
Vector Search
Production ML
Dr. Jody-Ann JonesDr. Jody-Ann Jones
Why RAG Beats Fine-Tuning for Most Enterprise Use Cases
AI/LLM
December 8, 20258 min read

Why RAG Beats Fine-Tuning for Most Enterprise Use Cases

Fine-tuning sounds impressive, but for 90% of enterprise applications, Retrieval-Augmented Generation delivers better results faster. Here's why.

RAG
LLM
Fine-Tuning
Dr. Jody-Ann JonesDr. Jody-Ann Jones

Enjoyed this article?

Subscribe to get notified when we publish new content. No spam, just valuable insights.