Data Quality: The Silent Killer of ML Projects

85% of machine learning projects fail. Not "shipped late" or "underwhelming"—actually fail. Never make it to production.

And the cause isn't what you'd expect. It's not algorithm selection, model architecture, or lack of compute. It's data quality.

Your Model Isn't Wrong. Your Data Is.

I've seen it dozens of times. A team spends months building a sophisticated model, only to discover:

The training data had duplicate records (inflating accuracy)
Null values were silently converted to zeros (biasing predictions)
Labels were inconsistently applied (model learned noise)
Production data looked nothing like training data (model collapsed)

The model wasn't wrong. The data was.

What "Data Quality" Actually Means

Data quality isn't one thing—it's a constellation of properties:

Dimension	Definition	Example Failure
Completeness	No missing values	Customer age is null for 40% of records
Consistency	Same thing represented same way	"USA", "US", "United States"
Accuracy	Values reflect reality	Revenue in cents stored as dollars
Timeliness	Data is current	Using 6-month-old customer segments
Validity	Values within expected ranges	Age = -5 or 200
Uniqueness	No unintended duplicates	Same transaction recorded twice

Every ML pipeline needs checks for each dimension.

What Bad Data Actually Costs

A project I consulted on last year went like this:

Initial approach: Build churn model → Deploy → Wonder why it doesn't work

3 months of engineering time: $75,000
Model in production for 2 months before issues discovered: $40,000 (lost predictions)
Rework to fix data issues: $50,000
Total: $165,000

Better approach: Data quality audit first → Fix issues → Build model → Deploy

2 weeks data quality work: $15,000
2 months engineering with clean data: $50,000
Model works first time: $0 rework
Total: $65,000

That's $100,000 saved by doing data quality upfront.

Building Data Quality Into Your Pipeline

Here's the framework I use with clients:

1. Profile Before You Model

Before writing a single line of model code, run data profiling:

import pandas as pd
from ydata_profiling import ProfileReport

# Generate comprehensive data profile
profile = ProfileReport(df, title="Data Quality Report")
profile.to_file("data_quality_report.html")

Look for:

Missing value patterns (random? or systematic?)
Distribution anomalies (outliers? bimodal?)
Correlation issues (multicollinearity?)
Cardinality problems (too many categories?)

2. Define Expectations

Use tools like Great Expectations to codify data contracts:

import great_expectations as gx

# Define expectations
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("age", 0, 120)
validator.expect_column_values_to_be_in_set("status", ["active", "churned"])
validator.expect_column_pair_values_a_to_be_greater_than_b("end_date", "start_date")

These expectations become automated tests that run on every data load.

3. Monitor Continuously

Data quality isn't a one-time check. Production data drifts. Sources change. Build monitoring:

# Example: dbt data tests
version: 2
models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - unique
          - not_null
      - name: created_at
        tests:
          - not_null
          - recency:
              days: 1 # Fail if no new customers in 24h

4. Create Data Contracts

When data crosses team boundaries, establish contracts:

{
  "name": "customer_events",
  "version": "2.1.0",
  "owner": "product-analytics",
  "schema": {
    "customer_id": { "type": "string", "required": true },
    "event_type": { "type": "string", "enum": ["purchase", "view", "click"] },
    "timestamp": { "type": "datetime", "required": true }
  },
  "sla": {
    "freshness": "1 hour",
    "completeness": "99.5%"
  }
}

Breaking changes require version bumps and migration paths.

The Data Quality Checklist

Before any ML project, run through this checklist:

Data Understanding

What is the source of this data?
How often is it updated?
Who owns it?
What transformations have been applied?
Are there known issues or caveats?

Completeness

What % of each column is null?
Are nulls random or systematic?
What's the null handling strategy?

Consistency

Are categorical values standardized?
Are date formats consistent?
Are units documented and consistent?

Accuracy

Can we validate against a known source?
Are there sanity checks for numeric ranges?
Do aggregates match expected values?

Freshness

How old is the data?
Is there a staleness alert?
What's the update frequency?

Uniqueness

Are there duplicate records?
Is the primary key truly unique?
Are there near-duplicates (fuzzy matches)?

Tool	Purpose	When to Use
Great Expectations	Data validation	Every project
dbt tests	SQL-based checks	Data warehouse projects
Pandera	DataFrame validation	Python pipelines
Monte Carlo	Automated monitoring	Large-scale production
Soda	Data observability	Multi-source environments

The Bottom Line

Data quality work doesn't get you Twitter followers or conference invitations. Nobody's impressed by a well-structured dbt test suite.

But it's the single best predictor of whether your ML project ships or dies.

Do the boring work first. Everything else gets easier.

Seeing weird model behavior that might be a data issue? I'm happy to take a look.