85% of machine learning projects fail. Not "shipped late" or "underwhelming"—actually fail. Never make it to production.
And the cause isn't what you'd expect. It's not algorithm selection, model architecture, or lack of compute. It's data quality.
Your Model Isn't Wrong. Your Data Is.
I've seen it dozens of times. A team spends months building a sophisticated model, only to discover:
- The training data had duplicate records (inflating accuracy)
- Null values were silently converted to zeros (biasing predictions)
- Labels were inconsistently applied (model learned noise)
- Production data looked nothing like training data (model collapsed)
The model wasn't wrong. The data was.
What "Data Quality" Actually Means
Data quality isn't one thing—it's a constellation of properties:
| Dimension | Definition | Example Failure |
|---|---|---|
| Completeness | No missing values | Customer age is null for 40% of records |
| Consistency | Same thing represented same way | "USA", "US", "United States" |
| Accuracy | Values reflect reality | Revenue in cents stored as dollars |
| Timeliness | Data is current | Using 6-month-old customer segments |
| Validity | Values within expected ranges | Age = -5 or 200 |
| Uniqueness | No unintended duplicates | Same transaction recorded twice |
Every ML pipeline needs checks for each dimension.
What Bad Data Actually Costs
A project I consulted on last year went like this:
Initial approach: Build churn model → Deploy → Wonder why it doesn't work
- 3 months of engineering time: $75,000
- Model in production for 2 months before issues discovered: $40,000 (lost predictions)
- Rework to fix data issues: $50,000
- Total: $165,000
Better approach: Data quality audit first → Fix issues → Build model → Deploy
- 2 weeks data quality work: $15,000
- 2 months engineering with clean data: $50,000
- Model works first time: $0 rework
- Total: $65,000
That's $100,000 saved by doing data quality upfront.
Building Data Quality Into Your Pipeline
Here's the framework I use with clients:
1. Profile Before You Model
Before writing a single line of model code, run data profiling:
import pandas as pd
from ydata_profiling import ProfileReport
# Generate comprehensive data profile
profile = ProfileReport(df, title="Data Quality Report")
profile.to_file("data_quality_report.html")
Look for:
- Missing value patterns (random? or systematic?)
- Distribution anomalies (outliers? bimodal?)
- Correlation issues (multicollinearity?)
- Cardinality problems (too many categories?)
2. Define Expectations
Use tools like Great Expectations to codify data contracts:
import great_expectations as gx
# Define expectations
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("age", 0, 120)
validator.expect_column_values_to_be_in_set("status", ["active", "churned"])
validator.expect_column_pair_values_a_to_be_greater_than_b("end_date", "start_date")
These expectations become automated tests that run on every data load.
3. Monitor Continuously
Data quality isn't a one-time check. Production data drifts. Sources change. Build monitoring:
# Example: dbt data tests
version: 2
models:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: email
tests:
- unique
- not_null
- name: created_at
tests:
- not_null
- recency:
days: 1 # Fail if no new customers in 24h
4. Create Data Contracts
When data crosses team boundaries, establish contracts:
{
"name": "customer_events",
"version": "2.1.0",
"owner": "product-analytics",
"schema": {
"customer_id": {"type": "string", "required": true},
"event_type": {"type": "string", "enum": ["purchase", "view", "click"]},
"timestamp": {"type": "datetime", "required": true}
},
"sla": {
"freshness": "1 hour",
"completeness": "99.5%"
}
}
Breaking changes require version bumps and migration paths.
The Data Quality Checklist
Before any ML project, run through this checklist:
Data Understanding
- What is the source of this data?
- How often is it updated?
- Who owns it?
- What transformations have been applied?
- Are there known issues or caveats?
Completeness
- What % of each column is null?
- Are nulls random or systematic?
- What's the null handling strategy?
Consistency
- Are categorical values standardized?
- Are date formats consistent?
- Are units documented and consistent?
Accuracy
- Can we validate against a known source?
- Are there sanity checks for numeric ranges?
- Do aggregates match expected values?
Freshness
- How old is the data?
- Is there a staleness alert?
- What's the update frequency?
Uniqueness
- Are there duplicate records?
- Is the primary key truly unique?
- Are there near-duplicates (fuzzy matches)?
Tools I Recommend
| Tool | Purpose | When to Use |
|---|---|---|
| Great Expectations | Data validation | Every project |
| dbt tests | SQL-based checks | Data warehouse projects |
| Pandera | DataFrame validation | Python pipelines |
| Monte Carlo | Automated monitoring | Large-scale production |
| Soda | Data observability | Multi-source environments |
The Bottom Line
Data quality work doesn't get you Twitter followers or conference invitations. Nobody's impressed by a well-structured dbt test suite.
But it's the single best predictor of whether your ML project ships or dies.
Do the boring work first. Everything else gets easier.
Seeing weird model behavior that might be a data issue? I'm happy to take a look.



