Data Contracts:

Silver Bullet or
Just More Bloat?

Article
Daniel Pritchard

Daniel Pritchard is the CEO of Simple Machines, a global consultancy that designs, builds, and deploys advanced data and AI systems for some of the world’s largest enterprises. His team works at the coalface of AI implementation, ensuring autonomous systems are not only powerful but reliable and safe.

Simon Scoltock

Simon Skoltock is Director of Technology at Simple Machines, leading the company’s data platforms practice from Christchurch as part of the New Zealand leadership team. With deep experience across telecommunications, insurance, and enterprise software, Simon is focused on the hard problems at the intersection of data and AI, where the real breakthroughs lie not just in using intelligence, but in engineering the data foundations that make it possible.

Jason Martin

Jason Martin is the Chief Technology Officer of Simple Machines, where he leads global technology strategy and consulting delivery. A founding member of the company, Jason brings a background in signals intelligence and defence, and was previously Chief Technology and Product Officer at Beonic, one of the world’s leading spatial analytics platforms. Based in Sydney, he leads the development of Simple Machines’ consulting practice focused on solving how to make AI and embedded intelligence deliver real value for global clients.

Design: James Duthie

As the hype around AI reaches fever pitch, it’s hard not to feel like we’re watching a reboot of Jurassic Park.

Welcome to the Series: Why Data Contracts Matter

Everyone’s obsessed with bringing the creatures to life, but no one’s asking whether the electric fences are strong enough, high enough, and actually turned on. While headlines shriek about generative breakthroughs and autonomous agents, the real story is unfolding beneath the surface, in the part of the stack that doesn’t trend on that casserole formerly known as Twitter: data quality and trust.

Because here’s the thing: while enterprises are pouring billions into AI, too many are building on foundations as stable as a Jenga tower in a light breeze. And as anyone who’s ever tried to debug someone else’s 3,000-line SQL script filled with nested CTEs, implicit joins, and a mysterious WHERE 1=1 knows, AI is only as good as the data it’s built on.

At Simple Machines, we work with some of the world’s most ambitious organisations to modernise their data platforms and deploy intelligent, trustworthy systems. And time after time, we come back to the same lesson: you can’t shortcut trust.

Now, don’t get us wrong. We’re genuinely excited about what’s coming. We’re elbow-deep in designing agentic and autonomous systems for our clients, pushing the frontier of what’s possible. But amidst all that bleeding-edge work, we’ve found ourselves equally captivated by something decidedly less flashy, and that’s data contracts.

Yes, data contracts. Sober, structured, and quietly transformative. If AI is the flashy lead singer, data contracts have the potential to be the rhythm section — the quiet force keeping everything in time, helping ensure the rest of the band doesn’t fall apart mid-performance. At their best, they promise real accountability, clarity, and durability in the data layer. And for organisations worn down by schema changes that break dashboards or trigger last-minute fire drills before the board report lands, that promise is worth paying attention to.

This series aims to cut through the noise, offering a clear-eyed look at what data contracts are, why they matter, and how to actually make them work. We’ll break down their anatomy, spotlight real-world deployments, examine trade-offs and tooling, and ask the big question: are data contracts ready for primetime, or just another governance fad dressed in YAML?

Let’s find out.

The Problem Statement: Data Chaos in Modern Enterprises

Every company aspires to be data-driven. The promise? Data-fuelled decisions, competitive advantages, and personalised customer experiences. The reality? A constant battle with unreliable, inconsistent, and incorrect data.

Ask any data team about their biggest headache, and they’ll tell you the same thing: data breaks all the time. Reports show inconsistent metrics, dashboards return different numbers for the same KPIs, and machine learning models fail due to poorly structured inputs. These issues stem from uncontrolled schema changes, unclear data ownership, and an overall lack of coordination between data producers and consumers.

Data contracts aim to put an end to this chaos by establishing clear agreements about what data should look like, how it should behave, and who is responsible for maintaining its integrity. But contracts alone aren’t enough. Without strong cataloguing and discovery practices, teams can’t find the data, or the contracts, in the first place. And without agreement on key metrics and their definitions, even the best contracts can’t prevent semantic drift.

It’s not just about writing some YAML and heading to the gym. Trustworthy data requires a coordinated system of contracts, visibility, and shared understanding, from the raw inputs to the metrics executives rely on.

What This Series Covers

Over the next four articles, we’ll break down data contracts in a way that’s both practical and critical:

  1. The Why and The What (You are here): We introduce the concept, explore the data quality crisis, and define what data contracts are and how they work.
  2. The How and The Who: A deep dive into implementation strategies, the key players in the space, and the different tools available.
  3. Proof of Concept: We put data contracts to the test with a real-world example, comparing tools and standards to see how they perform.
  4. The Verdict: Are data contracts enterprise-ready, or do they still have a long way to go? We’ll weigh the pros and cons and offer recommendations on when and where they make sense.

Throughout this series, we’ll pull insights from leading voices in data, including Zhamak Dehghani (Data Mesh), Chad Sanderson (Gable AI), and Jean-Georges Perrin (Implementing Data Mesh). We’ll also analyse how major enterprises are tackling data quality issues using data contracts.

Let’s start by diving into the core problem: why is data quality still such a mess, and what makes data contracts a potential solution?

The Data Dream, The Data Nightmare

There was a time when enterprises believed that more data meant better decisions. Warehouses grew and pipelines stretched across continents. But then data lakes became data swamps. Data teams, once empowered, found themselves drowning in an endless flood of schema mismatches, breaking pipelines, and the most dreaded phrase in the business: “this number doesn’t match my dashboard.”

As businesses scale, they accumulate thousands, sometimes millions, of data pipelines. Each one carries assumptions, implicit dependencies, and unspoken agreements between the teams who produce data and those who consume it. But here’s the issue: without explicit agreements, these assumptions fall apart.

Enter data contracts, hailed as the missing piece in the data reliability puzzle, promising to restore order to chaos. Are they the silver bullet we’ve been waiting for, or just another layer of complexity in an already convoluted data landscape? Let’s find out.

The Data Quality Crisis

If you’ve ever been on a data team, you’ve lived this nightmare: marketing launches an email campaign based on segmentation logic that quietly changed three pipelines back in the chain. Your AI model starts predicting churn for customers who don’t even exist. And who gets the call? You do. Downstream. Again.

The root cause? Not incompetence, but untracked schema changes, upstream breakages, and the slow drift of data logic through tribal knowledge and undocumented tweaks. The common denominator? Bad data, born from misalignment and lack of accountability across the pipeline.

According to an article by Gartner titled Data Quality: Best Practices for Accurate Insights, poor data quality costs organisations an average of 12.9 million US dollars per year. And that’s not just in wasted time, it’s in lost revenue, missed opportunities, and compliance risks. Yet, despite billions spent on modern data stacks, broken data pipelines remain one of the top headaches for enterprises.

For the past decade, solutions to this problem have revolved around monitoring and observability (Monte Carlo, Bigeye), orchestration and lineage (Dagster, Datahub), and various modelling paradigms, from semantic layers and star schemas to Data Vault and even schema-less approaches. While these tools help detect and track issues, they don’t prevent them from occurring in the first place.

Tools like dbt offer flexibility across modelling styles and now support contracts as a first-class concept, but their effectiveness still depends on broader agreements across teams.

Why Do Data Breakages Keep Happening?

  1. Unannounced Schema and Behaviour Changes: Data producers (typically engineering teams) modify schemas or underlying behaviours without informing data consumers such as analysts, ML engineers, or BI teams. A field that was once compulsory may suddenly become optional. A transaction that previously triggered event X might now result in event Z instead, breaking assumptions and downstream logic.
  2. Undefined Ownership: No single entity is responsible for ensuring data consistency across departments.
  3. Delayed Feedback Loops: Data issues often surface only when reports or models fail, making remediation costly.

The Promise of Data Contracts
Data contracts formalise and enforce expectations around data: structure, semantics, and quality. Think of it as API contracts, but for data.

They aim to:

  • Prevent Breaking Changes: Enforce schema consistency across pipelines.
  • Define Ownership: Make data producers responsible for quality, not just consumers.
  • Automate Validation: Stop bad data before it spreads downstream

Chad Sanderson, former head of data at Convoy and now founder of Gable AI, has been a vocal proponent of data contracts. “The fundamental issue is that data is treated as a by-product rather than a product. Data contracts force producers to take responsibility, just like engineers do with APIs,” he explains.

What’s Inside a Data Contract? Breaking It Down and Building It Right

A data contract is a structured agreement between data producers (typically data engineering teams) and data consumers (data analysts, AI and ML practitioners, business intelligence teams). The contract enforces rules, expectations, and guarantees about the data being shared, helping prevent downstream failures before they occur.

1. Schema Definition: The Blueprint of Reliable Data

At the core of every data contract is the schema definition, a clear, explicit blueprint that specifies the structure of the data being exchanged.

What it includes:

  • Column Names and Types: Every dataset must define clear column types, whether integer, string, boolean, timestamp, or structured JSON
  • Constraints: Fields like order_id should always be an integer, timestamp must never be null, and email_address should conform to an expected pattern
  • Expected Cardinality: If a table is expected to have unique values per row (like user_id), the contract should enforce it to avoid duplication errors

Airbnb’s Listings Data Schema
Airbnb relies heavily on structured data contracts to manage its listings database. When Airbnb engineers modify a database schema, the changes go through automated contract validation:

  • listing_id must be unique and persist across updates.
  • price_per_night should always be stored as a floating-point number with two decimal places.
  • created_at timestamps must be in UTC format to maintain consistency across global operations.

Without these safeguards, Airbnb’s pricing algorithms could ingest faulty data, resulting in incorrect nightly rates being displayed to users.

2. Validation Rules: Guardrails for Data Integrity

Validation rules enforce business logic. They ensure the data makes sense beyond just structural correctness.

Examples of common validation rules:

  • Temporal validations: order_status cannot be shipped before it has been created.
  • Data ranges: temperature_reading cannot be below absolute zero (-273.15°C)
  • Referential integrity: customer_id in the orders table must exist in the customers table

Preventing Fraud at Stripe
Global payments provider Stripe prevents fraudulent transactions by applying structured validation rules and machine learning models. While Stripe has not publicly documented the use of data contracts in this context, their approach reflects similar principles.

  • For example, their fraud detection systems rely on validation logic such as:
  • Transaction_amount should always be positive.
  • A chargeback should never be recorded before an initial transaction.New currency types must be approved before being processed.

These types of rules help ensure that bad data does not enter the system, thereby reducing the risk of fraud and maintaining trust in Stripe’s infrastructure.

The principles behind this approach are supported by Stripe’s guide to Radar custom rules and their article on machine learning for fraud detection. Both emphasise the importance of data validation and real-time checks as part of their fraud prevention strategy.

3. Change Management Policies: Preventing Breaking Changes

A well-structured data contract must also define how changes to data schemas are managed. Without proper versioning and governance, even a small modification such as renaming a column can break downstream dependencies.

Key change management strategies:

  • Backward compatibility: Ensure schema updates do not break existing consumers by allowing optional fields instead of deleting required ones
  • Deprecation notices: When removing a field, mark it as deprecated before deletion, giving downstream consumers time to adjust
  • Approval workflows: Require schema changes to be reviewed and approved through a pull request or automated validation tool

Netflix’s Data Contracts for Personalised Recommendations
Netflix relies on change management protocols to avoid breaking its recommendation engine. If engineers change the format of viewer_history, all downstream ML models consuming that data must be notified. Through automated versioning policies, Netflix ensures that old and new schemas can coexist, preventing service disruptions.

4. Versioning and Compatibility Checks: Keeping the System Flexible

Version control is critical for large organisations dealing with evolving data needs. A data contract should specify how new versions are introduced and maintained.

Types of versioning approaches:

  • Major versioning (v1, v2, v3): Used when fundamental schema changes occur
  • Feature flags and soft rollouts: Allow new schema versions to be tested in production before a full migration
  • Automated contract testing: CI/CD pipelines should test for backward compatibility before deploying changes

Google’s Approach to API and Data Versioning
Google uses strict versioning policies for its APIs and internal datasets. Whenever a new schema is proposed, it is automatically validated against existing consumers. If the update would cause a breaking change, it is rejected unless it follows the proper migration process.

5. Ownership and Enforcement Mechanisms: Who’s Responsible?

Finally, every data contract should define clear ownership. Without defined accountability, data contracts can become a theoretical exercise rather than an enforceable agreement.

Best practices for ownership in data contracts:

  • Data producers own data quality: Application teams responsible for generating the data must ensure adherence to the contract.
  • Automated monitoring and alerts: If a contract is violated, notifications should be sent to relevant teams. But alerts alone are not enough. There must be predefined action plans for handling these violations, whether that means rolling back changes, quarantining bad data, alerting downstream consumers, or triggering a coordinated incident response. Without clear remediation paths, teams risk alert fatigue and unresolved issues continuing to impact data reliability.
  • Service-level agreements (SLAs): Define data reliability expectations. For example, 95 percent of updates must be processed within one hour.

Enter data contracts, hailed as the missing piece in the data reliability puzzle, promising to restore order to chaos.

Case Study: How a Financial Services Firm Stopped Pipeline Failures

The Problem: A Never-Ending Cycle of Data Breakages
GoCardless, a global payments company, was facing a persistent and costly issue: their data pipelines were failing at an alarming rate. Critical transaction data, which fed into risk models, fraud detection systems, and compliance reporting, frequently contained inconsistencies, causing massive delays and operational disruptions. Each time a failure occurred, data engineers were forced to stop what they were doing and embark on yet another forensic investigation.

The root cause? Uncommunicated schema changes. Engineering teams responsible for application databases would modify table structures, add new fields, or alter data types without notifying downstream consumers. Analysts, data scientists, and regulatory teams relying on this data would only discover the changes when their reports broke or models failed to train correctly.

The impact was significant:

  • Regulatory non-compliance risks: The firm had strict reporting obligations. Any errors in compliance reporting could result in hefty fines from regulators.
  • Delayed fraud detection: Fraud models trained on incorrect transaction data led to an increase in undetected fraudulent transactions.
  • Operational inefficiencies: Data engineering teams were constantly firefighting issues instead of working on strategic initiatives.

The Solution: Introducing Data Contracts at the Engineering Level
Recognising the need for proactive rather than reactive data governance, the firm decided to implement data contracts. The goal was to formalise expectations around data quality and enforce them automatically within their existing development workflows.

Here’s How They Did it

1. Enforcing Column Constraints to Prevent Schema Drift
The first step was defining strict column constraints on transaction data. Previously, columns like amount were sometimes stored as strings instead of numeric values, leading to failed calculations in downstream analytics. Similarly, timestamps were occasionally null, making it impossible to track transaction times accurately.

With data contracts, they established clear rules:

  • Amount must always be numeric and cannot contain unexpected characters or formats.
  • Timestamp cannot be null, ensuring every transaction had a valid time reference.
  • New columns must be explicitly approved and documented before being added.

These rules were embedded into a central data contract repository, which both engineering and data teams could reference. Any attempt to push changes violating these constraints would be blocked automatically.

2. Integrating Contract Validation into CI/CD Pipelines
To ensure compliance with data contracts at every stage of the data lifecycle, the firm integrated contract validation into their Continuous Integration and Continuous Deployment (CI/CD) pipelines.

Here’s how it worked:

  • Every time an engineer made schema changes, the CI/CD pipeline automatically checked the data contract definitions. If the changes violated an existing contract (for example, removing a required field or altering a data type without approval), the deployment was halted immediately.
  • Engineers received detailed feedback on what failed and why, enabling them to fix issues before they reached production
  • Approved schema changes were version-controlled, ensuring traceability and documentation of every modification.

This ‘shift-left’ approach allowed engineering teams to catch and correct schema issues early in the development process rather than waiting for production failures.

3. Assigning Data Ownership to Engineering Teams
One of the most significant organisational changes was shifting data ownership to engineering teams. Traditionally, data quality had been considered the responsibility of data engineers and analytics teams, who were constantly reacting to broken pipelines. The firm recognised that for data contracts to be effective, ownership needed to be upstream with the teams generating the data.

To facilitate this shift, they introduced:

  • Clear data ownership roles: Application engineers became responsible for defining and maintaining contracts for their services.
  • Data quality SLAs: Teams were measured on how well they adhered to contract agreements, tying data quality directly to performance evaluations.
  • Regular cross-functional reviews: Data and engineering teams met monthly to refine contracts, discuss pain points, and iterate on governance strategies.

This cultural change reinforced data as a shared responsibility rather than an afterthought.

The Lifecycle of a Data Contract

Think of a data contract not as a static document, but as a living, breathing agreement — not just written, but grown — a bit like a houseplant, shaped by the hands that look after it. It starts life in development, where producers and consumers agree on what “good data” actually means. From there, it graduates into CI/CD pipelines, where it’s tested like any other piece of code. If it passes, it’s deployed with proper versioning, monitored in production to catch any unruly behaviour, and — crucially — it evolves. Because no contract should live forever untouched. Requirements change, data grows up, and contracts need to mature with them.

Think of a data contract not as a static document, but as a living, breathing agreement — not just written, but grown — a bit like a houseplant, shaped by the hands that look after it. It starts life in development, where producers and consumers agree on what “good data” actually means. From there, it graduates into CI/CD pipelines, where it’s tested like any other piece of code. If it passes, it’s deployed with proper versioning, monitored in production to catch any unruly behaviour, and — crucially — it evolves. Because no contract should live forever untouched. Requirements change, data grows up, and contracts need to mature with them.

The Results: 40% + Reduction in Pipeline Incidents

Within six months of implementing data contracts, the firm saw remarkable improvements:

  • Greater than 40 percent reduction in data pipeline failures: Engineers caught issues before they reached production, preventing disruptions
  • Significantly reduced debugging time: Clear ownership and automatic validation reduced the need for emergency investigations
  • More reliable compliance reporting: Regulatory filings were more accurate and timely, reducing audit risks
  • Improved collaboration between data and engineering teams: With explicit contracts in place, both sides had a shared understanding of expectations

This case study is informed by the excellent work of Andrew Jones and the GoCardless engineering team, who have publicly shared their experience implementing data contracts via Monte Carlo’s blog (7 Key Learnings from Our Experience Implementing Data Contracts) and on Medium (Improving Data Quality with Data Contracts and Implementing Data Contracts at GoCardless).

Key Takeaways for Other Enterprises

The success of this financial firm illustrates why data contracts are more than just a theoretical concept. They provide tangible benefits by:

  • Reducing schema-related failures that cause data inconsistencies.
  • Shifting accountability upstream, making data producers responsible for quality.
  • Automating enforcement mechanisms through CI/CD validation.
  • Fostering better collaboration between data consumers and producers.

For organisations struggling with broken data pipelines, frequent schema changes, and a lack of clear ownership, data contracts offer a scalable, proactive approach to maintaining data integrity.

We’ll let you in on a little secret: one of the most promising tools in this space doesn’t even have a product you can use yet.

What’s Next? Things Start to Get Interesting

If this article laid the foundation, the next one picks up the tools.

In Part Two of our series, we’re going hands-on. We’ll move from principles to practice and explore how data contracts actually get implemented inside real-world organisations. Think CI/CD pipelines that scream if your schema slips. Think GitHub workflows that reject bad data before it breathes. Think collaboration models that make data producers and consumers feel like a band, not a battlefield.

We’ll cover:

  • The tooling landscape, including what we’ve been trialling ourselves — from emerging players like Gable AI, to open standards like ODCS.
  • Step-by-step implementation strategies and integration into engineering workflows.
  • The common traps that teams fall into, and the tactical moves to avoid them.

And we’ll let you in on a little secret: one of the most promising tools in this space doesn’t even have a product you can use yet, but it might just shape the future of how data contracts are defined, enforced and scaled.

Stay tuned for Part Two: The How and The Who of Data Contracts. It’s where the theory meets reality, and things start to get interesting.