Data quality In Delta Lake and Iceberg

Part 1: Stable Metadata + Operational Evidence

May 28, 2026

white ice on body of water — Photo by Claudia Salvioli on Unsplash

Most companies already run some form of data quality monitoring. They have freshness checks, null checks, schema validation, row count checks, sometimes even anomaly detection, alerting, and incident workflows.

The problem is not that quality signals do not exist.

The problem is that they are usually hidden in operational tools, disconnected from the data assets people actually consume. So, data quality is a vanity metrics which is represented somewhere in the confluence and proudly shown to the leadership. But it means nothing.

A data analyst opens a table in a catalog and sees the owner(finger crossed) , description, lineage, maybe some tags. But the real question is usually much simpler:

Can I trust this table?

That question leads to a idea:

Should data quality indicators become part of the table metadata itself?

Should Delta Lake or Apache Iceberg expose quality status directly as part of the asset?

After looking at this from the perspective of open table formats, catalogs, data quality platforms, and data engineering workflows, my conclusion is:

Yes, data quality should be visible as asset metadata. But no, run-by-run data quality results should not be embedded directly into the open table format.

The distinction matters.

The Two Kinds of Data Quality Metadata

When people say “data quality metadata,” they often mix two very different things.

The first category is stable asset state:

quality_certification: certified
data_contract_status: certified
quality_owner: finance-engineering
monitoring_required: true
sla_tier: gold
retention-policy: 1 year

It does not change every few minutes. It is useful for discovery, governance, certification, access policies, platform automation, and downstream consumption.

The second category is operational quality state, something like:

last_freshness_check: failed
null_rate_customer_id: 0.08
row_count_anomaly_score: 0.91
failed_checks_count: 3
incident_id: MC-12345
last_run_id: dq_run_20260527_103000

This information is also important.

But it is operational. It changes every time a check runs. It belongs to the monitoring layer, not necessarily to the table definition.

The industry often gets into trouble when it treats these two categories as the same thing.

They are not the same thing.

What Data Quality support Delta Lake and Iceberg Already Provide

Before inventing a new metadata model, it is worth looking at what open table formats already support.

Delta Lake and Apache Iceberg are not data quality platforms, but they do contain several building blocks that are useful for data quality.

Delta Lake

Delta Lake has a few native capabilities that are directly or indirectly related to data quality.

The most obvious one is constraint enforcement.

Hard rules: constraint enforcement.

Delta supports NOT NULL constraints. This is the simplest form of quality enforcement: a required field cannot be missing.

Delta also supports CHECK constraints, which allow teams to define rules such as:

ALTER TABLE finance.orders ADD CONSTRAINT valid_revenue
CHECK (revenue >= 0);

This is useful because the rule is enforced at write time. If bad data is written, the write fails.

That makes constraints stronger than a dashboard, stronger than a catalog tag, and stronger than a downstream alert.

Soft expectations: key constraints.

In some environments, especially when using Unity Catalog, teams can also define informational primary key and foreign key constraints. These are not always enforced in the same way as traditional relational database constraints, but they are still valuable metadata. They describe expected uniqueness and relationships between datasets.

Technical signals: `data-skipping statistics.`

Delta also collects file-level statistics for data skipping. These statistics can include values such as minimum and maximum column values, and they help query engines avoid reading unnecessary files. These statistics are not designed as data quality indicators, but they can support quality-adjacent use cases.m For example, they can help answer questions like:

Does this file contain unexpected value ranges?
Are some partitions empty?
Did a column suddenly stop appearing in newly written data?
Is the table layout still useful for common access patterns?

But this is important: Delta statistics are primarily an optimization feature.

They are not a semantic data quality model.

So Delta gives us three useful layers:

1. Hard rules        → NOT NULL, CHECK constraints
2. Soft expectations → informational keys, schema expectations
3. Technical signals → data-skipping statistics

That is useful, but it is not the same as a full data quality platform.

Delta does not natively answer questions like:

Is this table certified?
Did the freshness check fail this morning?
Is there an open incident?
Was the latest anomaly acknowledged?
Which team owns the failed check?
What was the quality score over the last 30 days?

Those questions belong to the observability and governance layer.

Apache Iceberg

Apache Iceberg has a different but equally interesting metadata model.

Iceberg tracks table state through metadata files, snapshots, manifest lists, and manifest files. Each snapshot represents the state of the table at a point in time. This makes Iceberg strong for table evolution, reproducibility, rollback, and time travel.

File-level metrics.

Iceberg manifests track data files and include file-level metrics. Depending on the writer and table configuration, these metrics may include information such as:

record counts
null value counts
lower and upper bounds
column sizes
value counts

This is very useful metadata.

For data engineers, these metrics can help identify quality symptoms.

For example:

A null count suddenly increases.
A partition has far fewer records than usual.
A timestamp upper bound is older than expected.
A numeric column has values outside the expected range.

Again, though, Iceberg does not treat these as business-level data quality indicators. They are technical metadata used mostly for planning, pruning, and efficient reads.

Rich table metadata.

Iceberg(as well, as Delta lake) also supports custom table properties. This gives teams a simple place to attach stable metadata such as:

quality_certification: certified
data_contract_status: certified
monitoring_required: true
quality_owner: finance-platform

For richer metadata, Iceberg has Puffin files. Puffin is designed to store additional statistics or index-like metadata that does not fit naturally into Iceberg manifests.

This could theoretically support more advanced quality-related artifacts, especially if the industry wanted to standardize richer table-level statistics.

But even with Puffin, I would be careful.

IMHO, Puffin is a place for statistics and technical metadata. It should not become a dumping ground for every data quality run result, incident, alert, and failed check payload.

The Important Distinction: stable quality metadata vs operational data quality indicators

Delta and Iceberg already provide useful quality-adjacent primitives:

Schema enforcement
Constraints
Snapshots
Table properties
Column statistics
File metrics
Metadata tables
Additional statistics artifacts

These are valuable.

But they mostly answer structural and technical questions:

Is this write valid?
What files belong to this table version?
What are the column-level file statistics?
What changed between snapshots?
What properties describe this table?

They do not answer the higher-level trust questions users care about:

Is this asset certified?
Is it safe for executive reporting?
Did the latest freshness check pass?
Is there an open data incident?
Is this table covered by a data contract?
Who is responsible for quality?

~~Delta and Iceberg need to become data quality systems.~~

I would say:

Delta and Iceberg should expose stable quality metadata.

Operational Data Quality Results Do Not Belong in Table Metadata

The obvious implementation is tempting.

Every time Monte Carlo, Soda, Great Expectations, dbt tests, or a custom data quality framework runs, write the latest status back to the table metadata.

Something like:

quality_status: failed
quality_score: 0.82
last_checked_at: 2026-05-27T10:30:00Z
failed_checks_count: 3
results_uri: s3://dq-results/finance_metrics/run_123.json

For a small number of tables, this looks feasible. At platform scale, it becomes problematic. Data quality indicators are operational and time-sensitive. They change on every check:

Freshness can fail at 10:00 and recover at 10:15;

A volume anomaly can be detected in one run and disappear in the next;

A null-rate check can fail because of a temporary upstream delay;

Incidents can be opened, acknowledged, suppressed, escalated, or resolved;

If all of that is written directly into table metadata, the metadata layer becomes a high-churn operational store.

That creates several problems.

First, metadata history becomes noisy. Instead of capturing meaningful asset changes — schema updates, ownership changes, lifecycle transitions, contract certification — the table history gets filled with operational status updates.

Second, catalogs and sync systems are not designed to be incident event stores. They are optimized for discovery, governance, lineage, and relatively stable asset metadata. Constantly mutating properties across thousands of assets creates unnecessary load.

Third, consumers may read stale quality state. The data quality system may have the latest result, but the catalog sync may lag. The table property may show yesterday’s status. The BI tool may cache an older version. Now we have multiple versions of “truth.”

Fourth, large per-check payloads do not fit well into table properties. Detailed DQ output includes check names, thresholds, observed values, sample failures, incident links, owners, routing rules, and historical context.

Trying to squeeze this into table metadata turns metadata into an awkward JSON dump. That is not asset metadata anymore.

That is an operational log pretending to be metadata.

A data quality platform( Montecarlo, Soda, DBT tests, your custom data quality platform) should be the source of truth for operational quality state because it is designed to store check history, detect anomalies, route alerts, manage incidents, and provide debugging context. Catalogs and table metadata should only expose stable trust signals, such as certification or contract status, while detailed check results remain in the DQ platform.

The Better Pattern: Stable Metadata + Operational Evidence

The compromise I like is simple.

Put stable state in the asset metadata.

quality_certification: certified
data_contract_status: certified
quality_owner: data-platform
monitoring_required: true
sla_tier: gold

Put run-level results in a dedicated system.

That could be Monte Carlo. It could be Soda Cloud. It could be Great Expectations Cloud. It could be a custom observability service. It could also be a sidecar Delta or Iceberg table if you want a queryable internal history, such as

dq_run_id
asset_id
checked_at
status
score
failed_checks_count
failed_checks
incident_id
results_uri
producer

This gives us both things we need:

The asset remains clean and discoverable.

The operational history remains detailed and queryable.

Data, Engineering, and Beyond

Discussion about this post

Ready for more?