Data quality In Delta Lake and Iceberg

Part 2: Three Pillars of Data Engineering Monitoring

Sergey

May 28, 2026

a lake surrounded by mountains under a cloudy sky — Photo by Abby Santurbane on Unsplash

For data engineers, “data quality” is only one part of the operational picture.

A production data asset is trustworthy only when three things are true:

1. Pipeline health - the pipeline which creates the data is healthy.

2. Data Quality - the data is correct enough for its use case.

3. Data Freshness - the data is fresh enough for its SLA.

They are not the same thing.

A pipeline can be green while the data is wrong.

A table can pass all schema and null checks while still being stale.

A freshness check can pass even if the pipeline is silently producing duplicated data.

Each pillar should produce operational signals, and only some of those signals should be promoted into stable asset metadata.

This is the second part of series, the previous part is -

Data quality In Delta Lake and Iceberg

Sergey

May 28

Most companies already run some form of data quality monitoring. They have freshness checks, null checks, schema validation, row count checks, sometimes even anomaly detection, alerting, and incident workflows.

Read full story

Pillar 1: Pipeline Health

Pipeline health answers the question:

Did the system that produces this asset run successfully?

This is usually monitored by Airflow, dbt cloud, Spark jobs, Databricks Workflows, Flink, Kafka Connect, or another orchestration/runtime system.

Typical pipeline health signals:

pipeline.last_run_status: success | failed | skipped | running
pipeline.last_run_at: 2026-05-27T08:00:00Z
pipeline.last_successful_run_at: 2026-05-27T08:00:00Z
pipeline.duration_seconds: 842
pipeline.retries: 1
pipeline.owner: data-platform
pipeline.run_url: https://orchestrator/runs/123

So, it could mean:

The Airflow DAG finished successfully at 08:00.
The Spark job wrote the output table.
No task failed.
The pipeline SLA was met.

This is good news, but it does not prove the data is correct.

A pipeline can complete successfully and still produce bad data because of upstream delays, incorrect joins, unexpected source changes, or silent business logic issues.

Pipeline Health Check

A data engineer may define a simple rule:

pipeline_run = orchestrator.get_latest_run("build_finance_orders_mart")

if pipeline_run.status != "success":
    catalog.update_asset_metadata(
        asset="finance.orders_mart",
        pipeline_health="failed",
        asset_state="not_trusted"
    )
    raise Exception("Pipeline failed")

This signal is useful for operational dashboards and incident routing.

But I would not store every task log, retry event, executor metric, and stack trace in the table metadata. Those belong in the orchestrator and logging system.

The asset metadata should expose a compact state:

pipeline_health: healthy
last_successful_pipeline_run_at: 2026-05-27T08:00:00Z
pipeline_owner: finance-data-eng
pipeline_run_url: https://orchestrator/runs/123

Pillar 2: Data Quality

Data quality answers the question:

Did the produced data meet the expected rules?

This is where tools such as Monte Carlo, Soda, Great Expectations, dbt tests, or even custom Spark checks are useful.

Typical data quality checks:

Required columns are not null.
Primary business keys are unique.
Revenue is non-negative.
Country codes match reference data.
Order status belongs to an allowed list.
Row count is within an expected range.
Distribution of values did not unexpectedly shift.
Schema did not break.

Example DQ result:

dq.status: failed
dq.failed_checks_count: 2
dq.failed_checks:
  - order_id_not_null
  - revenue_non_negative
dq.run_id: dq_run_123
dq.results_uri: https://dq-platform/runs/123

This is operational state. It may change every time checks run.

The detailed result should stay in the data quality platform or sidecar DQ results table.

The asset metadata should expose only a stable summary or lifecycle signal:

quality_certification: certified
data_contract_status: warning
monitoring_required: true
quality_owner: finance-platform

Data Quality Gates

A pipeline can run checks after producing a table:

orders_mart = build_orders_mart(raw_orders)

write_table(orders_mart, "finance.orders_mart")

dq_result = dq.run_checks(
    asset="finance.orders_mart",
    checks=[
        "order_id_not_null",
        "unique_order_id",
        "revenue_non_negative",
        "valid_order_status",
        "row_count_within_expected_range"
    ]
)

dq_results_table.append(dq_result)

if dq_result.has_critical_failures:
    catalog.update_asset_metadata(
        asset="finance.orders_mart",
        data_contract_status="blocked"
    )
    raise Exception("Critical DQ checks failed")

if dq_result.passed_required_checks:
    catalog.update_asset_metadata(
        asset="finance.orders_mart",
        data_contract_status="certified"
    )

The DQ run result is stored in the DQ results system , while certification status is stored in asset metadata.

That keeps the asset metadata clean while still allowing pipelines to act on quality.

Pillar 3: Data Freshness

Data freshness answers the question:

Is the data current enough for the business use case?

Freshness is related to pipeline health, but it is not the same thing.

A pipeline may run successfully at 08:00, but if the upstream source stopped receiving new events at 02:00, the table is still stale.

Freshness checks usually compare expected arrival patterns with actual data timestamps or ingestion times.

Typical freshness signals:

freshness.status: fresh | stale | delayed | unknown
freshness.max_event_time: 2026-05-27T07:55:00Z
freshness.max_ingested_at: 2026-05-27T08:01:00Z
freshness.expected_by: 2026-05-27T08:15:00Z
freshness.delay_minutes: 20
freshness.sla_minutes: 60

Example:

The pipeline ran successfully.
The table has new rows.
But the newest business event is six hours old.
For a near-real-time reporting table, this is a freshness failure.

Freshness Gate

For a critical dashboard table, a data engineer may define:

freshness = dq.get_freshness_status("finance.orders_mart")

if freshness.delay_minutes > freshness.sla_minutes:
    catalog.update_asset_metadata(
        asset="finance.orders_mart",
        freshness_state="stale",
        data_contract_status="warning"
    )
    notify_owner("finance.orders_mart is stale")

For a downstream pipeline:

freshness = dq.get_freshness_status("raw.orders")

if freshness.status == "stale":
    raise Exception("Cannot rebuild finance.orders_mart because raw.orders is stale")

Freshness is often the most important quality signal for business users.

A table can have perfect schema, zero nulls, and valid values, but if it is three days late, it is not useful.

How the Three Pillars Work Together

The real value comes from combining the three signals.

Consider this asset:

asset: finance.orders_mart
pipeline_health: healthy
data_quality_status: passed
freshness_status: fresh
quality_certification: certified

This table is in good shape.

Now consider this one:

asset: finance.orders_mart
pipeline_health: healthy
data_quality_status: passed
freshness_status: stale
quality_certification: certified

The pipeline is green and quality checks pass, but the data is stale. This should trigger a warning, especially for operational dashboards.

Another case:

asset: finance.orders_mart
pipeline_health: failed
data_quality_status: unknown
freshness_status: stale
quality_certification: certified

The last run failed, so we do not know whether the latest data would pass checks. Freshness is stale because the table has not been updated. This is primarily a pipeline incident.

Another case:

asset: finance.orders_mart
pipeline_health: healthy
data_quality_status: failed
freshness_status: fresh
quality_certification: suspended

The pipeline ran and the data is fresh, but the values are wrong. This is a data quality incident.

These distinctions matter because the response should be different.

Pipeline failed → fix orchestration, runtime, permissions, infrastructure, code
DQ failed → inspect data values, business rules, source changes, transformations
Freshness failed → inspect source arrival, ingestion lag, scheduling, upstream dependencies

This is why the metadata model should not simply say:

quality_status: failed

That is too vague.

A better model says:

pipeline_health: healthy
data_quality_status: passed
freshness_status: stale
data_contract_status: warning

Now we know what is wrong and triage it.

Data, Engineering, and Beyond

Data quality In Delta Lake and Iceberg

Discussion about this post

Ready for more?