Data quality In Delta Lake and Iceberg

Part 3: Getting practical with data quality

Sergey

May 29, 2026

gray mountains near pine trees at daytime — Photo by Kenneth Hargrave on Unsplash

This is the third part of series, the previous parts are -

Data quality In Delta Lake and Iceberg

Sergey

May 28

Most companies already run some form of data quality monitoring. They have freshness checks, null checks, schema validation, row count checks, sometimes even anomaly detection, alerting, and incident workflows.

Read full story

Data quality In Delta Lake and Iceberg

Sergey

May 28

For data engineers, “data quality” is only one part of the operational picture.

Read full story

Imagine a daily finance.orders_mart table used by executives.

The data contract says:

asset: finance.orders_mart
expected_schedule: daily
freshness_sla: available by 08:30 Europe/Warsaw
required_checks:
  - order_id_not_null
  - unique_order_id
  - revenue_non_negative
  - valid_currency
  - row_count_within_expected_range
owner: finance-data-platform
alert_route: "#data-finance-alerts"

A pipeline run starts at 08:00.

Step 1: Check Upstream Freshness

raw_orders_freshness = dq.get_freshness_status("raw.orders")

if raw_orders_freshness.status == "stale":
    stop_pipeline("raw.orders is stale")

Step 2: Run Transformation

orders = spark.table("raw.orders")
orders_mart = build_orders_mart(orders)

write_table(orders_mart, "finance.orders_mart")

Step 3: Check Pipeline Health

pipeline_status = orchestrator.get_current_run_status()

if pipeline_status == "success":
    catalog.update_asset_metadata(
        asset="finance.orders_mart",
        pipeline_health="healthy",
        last_successful_pipeline_run_at=now()
    )

Step 4: Run DQ Checks

dq_result = dq.run_checks(
    asset="finance.orders_mart",
    checks=[
        "order_id_not_null",
        "unique_order_id",
        "revenue_non_negative",
        "valid_currency",
        "row_count_within_expected_range"
    ]
)

dq_results_table.append(dq_result)

Step 5: Check Output Freshness

freshness = dq.check_freshness(
    asset="finance.orders_mart",
    timestamp_column="order_created_at",
    expected_by="08:30",
    timezone="Europe/Warsaw"
)

Step 6: Publish Stable Asset State

if pipeline_status == "success" and dq_result.passed and freshness.passed:
    catalog.update_asset_metadata(
        asset="finance.orders_mart",
        data_contract_status="certified",
        quality_certification="certified",
        pipeline_health="healthy",
        freshness_state="fresh"
    )
else:
    catalog.update_asset_metadata(
        asset="finance.orders_mart",
        data_contract_status="warning"
    )

The detailed operational results should stay in the right systems:

Pipeline logs      → orchestrator
DQ check results   → DQ platform / sidecar results table
Freshness history  → DQ platform / sidecar results table
Stable trust state → catalog / asset metadata

The rule is simple:

Operational evidence belongs in operational systems.
Stable trust state belongs in asset metadata.

The Combined Metadata Model

For asset metadata, I suggest expose a compact view:

asset: finance.orders_mart

pipeline:
  health: healthy
  last_successful_run_at: 2026-05-27T08:12:00Z
  owner: finance-data-platform
  run_url: https://orchestrator/runs/123

quality:
  certification: certified
  contract_status: certified
  monitoring_required: true
  owner: finance-data-platform
  latest_results_uri: https://dq-platform/runs/456

freshness:
  state: fresh
  sla: available_by_08_30
  last_checked_at: 2026-05-27T08:20:00Z
  latest_results_uri: https://dq-platform/freshness/789

This is enough for catalogs, BI tools, policies, and pipelines. It is not trying to store every check result.

How Data Engineers Should Use Quality Indicators in Pipelines

The most useful quality metadata is not decorative.

It should change how pipelines behave.

A tag like quality_certification=certified is only valuable if systems and people use it to make decisions. Otherwise, it is just another label in a catalog.

For data engineers, quality indicators can become pipeline control signals.

1. Input Gating

Before a pipeline reads from an upstream table, it can check the upstream asset state.

For example:

source_quality = catalog.get_asset_quality("raw.orders")

if source_quality.data_contract_status == "blocked":
    raise Exception("raw.orders is blocked by its data contract status")

if source_quality.quality_certification == "deprecated":
    warn("raw.orders is deprecated and should not be used for new pipelines")

This does not require parsing every DQ result. The pipeline only needs a stable signal:

data_contract_status: certified | warning | blocked | deprecated

This allows teams to prevent bad data from flowing silently into downstream tables.

2. Freshness-Aware Execution

Some pipelines should only run if the upstream data is fresh enough.

For example, a daily revenue table should not be rebuilt if the upstream orders table has not received today’s data.

The pipeline can check a freshness signal:

freshness = dq_service.get_latest_check(
    asset="raw.orders",
    check="freshness"
)

if freshness.status == "failed":
    raise Exception("Upstream orders data is stale")

There are two possible patterns here.

For critical operational freshness, read from the DQ platform directly because it has the latest check state.

For stable lifecycle decisions, read from the catalog or asset metadata.

That gives us a useful split:

Need latest operational status? → DQ platform
Need stable trust state?        → Catalog / asset metadata

3. Promotion Gates

Quality indicators can/shall control movement between data layers.

For example:

bronze → silver
Requires:
- schema valid
- required fields present
- basic freshness check passing
- no severe ingestion errors

silver → gold
Requires:
- business rules passing
- accepted volume ranges
- key dimensions populated
- owner assigned

gold → certified
Requires:
- data contract approved
- monitoring enabled
- SLA defined
- alert routing configured
- successful check history

This makes certification a real engineering workflow instead of a manual catalog label.

A pipeline could implement this as:

dq_result = dq.run_checks("curated.orders")

if dq_result.passed_required_checks:
    catalog.update_asset_metadata(
        asset="curated.orders",
        data_contract_status="managed"
    )

if dq_result.passed_certification_checks and owner_approved:
    catalog.update_asset_metadata(
        asset="curated.orders",
        quality_certification="certified"
    )

The important part is that the pipeline does not write every check result into the table metadata.

It writes detailed results to the DQ system, then updates only the stable asset state when the lifecycle state changes.

4. Output Validation

Every important pipeline should validate what it produces.

This is where DQ tools are most useful.

After writing the output table, the pipeline runs checks such as:

order_id is not null
revenue is non-negative
event_time is within expected freshness window
row count is within expected range
country_code matches known reference values
no duplicate primary business keys

Then it publishes the result:

dq_result = dq.run_checks(
    asset="finance.orders_mart",
    checks=[
        "order_id_not_null",
        "revenue_non_negative",
        "freshness_within_sla",
        "row_count_within_expected_range",
        "valid_country_code",
        "unique_order_id"
    ]
)

dq_results_table.append(dq_result)
dq_platform.publish(dq_result)

The asset metadata may then be updated with a stable summary:

monitoring_required: true
quality_certification: certified
data_contract_status: certified
quality_owner: finance-platform

But the run-level details stay outside the asset metadata.

5. Incident-Aware Pipelines

If a critical upstream asset has an open incident, downstream jobs can behave differently.

For example:

incident = dq_service.get_open_incident("raw.payments")

if incident.severity == "critical":
    stop_pipeline()

if incident.severity == "warning":
    run_pipeline_but_mark_output_as_impacted()

This enables a more nuanced model than simply “run or fail.”

Possible actions:

Critical failure → stop pipeline
Warning          → continue but mark output as impacted
Deprecated input → continue for existing jobs, block new dependencies
Freshness delay  → wait, retry, or skip publish
Schema break     → fail immediately

This is how quality metadata becomes operationally useful.

6. Lineage-Aware Impact Propagation

The most powerful pattern is lineage-aware propagation.

If raw.orders fails, the platform should know that curated.orders, finance.revenue_mart, and executive.arr_dashboard may be impacted.

This does not mean all downstream tables immediately become “failed.” It means their trust state should reflect dependency risk.

For example:

quality_state: impacted
impacted_by: raw.orders
impact_reason: upstream_freshness_failed

This is extremely useful for consumers.

Instead of discovering a broken dashboard manually, users can see that the underlying data is impacted by an upstream incident.

For data engineers, this also helps prioritize response. If a failed raw table impacts a board-level dashboard, it should be treated differently from a failure in an unused sandbox table.

7. Environment and Release Gates

Quality metadata can also be used in CI/CD workflows.

For example, a data contract change should not be promoted to production unless required checks exist.

A dbt model should not be marked as certified unless it has an owner, tests, freshness checks, and alert routing.

A table should not move from experimental to managed unless it has basic quality coverage.

This turns quality into a release policy:

No owner               → cannot certify
No freshness check     → cannot certify
No null checks         → cannot certify
No alert route         → cannot certify
Open critical incident → cannot promote

This is where catalog metadata, DQ results, and pipeline orchestration come together.

8. BI and Consumer Warnings

Data engineers also need to think about downstream consumption.

A BI tool, notebook environment, or query interface can read asset trust metadata and display warnings.

For example:

Warning: this table is not certified.
Warning: this table is impacted by an upstream freshness incident.
Warning: this table is deprecated and will be removed after 2026-09-01.

This is not a pipeline pattern, but data engineers need to publish the metadata that makes it possible.

The Architecture I Would Recommend

The clean architecture looks like this:

Data pipeline
   ↓
Pipeline health checks
   ↓
DQ checks
   ↓
Freshness checks
   ↓
DQ platform / sidecar DQ results table
   ↓
Stable asset metadata update
   ↓
Catalog / governance layer
   ↓
Consumers, policies, BI warnings, certification workflows

That is why it works.

Final Take

Data quality indicators should be part of the asset experience.

When someone opens a table, they should immediately understand whether it is certified, monitored, owned, trusted, stale, deprecated, blocked, or impacted by an upstream incident.

That information belongs in the catalog and can be mirrored into table metadata as stable properties.

At the same time, Data Quality operational results should not be embedded directly into Delta or Iceberg table metadata. They are too volatile, too detailed, and too operational. They can, and should, carry enough stable trust metadata to make data assets more discoverable, governable, and usable.

Data, Engineering, and Beyond

Data quality In Delta Lake and Iceberg

Data quality In Delta Lake and Iceberg

Discussion about this post

Ready for more?