Data quality In Delta Lake and Iceberg
Part 3: Getting practical with data quality
This is the third part of series, the previous parts are -
Imagine a daily finance.orders_mart table used by executives.
The data contract says:
asset: finance.orders_mart
expected_schedule: daily
freshness_sla: available by 08:30 Europe/Warsaw
required_checks:
- order_id_not_null
- unique_order_id
- revenue_non_negative
- valid_currency
- row_count_within_expected_range
owner: finance-data-platform
alert_route: "#data-finance-alerts"A pipeline run starts at 08:00.
Step 1: Check Upstream Freshness
raw_orders_freshness = dq.get_freshness_status("raw.orders")
if raw_orders_freshness.status == "stale":
stop_pipeline("raw.orders is stale")Step 2: Run Transformation
orders = spark.table("raw.orders")
orders_mart = build_orders_mart(orders)
write_table(orders_mart, "finance.orders_mart")Step 3: Check Pipeline Health
pipeline_status = orchestrator.get_current_run_status()
if pipeline_status == "success":
catalog.update_asset_metadata(
asset="finance.orders_mart",
pipeline_health="healthy",
last_successful_pipeline_run_at=now()
)Step 4: Run DQ Checks
dq_result = dq.run_checks(
asset="finance.orders_mart",
checks=[
"order_id_not_null",
"unique_order_id",
"revenue_non_negative",
"valid_currency",
"row_count_within_expected_range"
]
)
dq_results_table.append(dq_result)Step 5: Check Output Freshness
freshness = dq.check_freshness(
asset="finance.orders_mart",
timestamp_column="order_created_at",
expected_by="08:30",
timezone="Europe/Warsaw"
)Step 6: Publish Stable Asset State
if pipeline_status == "success" and dq_result.passed and freshness.passed:
catalog.update_asset_metadata(
asset="finance.orders_mart",
data_contract_status="certified",
quality_certification="certified",
pipeline_health="healthy",
freshness_state="fresh"
)
else:
catalog.update_asset_metadata(
asset="finance.orders_mart",
data_contract_status="warning"
)The detailed operational results should stay in the right systems:
Pipeline logs → orchestrator
DQ check results → DQ platform / sidecar results table
Freshness history → DQ platform / sidecar results table
Stable trust state → catalog / asset metadataThe rule is simple:
Operational evidence belongs in operational systems.
Stable trust state belongs in asset metadata.The Combined Metadata Model
For asset metadata, I suggest expose a compact view:
asset: finance.orders_mart
pipeline:
health: healthy
last_successful_run_at: 2026-05-27T08:12:00Z
owner: finance-data-platform
run_url: https://orchestrator/runs/123
quality:
certification: certified
contract_status: certified
monitoring_required: true
owner: finance-data-platform
latest_results_uri: https://dq-platform/runs/456
freshness:
state: fresh
sla: available_by_08_30
last_checked_at: 2026-05-27T08:20:00Z
latest_results_uri: https://dq-platform/freshness/789This is enough for catalogs, BI tools, policies, and pipelines. It is not trying to store every check result.
How Data Engineers Should Use Quality Indicators in Pipelines
The most useful quality metadata is not decorative.
It should change how pipelines behave.
A tag like quality_certification=certified is only valuable if systems and people use it to make decisions. Otherwise, it is just another label in a catalog.
For data engineers, quality indicators can become pipeline control signals.
1. Input Gating
Before a pipeline reads from an upstream table, it can check the upstream asset state.
For example:
source_quality = catalog.get_asset_quality("raw.orders")
if source_quality.data_contract_status == "blocked":
raise Exception("raw.orders is blocked by its data contract status")
if source_quality.quality_certification == "deprecated":
warn("raw.orders is deprecated and should not be used for new pipelines")This does not require parsing every DQ result. The pipeline only needs a stable signal:
data_contract_status: certified | warning | blocked | deprecatedThis allows teams to prevent bad data from flowing silently into downstream tables.
2. Freshness-Aware Execution
Some pipelines should only run if the upstream data is fresh enough.
For example, a daily revenue table should not be rebuilt if the upstream orders table has not received today’s data.
The pipeline can check a freshness signal:
freshness = dq_service.get_latest_check(
asset="raw.orders",
check="freshness"
)
if freshness.status == "failed":
raise Exception("Upstream orders data is stale")There are two possible patterns here.
For critical operational freshness, read from the DQ platform directly because it has the latest check state.
For stable lifecycle decisions, read from the catalog or asset metadata.
That gives us a useful split:
Need latest operational status? → DQ platform
Need stable trust state? → Catalog / asset metadata3. Promotion Gates
Quality indicators can/shall control movement between data layers.
For example:
bronze → silver
Requires:
- schema valid
- required fields present
- basic freshness check passing
- no severe ingestion errors
silver → gold
Requires:
- business rules passing
- accepted volume ranges
- key dimensions populated
- owner assigned
gold → certified
Requires:
- data contract approved
- monitoring enabled
- SLA defined
- alert routing configured
- successful check historyThis makes certification a real engineering workflow instead of a manual catalog label.
A pipeline could implement this as:
dq_result = dq.run_checks("curated.orders")
if dq_result.passed_required_checks:
catalog.update_asset_metadata(
asset="curated.orders",
data_contract_status="managed"
)
if dq_result.passed_certification_checks and owner_approved:
catalog.update_asset_metadata(
asset="curated.orders",
quality_certification="certified"
)The important part is that the pipeline does not write every check result into the table metadata.
It writes detailed results to the DQ system, then updates only the stable asset state when the lifecycle state changes.
4. Output Validation
Every important pipeline should validate what it produces.
This is where DQ tools are most useful.
After writing the output table, the pipeline runs checks such as:
order_id is not null
revenue is non-negative
event_time is within expected freshness window
row count is within expected range
country_code matches known reference values
no duplicate primary business keysThen it publishes the result:
dq_result = dq.run_checks(
asset="finance.orders_mart",
checks=[
"order_id_not_null",
"revenue_non_negative",
"freshness_within_sla",
"row_count_within_expected_range",
"valid_country_code",
"unique_order_id"
]
)
dq_results_table.append(dq_result)
dq_platform.publish(dq_result)The asset metadata may then be updated with a stable summary:
monitoring_required: true
quality_certification: certified
data_contract_status: certified
quality_owner: finance-platformBut the run-level details stay outside the asset metadata.
5. Incident-Aware Pipelines
If a critical upstream asset has an open incident, downstream jobs can behave differently.
For example:
incident = dq_service.get_open_incident("raw.payments")
if incident.severity == "critical":
stop_pipeline()
if incident.severity == "warning":
run_pipeline_but_mark_output_as_impacted()This enables a more nuanced model than simply “run or fail.”
Possible actions:
Critical failure → stop pipeline
Warning → continue but mark output as impacted
Deprecated input → continue for existing jobs, block new dependencies
Freshness delay → wait, retry, or skip publish
Schema break → fail immediatelyThis is how quality metadata becomes operationally useful.
6. Lineage-Aware Impact Propagation
The most powerful pattern is lineage-aware propagation.
If raw.orders fails, the platform should know that curated.orders, finance.revenue_mart, and executive.arr_dashboard may be impacted.
This does not mean all downstream tables immediately become “failed.” It means their trust state should reflect dependency risk.
For example:
quality_state: impacted
impacted_by: raw.orders
impact_reason: upstream_freshness_failedThis is extremely useful for consumers.
Instead of discovering a broken dashboard manually, users can see that the underlying data is impacted by an upstream incident.
For data engineers, this also helps prioritize response. If a failed raw table impacts a board-level dashboard, it should be treated differently from a failure in an unused sandbox table.
7. Environment and Release Gates
Quality metadata can also be used in CI/CD workflows.
For example, a data contract change should not be promoted to production unless required checks exist.
A dbt model should not be marked as certified unless it has an owner, tests, freshness checks, and alert routing.
A table should not move from experimental to managed unless it has basic quality coverage.
This turns quality into a release policy:
No owner → cannot certify
No freshness check → cannot certify
No null checks → cannot certify
No alert route → cannot certify
Open critical incident → cannot promoteThis is where catalog metadata, DQ results, and pipeline orchestration come together.
8. BI and Consumer Warnings
Data engineers also need to think about downstream consumption.
A BI tool, notebook environment, or query interface can read asset trust metadata and display warnings.
For example:
Warning: this table is not certified.
Warning: this table is impacted by an upstream freshness incident.
Warning: this table is deprecated and will be removed after 2026-09-01.This is not a pipeline pattern, but data engineers need to publish the metadata that makes it possible.
The Architecture I Would Recommend
The clean architecture looks like this:
Data pipeline
↓
Pipeline health checks
↓
DQ checks
↓
Freshness checks
↓
DQ platform / sidecar DQ results table
↓
Stable asset metadata update
↓
Catalog / governance layer
↓
Consumers, policies, BI warnings, certification workflowsThat is why it works.
Final Take
Data quality indicators should be part of the asset experience.
When someone opens a table, they should immediately understand whether it is certified, monitored, owned, trusted, stale, deprecated, blocked, or impacted by an upstream incident.
That information belongs in the catalog and can be mirrored into table metadata as stable properties.
At the same time, Data Quality operational results should not be embedded directly into Delta or Iceberg table metadata. They are too volatile, too detailed, and too operational. They can, and should, carry enough stable trust metadata to make data assets more discoverable, governable, and usable.

