Apache Iceberg
Why Data Engineers Are Quietly Standardising on It
Over the last 18–24 months, I’ve seen the same decision show up in very different companies. Startups building their first serious data platform. Scaleups trying to escape warehouse lock-in. Large enterprises re-platforming analytics.
Different contexts. Same outcome.
“Let’s standardise on Iceberg.”
Not because it’s trendy. Not because a vendor pushed it. But because Iceberg fixes a set of structural problems data teams have been carrying since the Hive era — and warehouses never really solved.
This article explains what problem Iceberg actually solves, why it’s gaining traction now, and how teams are using it in practice — with Snowflake, Databricks, and plain object storage like AWS S3.
The Core Problem Iceberg Solves (That Warehouses Don’t)
Most traditional data stacks tightly couple storage, metadata, and compute. Data lives where the engine lives. Moving compute usually means copying data. Governance, retention, and lifecycle rules get re-implemented per platform. Costs become opaque very quickly.
Warehouses optimise for convenience and performance, not architectural separation.
Iceberg flips this model!
Iceberg is not a database. It’s a table standard for data lakes.
Data is stored once — typically as Parquet on object storage — and any engine that understands Iceberg can read or write it safely. That “safely” part matters more than people realise, because it’s where previous lake designs failed.
Why Iceberg Succeeded Where Hive Tables Failed
Hive tables and raw Parquet worked until scale and concurrency arrived. Then the cracks appeared.
They lacked atomic writes. There was no snapshot isolation. Concurrent writers were unsafe. Schema evolution was fragile and often required rewrites or downtime. These weren’t edge cases — they were structural limitations.
Iceberg fixes this by making metadata first-class.
Every Iceberg table is defined by immutable data files, immutable metadata files, and snapshots that represent a consistent table state. Writers commit atomically. Readers always see a coherent view. Time travel, rollback, concurrent writes, and controlled schema evolution become normal operations rather than special cases.
In practice, Iceberg behaves like a real database table — but lives entirely on object storage.
That’s the real breakthrough.
Why Iceberg Is Becoming the Standard
Iceberg isn’t winning because it’s novel. It’s winning because several things finally aligned.
First, it is genuinely vendor-neutral. Snowflake, Databricks, AWS, Trino, Flink, and DuckDB all support it, and no single vendor controls the specification. That matters more than individual features.
Second, Iceberg enforces a clean separation of concerns. Storage lives in S3, GCS, or ADLS. Metadata lives in a catalog such as Glue, Nessie, Unity, Polaris, or Lakekeeper. Compute is whatever engine you choose today — and can change tomorrow. Each layer evolves independently.
Third, Iceberg is now operationally mature. Compaction, snapshot expiration, partition evolution, and schema evolution are no longer afterthoughts. Five years ago these were DIY problems. Today they’re table stakes.
Finally, Iceberg is enterprise-ready. It works at petabyte scale, supports multi-writer workloads, and integrates with existing IAM and catalog systems. That full combination simply didn’t exist before.
How Iceberg Is Used in Practice
In Snowflake environments, Iceberg is increasingly used to decouple storage from the warehouse. Data lives in S3 or GCS. Snowflake manages Iceberg metadata and queries the data in place. Teams keep Snowflake for BI and analytics, avoid duplicating raw and curated data, and retain an exit option if pricing or strategy changes.
The trade-off is that Snowflake controls the metadata layer and optimisation is less transparent than with native tables. Performance, however, remains very strong for analytics workloads. This pattern is especially common in finance and large enterprise analytics teams.
In Databricks environments, Iceberg often plays a different role. Tables live on S3, metadata sits in Unity Catalog or an external catalog, and Spark is used for heavy transformations. Querying may happen in Databricks, Snowflake, or Trino.
Teams choose Iceberg here to avoid Delta lock-in, share tables across engines, and align with multi-cloud strategies. Databricks increasingly acts as a powerful compute engine rather than the system of record.
Then there’s the “bare metal” S3 model — where Iceberg really shines.
Here, S3 is the system of record. Iceberg provides table semantics. Glue, Nessie, or Lakekeeper manage metadata. Spark, Trino, Flink, or DuckDB handle compute. This model is popular with infra-heavy startups, platform teams, and cost-optimising organisations.
It works because storage is cheap, compute is elastic, governance logic is centralised, and there’s no per-terabyte warehouse tax. But it only scales well if teams invest in compaction, file size control, and snapshot management.
Performance: How Iceberg Actually Compares
Let’s be honest: Iceberg itself doesn’t make queries fast.
Performance depends on file sizes, partitioning, metadata pruning, and the compute engine. Get those wrong and Iceberg will feel slow. Get them right and it performs extremely well.
Compared to raw Parquet on S3, Iceberg is dramatically faster for analytical queries because engines can prune files using metadata and read consistent snapshots. Compared to Snowflake native tables, Snowflake still wins for small BI queries, but Iceberg becomes competitive at scale and wins on cost predictability and flexibility.
Compared to Delta Lake, performance is broadly similar. Iceberg wins on openness and portability. Delta wins on Databricks-specific optimisations.
In most real systems, the bottleneck is almost always small files and poor lifecycle management — not Iceberg itself.
The Hidden Cost Nobody Talks About
Iceberg introduces responsibility.
You now own compaction, snapshot expiration, retention, file size health, and cost observability across engines. Warehouses hide these concerns from you. Iceberg exposes them.
This is exactly why governance and FinOps around Iceberg are becoming critical. Once data is shared across multiple engines, someone needs to own its lifecycle and economics.
I ran into these trade-offs directly while building Dativo Ingest, an open-source ingestion project built around Iceberg. Once Iceberg is your contract, you stop optimising for a single engine and start thinking in terms of table health, commits, and long-term operability.
When Iceberg Is the Wrong Choice
Iceberg is not a silver bullet.
If you only need BI on a small amount of data, don’t want to operate data infrastructure, or have no Spark or Trino experience, a warehouse alone is still a perfectly reasonable choice.
Iceberg pays off when flexibility, scale, and long-term economics matter.
Final Thought
Iceberg isn’t popular because it’s fashionable.
It’s popular because data teams are tired of rebuilding the same foundations in every warehouse.
Iceberg gives you a durable data layer, engine independence, and predictable long-term economics. And once teams adopt it, they rarely go back.
If you’re designing a data platform today, Iceberg should at least be in the conversation — even if Snowflake or Databricks still sit on top.
That’s the real shift we’re seeing.

