Building an Open Data Platform with Apache Iceberg Ryan Blue Alluxio Day 8, December 2021
What is Iceberg?
What is Iceberg?
Where are we going?
Current data architecture ● Multi-engine β—‹ Spark for ETL, ML β—‹ Trino for ad-hoc, ETL β—‹ Flink for streaming β—‹ Druid for aggregates ● In the cloud (or moving) ● Hive Metastore β—‹ No metastore? ● Investing in data β—‹ In people β—‹ In tools β—‹ In infrastructure
But the pieces don’t fit together quite right
What is Iceberg?
What is Iceberg? ● A table format β—‹ Akin to columnar file formats β—‹ Transactional guarantees β—‹ Performance enhancements ● A standard for analytic tables β—‹ Open source spec and library β—‹ Integrated into query engines
And how does that help?
Object storage The gap Data & metadata Compute Apache Spark Catalog ???
Shared storage requirements Technical: ● Must handle concurrent writes ● Must be scalable, performant ● Must be cloud native Practical: ● Must be open source ● Must be neutral ● Must address productivity
Iceberg’s goals ● Add reliable transactions ● Unlock performance ● Fix usability
Object storage Open data platform Data & metadata Compute Apache Spark Catalog Vertical solutions Open data stack Data Services
Thank you!
Iceberg exists to fix productivity
Lessons learned ● Avoid unpleasant surprises β—‹ Principle of least surprise ● DonΚΌt steal attention β—‹ Reduce context switching
We try to make Iceberg invisible
Usability improvements ● Schema evolution β—‹ Instantaneous – no rewrites β—‹ Safe – no undead columns 🧟 β—‹ Saves days of headache ALTER TABLE db.tab RENAME COLUMN id TO customer_id ● Layout evolution β—‹ Lazy – only rewrite if needed β—‹ Partitioning mistakes are okay β—‹ Changes with your data β—‹ Saves a month of headache ALTER TABLE db.tab ADD PARTITION FIELD bucket(256, id)
Practical improvements ● Hidden partitioning β—‹ No silent correctness bugs β—‹ No conversion mistakes β—‹ Query without understanding a tableΚΌs physical layout ● Reliable updates β—‹ Stop manual cleanup β—‹ Use any query engine β—‹ Automate maintenance
Performance improvements ● Indexed metadata β—‹ Fast job planning β—‹ Fast query execution β—‹ Faster iteration ● Table configuration β—‹ Tune tables, not jobs β—‹ Automate table tuning β—‹ Cluster and sort from config

Building an open data platform with apache iceberg

  • 1.
    Building an Open DataPlatform with Apache Iceberg Ryan Blue Alluxio Day 8, December 2021
  • 2.
  • 3.
  • 4.
  • 5.
    Current data architecture ●Multi-engine β—‹ Spark for ETL, ML β—‹ Trino for ad-hoc, ETL β—‹ Flink for streaming β—‹ Druid for aggregates ● In the cloud (or moving) ● Hive Metastore β—‹ No metastore? ● Investing in data β—‹ In people β—‹ In tools β—‹ In infrastructure
  • 6.
  • 7.
  • 8.
    What is Iceberg? ●A table format β—‹ Akin to columnar file formats β—‹ Transactional guarantees β—‹ Performance enhancements ● A standard for analytic tables β—‹ Open source spec and library β—‹ Integrated into query engines
  • 9.
    And how doesthat help?
  • 10.
    Object storage The gap Data& metadata Compute Apache Spark Catalog ???
  • 11.
    Shared storage requirements Technical: ●Must handle concurrent writes ● Must be scalable, performant ● Must be cloud native Practical: ● Must be open source ● Must be neutral ● Must address productivity
  • 12.
    Iceberg’s goals ● Add reliabletransactions ● Unlock performance ● Fix usability
  • 13.
    Object storage Open dataplatform Data & metadata Compute Apache Spark Catalog Vertical solutions Open data stack Data Services
  • 14.
  • 15.
    Iceberg exists tofix productivity
  • 16.
    Lessons learned ● Avoidunpleasant surprises β—‹ Principle of least surprise ● DonΚΌt steal attention β—‹ Reduce context switching
  • 17.
    We try tomake Iceberg invisible
  • 18.
    Usability improvements ● Schemaevolution β—‹ Instantaneous – no rewrites β—‹ Safe – no undead columns 🧟 β—‹ Saves days of headache ALTER TABLE db.tab RENAME COLUMN id TO customer_id ● Layout evolution β—‹ Lazy – only rewrite if needed β—‹ Partitioning mistakes are okay β—‹ Changes with your data β—‹ Saves a month of headache ALTER TABLE db.tab ADD PARTITION FIELD bucket(256, id)
  • 19.
    Practical improvements ● Hiddenpartitioning β—‹ No silent correctness bugs β—‹ No conversion mistakes β—‹ Query without understanding a tableΚΌs physical layout ● Reliable updates β—‹ Stop manual cleanup β—‹ Use any query engine β—‹ Automate maintenance
  • 20.
    Performance improvements ● Indexedmetadata β—‹ Fast job planning β—‹ Fast query execution β—‹ Faster iteration ● Table configuration β—‹ Tune tables, not jobs β—‹ Automate table tuning β—‹ Cluster and sort from config