Building an open data platform with apache iceberg

Building an Open Data Platform with Apache Iceberg Ryan Blue Alluxio Day 8, December 2021

Current data architecture ● Multi-engine ○ Spark for ETL, ML ○ Trino for ad-hoc, ETL ○ Flink for streaming ○ Druid for aggregates ● In the cloud (or moving) ● Hive Metastore ○ No metastore? ● Investing in data ○ In people ○ In tools ○ In infrastructure

But the pieces don’t ﬁt together quite right

What is Iceberg? ● A table format ○ Akin to columnar file formats ○ Transactional guarantees ○ Performance enhancements ● A standard for analytic tables ○ Open source spec and library ○ Integrated into query engines

Object storage The gap Data & metadata Compute Apache Spark Catalog ???

Shared storage requirements Technical: ● Must handle concurrent writes ● Must be scalable, performant ● Must be cloud native Practical: ● Must be open source ● Must be neutral ● Must address productivity

Iceberg’s goals ● Add reliable transactions ● Unlock performance ● Fix usability

Object storage Open data platform Data & metadata Compute Apache Spark Catalog Vertical solutions Open data stack Data Services

Iceberg exists to ﬁx productivity

Lessons learned ● Avoid unpleasant surprises ○ Principle of least surprise ● Donʼt steal attention ○ Reduce context switching

We try to make Iceberg invisible

Usability improvements ● Schema evolution ○ Instantaneous – no rewrites ○ Safe – no undead columns 🧟 ○ Saves days of headache ALTER TABLE db.tab RENAME COLUMN id TO customer_id ● Layout evolution ○ Lazy – only rewrite if needed ○ Partitioning mistakes are okay ○ Changes with your data ○ Saves a month of headache ALTER TABLE db.tab ADD PARTITION FIELD bucket(256, id)

Practical improvements ● Hidden partitioning ○ No silent correctness bugs ○ No conversion mistakes ○ Query without understanding a tableʼs physical layout ● Reliable updates ○ Stop manual cleanup ○ Use any query engine ○ Automate maintenance

Performance improvements ● Indexed metadata ○ Fast job planning ○ Fast query execution ○ Faster iteration ● Table configuration ○ Tune tables, not jobs ○ Automate table tuning ○ Cluster and sort from config

Building an open data platform with apache iceberg

More Related Content

What's hot

Similar to Building an open data platform with apache iceberg

More from Alluxio, Inc.

Recently uploaded

In this document

Building an open data platform with apache iceberg