Posted on Sep 16

Introduction to Apache Iceberg using MinIO

#iceberg #devops #mlops #ai

refs
more

Introduction

Apache Iceberg is one of three popular open table formats (OTF).

(Hudi, Uber) and (Delta Lake, Databricks)

In this post:

Iceberg specification
Docker Compose Hands-on
Metadata

What is OTF?

Turn files into tables
Open Table Format is a specification for organizing a collection of files containing the same information such that they are presented as a single table.

Table
Implying is that we want all these files to be viewable and updateable as if they were a single entity - the table.

We can interact with this collection of files in the same way with a table in a database.

Various parties must implement this specification to produce usable software.

Apache Iceberg specification (3)

To implement the Apache Iceberg specification, we need three things:

Catalog: keep track of all the metadata files
Processing engine: e.g., query engine
Scalable storage: object storage

The compute node that ties everything together.

Issue commands to the compute node for:

Creating tables
Inserting data into tables
Querying tables

Logical Diagram

The Rest catalog uses MinIO for storing metadata.

Iceberg's Data Architecture

Catalog: The processing engine connects to the Catalog to get a list of all tables. Each table, the catalog keeps track of all Metadata files.
Metadata file: Schema and partition and snapshots (previous schemas and partitions)
Manifest Lists: Snapshot itself. (so, s1 ..)
Manifest Files
1. Points to one or more data files (parquets).
2. important for efficient query execution.
3. Column-level information
4. e.g., max / min values for each Data File
Data Files: Typically Parquet format

Hands-on

source code

$ docker compose up $ docker exec -it spark-iceberg spark-sql

Create db

❯ docker exec -it spark-iceberg spark-sql Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/09/16 08:01:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 25/09/16 08:01:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Spark Web UI available at http://e5356708c7af:4041 Spark master: local[*], Application Id: local-1758009676374 spark-sql ()> CREATE DATABASE IF NOT EXISTS climate; Time taken: 0.537 seconds

Create table

spark-sql ()> CREATE TABLE IF NOT EXISTS climate.weather ( > datetime timestamp, > temp double, > lat double, > long double, > cloud_coverage string, > precip double, > wind_speed double > ) > USING iceberg > PARTITIONED BY (days(datetime)) > ; Time taken: 0.911 seconds

{ "format-version": 2, "table-uuid": "195f0af4-6ff1-4ea1-8def-73c85fa2d483", "location": "s3://warehouse/climate/weather", "last-sequence-number": 0, "last-updated-ms": 1758009757603, "last-column-id": 7, "current-schema-id": 0, "schemas": [ { "type": "struct", "schema-id": 0, "fields": [ { "id": 1, "name": "datetime", "required": false, "type": "timestamptz" }, { "id": 2, "name": "temp", "required": false, "type": "double" }, { "id": 3, "name": "lat", "required": false, "type": "double" }, { "id": 4, "name": "long", "required": false, "type": "double" }, { "id": 5, "name": "cloud_coverage", "required": false, "type": "string" }, { "id": 6, "name": "precip", "required": false, "type": "double" }, { "id": 7, "name": "wind_speed", "required": false, "type": "double" } ] } ], "default-spec-id": 0, "partition-specs": [ { "spec-id": 0, "fields": [ { "name": "datetime_day", "transform": "day", "source-id": 1, "field-id": 1000 } ] } ], "last-partition-id": 1000, "default-sort-order-id": 0, "sort-orders": [ { "order-id": 0, "fields": [] } ], "properties": { "owner": "root", "write.parquet.compression-codec": "zstd" }, "current-snapshot-id": -1, "refs": {}, "snapshots": [], "statistics": [], "partition-statistics": [], "snapshot-log": [], "metadata-log": [] }

Adding Data

Let's add data from jupyter notebook (using pyspark)

 from datetime import datetime schema = spark.table("climate.weather").schema data = [ (datetime(2023,8,16), 76.2, 40.951908, -74.075272, "Partially sunny", 0.0, 3.5), (datetime(2023,8,17), 82.5, 40.951908, -74.075272, "Sunny", 0.0, 1.2), (datetime(2023,8,18), 70.9, 40.951908, -74.075272, "Cloudy", .5, 5.2) ] df = spark.createDataFrame(data, schema) df.writeTo("climate.weather").append()

DEV Community