Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apache Software Foundation

Streaming Data Lakes Using Kafka Connect +Apache Hudi Balaji Varadarajan, Vinoth Chandar

Speakers Vinoth Chandar PMC Chair/Creator of Hudi Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking) Principal Eng @ Confluent (ksqlDB, Kafka/Streams) Staff Eng @ Linkedin (Voldemort, DDS) Sr Eng @ Oracle (CDC/Goldengate/XStream) Balaji Varadarajan PMC Member, Apache Hudi Sr. Staff Eng @ Robinhood, Data Infra Tech Lead @Uber, Data Platform Staff Engineer @Linkedin, Databus CDC

Agenda 1) Background 2) Hudi 101 3) Hudi’s Spark Writers (existing) 4) Kafka Connect Sink (new) 5) Onwards

Background Event Streams, Data Lakes

Data Lakes are now essential Architectural Pattern for Analytical Data ❏ Data Lake != Spark, Flink ❏ Data Lake != Files on S3 ❏ Raw data (OLTP schema) ❏ Derived Data (OLAP/BI, ML schema) Open Storage + Scalable Compute ❏ Avoid data lock-in, Open formats (data + metadata) ❏ Efficient, Universal (Analytics, Data Science) Lot of exciting progress ❏ Lakehouse = Lake + Warehouse ❏ Data meshes on Lakes => Need for streams Source: https://martinfowler.com/bliki/images/dataLake/context.png

Event Streams are the new norm Events come in many flavors Database change Events ❏ High fidelity, High value, update/deletes ❏ E.g: Debezium changelogs into Kafka Application/Service business events ❏ High volume, Immutable or Deltas, ❏ E.g: Emit Uber app events, emit changes from IoT sensors SaaS Data Sources ❏ Lower volume, mutable ❏ E.g: polling Github events API

Database Kafka Cluster Apps/ Services Event Firehose External Sources Extracting Event Streams Kafka Connect Sources

Why not just Connect File Sinks? Queries DFS/Cloud Storage Data Lake?? Files Kafka Cluster Kafka Connect Sinks (S3/HDFS)

Challenges Working at the file abstraction level is painful ❏ Transactional, Concurrency Control ❏ Updates subset of data, indexing for faster access Scalability, Operational Overhead ❏ Writing columnar files is resource intensive ❏ Partitioned data increases memory overhead Lack of management ❏ Control file sizes, Deletes for GDPR/Compliance ❏ Re-align storage for better query performance

Apache Hudi Transactional Writes, MVCC/OCC ❏ Work with tables and records ❏ Automatic compaction, clustering, sizing First class support for Updates, Deletes ❏ Record level Update/Deletes inspired by stream processors CDC Streams From Lake Storage ❏ Storage Layout optimized for incremental fetches ❏ Hudi’s unique contribution in the space

Hudi 101 Components, APIs, Architecture

Stream processing + Batch data The Incremental Stack + Intelligent, Incremental + Fast, Efficient + Scans, Columnar formats + Scalable Compute https://www.oreilly.com/content/ubers-case-for- incremental-processing-on-hadoop/; 2016

The Hudi Stack ❏ Complete “data” lake platform ❏ Tightly integrated, Self managing ❏ Write using Spark, Flink ❏ Query using Spark, Flink, Hive, Presto, Trino, Impala, AWS Athena/Redshift, Aliyun DLA etc ❏ Out-of-box tools/services for data ops http://hudi.apache.org/blog/2021/07/21/st reaming-data-lake-platform

❏ Powers arguably the largest transactional data lake on the planet @ Uber ❏ (Database CDC) Robinhood’s near-realtime data lake ❏ (ML Feature stores) @ Logical Clocks ❏ (Event Deletions/De-Duping) @ Moveworks ❏ Many more companies, pre-installed by 5 major cloud providers 1000+ Slack members 150+ Contributors 1000+ GH Engagers ~10-20 PRs/week 20+ Committers 10+ PMCs The Community

Hudi DeltaStreamer Efficient, Micro-batched

Event Streams DFS/Cloud Storage Tables Pull using Spark Kafka De-Dupe Indexing Txn DeltaStreamer Utility, Spark Streaming Cluster Optimize Compact Apply Pull Cleaning

Current Kafka to Hudi Options - Ingest streaming data to Data Lake - Raw Tables - Current Solutions through Spark: - Hudi DeltaStreamer - Spark Structured Streaming Kafka Cluster Hudi DeltaStreamer Spark Structured Streaming DFS/Cloud Storage Tables Apply

Structured Streaming Sink // Read data from stream Dataset<Row> streamingInput = spark.readStream()... // Write to Hudi in a streaming fashion DataStreamWriter<Row> writer = streamingInput.writeStream() .format("org.apache.hudi") .option(DataSourceWriteOptions.TABLE_TYPE.key(), tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD.key(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "partition") .option(DataSourceWriteOptions.PRECOMBINE_FIELD.key(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME.key(), tableName) .option("checkpointLocation", checkpointLocation) .outputMode(OutputMode.Append()); String tablePath = “s3://…." // Schedule the job StreamingQuery query = ... writer.trigger(Trigger.ProcessingTime(500)).start(tablePath); query.awaitTermination(streamingDurationInMs);

DeltaStreamer Utility ❏ Fully Managed Ingestion and ETL service ❏ Integration with various Streaming and batch sources ❏ Table State & Checkpoints transactionally consistent ❏ Pluggable Transformations for ETL use cases.

DeltaStreamer Example spark-submit --master yarn --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.8.0 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 ... --enable-sync --hoodie-conf auto.offset.reset=latest --hoodie-conf hoodie.avro.schema.validate=true …. --table-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider --props /path/job.properties --transformer-class com.some.someTransformer --continuous ← Enables async compaction, clustering & cleaning along with streaming writes Streaming Data Lake without writing any code!

Case Study: Robinhood Data Lake Master RDS Replica RDS Table Topic DeltaStreamer (Live) DeltaStreamer (Bootstrap) DATA LAKE (s3://xxx/… Update schema and partition Write incremental data and checkpoint offsets

Case Study: Robinhood Data Lake ❏ 1000s of CDC based Streaming ingest pipelines supported by Apache Hudi DeltaStreamer. ❏ Data Lake freshness Latency down to 5-15 mins from hours. ❏ Powers critical dashboards and use-cases

End-to-End Streaming Data Lake ❏ Data Lake has both raw tables and derived tables built through ETLs. ❏ Streaming Data-lake - Needs streaming semantics supported for both kinds of tables. ❏ The Missing Primitive : Derived Tables need Changelog view of the upstream dataset -> Apache Hudi Incremental Read to rescue

The Big Picture Pull Database Event Streams Apps/ Service s External Sources CDC Push Streaming Data Lake Raw Tables DeltaStreamer Spark Streaming Hudi Change log Derived Tables DeltaStreamer Spark Streaming

Connect Hudi Sink Kafkaesque, Commit protocol, Transactional

Motivations Integration with Kafka Connect ❏ Separation of concerns (writing vs optimization/management) ❏ Streamline operationally, just one framework for ingesting ❏ Less need for Spark expertise Faster data ❏ Amortize startup costs (containers, queue delays) ❏ Commit frequently i.e every 1 minute (every N secs in near future) ❏ E.g avro records in Kafka log to Hudi’s log format

Putting it all together Event Streams DFS/Cloud Storage Tables Kafka De-Dupe Indexing Txn Hudi Connect Sink (Writing) Commit Pull Compact Cluster Hudi’s Table Services (Optimization, management) Clean Deletes

Design Challenges Determining Transaction Boundaries ❏ No co-ordination via driver process like Spark/Flink ❏ Workers doing their own commits => horrible concurrency bottlenecks Connect APIs cannot express DAGs ❏ Meant to be simple `putRecords()`/`preCommit()` ❏ Indexing, De-duplication, Storage optimization all shuffle data

Design Overview Central Transaction Co-ordination ❏ Use Kafka to elect coordinator. ❏ Runs in one of the workers Kafka as control channel ❏ Consume from latest control topic offsets https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi

Design Overview Transaction Coordinator ❏ Daemon thread on owner of partition 0 ❏ Sends commands to participants Embedded Hudi Java Writer ❏ Lands data into set of file groups, mapped to a partition ❏ Hudi’s commit fencing guards from failures/partial writes

Co-ordinator State Machine Paxos-like two phase commit ❏ Co-ordinator process to start, end commits ❏ Safety > liveness, abort after timeout Participants “pause” at each commit boundary ❏ Return latest write offsets to co-ordinator ❏ Resume again on start of next commit

Example Sink Configuration # hudi table properties target.base.path target.table.name target.database.name schemaprovider.class partition.field.name hoodie.table.base.file.format Pre-release, subject to change. Refer to official Hudi docs, for actual config names. # controller properties control.topic.name coordinator.writestatus.timeout write.retry.timeout

Choosing Right Delta Streamer Connect Sink Provides full set of Hudi features Insert only for now, indexes/updates coming as enhancements Offers better elasticity for merging/writing columnar data i.e copy-on-write tables Great impedance match with Kafka, for landing avro/row-oriented data i.e merge-on- read tables Data freshness of several minutes, if not running in continuous mode Approach ~1 min freshness Need experience with Spark/Flink Operate all data ingestion in a single framework.

Kafka + Hudi Support for mutable, keyed updates/deletes ❏ Need to implement a new index ala Flink writer ❏ preCombine, buffering/batching What if : Back Kafka’s tiered storage using Hudi ❏ Map offsets to Hudi commit_seq_no ❏ Columnar reads for historical/catch-up reads

Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apache Software Foundation

More Related Content

What's hot

Similar to Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apache Software Foundation

More from HostedbyConfluent

Recently uploaded

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apache Software Foundation