End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
The document discusses the importance of data in machine learning and presents an overview of tools like Hopsworks, Databricks Delta, and various feature stores. It highlights advancements in data lakes with ACID transactions, incremental ingestion, and efficient querying using frameworks such as Delta, Hudi, and Iceberg. The Hopsworks feature store is emphasized as the world's first open-source feature store, supporting end-to-end ML pipelines with reliable and timely data access.
Introduction of WIFI details, presenters Kim Hammar and Jim Dowling, topic focus on ML Pipelines with Databricks and Hopsworks.
Explains the significance of data in ML, stating it's the hardest part, with modelers focusing on feature selection and transformation.
Introduction to data sourcing from the Feature Store, emphasizing its vital role in ML operations.
An outline of the presentation covering Hopsworks, Databricks Delta, Feature Store, demo, and summary.
Describes the ecosystem including data sources and applications, and how Hopsworks integrates various technologies like Apache Beam, Spark, and TensorFlow.
Defines new characteristics of Data Lakes, like ACID transactional layers, and solutions for issues like incremental updates and rollback failures.
Details on Upsert and Time Travel functionalities in data management, providing examples of how these concepts work.
Discusses Delta Lake's transactional layer, ACID transactions, open format storage, and time-travel capabilities.
Covers optimistic concurrency control, mutual exclusion, retrial strategies, and how scalable metadata management works.
Comparison of Delta, Hudi, and Iceberg frameworks, highlighting their common goals of reliable updates and storage efficiency.
Discusses how Feature Stores can utilize log-structured storage, integration with Databricks for incrementing ACID ingestion and data validation.
Explains incremental feature engineering and point-in-time correct data with examples using Hudi and Hopsworks.
Demonstrates the integration of Hopsworks Feature Store and Databricks platform in action.
Summarizes key functionalities of Delta, Hudi, Iceberg for data lakes and introduces Hopsworks as an open-source feature store supporting end-to-end ML.
Provides company information, resources for further reading, and thanks to team members involved in the project.
Kim Hammar, LogicalClocks AB KimHammar1 Jim Dowling, Logical Clocks AB jim_dowling End-to-End ML Pipelines with Databricks Delta and Hopsworks Feature Store #UnifiedDataAnalytics #SparkAISummit
Where does theData come from? 5 “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.” [Uber on Michelangelo]
Next-Gen Data Lakes DataLakes are starting to resemble databases: – Apache Hudi, Delta, and Apache Iceberg add: • ACID transactional layers on top of the data lake • Indexes to speed up queries (data skipping) • Incremental Ingestion (late data, delete existing records) • Time-travel queries 16
Delta Lake byDatabricks • Delta Lake is a Transactional Layer that sits on top of your Data Lake: – ACID Transactions with Optimistic Concurrency Control – Log-Structured Storage – Open Format (Parquet-based storage) – Time-travel 23
Other Frameworks: ApacheHudi, Apache Iceberg • Hudi was developed by Uber for their Hadoop Data Lake (HDFS first, then S3 support) • Iceberg was developed by Netflix with S3 as target storage layer • All three frameworks (Delta, Hudi, Iceberg) have common goals of adding ACID updates, incremental ingestion, efficient queries. 30
31.
Next-Gen Data LakesCompared 31 Delta Hudi Iceberg Incremental Ingestion Spark Spark Spark ACID updates HDFS, S3* HDFS S3, HDFS File Formats Parquet Avro, Parquet Parquet, ORC Data Skipping (File-Level Indexes) Min-Max Stats+Z-Order Clustering* File-Level Max-Min stats + Bloom Filter File-Level Max-Min Filtering Concurrency Control Optimistic Optimistic Optimistic Data Validation Expectations (coming soon) In Hopsworks N/A Merge-on-Read No Yes (coming soon) No Schema Evolution Yes Yes Yes File I/O Cache Yes* No No Cleanup Manual Automatic, Manual No Compaction Manual Automatic No *Databricks version only (not open-source)
32.
32 How can aFeature Store leverage Log-Structured Storage (e.g., Delta or Hudi or Iceberg)?
33.
Hopsworks Feature Store 33 FeatureMgmt Storage Access Statistics Online Features Discovery Offline Features Data Scientist Online Apps Data Engineer MySQL Cluster (Metadata, Online Features) Apache Hive Columnar DB (Offline Features) Feature Data Ingestion Hopsworks Feature Store Training Data (S3, HDFS) Batch Apps Discover features, create training data, save models, read online/offline/on- demand features, historical feature values. Models HopsFS JDBC (SAS, R, etc) Feature CRUD Add/remove features, access control, feature data validation. Access Control Time Travel Data Validation Pandas or PySpark DataFrame External DB Feature Defn Țselect ..Ț AWS Sagemaker and Databricks Integration • Computation engine (Spark) • Incremental ACID Ingestion • Time-Travel • Data Validation • On-Demand or Cached Features • Online or Offline Features
Summary • Delta, Hudi,Iceberg bring Reliability, Upserts & Time-Travel to Data Lakes – Functionalities that are well suited for Feature Stores • Hopsworks Feature Store builds on Hudi/Hive and is the world’s first open-source Feature Store (released 2018) • The Hopsworks Platform also supports End-to-End ML pipelines using the Feature Store and Spark/Beam/Flink, Tensorflow/PyTorch, and Airflow 38
39.
Thank you! 470 RamonaSt, Palo Alto Kista, Stockholm https://www.logicalclocks.com Register for a free account at www.hops.site Twitter @logicalclocks @hopsworks GitHub https://github.com/logicalclocks/hopswo rks https://github.com/hopshadoop/hops
40.
References • Feature Store:the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/ • Python-First ML Pipelines with Hopsworks https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. • Hopsworks white paper. https://www.logicalclocks.com/whitepapers/hopsworks • HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases. https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi • Open Source: https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops • Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, Alex Ormenisan, Rasmus Toivonen, Steffen Grohsschmiedt, and Moritz Meister 40
41.
DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT