Get Started with
Databricks for
Machine Learning
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Learning goals
Upon completion of this content, you should be able to:
Explain fundamental concepts about using the Databricks Lakehouse
1
Platform for machine learning.
2 Perform basic notebook tasks using the Databricks Lakehouse Platform.
3 Store and manage data in the Lakehouse for machine learning tasks.
4 Create and use a baseline model using AutoML.
5 Create and use a feature store table for model training.
6 Track, register, and manage the stage of a model with MLflow.
©2023 Databricks Inc. — All rights reserved
Prerequisites/Technical Considerations
Things to keep in mind before you work through this course
Prerequisites Technical Considerations
Intermediate level knowledge of Python
1 1 A cluster running on DBR ML 13.3+
Basic knowledge of data science and
2 machine learning topics such as Unity Catalog enabled workspace
2
regression/classification models, model
evaluation metrics.
Basic knowledge of a machine learning
3 3 Model Serving enabled workspace
library such as scikit-learn.
©2023 Databricks Inc. — All rights reserved
Databricks Lakehouse Fundamentals:
Databricks
Fundamentals
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Learning objectives
Things you’ll be able to do after completing this lesson
• Identify Databricks as the Lakehouse Platform
• Describe core services of the Databricks Lakehouse Platform for
different personas.
• Identify different types of assets in the Databricks Workspace.
• Navigate throughout different sections of the Workspace.
• Perform common actions available in the Workspace.
• Describe Databricks Repos and its features.
©2023 Databricks Inc. — All rights reserved
Databricks Overview
What is Databricks?
Inventor and pioneer
5000+ of the data lakehouse Creator of
global employees
$1B+
in revenue
The Lakehouse Company
$3B Gartner-recognized Leader
in investment Database Management Systems
Data Science and Machine Learning Platforms
©2023 Databricks Inc. — All rights reserved
Databricks
Lakehouse Platform
Lakehouse Platform
Simple
Data Data Data Data Science Unify your data warehousing and AI
Warehousing Engineering Streaming and ML
use cases on a single platform
Unity Catalog
Fine-grained governance for data and AI Open
Built on open source and open standards
Delta Lake
Data reliability and performance
Multicloud
Cloud Data Lake
All structured and unstructured data One consistent data platform across
clouds
©2023 Databricks Inc. — All rights reserved
Databricks ecosystem
©2023 Databricks Inc. — All rights reserved
The lakehouse is for ALL data practitioners
Machine Learning Data Engineers Data Analysts Data Governance
Practitioners
Databricks Lakehouse
©2023 Databricks Inc. — All rights reserved
Data Engineering workloads on Databricks
• Simplifies data engineering
with a curated data lake
approach through Delta Lake
• Data orchestration through
Databricks Workflows
• Delta Live Tables manage your
full data pipelines
•
©2023 Databricks Inc. — All rights reserved
Data Analysts workloads on Databricks
• Great performance and
concurrency for BI and SQL
workloads on Delta Lake
• Native SQL interface for analysts
• Support for BI tools to directly
query your most recent data in
Delta Lake
©2023 Databricks Inc. — All rights reserved
ML & Data Science workloads on
Databricks
Machine Learning
• Model registry, reproducibility,
productionization with MLflow
• Leverages Delta Lake for reproducibility
• AutoML for citizen data scientists
Data Science
• Collaborative notebooks and
dashboards for interactive analysis
• Native support for Python, Java, R, Scala
• Delta Lake data natively supported
©2023 Databricks Inc. — All rights reserved
Lakehouse Governance with Unity Catalog
Govern and manage all data assets
• Warehouse, Tables, Columns
• Data Lake, Files
• Machine Learning Models
• Dashboards and Notebooks
Capabilities
• Data lineage
• Attribute-based access control
• Security policies
• Auditing
• Data sharing
©2023 Databricks Inc. — All rights reserved
Demo:
Exploring the
Workspace
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Demo
High-level steps
Overview of the UI
• Landing page
• Navigation
Workspace
• Creating and managing assets
• Search assets
• Repos
• Clone a repo
• Pull/push changes
©2023 Databricks Inc. — All rights reserved
Databricks Lakehouse Fundamentals:
Working with
Notebooks
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Learning objectives
Things you’ll be able to do after completing this lesson
• Describe Databricks Notebooks as the most common interface for
data engineers when working with Databricks.
• Recognize common use cases for data engineers when working with
Notebooks.
• Describe Databricks cluster.
• Describe the basic cloud-based compute structure of Databricks.
©2023 Databricks Inc. — All rights reserved
Compute Resources
©2023 Databricks Inc. — All rights reserved
Clusters
Overview
Clusters are made up of one Workloads Cluster
or more virtual machine (VM) Worker
instances Notebook
VM instance
Distributes workloads across
Driver Worker
workers Job
VM instance VM instance
• Driver coordinates activities
of executors DBSQL Worker
• Workers run tasks
VM instance
composing a Spark job
©2023 Databricks Inc. — All rights reserved
Clusters
Overview
Three main compute types: Workloads Cluster
Worker
• All-purpose clusters for
Notebook
interactive development VM instance
• Job clusters for automating
Driver Worker
workloads Job
• SQL Warehouses VM instance VM instance
(Serverless) instant DBSQL Worker
compute to run DBSQL
VM instance
queries and dashboards
©2023 Databricks Inc. — All rights reserved
Cluster Mode
Single node Standard (Multi Node)
Low-cost single-instance Default mode for workloads
cluster catering to single-node developed in any supported
machine learning workloads language (requires at least two
and lightweight exploratory VM instances)
analysis
©2023 Databricks Inc. — All rights reserved
Databricks Runtimes
DB Runtime; A set of core components that run on Databricks clusters
Standard Photon Machine Learning
Apache Spark and An optional add-on to Adds popular machine
many other optimize Spark learning libraries like
components and queries (e.g. SQL, TensorFlow, Keras,
updates to provide an DataFrame) PyTorch, and XGBoost.
optimized big data
analytics experiences
©2023 Databricks Inc. — All rights reserved
ML Runtime
Pre-built machine learning infrastructure
Databricks Machine Learning Runtime
• Optimized and pre-configured ML Frameworks
• Turnkey distributed ML
• Built-in AutoML
• GPU support out of the box
Built-in ML Frameworks and Built-in support for Built-in support for AutoML and Built-in support for
Model Explainability distributed Training Hyperparameter Tuning Hardware Accelerators
AutoML
©2023 Databricks Inc. — All rights reserved
Notebooks
©2023 Databricks Inc. — All rights reserved
Databricks Notebooks
Collaborative, reproducible, and enterprise ready
Reproducible
Multi-Language Automatically track version
Use Python, SQL, Scala, and R,
history, and use git version
all in one Notebook
control with Repos
Visualizations
Built-in visualizations and Collaborative
support for the most popular Real-time co-presence,
visualization libraries co-editing, and commenting
(e.g. matplotlib, ggplot)
Enterprise Ready
Adaptable Enterprise-grade access
Install standard libraries and
controls, identity management,
use local modules
and auditability
©2023 Databricks Inc. — All rights reserved
Ideal for exploratory data analysis
Native tools for visualizing and understanding data in ML workflow
Create interactive charts to Summarize a data set’s essential
visualize data in the Notebook with properties and statistics in a data
only two clicks profile with the push of a button
©2023 Databricks Inc. — All rights reserved
Right tool for quick development
Multi-language support, use standard libraries and custom modules
Mix and match languages based on Install Python libraries for a
use case and preferred workflow, notebook without affecting other
choosing from Python, SQL, Scala, users with %pip
and R Import local modules using
arbitrary file support when working
in Repos
©2023 Databricks Inc. — All rights reserved
Demo:
Working with
Notebooks
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Demo
High-level steps
Compute
• Configure and launch a cluster for ML
Notebooks
• UI Walkthrough
• Using multiple languages
• Working with Markdown
• Data visualization
• Table
• Graphs
• Data Profiler
©2023 Databricks Inc. — All rights reserved
Databricks Lakehouse Fundamentals:
Data Storage and
Management
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Learning objectives
Things you’ll be able to do after completing this lesson
• Describe that data is stored in cloud object storage locations and
accessed via Databricks.
• Explain the benefits of data storage in the data lakehouse architecture
across roles and Databricks services.
• Identify Delta Lake as the optimized storage layer that provides the
foundation for data storage for the data lakehouse.
• Describe Unity Catalog as a centralized governance solution in
Databricks.
• Explain the three-tier namespace and its levels.
©2023 Databricks Inc. — All rights reserved
Control
plane
©2023 Databricks Inc. — All rights reserved
Control
plane
©2023 Databricks Inc. — All rights reserved
Delta Lake
Open-source, default storage format on Databricks
• Delta Lake is an open-source project.
• It is the default format for the tables created
in Databricks.
• Delta Lake is the optimized storage layer that
provides the foundation for storing data and
tables in the Databricks Lakehouse Platform.
• Designed to improve data reliability, quality, and
performance in data lakes.
©2023 Databricks Inc. — All rights reserved
Delta Lake brings ACID to object storage
Atomicity means all transactions either succeed
or fail completely
Consistency guarantees relate to how a given
state of the data is observed by simultaneous
operations
Isolation refers to how simultaneous operations
A C I D
ATOMICITY CONSISTENCY ISOLATION DURABILITY
conflict with one another. The isolation
guarantees that Delta Lake provides do differ
from other systems
Durability means that committed changes are
permanent
©2023 Databricks Inc. — All rights reserved
Delta Lake features
Key features
• Unified batch and streaming
• Automatic schema validation
• Support upserts using the merge operation
• Update your table schema without rewriting data.
• Track row-level changes with Change Data Feed
• Querying previous versions of a table based on version number of
timestamp
• Performance optimization with ZORDER and OPTIMIZE
• Supports multiple programming languages like Python, Scala, and SQL.
©2023 Databricks Inc. — All rights reserved
Delta’s rich ecosystem of connectors
Cloud platforms API languages
Google Scala Ruby Python
DataProc
Azure
Synapse Rust
SQL engines ETL and streaming engines
AWS Redshift Power BI
Spectrum
AWS Athena
©2023 Databricks Inc. — All rights reserved
Data ingestion and transformation for ML
An example data ingestion workflow
©2023 Databricks Inc. — All rights reserved
Data and AI
Governance with
Unity Catalog
©2023 Databricks Inc. — All rights reserved
Today, data and AI governance is complex
Data Consumers Data Governance Team
Permissions on files
“Where to “How to
discover secure
Data lake
the datasets, these
Data analyst Permissions
models, on tables, rows and columns
assets?”
notebooks,
dashboards?”
“Who is
Data warehouse accessing
“Can I trust Data engineer these assets
Permissions on ML
the data and and how?”
models, features
ML models?”
“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards
Applications BI dashboards
©2023 Databricks Inc. — All rights reserved
Unity Catalog (UC)
Unified governance for data and AI
Unified visibility into data and AI
Single permission model for data and AI
AI-powered monitoring and observability
Open data sharing
Databricks Unity Catalog
Access Data
Discovery Lineage Monitoring Auditing
Controls Sharing
Tables Files Models Notebook Dashboards
s
©2023 Databricks Inc. — All rights reserved
Key Capabilities of UC
Governance model:
Unity Catalog
• Unified governance across clouds
• Centralized metadata and user Databricks Databricks
Workspace Workspace
management
• Centralized access controls
GRANT … ON … TO …
• Grant or revoke permission to data and REVOKE … ON … FROM …
AI assets using the UI or the API.
Catalogs, Databases (schemas), Tables,
Views, Storage credentials, External locations
©2023 Databricks Inc. — All rights reserved
The three level namespace of UC
How to use UC
(Unity) Unity Catalog
Metastore
…
(Unity)
Catalog
Metastore
Managed
External
Model
…
Table
table
Databricks
Catalog
…
Account assigned to
Schema External
Managed
Databricks (Database) Table
Table
Workspace
Databricks View
View
Workspace
SELECT * FROM catalog1.database1.table1;
©2023 Databricks Inc. — All rights reserved
Demo:
Data Storage and
Management
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Demo
High-level steps
Data Storage and Management
• Ingest data and create a Delta table
• View and manage tables
• Performance optimization for Delta
• Manage permissions with Unity Catalog
©2023 Databricks Inc. — All rights reserved
Databricks for Machine Learning:
Introduction to
Databricks for
Machine Learning
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Learning objectives
Things you’ll be able to do after completing this lesson
• Describe MLflow as an open source platform for managing the
end-to-end machine learning lifecycle that’s built into Databricks.
• Describe MLflow Experiments as a tool for tracking model
development runs and comparing the resulting model parameters
and metrics.
• Describe the Model Registry as a centralized model store for
managing models’ full lifecycle, including versioning and annotating.
• [Extra] Describe AutoML and its features.
©2023 Databricks Inc. — All rights reserved
Databricks supports both coding and low-coding users
Low code ML with AutoML Multi-language Notebooks
UI based ML development with a glass box approach Co-edit Notebooks in Python, R, Scala, and SQL
©2023 Databricks Inc. — All rights reserved
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Collaborative Multi-Language Notebooks
AutoML
Model Model Runtime and Batch
Data Scoring
Training Tuning Environments
Prep
Online Serving
Data Feature Jobs and API Automation
Versioning Store Monitoring
MLOps / Governance powered by
Open Multi-Cloud Data Lakehouse Foundation with
AutoML
©2023 Databricks Inc. — All rights reserved
AutoML
Rapid, simplified machine learning for everyone
Quick-start ML initiatives Select and input dataset (UI
Automated Data Prep
or API)
Accelerate your time to production,
Save weeks on ML projects
Auto-generated Automated Training and Automated Feature
notebooks Model Selection Engineering
Ensure best practices.
Customize baseline models with
your domain expertise Automated Explore Generated
Hyperparameter Tuning Artifacts and Notebooks
Wide range of problems
Solve classification, regression, and
forecasting problems from a variety
of ML libraries Monitor Deploy
©2023 Databricks Inc. — All rights reserved
Databricks AutoML
A glass-box solution empowering data teams without taking away control
UI and API to start Auto-created
AutoML training MLflow Experiment Easily deploy to
Model Registry
to track models and
metrics
Auto-generated Understand and
debug data
Data Exploration quality and
notebook preprocessing
Auto-generated Iterate further on
models from
notebooks with AutoML, adding
source code your expertise
©2023 Databricks Inc. — All rights reserved
AutoML solves two key pain points
Quickly Verify the Predictive Power of Get a Baseline Model to Guide Project
a Dataset Direction
Data
Marketing Data Science
Team Science Team
Team Dataset Baseline
Dataset Model
“Can this dataset be used to “What direction should I go in for
predict customer churn?” this ML project and what
benchmark should
I aim to beat?”
©2023 Databricks Inc. — All rights reserved
MLflow
©2023 Databricks Inc. — All rights reserved
Core Machine Learning Issues
Modern ML lifecycle comes with many challenges
• Keeping track of experiments or model development
• Reproducing code
• Comparing models
• Standardization of packaging and deploying models
MLflow addresses these issues.
©2023 Databricks Inc. — All rights reserved
MLflow
What is mlflow?
• Open-source platform for
machine learning lifecycle
• Operationalizing machine learning
• Developed by Databricks
• Pre-installed on the Databricks
Runtime for ML
Model
Registry
©2023 Databricks Inc. — All rights reserved
MLflow Components
The four components of MLflow
Tracking Projects Models Model Registry
Record and Packaging General model Centralized and
query format format that collaborative
experiments: for reproducible supports diverse model lifecycle
code, data, runs on any deployment tools management
config, results platform
APIs: CLI, Python, R, Java, REST
©2023 Databricks Inc. — All rights reserved
Model Tracking and Auto-logging using MLFlow
Ensure reproducibility
Inspect, Visualize and Compare
Metrics
mlflow.autolog()
Track ML development with one Model, environment,
line of code: parameters, metrics, and artifacts
data lineage, model, and
environment.
Auto-generated Data
Exploration Notebook
©2023 Databricks Inc. — All rights reserved
MLflow Model Registry
Features and Architecture
Tracking Server
• Collaborative, centralized model hub
Parameters Metrics Artifacts Models
• Allows Versioning of ML artifacts
• Facilitate experimentation, testing, and
production
Model Registry
• Integrate with approval and governance Data Deployment Engineers
workflows Scientists
Staging Productio Archived
• Audit log of stage transitions and requests, n
approval workflow for stage transitions
v1
v2
• Helps in automation through CI/CD
integration
v3
©2023 Databricks Inc. — All rights reserved
Demo:
Experimentation
with AutoML
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Demo
High-level steps
Create an Experiment
• Create and run an AutoML experiment
• View the best model
Model Registry
• Register the best model to Model Registry
• Manage model stages
©2023 Databricks Inc. — All rights reserved
Databricks for Machine Learning:
End-to-End ML
on the Lakehouse
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Learning objectives
Things you’ll be able to do after completing this lesson
• Compare and contrast model governance solutions with and without
Unity Catalog.
• Describe the Databricks Feature Store as a centralized repository that
enables data scientists to find and share features.
• Describe Workflows as a capability to productionize data workflows.
• Describe Jobs as a simple solution to schedule and automate one or
more tasks.
• Describe Databricks’ built-in model serving capabilities with real-time
inference, streaming, and batch
©2023 Databricks Inc. — All rights reserved
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Collaborative Multi-Language Notebooks
AutoML
Model Model Runtime and Batch
Data Scoring
Training Tuning Environments
Prep
Online Serving
Data Feature Jobs and API Automation
Versioning Store Monitoring
MLOps / Governance powered by
Open Multi-Cloud Data Lakehouse Foundation with
MLOps - End 2 End workflow
Setup MLFlow Model
webhook Schedule Monthly
Slack notifications, Trigger Retrain Job
Testing Jobs Databricks Job
Data Prep & Build baseline Model Promote Best Run to Automated Model testing Run inferences
Featurization with AutoML Registry Schema, Demographic Load model
ETL + EDA, Feature MLflow autologging, Annotate model. Request accuracy, Docs & artifacts… Batch or
Engineering with Koalas Hyperopt +Spark transition to staging Approve/reject request realtime
Approved.
Move to
STAGING STAGING
Tracking
Feature Store STAGING Request
Model Registry Rejected
Webhook triggers test
...
Realtime
HTTP inference
Data Scientist ML Engineer Data Engineer
©2023 Databricks Inc. — All rights reserved
Feature Store
©2023 Databricks Inc. — All rights reserved
Feature Store
How feature stores help?
• Feature store provide a centralized repository for managing and serving
machine learning (ML) features.
• Feature stores provide auditing and logging capabilities to track who
accessed or modified features.
• Feature store helps to handle scaling requirements of feature storage,
retrieval, and serving, ensuring that ML pipelines can operate efficiently.
• Feature store allows reusing feature across projects, reducing
duplication.
©2023 Databricks Inc. — All rights reserved
Why would you need a feature store?
Basic Motivations
Discovery
Multiple Data Scientists are trying to solve similar modeling tasks and come up with different definitions
of the same features. How can I find the features?
Lineage
Model governance requires documentation of the features used to train a model, as well as the
upstream lineage of a feature to reliably use it. How is it computed, and who owns it?
Skew
When multiple teams manage feature computation and ML models in production, minor yet significant
skew in upstream data at the input of a feature pipeline can be very hard to detect and fix.
Online Serving
During exploration and model experimentation phases features are implemented in frameworks that do
not scale to production.
©2023 Databricks Inc. — All rights reserved
Databricks Feature Store
Feature Definitions Feature Tables Training Data Set Creation
● Define reusable, ● Represent features as tables
shareable featurization that can be queried from any
logic language
● SQL, ACLs, versions, and
performance optimizations
Feature 1 Feature 2 snapshot
Batch Scoring
load
save Customer Item
Features Features
publish Online Serving
Databricks
Model Serving
... ...
REST
©2023 Databricks Inc. — All rights reserved Endpoint
Model Deployment
©2023 Databricks Inc. — All rights reserved
Model Serving Modes
Serving models for batch, streaming, real-time and, edge inference
Batch • High latency
• Leverages databases or object storage
• Fast retrieval of stored predictions
Delta Lake /
Feature Store
Streaming • Stream processing
• Moderately fast scoring on new data
Model
Registry
Real Time • Low latency scoring
Model training • High availability
• Usually using REST (containers, K8s)
Embedded (Edge) • Special case deployments
• Limited connectivity with cloud services
©2023 Databricks Inc. — All rights reserved
Challenges with building Real-time ML Systems
Most ML models don’t get into production
ML infrastructure is hard Deploying real time models Operating production ML
needs disparate tools requires expert resources
Real-time ML systems Data teams use diverse tools Steep learning curve of
require fast and scalable to develop models deployment tools.
serving infrastructure, which
Customers use separate Model deployment is
is costly to build and
platforms for data, ML, and bottlenecked by limited
maintain
Serving, adding complexity engineering resources,
and cost limiting the ability to scale
©2023 Databricks Inc. — All rights reserved
Databricks Model Serving
• Multiple model scoring and
deployment choices
World class • Leading multi-cloud inference
model scoring provider giving the customer the
and deployment choice of what, where, and when
they will score their model
options
• Ultra low latency real-time model
serving
©2023 Databricks Inc. — All rights reserved
Model deployment with Model Serving
Flexible deployment at any scale
Batch scoring
One-click deployment of models
from the Model Registry to scalable
compute clusters for batch scoring
Online scoring
One-click deployment of models to
REST endpoints for auto-scaling low
latency scoring
©2023 Databricks Inc. — All rights reserved
Core Features of Model Serving
Support real-time production ML workloads
Real Time Lakehouse Unified Simplified Deployment
● Low overhead latency: <100ms ● Feature Store Integrated: ● Simple: Endpoints UI and API for
Automated online lookups simple deployment
● Throughput: 3K+ QPS
● MLflow Integrated: Fast & easy ● Flexible: Traffic splitting for staged
● Availability: 99.5% model deployment roll-out and A/B testing
● Scalable: Automatically scales ● Quality & Diagnostics: Payload
● Manageable: Endpoints
up/down to handle bursty traffic logging to Delta
observability with built-in-metrics
● Secure: PrivateLink and ● Unified governance: Manage data and export options
IP-allowlist & AI with UCt
©2023 Databricks Inc. — All rights reserved
Orchestration with
Workflows
©2023 Databricks Inc. — All rights reserved
Databricks Workflows
Databricks Workflows
Workflows is a fully-managed
cloud-based general-purpose task
orchestration service for the entire Lakehouse Platform
Lakehouse. Data Data Data Data Science
Warehousing Engineering Streaming and ML
Unity Catalog
Workflows is a service for data Fine-grained governance for data and AI
engineers, data scientists, and analysts Delta Lake
Data reliability and performance
to build reliable data, analytics and AI Cloud Data Lake
All structured and unstructured data
workflows on any cloud.
©2023 Databricks Inc. — All rights reserved
Workflows Features
Orchestrate Anything Fully Managed Simple Workflow
Anywhere Authoring
Run diverse workloads for the full Remove operational overhead An easy point-and-click
data and AI lifecycle, on any with a fully managed authoring experience for all your
cloud. Orchestrate; orchestration service enabling data teams not just those with
you to focus on your workflows specialized skills
• Notebooks
not on managing your
• Delta Live Tables
infrastructure
• Jobs for SQL
• ML models, and more
©2023 Databricks Inc. — All rights reserved
Workflow features
Key features
Databricks Workflow offers:
• Monitoring and debugging
• Repair only failed tasks and sub-tasks Tasks
• Reduces the time and resources required to
recover from unsuccessful job runs
• Access Control
• Manage access across different teams
• Scheduling
• Run jobs immediately or periodically
• Alerts
Job run
©2023 Databricks Inc. — All rights reserved
Example Workflow
Data ingestion funnel
E.g. Auto Loader, DLT
Data filtering, quality assurance, transformation
E.g. DLT, SQL, Python
ML feature extraction
E.g. MLflow
Persisting features and training prediction
model
©2023 Databricks Inc. — All rights reserved
Demo:
End-to-End ML
on the Lakehouse
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Demo
High-level steps
End-to-end ML
• Create a feature store table
• Train and track a model with MLflow
• Register a model to Model Registry
• Transition model to next stage
• Use model for batch inference
• Automate inference with Workflows
©2023 Databricks Inc. — All rights reserved
Course Summary
and Next Steps
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved
Extra Resources
©2023 Databricks Inc. — All rights reserved
Feature Store
The first Feature Store codesigned with a Data and MLOps Platform
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)
Feature Registry Feature Provider
● Discoverability and Reusability ● Batch and online access to Features
● Versioning ● Feature lookup packaged with Models
● Upstream and downstream Lineage ● Simplified deployment process
Co-designed with Co-designed with
● Open format ● Open model format that supports all ML
frameworks
● Built-in data versioning and governance
● Feature version and lookup logic hermetically
● Native access through PySpark, SQL, etc. logged with Model
©2023 Databricks Inc. — All rights reserved