Getting started with
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
Compute cluster refresh
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
Cluster policies and access
DATABRICKS CONCEPTS
Cluster Access
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
Cluster policies and access
Databricks Runtime
Photon Acceleration
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
Cluster policies and access
Databricks Runtime
Photon Acceleration
Node instance types and number
Auto-scaling / Auto-termination
DATABRICKS CONCEPTS
Data Explorer
Get familiar with the Data Explorer! In this UI,
you can:
1. Browse available catalogs/schemas/tables
2. Look at sample data and summary
statistics
3. View data lineage and history
You can also upload new data by clicking the
"plus" icon!
1 Photo by Jakub Zerdzicki: https://www.pexels.com/photo/magnifier-loupe-17284804/
DATABRICKS CONCEPTS
Create a notebook
Databricks notebooks:
Standard interface for Databricks
Improvements on open-source Jupyter
Support for many languages
Python, R, Scala, SQL
Magic commands (%sql)
Built-in visualizations
Real-time commenting and collaboration
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Data Engineering
foundations in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
Medallion architecture
DATABRICKS CONCEPTS
Reading data
Spark is a highly flexible framework and can
read from various data sources/types.
Common data sources and types:
Delta tables
File formats (CSV, JSON, Parquet, XML)
Databases (MySQL, Postgres, EDW)
Streaming data
Images / Videos
DATABRICKS CONCEPTS
Reading data
Spark is a highly flexible framework and can #Delta table
read from various data sources/types. spark.read.table()
#CSV files
Common data sources and types:
spark.read.format('csv').load('*.csv')
Delta tables #Postgres table
spark.read.format("jdbc")
File formats (CSV, JSON, Parquet, XML)
.option("driver", driver)
Databases (MySQL, Postgres, EDW)
.option("url", url)
Streaming data .option("dbtable", table)
.option("user", user)
Images / Videos
.option("password", password)
.load()
DATABRICKS CONCEPTS
Structure of a Delta table
A Delta table provides table-like qualities to an open file format.
Feels like a table when reading
Access to underlying files (Parquet and JSON)
DATABRICKS CONCEPTS
Explaining the Delta Lake structure
DATABRICKS CONCEPTS
DataFrames
DataFrames are two-dimensional id customerName bookTitle
representations of data. 1 John Data Guide to Spark
Look and feel similar to tables 2 Sally Bricks SQL for Data
Engineering
Similar concept for many different data
3 Adam Delta Keeping Data Clean
tools
Spark (default), pandas, dplyr, SQL df = (spark.read
queries .format("csv")
Underlying construct for most data .option("header", "true")
processes .option("inferSchema", "true")
.load("/data.csv"))
DATABRICKS CONCEPTS
Writing data
Kinds of tables in Databricks df.write.saveAsTable(table_name)
1. Managed tables
CREATE TABLE table_name
Default type
USING delta
Stored with Unity Catalog AS ...
Databricks managed
2. External tables df.write
Stored in another location .location('').saveAsTable(table_name)
Set LOCATION
CREATE TABLE table_name
Customer managed USING delta
LOCATION "<path>"
AS ...
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Data
transformations in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
SQL for data engineering
SQL -- Creating a new table in SQL
Familiar for Database Administrators
CREATE TABLE table_name
(DBAs)
USING delta
Great for standard manipulations AS (
Execute pre-defined UDFs SELECT *
FROM source_table
WHERE date >= '2023-01-01'
)
DATABRICKS CONCEPTS
Other languages for data engineering
Python, R, Scala #Creating a new table in Pyspark
Familiar for software engineers
spark
Standard and complex transformations .read
Use and define custom functions .table('source_table')
.filter(col('date') >= '2023-01-01')
.write
.saveAsTable('table_name')
DATABRICKS CONCEPTS
Common transformations
Schema manipulation #Pyspark
Add and remove columns df
Redefine columns .withColumn(col('newCol'), ...)
.drop(col('oldCol'))
Filtering #Pyspark
Reduce DataFrame to subset of data
df
Pass multiple criteria .filter(col('date') >= target_date)
.filter(col('id') IS NOT NULL)
DATABRICKS CONCEPTS
Common transformations (continued)
Nested data df
.explode(col('arrayCol')) #wide to long
Arrays or Struct data
.flatten(col('items')) #long to wide
Expand or contract
Aggregation df
.groupBy(col('region'))
Group data based on columns
.agg(sum(col('sales')))
Calculate data summarizations
DATABRICKS CONCEPTS
Auto Loader
Auto Loader processes new data files as they
land in a data lake.
Incremental processing
Efficient processing
Automatic
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(file_path)
1 https://www.databricks.com/blog/2020/02/24/introducing-databricks-ingest-easy-data-ingestion-into-delta-
lake.html
DATABRICKS CONCEPTS
Structured Streaming
spark.readStream
.format("kafka")
.option("subscribe", "<topic>")
.load()
.join(table_df,
on="<id>", how="left")
.writeStream
.format("kafka")
.option("topic", "<topic>")
.start()
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Orchestration in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Analytics Practitioner
What is data orchestration?
Data orchestration is a form of automation!
Enables data engineers to automate the end-to-end data life cycle
DATABRICKS CONCEPTS
Databricks Workflows
Databricks Workflows is a collection of built-in capabilities to orchestrate all your data
processes, at no additional cost!
Example Databricks Workflow
1 https://docs.databricks.com/workflows
DATABRICKS CONCEPTS
What can we orchestrate?
Data engineers/data scientists Data analysts
DATABRICKS CONCEPTS
Databricks Jobs
Workflows UI
Users can create jobs directly from the
Databricks UI:
Directly from a notebook
In the Workflows section
1 https://docs.databricks.com/workflows/jobs
DATABRICKS CONCEPTS
Databricks Jobs
Programmatic {
Users can also programmatically create jobs "name": "A multitask job",
using the Jobs CLI or Jobs API with the "tags": {},
Databricks platform. "tasks": [],
"job_clusters": [],
"format": "MULTI_TASK",
}
DATABRICKS CONCEPTS
Delta Live Tables
DATABRICKS CONCEPTS
Delta Live Tables
DATABRICKS CONCEPTS
Delta Live Tables
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
End-to-end data
pipeline example in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
Let's practice!
D ATA B R I C K S C O N C E P T S