Introduction to Spark SQL and DataFrames

Exercise files and notebooks for the LinkedIn Learning course "Introduction to Spark SQL and DataFrames" by Dan Sullivan.

Course Overview

This repository contains comprehensive hands-on exercises covering Apache Spark DataFrames and Spark SQL fundamentals. The course provides practical experience with distributed data processing, combining both DataFrame API and SQL approaches for maximum flexibility.

Course Details:

Duration: 1h 54m
Level: Intermediate
Instructor: Dan Sullivan (Data Architect, Author, and Instructor)
Last Updated: April 1, 2024

Prerequisites

Python programming experience
Basic understanding of data structures and SQL concepts
Familiarity with Jupyter notebooks

Installation

Spark Setup

Follow the platform-specific setup instructions:

Mac/Linux: See Spark Mac Linux Export Environment Variables
Windows: See Spark Windows Instructions

Required Dependencies

pip install pyspark jupyter

Repository Structure

├── CH 03/ # DataFrame API Fundamentals │ ├── begin/ # Starting notebooks with exercises │ │ ├── 03.01 Loading csv files into dataframes.ipynb │ │ ├── 03.02 Reading JSON Files.ipynb │ │ ├── 03.03 Basic Dataframe Operations.ipynb │ │ ├── 03.04 Filtering using Dataframe API.ipynb │ │ ├── 03.05 Aggregating using Dataframe API.ipynb │ │ ├── 03.06 Sampling using Dataframe API.ipynb │ │ └── 03.07 Saving Data from Dataframes.ipynb │ └── end/ # Completed solutions │ └── [corresponding completed notebooks] ├── CH 04/ # Spark SQL │ ├── begin/ │ │ ├── 04.01 Querying Dataframes with SQL.ipynb │ │ ├── 04.02 Filtering Dataframes with SQL.ipynb │ │ ├── 04.03 Aggregating Dataframes with SQL.ipynb │ │ ├── 04.04 Joining Dataframes with SQL.ipynb │ │ ├── 04.05 De-duplicating.ipynb │ │ └── 04.06 Working with NAs.ipynb │ └── end/ │ └── [corresponding completed notebooks] ├── CH 05/ # Advanced Data Science Applications │ ├── begin/ │ │ ├── 05.01 Exploratory Analysis.ipynb │ │ ├── 05.02 Timeseries Analysis .ipynb │ │ ├── 05.03 Machine Learning - Clustering.ipynb │ │ └── 05.04 Machine Learning - Linear Regression.ipynb │ └── end/ │ └── [corresponding completed notebooks] ├── Data/ # Sample datasets │ ├── location_temp.csv # Temperature sensor readings │ ├── server_name.csv # Server ID to name mappings │ ├── utilization.csv # Server utilization metrics │ └── utilization.json # Server utilization (JSON format) └── [Platform-specific setup files]

Learning Path

Chapter 3: DataFrame API Fundamentals

Learn core DataFrame operations using Spark's Python API:

Data Loading: Read CSV and JSON files into DataFrames
Basic Operations: Schema inspection, column selection, data types
Filtering: Row-level filtering with conditions
Aggregation: Group-by operations, statistical functions
Sampling: Data sampling techniques for large datasets
Data Export: Saving processed DataFrames in various formats

Chapter 4: Spark SQL

Master SQL-based data manipulation:

SQL Queries: Transform DataFrames using familiar SQL syntax
Advanced Filtering: Complex WHERE clauses and conditions
Aggregation Functions: GROUP BY, statistical aggregations
Joins: INNER, LEFT, RIGHT joins across multiple DataFrames
Data Quality: Deduplication strategies and null value handling

Chapter 5: Advanced Analytics

Apply Spark to real-world data science scenarios:

Exploratory Data Analysis: Statistical profiling and data discovery
Time Series Analysis: Window functions, moving averages, temporal patterns
Machine Learning: Clustering algorithms and linear regression implementation

Sample Data

The course uses realistic datasets that simulate common data engineering scenarios:

Temperature Monitoring (`location_temp.csv`)

Time-series temperature readings from multiple sensor locations:

event_date | location_id | temp_celsius 03/04/2019 19:48:06 | loc0 | 29 03/04/2019 19:53:06 | loc0 | 27

Server Utilization (`utilization.json`)

System performance metrics including CPU, memory, and session data:

cpu_utilization | event_datetime | free_memory | server_id | session_count 0.77 | 03/16/2019 17:21:40 | 0.22 | 115 | 58

Server Registry (`server_name.csv`)

Mapping table for server identification:

server_id | server_name 115 | 115 Server

Key Concepts Covered

DataFrame API vs SQL

The course demonstrates both approaches for maximum flexibility:

DataFrame API: Programmatic, method-chaining approach
Spark SQL: Declarative SQL syntax for complex queries

Advanced Techniques

Window Functions: Partition-based calculations and moving averages
Join Strategies: Multiple table relationships and data enrichment
Data Quality: Handling missing values and duplicate records
Performance: Sampling strategies for large-scale data processing

Getting Started

Setup Environment: Follow platform-specific Spark installation instructions
Start with Chapter 3: Begin with begin/ notebooks to practice exercises
Check Solutions: Compare your work with end/ notebooks
Progress Sequentially: Each chapter builds on previous concepts

Usage

Start Jupyter notebook in the repository root:

jupyter notebook

Navigate to the begin/ folder for any chapter and work through the exercises. Each notebook is self-contained and includes detailed explanations.

Learning Outcomes

Upon completion, you will be able to:

Set up and configure Apache Spark with PySpark
Load and manipulate large datasets using DataFrames
Perform complex data analysis using both DataFrame API and SQL
Implement data quality procedures and join operations
Apply Spark to machine learning and time-series analysis
Optimize data processing workflows for distributed computing

Technical Requirements

Apache Spark 2.x or 3.x
Python 3.6+
PySpark
Jupyter Notebook
Minimum 4GB RAM recommended for exercises

Course Resources

LinkedIn Learning Course: Introduction to Spark SQL and DataFrames
Instructor: Dan Sullivan
Course Materials: Exercise files and sample datasets included

License

These materials are provided for educational use only. All rights belong to LinkedIn Learning and the original instructor. Redistribution for commercial purposes is strictly prohibited.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction to Spark SQL and DataFrames

Course Overview

Prerequisites

Installation

Spark Setup

Required Dependencies

Repository Structure

Learning Path

Chapter 3: DataFrame API Fundamentals

Chapter 4: Spark SQL

Chapter 5: Advanced Analytics

Sample Data

Temperature Monitoring (`location_temp.csv`)

Server Utilization (`utilization.json`)

Server Registry (`server_name.csv`)

Key Concepts Covered

DataFrame API vs SQL

Advanced Techniques

Getting Started

Usage

Learning Outcomes

Technical Requirements

Course Resources

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CH 03		CH 03
CH 04		CH 04
CH 05		CH 05
Data		Data
README.md		README.md

nabilshadman/intro-to-spark-sql-and-dataframes

Folders and files

Latest commit

History

Repository files navigation

Introduction to Spark SQL and DataFrames

Course Overview

Prerequisites

Installation

Spark Setup

Required Dependencies

Repository Structure

Learning Path

Chapter 3: DataFrame API Fundamentals

Chapter 4: Spark SQL

Chapter 5: Advanced Analytics

Sample Data

Temperature Monitoring (location_temp.csv)

Server Utilization (utilization.json)

Server Registry (server_name.csv)

Key Concepts Covered

DataFrame API vs SQL

Advanced Techniques

Getting Started

Usage

Learning Outcomes

Technical Requirements

Course Resources

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Temperature Monitoring (`location_temp.csv`)

Server Utilization (`utilization.json`)

Server Registry (`server_name.csv`)

Packages