Exercise files and notebooks for the LinkedIn Learning course "Introduction to Spark SQL and DataFrames" by Dan Sullivan.
This repository contains comprehensive hands-on exercises covering Apache Spark DataFrames and Spark SQL fundamentals. The course provides practical experience with distributed data processing, combining both DataFrame API and SQL approaches for maximum flexibility.
Course Details:
- Duration: 1h 54m
- Level: Intermediate
- Instructor: Dan Sullivan (Data Architect, Author, and Instructor)
- Last Updated: April 1, 2024
- Python programming experience
- Basic understanding of data structures and SQL concepts
- Familiarity with Jupyter notebooks
Follow the platform-specific setup instructions:
- Mac/Linux: See
Spark Mac Linux Export Environment Variables
- Windows: See
Spark Windows Instructions
pip install pyspark jupyter
├── CH 03/ # DataFrame API Fundamentals │ ├── begin/ # Starting notebooks with exercises │ │ ├── 03.01 Loading csv files into dataframes.ipynb │ │ ├── 03.02 Reading JSON Files.ipynb │ │ ├── 03.03 Basic Dataframe Operations.ipynb │ │ ├── 03.04 Filtering using Dataframe API.ipynb │ │ ├── 03.05 Aggregating using Dataframe API.ipynb │ │ ├── 03.06 Sampling using Dataframe API.ipynb │ │ └── 03.07 Saving Data from Dataframes.ipynb │ └── end/ # Completed solutions │ └── [corresponding completed notebooks] ├── CH 04/ # Spark SQL │ ├── begin/ │ │ ├── 04.01 Querying Dataframes with SQL.ipynb │ │ ├── 04.02 Filtering Dataframes with SQL.ipynb │ │ ├── 04.03 Aggregating Dataframes with SQL.ipynb │ │ ├── 04.04 Joining Dataframes with SQL.ipynb │ │ ├── 04.05 De-duplicating.ipynb │ │ └── 04.06 Working with NAs.ipynb │ └── end/ │ └── [corresponding completed notebooks] ├── CH 05/ # Advanced Data Science Applications │ ├── begin/ │ │ ├── 05.01 Exploratory Analysis.ipynb │ │ ├── 05.02 Timeseries Analysis .ipynb │ │ ├── 05.03 Machine Learning - Clustering.ipynb │ │ └── 05.04 Machine Learning - Linear Regression.ipynb │ └── end/ │ └── [corresponding completed notebooks] ├── Data/ # Sample datasets │ ├── location_temp.csv # Temperature sensor readings │ ├── server_name.csv # Server ID to name mappings │ ├── utilization.csv # Server utilization metrics │ └── utilization.json # Server utilization (JSON format) └── [Platform-specific setup files]
Learn core DataFrame operations using Spark's Python API:
- Data Loading: Read CSV and JSON files into DataFrames
- Basic Operations: Schema inspection, column selection, data types
- Filtering: Row-level filtering with conditions
- Aggregation: Group-by operations, statistical functions
- Sampling: Data sampling techniques for large datasets
- Data Export: Saving processed DataFrames in various formats
Master SQL-based data manipulation:
- SQL Queries: Transform DataFrames using familiar SQL syntax
- Advanced Filtering: Complex WHERE clauses and conditions
- Aggregation Functions: GROUP BY, statistical aggregations
- Joins: INNER, LEFT, RIGHT joins across multiple DataFrames
- Data Quality: Deduplication strategies and null value handling
Apply Spark to real-world data science scenarios:
- Exploratory Data Analysis: Statistical profiling and data discovery
- Time Series Analysis: Window functions, moving averages, temporal patterns
- Machine Learning: Clustering algorithms and linear regression implementation
The course uses realistic datasets that simulate common data engineering scenarios:
Time-series temperature readings from multiple sensor locations:
event_date | location_id | temp_celsius 03/04/2019 19:48:06 | loc0 | 29 03/04/2019 19:53:06 | loc0 | 27
System performance metrics including CPU, memory, and session data:
cpu_utilization | event_datetime | free_memory | server_id | session_count 0.77 | 03/16/2019 17:21:40 | 0.22 | 115 | 58
Mapping table for server identification:
server_id | server_name 115 | 115 Server
The course demonstrates both approaches for maximum flexibility:
- DataFrame API: Programmatic, method-chaining approach
- Spark SQL: Declarative SQL syntax for complex queries
- Window Functions: Partition-based calculations and moving averages
- Join Strategies: Multiple table relationships and data enrichment
- Data Quality: Handling missing values and duplicate records
- Performance: Sampling strategies for large-scale data processing
- Setup Environment: Follow platform-specific Spark installation instructions
- Start with Chapter 3: Begin with
begin/
notebooks to practice exercises - Check Solutions: Compare your work with
end/
notebooks - Progress Sequentially: Each chapter builds on previous concepts
Start Jupyter notebook in the repository root:
jupyter notebook
Navigate to the begin/
folder for any chapter and work through the exercises. Each notebook is self-contained and includes detailed explanations.
Upon completion, you will be able to:
- Set up and configure Apache Spark with PySpark
- Load and manipulate large datasets using DataFrames
- Perform complex data analysis using both DataFrame API and SQL
- Implement data quality procedures and join operations
- Apply Spark to machine learning and time-series analysis
- Optimize data processing workflows for distributed computing
- Apache Spark 2.x or 3.x
- Python 3.6+
- PySpark
- Jupyter Notebook
- Minimum 4GB RAM recommended for exercises
- LinkedIn Learning Course: Introduction to Spark SQL and DataFrames
- Instructor: Dan Sullivan
- Course Materials: Exercise files and sample datasets included
These materials are provided for educational use only. All rights belong to LinkedIn Learning and the original instructor. Redistribution for commercial purposes is strictly prohibited.