Skip to content

nabilshadman/intro-to-spark-sql-and-dataframes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction to Spark SQL and DataFrames

Exercise files and notebooks for the LinkedIn Learning course "Introduction to Spark SQL and DataFrames" by Dan Sullivan.

Course Overview

This repository contains comprehensive hands-on exercises covering Apache Spark DataFrames and Spark SQL fundamentals. The course provides practical experience with distributed data processing, combining both DataFrame API and SQL approaches for maximum flexibility.

Course Details:

  • Duration: 1h 54m
  • Level: Intermediate
  • Instructor: Dan Sullivan (Data Architect, Author, and Instructor)
  • Last Updated: April 1, 2024

Prerequisites

  • Python programming experience
  • Basic understanding of data structures and SQL concepts
  • Familiarity with Jupyter notebooks

Installation

Spark Setup

Follow the platform-specific setup instructions:

  • Mac/Linux: See Spark Mac Linux Export Environment Variables
  • Windows: See Spark Windows Instructions

Required Dependencies

pip install pyspark jupyter

Repository Structure

├── CH 03/ # DataFrame API Fundamentals │ ├── begin/ # Starting notebooks with exercises │ │ ├── 03.01 Loading csv files into dataframes.ipynb │ │ ├── 03.02 Reading JSON Files.ipynb │ │ ├── 03.03 Basic Dataframe Operations.ipynb │ │ ├── 03.04 Filtering using Dataframe API.ipynb │ │ ├── 03.05 Aggregating using Dataframe API.ipynb │ │ ├── 03.06 Sampling using Dataframe API.ipynb │ │ └── 03.07 Saving Data from Dataframes.ipynb │ └── end/ # Completed solutions │ └── [corresponding completed notebooks] ├── CH 04/ # Spark SQL │ ├── begin/ │ │ ├── 04.01 Querying Dataframes with SQL.ipynb │ │ ├── 04.02 Filtering Dataframes with SQL.ipynb │ │ ├── 04.03 Aggregating Dataframes with SQL.ipynb │ │ ├── 04.04 Joining Dataframes with SQL.ipynb │ │ ├── 04.05 De-duplicating.ipynb │ │ └── 04.06 Working with NAs.ipynb │ └── end/ │ └── [corresponding completed notebooks] ├── CH 05/ # Advanced Data Science Applications │ ├── begin/ │ │ ├── 05.01 Exploratory Analysis.ipynb │ │ ├── 05.02 Timeseries Analysis .ipynb │ │ ├── 05.03 Machine Learning - Clustering.ipynb │ │ └── 05.04 Machine Learning - Linear Regression.ipynb │ └── end/ │ └── [corresponding completed notebooks] ├── Data/ # Sample datasets │ ├── location_temp.csv # Temperature sensor readings │ ├── server_name.csv # Server ID to name mappings │ ├── utilization.csv # Server utilization metrics │ └── utilization.json # Server utilization (JSON format) └── [Platform-specific setup files] 

Learning Path

Chapter 3: DataFrame API Fundamentals

Learn core DataFrame operations using Spark's Python API:

  • Data Loading: Read CSV and JSON files into DataFrames
  • Basic Operations: Schema inspection, column selection, data types
  • Filtering: Row-level filtering with conditions
  • Aggregation: Group-by operations, statistical functions
  • Sampling: Data sampling techniques for large datasets
  • Data Export: Saving processed DataFrames in various formats

Chapter 4: Spark SQL

Master SQL-based data manipulation:

  • SQL Queries: Transform DataFrames using familiar SQL syntax
  • Advanced Filtering: Complex WHERE clauses and conditions
  • Aggregation Functions: GROUP BY, statistical aggregations
  • Joins: INNER, LEFT, RIGHT joins across multiple DataFrames
  • Data Quality: Deduplication strategies and null value handling

Chapter 5: Advanced Analytics

Apply Spark to real-world data science scenarios:

  • Exploratory Data Analysis: Statistical profiling and data discovery
  • Time Series Analysis: Window functions, moving averages, temporal patterns
  • Machine Learning: Clustering algorithms and linear regression implementation

Sample Data

The course uses realistic datasets that simulate common data engineering scenarios:

Temperature Monitoring (location_temp.csv)

Time-series temperature readings from multiple sensor locations:

event_date | location_id | temp_celsius 03/04/2019 19:48:06 | loc0 | 29 03/04/2019 19:53:06 | loc0 | 27 

Server Utilization (utilization.json)

System performance metrics including CPU, memory, and session data:

cpu_utilization | event_datetime | free_memory | server_id | session_count 0.77 | 03/16/2019 17:21:40 | 0.22 | 115 | 58 

Server Registry (server_name.csv)

Mapping table for server identification:

server_id | server_name 115 | 115 Server 

Key Concepts Covered

DataFrame API vs SQL

The course demonstrates both approaches for maximum flexibility:

  • DataFrame API: Programmatic, method-chaining approach
  • Spark SQL: Declarative SQL syntax for complex queries

Advanced Techniques

  • Window Functions: Partition-based calculations and moving averages
  • Join Strategies: Multiple table relationships and data enrichment
  • Data Quality: Handling missing values and duplicate records
  • Performance: Sampling strategies for large-scale data processing

Getting Started

  1. Setup Environment: Follow platform-specific Spark installation instructions
  2. Start with Chapter 3: Begin with begin/ notebooks to practice exercises
  3. Check Solutions: Compare your work with end/ notebooks
  4. Progress Sequentially: Each chapter builds on previous concepts

Usage

Start Jupyter notebook in the repository root:

jupyter notebook

Navigate to the begin/ folder for any chapter and work through the exercises. Each notebook is self-contained and includes detailed explanations.

Learning Outcomes

Upon completion, you will be able to:

  • Set up and configure Apache Spark with PySpark
  • Load and manipulate large datasets using DataFrames
  • Perform complex data analysis using both DataFrame API and SQL
  • Implement data quality procedures and join operations
  • Apply Spark to machine learning and time-series analysis
  • Optimize data processing workflows for distributed computing

Technical Requirements

  • Apache Spark 2.x or 3.x
  • Python 3.6+
  • PySpark
  • Jupyter Notebook
  • Minimum 4GB RAM recommended for exercises

Course Resources

License

These materials are provided for educational use only. All rights belong to LinkedIn Learning and the original instructor. Redistribution for commercial purposes is strictly prohibited.


Releases

No releases published

Packages

No packages published