Python + Pyspark with Data Bricks
Syllabus:
Python Syllabus
1. Introduction to Python
Python Introduction, History of Python, Introduction to Python
Interpreter and program execution, Python Installation Process in
Windows and Linux, Python IDE, Introduction to anaconda, python
variable declaration, Keywords, Indents in Python,
Python input/output operations
2. Python’s Operators
Arithmetic Operators, Comparison Operators, Assignment Operators,
Logical Operators, Bitwise Operators, Membership Operators, Identity
Operators, Ternary Operator, Operator precedence.
3. Python’s Built-in Data types
String, List, Tuple, Set, Dictionary (characteristics and methods)
4. Conditional Statements & Loop
Conditional Statements (If, If-else, If-elif-else, Nested-if etc.) and loop
control statements
(for, while, Nested loops, Break, Continue, Pass statements)
5. Function in python.
Introduction to functions, Function definition and calling, Function
parameters, Default argument function, Variable argument
function, in built functions in python, Scope of variable in python
6. File Processing
Concept of Files, File opening in various modes and closing of a file,
Reading from a file, Writing onto a file, some important File
handling functions e.g open(), close(), read(), readline() etc.
7. Modules
Concept of modularization, Importance of modules in python,
Importing modules, Built in modules ( ex: Numpy)
Databricks Concepts.
1) Databricks Introduction
A. Databricks Architecture
B. Databricks Components overview
C. Benefits for data engineers and data scientists
2) Databricks concepts
A. Workspace – Creation and managing workspace.
B. Notebook – creating notebooks, calling and managing
different notebooks.
C. Library - installing libraries, managing libraries
3) Data Management
A. Databricks File System. - DBFS commands copy and
manage files using DBFS.
B. Database - Creating database, tables and managing
databases and tables.
C. Table - Creating Tables, dropping tables, loading data ..
D. Metastore - managing metadata and delta tables creation,
managing delta tables.
E. Unity Catalog configuration and creation
4) Computation Management
A. Cluster -- Creating Clusters , managing clusters
B. Pool - creating pools and using pools for Auto scaling.
C. Databricks RunTime - understanding and using Databricks runtimes
based on requirement.
D. Jobs - creating jobs from notebooks and assigning types of clusters
for jobs.
E. Workload - monitoring jobs and managing loads.
F. Execution Context –understandingcontext.
5) Databricks Advanced topics.
A. Databricks Workflows
B. Workflow task
C. Implementing parallel and sequential tasks
D. Scheduling workflows in Databricks
E. Calling one notebook into another
notebook. F. Parameterization in
notebooks
G. How to implement parallelism in notebooks execution.
H. Mounting azure blob storage and data lake
storage accounts. I. Repos integration in
Databricks
J. Volumes in Databricks
K. Costing and Performance monitoring
L. Databricks Unity Catalog
M. Databricks Delta Live Tables
N. Databricks Change Data Feed
PySpark Content
1 Pyspark Introduction
2 Pyspark Features and Advantages
3 Pyspark RDD Computation
4 Pyspark Transformations and Actions
5 Pyspark Fault-Tolerance mechanism
6 Pyspark RDD persistence
7 Different persistence options
8 Test
9 ON Lambda filter and map functions
10 Pyspark RDD in-built Transformations
11 Pyspark key value Transformations
12 Pyspark inbuilt Actions
13 Pyspark inbuilt actions and increasing part
14 Pyspark Filtering operations and word count
15 Pyspark Goupings and Aggregations
16 Pyspark installation within jupyter Notebook
17 Pyspark SQL and Creating Dataframes
18 Pyspark sql Dataframe functions
19 Pyspark various Dataframe Functions
20 Pyspark Sql DataFrame Functions
21 Pyspark different types of joins
22 Pyspark working with sql stmts
23 Pyspark Working with CSV and Json data
24 MultiLine JSON and Pyspark integration with
25 Pyspark Column Transformations
26 Nosql Introduction
27 NoSql Hbase Introduction
28 Nosql Hbase CRUD operations
29 Importing data from RDBMS to Hbase table
30 Mysql and Hbase
31 Various pyspark functions
32 Filtering and Replacing column values
33 Pyspark Jupyter and pyspark pandas and cal
34 Pyspark Date and Timestamp functions
35 Stages and Tasks Narrow and wide Transforma
36 Test
37 Nifi Lecture 1
38 Nifi Lecture 2
39 Kafka Lecture 1
40 Kafka Lecture 2
41 Streaming Lecture 1
43 Streaming Lecture 2
44 Streaming Lecture 3
45 Pyspark PROJECTS(s)
1) Delta Lake usage in
Databricks.
A. Delta Lake Architecture
B. Delta Lake Storage Understanding
C. Delta lake table creation and API options
D. Delta Lake DML
Operations usage. E.
Delta Lake partitions
F. Delta Lake Schema Enforcement
G. Delta Lake Schema Evolution
H. Delta Lake Versions
I. Delta Lake Time Travel
J. Delta Lake Vaccum
K. Delta Lake Merge (SCD Type 1 and SCD Type2)
B. Understand storage account keys.
C. Understand shared access signatures.
D. Understand transport-level encryption
with HTTPS. E. Understand Advanced
Threat Protection.
F. Control network access.
SPARK SQL:
1) Introduction to Spark SQL.
2) Spark SQL Create database
3) Drop databases
4) Create internal table
5) Create external table
6) Create partitioned table
7) Create partitioned with bucketing table
8) SPARK DML insert, update, delete and merge operations
9) SPARK SQL DRL Select queries with different clauses
10) Spark SQL MERGE With SCD Type 1 and SCD Type 2
11) Spark SQL WHERE Clause, Group By Clause and Having Clauses
12) Spark SQL Order by, Sort By clauses
13) Spark SQL join types, Window , Pivot , Limit and Like
14) Spark SQL Grouping Sets, Rollup and Cube
15) Spark SQL Cultured By and Distributed By
16) Spark SQL Case, With and Take sample
AZURE
1) Overview of the Microsoft Azure Platform
A. Introduction to Azure
B. Basics of Cloud computing
C. Azure Infrastructure
D. Walkthrough of Azure Portal
E. Overview of Azure Services
2) Azure Data Architecture
A. Traditional RDBMS workloads.
B. Data Warehousing Approach
C. Big data architectures.
D. Transferring data to and from Azure
3) Blob Storage
A. Azure Blob Resources
B. Azure storage account data objects
C. Azure storage account types and Options
D. Replications in distribution
E. Secure access to an application's data
F. Azure Import/Export service
G. Storage Explorer
H. Practical section on Blob Storage
PROJECT AZURE
Streaming Project Using
1.Nifi
2.Kafka
3.Pyspark
4.Azure