-
Author: Mahmoud Parsian
-
This new book (to be published by O'Reilly) is the 2nd Edition of Data Algorithms (published by O'Reilly)
-
The first edition used Java for Spark, but for the new book, I will use PySpark (much simpler and readable)
-
This GitHub repository will host all source code and scripts for Data Algorithms with Spark
-
Estimated Publication date: April 2022
-
Autor Contact: [
Email ] [
Mahmoud Parsian @LinkedIn ][
Mahmoud Parsian @GitHub ]
Chapter solutions are provided in PySpark and Scala
- PySpark solutions are provided by Mahmoud Parsian
- Scala solutions are provided by Deepak Kumar and Biman Mandal
| Spark | Python | Scala | Java |
|---|---|---|---|
| Apache Spark 3.2.0 | Python 3.7.2 | Scala 2.13 | Java 8 |
| Chapter | Title |
|---|---|
| Bonus Chapters | Bonus Chapters (TF-IDF, Correlation, K-mers, anagrams, ...) |
| Chapter 1 | Introduction to Data Algorithms |
| Chapter 2 | Transformations in Action |
| Chapter 3 | Mapper Transformations |
| Chapter 4 | Reductions in Spark |
| Chapter 5 | Partitioning Data |
| Chapter 6 | Graph Algorithms |
| Chapter 7 | Interacting with External Data Sources |
| Chapter 8 | Ranking Algorithms |
| Chapter 9 | Fundamental Data Design Patterns |
| Chapter 10 | Common Data Design Patterns |
| Chapter 11 | Join Design Patterns |
| Chapter 12 | Feature Engineering in PySpark |