Apache Spark and Python: unified Big Data analytics

@anguenot @ilandcloud Apache Spark and Python unified big data analytics Flatiron school, Houston - July 2019

@anguenot @ilandcloud OUTLINE ● What is Apache Spark? ● The issue of Big Data ● Past, Present and Future of Spark ● Spark architecture quick overview ● Spark’s languages and APIs ● PySpark ● Community & Ecosystem ● Examples and demo using Jupyter notebooks from Andrew Spargue

@anguenot @ilandcloud What is Apache Spark? ● Unified computing engine ● Libraries for parallel data processing on clusters ● Support multiple programming language: SQL, Scala, Java, Python, R ● Provides libraries for streaming, machine learning and graph computing ● Spark UI: Web UI to monitor and inspect jobs and tasks ● Can run anywhere: laptop to clusters (on-prem or cloud) ● De facto standard for Big Data processing across all industries and use cases ● Open Source @ Apache Software Foundation

@anguenot @ilandcloud Unified ● One (1) compute engine ● One (1) set of APIs ● One (1) way of developing and deploying applications (or jobs) ● Data loading, machine learning, streaming computation, etc. ● Interactive or traditional application deployment ● Code reuse and access to multiple libraries

@anguenot @ilandcloud Compute engine ● Compute engine only: not a persistent data storage system ● Supports a wide range of persistent storage systems: Amazon S3, Azure Storage, Apache Hadoop, Apache Cassandra, etc. ● Easier to deploy and maintain without persistent storage layer ● Also supports message buses such as Apache Kafka

@anguenot @ilandcloud The issue of Big Data ● Before applications mainly using single processor ● Processors stopped going faster since 2005 ● Amount of data increases ● Price of storage decreases: cheap to store data ● Solution: cluster and parallel CPU cores ● Performances ○ In-memory processing faster ○ Exit Hadoop / MapReduce at least for real-time analytics and streaming

@anguenot @ilandcloud Past, Present and future of Spark ● UC Berkeley, CA in 2009 @ Spark research project ● Hadoop / MapReduce first Open Source parallel computing engine for clusters ● Issue of multiple passes over the data requiring multiple jobs with each pass writing results on disks ● Spark enabled multistep applications with efficient in-memory data sharing in between steps (batch only at first) ● Interactive and ad-hoc queries (data scientist) ● Apache foundation project in 2013 ● Spark 1.0 in 2014 introduced Spark SQL and structured data ● Then came: structured streaming, machine learning pipelines and graphs ● Now: de facto standard in all industries: Netflix, Uber, CERN, MIT, Harvard, etc.

@anguenot @ilandcloud Spark Architecture Overview Graphics from https://spark.apache.org

@anguenot @ilandcloud Spark Language’s ● Scala: default language ● Java: available but not popular ● Python: supports nearly all constructs that Scala supports ● SQL: subset of SQL 2003 Standard ● R: SparkR and sparklyr (community) but Python in the process of eating R in the data scientist communities

@anguenot @ilandcloud Spark’s high level structured APIs (1/2) ● Spark Session ○ driver process controlling the Spark application throughout the cluster ○ One-to-one correspondence ● DataFrames ○ Most common structured API ○ Table of data with rows and columns with a schema (column name and value type) ○ Think distributed! ○ Columns / Rows and Spark Types ○ Basically the same as Table and views against which you execute SQL w/ Spark SQL ○ “Untyped” DataSet ● Partitions ○ Parallel executions of chunked data. Spark does it for you by default if using Dataframes

@anguenot @ilandcloud ● Transformations ○ Data structures are immutables: you need to apply instructions called transformations ○ In-memory (narrow transformations) with pipelining when one-to-one partition transformation ○ Shuffled (wide transformations) with disk writes when one-to-many partitions transformation ● Lazy Evaluation ○ Streamlined plan of transformations ○ Optimized graph of computation instructions: logical plan ● Actions ○ count(), collect(), take(n), top(), countByValues(), reduce(), fold(), aggregate(), foreach() ○ Triggers computation against the plan of transformations ○ Single job, broken down in multiple stages and tasks to execute across the cluster ○ Physical plan (clustered RDD manipulations) Spark’s high level structured APIs (2/2)

@anguenot @ilandcloud Spark’s application Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do

@anguenot @ilandcloud Spark’s toolkit Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do

@anguenot @ilandcloud The Catalyst Optimizer Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do

@anguenot @ilandcloud The Structured API logical planning process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do

@anguenot @ilandcloud The Physical Planning Process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do

@anguenot @ilandcloud PySpark ● Use cases: ○ exploratory data analysis at scale ○ building machine learning pipelines ○ creating ETLs for a data platform ○ streaming data pipelines ● Interactive pyspark-shell ● Since Spark 2.2 available as a PyPI package ● Differences with native Scala? ● Spark Catalyst Engine optimization w/ structured APIs ● Pandas Dataframes & UDFs integration (single node) ● Python is the fastest growing language for data science & machine learning

@anguenot @ilandcloud ● Apache Spark website: https://spark.apache.org/ ● Mailing lists: user@spark.apache.org dev@spark.apache.org ● Community resources: ○ https://spark.apache.org/community.html ● Spark Packages: https://spark-packages.org/ ● Spark Summit ● Local meetups: ○ @ Houston: https://www.meetup.com/Houston-Spark-Meetup/ Ecosystem & Community

@anguenot @ilandcloud “Software is like sex: it's better when it's free.” -Linus Torvalds, creator of Linux (and GIT)

@anguenot @ilandcloud Examples and demo with Andrew Spargue https://github.com/spraguesy/spark-ncaa-bb

Apache Spark and Python: unified Big Data analytics

More Related Content

What's hot

Similar to Apache Spark and Python: unified Big Data analytics

More from Julien Anguenot

Recently uploaded

Apache Spark and Python: unified Big Data analytics