@anguenot @ilandcloud Apache Spark and Python unified big data analytics Flatiron school, Houston - July 2019
@anguenot @ilandcloud OUTLINE ● What is Apache Spark? ● The issue of Big Data ● Past, Present and Future of Spark ● Spark architecture quick overview ● Spark’s languages and APIs ● PySpark ● Community & Ecosystem ● Examples and demo using Jupyter notebooks from Andrew Spargue
@anguenot @ilandcloud What is Apache Spark? ● Unified computing engine ● Libraries for parallel data processing on clusters ● Support multiple programming language: SQL, Scala, Java, Python, R ● Provides libraries for streaming, machine learning and graph computing ● Spark UI: Web UI to monitor and inspect jobs and tasks ● Can run anywhere: laptop to clusters (on-prem or cloud) ● De facto standard for Big Data processing across all industries and use cases ● Open Source @ Apache Software Foundation
@anguenot @ilandcloud Unified ● One (1) compute engine ● One (1) set of APIs ● One (1) way of developing and deploying applications (or jobs) ● Data loading, machine learning, streaming computation, etc. ● Interactive or traditional application deployment ● Code reuse and access to multiple libraries
@anguenot @ilandcloud Compute engine ● Compute engine only: not a persistent data storage system ● Supports a wide range of persistent storage systems: Amazon S3, Azure Storage, Apache Hadoop, Apache Cassandra, etc. ● Easier to deploy and maintain without persistent storage layer ● Also supports message buses such as Apache Kafka
@anguenot @ilandcloud The issue of Big Data ● Before applications mainly using single processor ● Processors stopped going faster since 2005 ● Amount of data increases ● Price of storage decreases: cheap to store data ● Solution: cluster and parallel CPU cores ● Performances ○ In-memory processing faster ○ Exit Hadoop / MapReduce at least for real-time analytics and streaming
@anguenot @ilandcloud Past, Present and future of Spark ● UC Berkeley, CA in 2009 @ Spark research project ● Hadoop / MapReduce first Open Source parallel computing engine for clusters ● Issue of multiple passes over the data requiring multiple jobs with each pass writing results on disks ● Spark enabled multistep applications with efficient in-memory data sharing in between steps (batch only at first) ● Interactive and ad-hoc queries (data scientist) ● Apache foundation project in 2013 ● Spark 1.0 in 2014 introduced Spark SQL and structured data ● Then came: structured streaming, machine learning pipelines and graphs ● Now: de facto standard in all industries: Netflix, Uber, CERN, MIT, Harvard, etc.
@anguenot @ilandcloud Spark Architecture Overview Graphics from https://spark.apache.org
@anguenot @ilandcloud Spark Language’s ● Scala: default language ● Java: available but not popular ● Python: supports nearly all constructs that Scala supports ● SQL: subset of SQL 2003 Standard ● R: SparkR and sparklyr (community) but Python in the process of eating R in the data scientist communities
@anguenot @ilandcloud Spark’s high level structured APIs (1/2) ● Spark Session ○ driver process controlling the Spark application throughout the cluster ○ One-to-one correspondence ● DataFrames ○ Most common structured API ○ Table of data with rows and columns with a schema (column name and value type) ○ Think distributed! ○ Columns / Rows and Spark Types ○ Basically the same as Table and views against which you execute SQL w/ Spark SQL ○ “Untyped” DataSet ● Partitions ○ Parallel executions of chunked data. Spark does it for you by default if using Dataframes
@anguenot @ilandcloud ● Transformations ○ Data structures are immutables: you need to apply instructions called transformations ○ In-memory (narrow transformations) with pipelining when one-to-one partition transformation ○ Shuffled (wide transformations) with disk writes when one-to-many partitions transformation ● Lazy Evaluation ○ Streamlined plan of transformations ○ Optimized graph of computation instructions: logical plan ● Actions ○ count(), collect(), take(n), top(), countByValues(), reduce(), fold(), aggregate(), foreach() ○ Triggers computation against the plan of transformations ○ Single job, broken down in multiple stages and tasks to execute across the cluster ○ Physical plan (clustered RDD manipulations) Spark’s high level structured APIs (2/2)
@anguenot @ilandcloud Spark’s application Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud Spark’s toolkit Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud The Catalyst Optimizer Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud The Structured API logical planning process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud The Physical Planning Process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud PySpark ● Use cases: ○ exploratory data analysis at scale ○ building machine learning pipelines ○ creating ETLs for a data platform ○ streaming data pipelines ● Interactive pyspark-shell ● Since Spark 2.2 available as a PyPI package ● Differences with native Scala? ● Spark Catalyst Engine optimization w/ structured APIs ● Pandas Dataframes & UDFs integration (single node) ● Python is the fastest growing language for data science & machine learning
@anguenot @ilandcloud ● Apache Spark website: https://spark.apache.org/ ● Mailing lists: user@spark.apache.org dev@spark.apache.org ● Community resources: ○ https://spark.apache.org/community.html ● Spark Packages: https://spark-packages.org/ ● Spark Summit ● Local meetups: ○ @ Houston: https://www.meetup.com/Houston-Spark-Meetup/ Ecosystem & Community
@anguenot @ilandcloud “Software is like sex: it's better when it's free.” -Linus Torvalds, creator of Linux (and GIT)
@anguenot @ilandcloud Examples and demo with Andrew Spargue https://github.com/spraguesy/spark-ncaa-bb
@anguenot @ilandcloud Q&A

Apache Spark and Python: unified Big Data analytics

  • 1.
    @anguenot @ilandcloud Apache Sparkand Python unified big data analytics Flatiron school, Houston - July 2019
  • 2.
    @anguenot @ilandcloud OUTLINE ● Whatis Apache Spark? ● The issue of Big Data ● Past, Present and Future of Spark ● Spark architecture quick overview ● Spark’s languages and APIs ● PySpark ● Community & Ecosystem ● Examples and demo using Jupyter notebooks from Andrew Spargue
  • 3.
    @anguenot @ilandcloud What isApache Spark? ● Unified computing engine ● Libraries for parallel data processing on clusters ● Support multiple programming language: SQL, Scala, Java, Python, R ● Provides libraries for streaming, machine learning and graph computing ● Spark UI: Web UI to monitor and inspect jobs and tasks ● Can run anywhere: laptop to clusters (on-prem or cloud) ● De facto standard for Big Data processing across all industries and use cases ● Open Source @ Apache Software Foundation
  • 4.
    @anguenot @ilandcloud Unified ● One(1) compute engine ● One (1) set of APIs ● One (1) way of developing and deploying applications (or jobs) ● Data loading, machine learning, streaming computation, etc. ● Interactive or traditional application deployment ● Code reuse and access to multiple libraries
  • 5.
    @anguenot @ilandcloud Compute engine ●Compute engine only: not a persistent data storage system ● Supports a wide range of persistent storage systems: Amazon S3, Azure Storage, Apache Hadoop, Apache Cassandra, etc. ● Easier to deploy and maintain without persistent storage layer ● Also supports message buses such as Apache Kafka
  • 6.
    @anguenot @ilandcloud The issueof Big Data ● Before applications mainly using single processor ● Processors stopped going faster since 2005 ● Amount of data increases ● Price of storage decreases: cheap to store data ● Solution: cluster and parallel CPU cores ● Performances ○ In-memory processing faster ○ Exit Hadoop / MapReduce at least for real-time analytics and streaming
  • 7.
    @anguenot @ilandcloud Past, Presentand future of Spark ● UC Berkeley, CA in 2009 @ Spark research project ● Hadoop / MapReduce first Open Source parallel computing engine for clusters ● Issue of multiple passes over the data requiring multiple jobs with each pass writing results on disks ● Spark enabled multistep applications with efficient in-memory data sharing in between steps (batch only at first) ● Interactive and ad-hoc queries (data scientist) ● Apache foundation project in 2013 ● Spark 1.0 in 2014 introduced Spark SQL and structured data ● Then came: structured streaming, machine learning pipelines and graphs ● Now: de facto standard in all industries: Netflix, Uber, CERN, MIT, Harvard, etc.
  • 8.
    @anguenot @ilandcloud Spark ArchitectureOverview Graphics from https://spark.apache.org
  • 9.
    @anguenot @ilandcloud Spark Language’s ●Scala: default language ● Java: available but not popular ● Python: supports nearly all constructs that Scala supports ● SQL: subset of SQL 2003 Standard ● R: SparkR and sparklyr (community) but Python in the process of eating R in the data scientist communities
  • 10.
    @anguenot @ilandcloud Spark’s highlevel structured APIs (1/2) ● Spark Session ○ driver process controlling the Spark application throughout the cluster ○ One-to-one correspondence ● DataFrames ○ Most common structured API ○ Table of data with rows and columns with a schema (column name and value type) ○ Think distributed! ○ Columns / Rows and Spark Types ○ Basically the same as Table and views against which you execute SQL w/ Spark SQL ○ “Untyped” DataSet ● Partitions ○ Parallel executions of chunked data. Spark does it for you by default if using Dataframes
  • 11.
    @anguenot @ilandcloud ● Transformations ○Data structures are immutables: you need to apply instructions called transformations ○ In-memory (narrow transformations) with pipelining when one-to-one partition transformation ○ Shuffled (wide transformations) with disk writes when one-to-many partitions transformation ● Lazy Evaluation ○ Streamlined plan of transformations ○ Optimized graph of computation instructions: logical plan ● Actions ○ count(), collect(), take(n), top(), countByValues(), reduce(), fold(), aggregate(), foreach() ○ Triggers computation against the plan of transformations ○ Single job, broken down in multiple stages and tasks to execute across the cluster ○ Physical plan (clustered RDD manipulations) Spark’s high level structured APIs (2/2)
  • 12.
    @anguenot @ilandcloud Spark’s application Graphicsfrom the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 13.
    @anguenot @ilandcloud Spark’s toolkit Graphicsfrom the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 14.
    @anguenot @ilandcloud The CatalystOptimizer Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 15.
    @anguenot @ilandcloud The StructuredAPI logical planning process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 16.
    @anguenot @ilandcloud The PhysicalPlanning Process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 17.
    @anguenot @ilandcloud PySpark ● Usecases: ○ exploratory data analysis at scale ○ building machine learning pipelines ○ creating ETLs for a data platform ○ streaming data pipelines ● Interactive pyspark-shell ● Since Spark 2.2 available as a PyPI package ● Differences with native Scala? ● Spark Catalyst Engine optimization w/ structured APIs ● Pandas Dataframes & UDFs integration (single node) ● Python is the fastest growing language for data science & machine learning
  • 18.
    @anguenot @ilandcloud ● ApacheSpark website: https://spark.apache.org/ ● Mailing lists: user@spark.apache.org dev@spark.apache.org ● Community resources: ○ https://spark.apache.org/community.html ● Spark Packages: https://spark-packages.org/ ● Spark Summit ● Local meetups: ○ @ Houston: https://www.meetup.com/Houston-Spark-Meetup/ Ecosystem & Community
  • 19.
    @anguenot @ilandcloud “Software islike sex: it's better when it's free.” -Linus Torvalds, creator of Linux (and GIT)
  • 20.
    @anguenot @ilandcloud Examples anddemo with Andrew Spargue https://github.com/spraguesy/spark-ncaa-bb
  • 21.