Data Science at Scale with Apache Spark and Zeppelin Notebook Carolyn Duby Big Data Solutions Architect Hortonworks
About Carolyn Duby • Big Data Solutions Architect • High performance data intensive systems • Data science • ScB ScM Computer Science, Brown University • LinkedIn: https://www.linkedin.com/in/carolynduby/ • Twitter: @carolynduby • Github: carolynduby • Hortonworks – Innovation through data – Enterprise ready, 100% open source, modern data platforms – Engineering, Technical Support, Professional Services, Training
https://www.meetup.com/futureofdata-boston
Agenda • Moving beyond the desktop to distributed computing – Apache Spark • Recording your results and sharing them with others – Apache Zeppelin
Are you Outgrowing your Desktop? • Analyzing and training with portion of available data • Analysis or training too slow • Out of memory • Data accumulates over time
How do you collaborate and record results? • Show your work – Effective peer review – Answer questions more quickly – Correct errors – Apply methods to other data • Increased quality and respect for results • Justify business decisions
Data Science at Scale with Apache Open Source • Apache Spark version 2.1 – Cleaning and analysis of large data sets – http://spark.apache.org • Apache Zeppelin Notebook 0.7.0 – Capture and share analysis – Visualize data for exploration and results – https://zeppelin.apache.org
Apache SPARK • Distributed processing efficiently crunches large data sets – Optimized – Horizontally scalable with multi tenancy – Fault tolerant • One platform for streaming, cleaning, analyzing • Elegant APIs – Scala, Python, Java, R • Many data source connectors – file system, HDFS, Hive, Phoenix, S3, etc
SPARK Libraries • Same API for all data sources • SQL - http://spark.apache.org/sql/ – Access structured data and combine with other sources • MLLIB - http://spark.apache.org/mllib/ – Machine learning for training models and predicting • GraphX - http://spark.apache.org/graphx/ – Connectivity algorithms • Streaming - http://spark.apache.org/streaming/ – Complex event processing and data ingest
Zeppelin • Notebook – Combine mark down, shell, spark, sql commands in same notebook – Easily integrate with Spark in different languages – Visualize data using graphs and pivot charts – Share notebooks or paragraphs
ARCHITECTURE Spark Driver Zeppelin Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task Client Browser
Getting Started • Use a distribution – Curated set of compatible open source projects • Sandbox - single node cluster in VM or Azure – https://hortonworks.com/products/sandbox/ • Hortonworks Community Connection – http://community.hortonworks.com • On premise – Use Apache Ambari to manage on premise physical hardware • Cloud – Automated provisioning with Cloudbreak (https://github.com/hortonworks/cloudbreak) – AWS, Azure, Google Cloud
Zeppelin Basics • Notes are composed of paragraphs • Paragraph contains code or markdown – Specify interpreter - % <interpreter name> or blank for default – Enter commands – Click play button to run code on cluster – Results display in paragraph • Code and results can be shown or hidden
Create/open Note Note tools Paragraph tools User and note configuration Markdown Interpreter (%md) (editor hidden) Shell Interpreter (%sh) (editor shown)
Markdown # headers %md hyperlink show/hide editor run paragraph run all paragraphs block quote
Example • Crimes in Chicago Kaggle Dataset • Interesting opportunities for time series and prediction https://www.kaggle.com/currie32/crimes-in-chicago
Data PIPELINE Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
Optimizing Data Cleaning • Keep a raw copy – Web sites go away, remove data, change links and interfaces • Store the clean data – Saves time each time you analyze • Use a standard format (Optimized Row Columnar(ORC), parquet, etc) – Query data with hive • Shared location if security and privacy requirements allow – Collaborate by sharing data with others
Acquire Dataset Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
• %sh interpreter • Bash shell • Show intermediate results for debug
CLEAN DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
Switching To spark Scala code
Spark is fast but lazy • Transformations • Specify which data to read • Modify data • Actions • Show data • Write data
Header and Case data on Same CSV line
Apply numeric types On clean data Add some columns to make aggregations easier
Table for SQL Save clean data as ORC
EXPLORE DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
Read clean data and create table
Specify query Select visualization Configure visualization X Y
Hover to see values
ANALYZE DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
CREATE DATA TO FIT POISSON
Fit Poisson Model
Evaluate Model
Model Pipelines • https://spark.apache.org/docs/2.1.0/ml- pipeline.html TRANFORMER Transformers Estimator Pipeline Model Training data Test data Prediction s
Python – Looks like Scala
R
Tips and Tricks • Use val for variables used across paragraphs – Vars can yield unpredictable results when run out of order • Break up big notebooks – Store intermediate results – Avoid reloading and recalculating the same values • Verify your notebook by running all paragraphs
Sharing Notebooks • Share link to notebook or paragraph – Readers access your Zeppelin server – Use logins and permissions • Export to JSON and save to shared file – Readers get JSON from shared file (github, cloud, etc) – Import to their Zeppelin server • Sync your to Zeppelin Hub (https://www.zepl.com) – Share Zeppelin Hub link with readers – Free version for small teams
Questions and THANK YOU!
REFERENCES
Reproducible Research • Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285 – http://journals.plos.org/ploscompbiol/article/file? id=10.1371/journal.pcbi.1003285&type=printable
Zeppelin and Spark • Spark – https://dzone.com/articles/try-the-latest-innovations-in- apache-spark-and-apa – https://hortonworks.com/hadoop-tutorial/learning-spark- zeppelin/ – https://spark.apache.org/docs/2.1.0/ml-pipeline.html • Example Notebooks – https://github.com/hortonworks-gallery/zeppelin-notebooks
Zeppelin Interpreters • Markdown syntax – http://daringfireball.net/projects/markdown/syntax
Example • Chicago Crimes Data Set – https://www.kaggle.com/currie32/crimes-in- Chicago • Example notebooks – https://github.com/carolynduby/ODSC2017
www.globalbigdataconference.com Twitter : @bigdataconf

Data Science at Scale with Apache Spark and Zeppelin Notebook

  • 1.
    Data Science atScale with Apache Spark and Zeppelin Notebook Carolyn Duby Big Data Solutions Architect Hortonworks
  • 2.
    About Carolyn Duby •Big Data Solutions Architect • High performance data intensive systems • Data science • ScB ScM Computer Science, Brown University • LinkedIn: https://www.linkedin.com/in/carolynduby/ • Twitter: @carolynduby • Github: carolynduby • Hortonworks – Innovation through data – Enterprise ready, 100% open source, modern data platforms – Engineering, Technical Support, Professional Services, Training
  • 3.
  • 4.
    Agenda • Moving beyondthe desktop to distributed computing – Apache Spark • Recording your results and sharing them with others – Apache Zeppelin
  • 5.
    Are you Outgrowing yourDesktop? • Analyzing and training with portion of available data • Analysis or training too slow • Out of memory • Data accumulates over time
  • 6.
    How do youcollaborate and record results? • Show your work – Effective peer review – Answer questions more quickly – Correct errors – Apply methods to other data • Increased quality and respect for results • Justify business decisions
  • 7.
    Data Science atScale with Apache Open Source • Apache Spark version 2.1 – Cleaning and analysis of large data sets – http://spark.apache.org • Apache Zeppelin Notebook 0.7.0 – Capture and share analysis – Visualize data for exploration and results – https://zeppelin.apache.org
  • 8.
    Apache SPARK • Distributedprocessing efficiently crunches large data sets – Optimized – Horizontally scalable with multi tenancy – Fault tolerant • One platform for streaming, cleaning, analyzing • Elegant APIs – Scala, Python, Java, R • Many data source connectors – file system, HDFS, Hive, Phoenix, S3, etc
  • 9.
    SPARK Libraries • SameAPI for all data sources • SQL - http://spark.apache.org/sql/ – Access structured data and combine with other sources • MLLIB - http://spark.apache.org/mllib/ – Machine learning for training models and predicting • GraphX - http://spark.apache.org/graphx/ – Connectivity algorithms • Streaming - http://spark.apache.org/streaming/ – Complex event processing and data ingest
  • 10.
    Zeppelin • Notebook – Combinemark down, shell, spark, sql commands in same notebook – Easily integrate with Spark in different languages – Visualize data using graphs and pivot charts – Share notebooks or paragraphs
  • 12.
    ARCHITECTURE Spark Driver Zeppelin Spark Application Master YARNcontainer Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task Client Browser
  • 13.
    Getting Started • Usea distribution – Curated set of compatible open source projects • Sandbox - single node cluster in VM or Azure – https://hortonworks.com/products/sandbox/ • Hortonworks Community Connection – http://community.hortonworks.com • On premise – Use Apache Ambari to manage on premise physical hardware • Cloud – Automated provisioning with Cloudbreak (https://github.com/hortonworks/cloudbreak) – AWS, Azure, Google Cloud
  • 14.
    Zeppelin Basics • Notesare composed of paragraphs • Paragraph contains code or markdown – Specify interpreter - % <interpreter name> or blank for default – Enter commands – Click play button to run code on cluster – Results display in paragraph • Code and results can be shown or hidden
  • 15.
    Create/open Note Note tools Paragraph tools User andnote configuration Markdown Interpreter (%md) (editor hidden) Shell Interpreter (%sh) (editor shown)
  • 16.
  • 17.
    Example • Crimes inChicago Kaggle Dataset • Interesting opportunities for time series and prediction https://www.kaggle.com/currie32/crimes-in-chicago
  • 18.
    Data PIPELINE Acquire Kaggle Common Store RawCSV zip Clean ORC Clean Explore Analyze
  • 19.
    Optimizing Data Cleaning •Keep a raw copy – Web sites go away, remove data, change links and interfaces • Store the clean data – Saves time each time you analyze • Use a standard format (Optimized Row Columnar(ORC), parquet, etc) – Query data with hive • Shared location if security and privacy requirements allow – Collaborate by sharing data with others
  • 20.
    Acquire Dataset Acquire Kaggle Common Store RawCSV zip Clean ORC Clean Explore Analyze
  • 21.
    • %sh interpreter •Bash shell • Show intermediate results for debug
  • 22.
    CLEAN DATASET Acquire Kaggle Common Store RawCSV zip Clean ORC Clean Explore Analyze
  • 24.
  • 25.
    Spark is fastbut lazy • Transformations • Specify which data to read • Modify data • Actions • Show data • Write data
  • 26.
    Header and Case dataon Same CSV line
  • 29.
    Apply numeric types Onclean data Add some columns to make aggregations easier
  • 30.
    Table for SQL Saveclean data as ORC
  • 31.
    EXPLORE DATASET Acquire Kaggle Common Store RawCSV zip Clean ORC Clean Explore Analyze
  • 32.
  • 33.
  • 39.
  • 41.
    ANALYZE DATASET Acquire Kaggle Common Store RawCSV zip Clean ORC Clean Explore Analyze
  • 42.
    CREATE DATA TOFIT POISSON
  • 44.
  • 45.
  • 46.
  • 47.
    Python – Lookslike Scala
  • 48.
  • 49.
    Tips and Tricks •Use val for variables used across paragraphs – Vars can yield unpredictable results when run out of order • Break up big notebooks – Store intermediate results – Avoid reloading and recalculating the same values • Verify your notebook by running all paragraphs
  • 50.
    Sharing Notebooks • Sharelink to notebook or paragraph – Readers access your Zeppelin server – Use logins and permissions • Export to JSON and save to shared file – Readers get JSON from shared file (github, cloud, etc) – Import to their Zeppelin server • Sync your to Zeppelin Hub (https://www.zepl.com) – Share Zeppelin Hub link with readers – Free version for small teams
  • 51.
  • 52.
  • 53.
    Reproducible Research • SandveGK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285 – http://journals.plos.org/ploscompbiol/article/file? id=10.1371/journal.pcbi.1003285&type=printable
  • 54.
    Zeppelin and Spark •Spark – https://dzone.com/articles/try-the-latest-innovations-in- apache-spark-and-apa – https://hortonworks.com/hadoop-tutorial/learning-spark- zeppelin/ – https://spark.apache.org/docs/2.1.0/ml-pipeline.html • Example Notebooks – https://github.com/hortonworks-gallery/zeppelin-notebooks
  • 55.
    Zeppelin Interpreters • Markdownsyntax – http://daringfireball.net/projects/markdown/syntax
  • 56.
    Example • Chicago CrimesData Set – https://www.kaggle.com/currie32/crimes-in- Chicago • Example notebooks – https://github.com/carolynduby/ODSC2017
  • 57.