Data Science at Scale with Apache Spark and Zeppelin Notebook

Data Science at Scale with Apache Spark and Zeppelin Notebook Carolyn Duby Big Data Solutions Architect Hortonworks

About Carolyn Duby • Big Data Solutions Architect • High performance data intensive systems • Data science • ScB ScM Computer Science, Brown University • LinkedIn: https://www.linkedin.com/in/carolynduby/ • Twitter: @carolynduby • Github: carolynduby • Hortonworks – Innovation through data – Enterprise ready, 100% open source, modern data platforms – Engineering, Technical Support, Professional Services, Training

https://www.meetup.com/futureofdata-boston

Agenda • Moving beyond the desktop to distributed computing – Apache Spark • Recording your results and sharing them with others – Apache Zeppelin

Are you Outgrowing your Desktop? • Analyzing and training with portion of available data • Analysis or training too slow • Out of memory • Data accumulates over time

How do you collaborate and record results? • Show your work – Effective peer review – Answer questions more quickly – Correct errors – Apply methods to other data • Increased quality and respect for results • Justify business decisions

Data Science at Scale with Apache Open Source • Apache Spark version 2.1 – Cleaning and analysis of large data sets – http://spark.apache.org • Apache Zeppelin Notebook 0.7.0 – Capture and share analysis – Visualize data for exploration and results – https://zeppelin.apache.org

Apache SPARK • Distributed processing efficiently crunches large data sets – Optimized – Horizontally scalable with multi tenancy – Fault tolerant • One platform for streaming, cleaning, analyzing • Elegant APIs – Scala, Python, Java, R • Many data source connectors – file system, HDFS, Hive, Phoenix, S3, etc

SPARK Libraries • Same API for all data sources • SQL - http://spark.apache.org/sql/ – Access structured data and combine with other sources • MLLIB - http://spark.apache.org/mllib/ – Machine learning for training models and predicting • GraphX - http://spark.apache.org/graphx/ – Connectivity algorithms • Streaming - http://spark.apache.org/streaming/ – Complex event processing and data ingest

Zeppelin • Notebook – Combine mark down, shell, spark, sql commands in same notebook – Easily integrate with Spark in different languages – Visualize data using graphs and pivot charts – Share notebooks or paragraphs

ARCHITECTURE Spark Driver Zeppelin Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task Client Browser

Getting Started • Use a distribution – Curated set of compatible open source projects • Sandbox - single node cluster in VM or Azure – https://hortonworks.com/products/sandbox/ • Hortonworks Community Connection – http://community.hortonworks.com • On premise – Use Apache Ambari to manage on premise physical hardware • Cloud – Automated provisioning with Cloudbreak (https://github.com/hortonworks/cloudbreak) – AWS, Azure, Google Cloud

Zeppelin Basics • Notes are composed of paragraphs • Paragraph contains code or markdown – Specify interpreter - % <interpreter name> or blank for default – Enter commands – Click play button to run code on cluster – Results display in paragraph • Code and results can be shown or hidden

Create/open Note Note tools Paragraph tools User and note configuration Markdown Interpreter (%md) (editor hidden) Shell Interpreter (%sh) (editor shown)

Markdown # headers %md hyperlink show/hide editor run paragraph run all paragraphs block quote

Example • Crimes in Chicago Kaggle Dataset • Interesting opportunities for time series and prediction https://www.kaggle.com/currie32/crimes-in-chicago

Data PIPELINE Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze

Optimizing Data Cleaning • Keep a raw copy – Web sites go away, remove data, change links and interfaces • Store the clean data – Saves time each time you analyze • Use a standard format (Optimized Row Columnar(ORC), parquet, etc) – Query data with hive • Shared location if security and privacy requirements allow – Collaborate by sharing data with others

Acquire Dataset Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze

• %sh interpreter • Bash shell • Show intermediate results for debug

CLEAN DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze

Spark is fast but lazy • Transformations • Specify which data to read • Modify data • Actions • Show data • Write data

Header and Case data on Same CSV line

Apply numeric types On clean data Add some columns to make aggregations easier

Table for SQL Save clean data as ORC

EXPLORE DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze

Read clean data and create table

Specify query Select visualization Configure visualization X Y

ANALYZE DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze

Model Pipelines • https://spark.apache.org/docs/2.1.0/ml- pipeline.html TRANFORMER Transformers Estimator Pipeline Model Training data Test data Prediction s

Tips and Tricks • Use val for variables used across paragraphs – Vars can yield unpredictable results when run out of order • Break up big notebooks – Store intermediate results – Avoid reloading and recalculating the same values • Verify your notebook by running all paragraphs

Sharing Notebooks • Share link to notebook or paragraph – Readers access your Zeppelin server – Use logins and permissions • Export to JSON and save to shared file – Readers get JSON from shared file (github, cloud, etc) – Import to their Zeppelin server • Sync your to Zeppelin Hub (https://www.zepl.com) – Share Zeppelin Hub link with readers – Free version for small teams

Reproducible Research • Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285 – http://journals.plos.org/ploscompbiol/article/file? id=10.1371/journal.pcbi.1003285&type=printable

Zeppelin and Spark • Spark – https://dzone.com/articles/try-the-latest-innovations-in- apache-spark-and-apa – https://hortonworks.com/hadoop-tutorial/learning-spark- zeppelin/ – https://spark.apache.org/docs/2.1.0/ml-pipeline.html • Example Notebooks – https://github.com/hortonworks-gallery/zeppelin-notebooks

Zeppelin Interpreters • Markdown syntax – http://daringfireball.net/projects/markdown/syntax

Example • Chicago Crimes Data Set – https://www.kaggle.com/currie32/crimes-in- Chicago • Example notebooks – https://github.com/carolynduby/ODSC2017

www.globalbigdataconference.com Twitter : @bigdataconf

Data Science at Scale with Apache Spark and Zeppelin Notebook

More Related Content

What's hot

Similar to Data Science at Scale with Apache Spark and Zeppelin Notebook

Recently uploaded

Data Science at Scale with Apache Spark and Zeppelin Notebook