Building A Scalable Data Science Platform with R Mario Inchiosa, Principal Software Engineer Roni Burd, Principal Program Manager
R – What is it? Open Source “lingua franca” Analytics, Computing, Modeling Global Community Millions of users 7000+ Algorithms, Test Data & Evaluations Ecosystem
Common R use cases Vertical Sales & Marketing Finance & Risk Customer & Channel Operations & Workforce Retail Demand Forecasting Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Pricing Strategy Personalization Lifetime Customer Value Product Segmentation Store Location Demographics Supply Chain Management Inventory Management Financial Services Customer Churn Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Risk& Compliance Loan Defaults Personalization Lifetime Customer Value Call Center Optimization Pay for Performance Healthcare Marketing Mix Optimization Patient Acquisition Fraud Detection Bill Collection Population Health Patient Demographics Operational Efficiency Pay for Performance Manufacturing Demand Forecasting Marketing mix Optimization Pricing Strategy Perf Risk Management Supply Chain Optimization Personalization Remote Monitoring Predictive Maintenance Asset Management
IEEE Spectrum July 2015 Data Flows Overwhelm Open Source R – In-Memory Operation – Lack of Implicit Parallelism – Expensive Data Movement & Duplication Not enterprise ready – Inadequacy of Community Support – Lack of Guaranteed Support Timeliness – No SLAs or Support models R has some challenges
R Server: scale-out R, Enterprise Class!  Enterprise Scale & Performance – Scales from workstations to large clusters – Can process terabytes of data faster – Growing portfolio of Parallelized algorithms  Enterprise Class Support  Secure R Deployment/Operationalization  Write Once Deploy Anywhere for multiple platforms – Cloud: HDInsight and Marketplace – On Prem: SQL Server, Hadoop (HortonWorks, Cloudera, MapR) and TeraData DB  IDE for data scientists and developers (R Tools for Visual Studio) DistributedR RTVS DeployR ScaleR ConnectR
R Server: scale-out R, Enterprise Class! • 100% compatible with open source R • Any code/package that works today with R will work in R Server • Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict() • Ability to parallelize any R function • Ideal for parameter sweeps, simulation or multiple runs
Parallelized & Distributed Algorithms  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Predictions/scoring for models  Residuals for all models Predictive Statistics  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees  Naïve Bayes Cluster Analysis Machine Learning Simulation Variable Selection  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Custom Parallelization  rxDataStep  rxExec  PEMA-R API  Stepwise Regression
HDInsight + R Server: Managed Hadoop for advanced analytics Spark (data manipulation) R Server (big computation) R Spark and Hadoop Blob Storage Data Lake Storage NotebooksScala/Java Hadoop: lingua franca for BigData • Spark (Standard) • Integrated notebooks experience • Upgraded to latest Version 1.6 • R Server (Premium) • Leverage R skills with massively scalable algorithms and statistical functions • Reuse existing R functions over multiple machines
R Server HDInsight Architecture R R R R R R R R R R RStudio Server (optional) R Server Master R process on Edge node Apache YARN and Spark Worker R processes on Data nodes
OperationalizeModelPrepare Typical advanced analytics lifecycle
• Clean/Join – Using SparkR from R Server • Train/Score/Evaluate – Scalable R Server functions • Deploy/Consume – Using AzureML from R Server Airline Arrival Delay Prediction Demo
• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection • >20 years of data • 300+ Airports • Every carrier, every commercial flight • http://www.transtats.bts.gov Airline data set
• Hourly land-based weather observations from NOAA • > 2,000 weather stations • http://www.ncdc.noaa.gov/orders/qclcd/ Weather data set
Provisioning a cluster with R Server
Scaling a cluster
Clean and Join using SparkR in R Server
Train, Score, and Evaluate using R Server
Publish Web Service from R
• HDInsight Premium Hadoop cluster • Spark on YARN distributed computing • R Server R interpreter • SparkR data manipulation functions • RevoScaleR Statistical & Machine Learning functions • AzureML R package and Azure ML web service Demo Technologies
Building a genetic disease risk application with R Data • Public genome data from 1000 Genomes • About 2TB of raw data Processing • VariantTools R package (Bioconductor) • Match against NHGRI GWAS catalog Analytics • Disease Risk • Ancestry Presentation • Expose as Web Service APIs • Phone app, Web page, Enterprise applications BAM BAM BAM BAM VariantTools GWAS BAM Platform • HDInsight Hadoop (8 clusters) • 1500 cores, 4 data centers • Microsoft R Server
The Four Transformational Trends cloud computing 2011  2016 5x increase data science Universities filling 300,000 US talent gap 90% of the data in the world today has been created in the last two years alone big data open source including R, Linux, Hadoop
microsoft.com/hdinsight microsoft.com/r-server

Building a scalable data science platform with R

  • 1.
    Building A ScalableData Science Platform with R Mario Inchiosa, Principal Software Engineer Roni Burd, Principal Program Manager
  • 2.
    R – Whatis it? Open Source “lingua franca” Analytics, Computing, Modeling Global Community Millions of users 7000+ Algorithms, Test Data & Evaluations Ecosystem
  • 3.
    Common R usecases Vertical Sales & Marketing Finance & Risk Customer & Channel Operations & Workforce Retail Demand Forecasting Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Pricing Strategy Personalization Lifetime Customer Value Product Segmentation Store Location Demographics Supply Chain Management Inventory Management Financial Services Customer Churn Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Risk& Compliance Loan Defaults Personalization Lifetime Customer Value Call Center Optimization Pay for Performance Healthcare Marketing Mix Optimization Patient Acquisition Fraud Detection Bill Collection Population Health Patient Demographics Operational Efficiency Pay for Performance Manufacturing Demand Forecasting Marketing mix Optimization Pricing Strategy Perf Risk Management Supply Chain Optimization Personalization Remote Monitoring Predictive Maintenance Asset Management
  • 4.
    IEEE Spectrum July2015 Data Flows Overwhelm Open Source R – In-Memory Operation – Lack of Implicit Parallelism – Expensive Data Movement & Duplication Not enterprise ready – Inadequacy of Community Support – Lack of Guaranteed Support Timeliness – No SLAs or Support models R has some challenges
  • 5.
    R Server: scale-outR, Enterprise Class!  Enterprise Scale & Performance – Scales from workstations to large clusters – Can process terabytes of data faster – Growing portfolio of Parallelized algorithms  Enterprise Class Support  Secure R Deployment/Operationalization  Write Once Deploy Anywhere for multiple platforms – Cloud: HDInsight and Marketplace – On Prem: SQL Server, Hadoop (HortonWorks, Cloudera, MapR) and TeraData DB  IDE for data scientists and developers (R Tools for Visual Studio) DistributedR RTVS DeployR ScaleR ConnectR
  • 6.
    R Server: scale-outR, Enterprise Class! • 100% compatible with open source R • Any code/package that works today with R will work in R Server • Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict() • Ability to parallelize any R function • Ideal for parameter sweeps, simulation or multiple runs
  • 7.
    Parallelized & DistributedAlgorithms  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Predictions/scoring for models  Residuals for all models Predictive Statistics  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees  Naïve Bayes Cluster Analysis Machine Learning Simulation Variable Selection  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Custom Parallelization  rxDataStep  rxExec  PEMA-R API  Stepwise Regression
  • 8.
    HDInsight + RServer: Managed Hadoop for advanced analytics Spark (data manipulation) R Server (big computation) R Spark and Hadoop Blob Storage Data Lake Storage NotebooksScala/Java Hadoop: lingua franca for BigData • Spark (Standard) • Integrated notebooks experience • Upgraded to latest Version 1.6 • R Server (Premium) • Leverage R skills with massively scalable algorithms and statistical functions • Reuse existing R functions over multiple machines
  • 9.
    R Server HDInsightArchitecture R R R R R R R R R R RStudio Server (optional) R Server Master R process on Edge node Apache YARN and Spark Worker R processes on Data nodes
  • 10.
  • 11.
    • Clean/Join –Using SparkR from R Server • Train/Score/Evaluate – Scalable R Server functions • Deploy/Consume – Using AzureML from R Server Airline Arrival Delay Prediction Demo
  • 12.
    • Passenger flighton-time performance data from the US Department of Transportation’s TranStats data collection • >20 years of data • 300+ Airports • Every carrier, every commercial flight • http://www.transtats.bts.gov Airline data set
  • 13.
    • Hourly land-basedweather observations from NOAA • > 2,000 weather stations • http://www.ncdc.noaa.gov/orders/qclcd/ Weather data set
  • 14.
  • 15.
  • 16.
    Clean and Joinusing SparkR in R Server
  • 17.
    Train, Score, andEvaluate using R Server
  • 18.
  • 19.
    • HDInsight PremiumHadoop cluster • Spark on YARN distributed computing • R Server R interpreter • SparkR data manipulation functions • RevoScaleR Statistical & Machine Learning functions • AzureML R package and Azure ML web service Demo Technologies
  • 20.
    Building a geneticdisease risk application with R Data • Public genome data from 1000 Genomes • About 2TB of raw data Processing • VariantTools R package (Bioconductor) • Match against NHGRI GWAS catalog Analytics • Disease Risk • Ancestry Presentation • Expose as Web Service APIs • Phone app, Web page, Enterprise applications BAM BAM BAM BAM VariantTools GWAS BAM Platform • HDInsight Hadoop (8 clusters) • 1500 cores, 4 data centers • Microsoft R Server
  • 21.
    The Four TransformationalTrends cloud computing 2011  2016 5x increase data science Universities filling 300,000 US talent gap 90% of the data in the world today has been created in the last two years alone big data open source including R, Linux, Hadoop
  • 22.

Editor's Notes

  • #2 Abstract: Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Services API. Come learn how to use the magic of the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
  • #21 Disease Risk Prediction Pipeline Alignment of unaligned short reads – done beforehand using MSR’s SNAP Variant Calling – pick most likely letter (SNP) from ~30 overlapping reads Risk Scoring – look up disease risks for each SNP in GWAS table and aggregate by disease The Thousand Genomes Project Raw sequence data – unaligned short reads The Bioconductor R Packages VariantTools Genome Wide Associates Studies (GWAS) Associations between SNPs (i.e. genetic mutations) and disease risk Compiled by the National Human Genome Research Institute