Building a scalable data science platform with R

Building A Scalable Data Science Platform with R Mario Inchiosa, Principal Software Engineer Roni Burd, Principal Program Manager

R – What is it? Open Source “lingua franca” Analytics, Computing, Modeling Global Community Millions of users 7000+ Algorithms, Test Data & Evaluations Ecosystem

Common R use cases Vertical Sales & Marketing Finance & Risk Customer & Channel Operations & Workforce Retail Demand Forecasting Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Pricing Strategy Personalization Lifetime Customer Value Product Segmentation Store Location Demographics Supply Chain Management Inventory Management Financial Services Customer Churn Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Risk& Compliance Loan Defaults Personalization Lifetime Customer Value Call Center Optimization Pay for Performance Healthcare Marketing Mix Optimization Patient Acquisition Fraud Detection Bill Collection Population Health Patient Demographics Operational Efficiency Pay for Performance Manufacturing Demand Forecasting Marketing mix Optimization Pricing Strategy Perf Risk Management Supply Chain Optimization Personalization Remote Monitoring Predictive Maintenance Asset Management

IEEE Spectrum July 2015 Data Flows Overwhelm Open Source R – In-Memory Operation – Lack of Implicit Parallelism – Expensive Data Movement & Duplication Not enterprise ready – Inadequacy of Community Support – Lack of Guaranteed Support Timeliness – No SLAs or Support models R has some challenges

R Server: scale-out R, Enterprise Class!  Enterprise Scale & Performance – Scales from workstations to large clusters – Can process terabytes of data faster – Growing portfolio of Parallelized algorithms  Enterprise Class Support  Secure R Deployment/Operationalization  Write Once Deploy Anywhere for multiple platforms – Cloud: HDInsight and Marketplace – On Prem: SQL Server, Hadoop (HortonWorks, Cloudera, MapR) and TeraData DB  IDE for data scientists and developers (R Tools for Visual Studio) DistributedR RTVS DeployR ScaleR ConnectR

R Server: scale-out R, Enterprise Class! • 100% compatible with open source R • Any code/package that works today with R will work in R Server • Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict() • Ability to parallelize any R function • Ideal for parameter sweeps, simulation or multiple runs

Parallelized & Distributed Algorithms  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Predictions/scoring for models  Residuals for all models Predictive Statistics  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees  Naïve Bayes Cluster Analysis Machine Learning Simulation Variable Selection  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Custom Parallelization  rxDataStep  rxExec  PEMA-R API  Stepwise Regression

HDInsight + R Server: Managed Hadoop for advanced analytics Spark (data manipulation) R Server (big computation) R Spark and Hadoop Blob Storage Data Lake Storage NotebooksScala/Java Hadoop: lingua franca for BigData • Spark (Standard) • Integrated notebooks experience • Upgraded to latest Version 1.6 • R Server (Premium) • Leverage R skills with massively scalable algorithms and statistical functions • Reuse existing R functions over multiple machines

R Server HDInsight Architecture R R R R R R R R R R RStudio Server (optional) R Server Master R process on Edge node Apache YARN and Spark Worker R processes on Data nodes

OperationalizeModelPrepare Typical advanced analytics lifecycle

• Clean/Join – Using SparkR from R Server • Train/Score/Evaluate – Scalable R Server functions • Deploy/Consume – Using AzureML from R Server Airline Arrival Delay Prediction Demo

• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection • >20 years of data • 300+ Airports • Every carrier, every commercial flight • http://www.transtats.bts.gov Airline data set

• Hourly land-based weather observations from NOAA • > 2,000 weather stations • http://www.ncdc.noaa.gov/orders/qclcd/ Weather data set

Provisioning a cluster with R Server

Clean and Join using SparkR in R Server

Train, Score, and Evaluate using R Server

• HDInsight Premium Hadoop cluster • Spark on YARN distributed computing • R Server R interpreter • SparkR data manipulation functions • RevoScaleR Statistical & Machine Learning functions • AzureML R package and Azure ML web service Demo Technologies

Building a genetic disease risk application with R Data • Public genome data from 1000 Genomes • About 2TB of raw data Processing • VariantTools R package (Bioconductor) • Match against NHGRI GWAS catalog Analytics • Disease Risk • Ancestry Presentation • Expose as Web Service APIs • Phone app, Web page, Enterprise applications BAM BAM BAM BAM VariantTools GWAS BAM Platform • HDInsight Hadoop (8 clusters) • 1500 cores, 4 data centers • Microsoft R Server

The Four Transformational Trends cloud computing 2011  2016 5x increase data science Universities filling 300,000 US talent gap 90% of the data in the world today has been created in the last two years alone big data open source including R, Linux, Hadoop

microsoft.com/hdinsight microsoft.com/r-server

Building a scalable data science platform with R

More Related Content

What's hot

Viewers also liked

Similar to Building a scalable data science platform with R

More from Revolution Analytics

Recently uploaded

Building a scalable data science platform with R

Editor's Notes