Introducing Microsoft R Server & Microsoft R Open Krit Kamtuo Technical Evangelist Microsoft (Thailand) Limited
What is R? Language Platform Community Ecosystem • A programming language for statistics, analytics, and data science • A data visualization framework • Provided as Open Source • Used by 2.5M+ data scientists, statisticians and analysts • Taught in most university statistics programs • Active and thriving user groups across the world • CRAN: 7000+ freely available algorithms, test data and evaluation • Many of these are applicable to big data if scaled • New and recent graduates prefer it
20152009200420032000199719951993 Research Projectin New Zealand Open Source Project R-Core Group R-1.0.0 released R Foundation First user New York Times article R-3.2.0 and R Consortium (foundedby Microsoft) History of R
$? Challenges posed by open source R Uncertain total cost of ownership Inadequate access to important business data Limited business agility Limited business value
R from Microsoft brings
6 • Free and open source R distribution • Enhanced and distributed by Revolution Analytics Microsoft R Open • Built in Advanced Analytics and Stand Alone Server Capability • Leverages the Benefits of SQL 2016 Enterprise Edition SQL Server R Services Microsoft R Products
Microsoft R Server • Microsoft R Server for Redhat Linux • Microsoft R Server for SUSE Linux • Microsoft R Server for Teradata DB • Microsoft R Server for Hadoop on Redhat Microsoft R Server
Introducing SQL Server 2016 R Services Enterprise speed and performance Near-DB analytics Parallel threading and processing Model on-premises, store in cloud—or vice versa Hybrid memory and disk scalability Not bound by memory- enabling limits of larger datasets Included in SQL Server 2016 Reuse and optimize existing R code Eliminate data movement across machines Write once, deploy anywhere
Microsoft R server for distributed computing The First NIDA Business Analytics and Data Sciences Contest/Conference วันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์ -แนะนํา Microsoft R Server -Distributed Computing มีวิธีการอย่างไร และมีประโยชน์อย่างไร -แนะนําวิธีการ Configuration สําหรับ Distributed Computing https://businessanalyticsnida.wordpress.com https://www.facebook.com/BusinessAnalyticsNIDA/ กฤษฏิ์ คําตื้อ, Technical Evangelist, Microsoft (Thailand) -Distributed computing กับ Big Data -Analytics บน R server -สาธิตและสอนในลักษณะ workshop Computer Lab 2 ชั้น 10 อาคารสยามบรมราชกุมารี 1 กันยายน 2559 เวลา 9.00-12.30
Scalable in-database analytics Data Scientist Interacts directly with data Creates models and experiments Data Analyst/DBA Manages data and analytics together Example Solutions • Fraud detection • Sales forecasting • Warehouse efficiency • Predictive maintenance 010010 100100 010101 Relational Data Extensibility ? R R Integration Analytic Library Open Source R Revolution PEMA T-SQL Interface How is it Integrated? • T-SQL calls a Stored Procedure • Script is run in SQL through extensibility model • Result sets sent through Web API to database or applications Benefits • Faster deployment of ML models • Less data movement, faster insights • Work with large datasets: mitigate R memory and scalability limitations
Cost effectiveness • Best Advanced Analytics Value • R Services and Polybase are built-in o Part of SQL Server 2016 Enterprise Edition • In DB analytics shrinks analysis cost and time o No data movement reduces costs • No Proprietary Hardware Requirement o Can be installed in commodity hardware • Integration between cloud and open source offerings SQL SERVER 2016 $ 648 K + $120 Per user for PowerBI Costs based on a Server with 2 proc/ 8 Cores
11
High-performance open source R plus: • Data source connectivity to big-data objects • Big-data advanced analytics • Multi-platform environment support • In-Hadoop and in-Teradata predictive modeling • Development and production environment support • IDE for data scientist developers • Secure, Scalable R Deployment DeployR R Open R Server DevelopR Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling Introducing Microsoft R Server
R Open MicrosoftR Server DeployRDevelopR The Microsoft R Server Platform ConnectR • High-speed & direct connectors Available for: • High-performance XDF • SAS, SPSS, delimited& fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database & Aster • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelizedanalytics • Data prep & data distillation • Descriptive statistics & statistical tests • Range of predictive functions • User tools for distributingcustomizedR algorithms across nodes • Wide data sets supported – thousands of variables DistributedR • Distributed computingframework • Delivers cross-platformportability R+CRAN • Open source R interpreter • R 3.1.2 • Freely-available huge range of R algorithms • Algorithms callable by RevoR • Embeddable in R scripts • 100% Compatible with existingR scripts, functions and packages RevoR • Performance enhancedR interpreter • Based on open source R • Adds high-performance math libraryto speed up linear algebra functions
ScaleR – Parallel + “Big Data” Stream data in to RAM in blocks. “Big Data” can be any data size. We handle Megabytes to Gigabytes to Terabytes… Our ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Interim results are collected and combined analytically to produce the output on the entire data set XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.
16
SQL Server 2016 Enterprise Edition SQL Server R Services Integration Facilities: • Component Integration • Launchers • Parameter Passing • Results Return • Console Output Return • Parallel Data Exchange (RTM) • Stored Procedures • Package Administration SQL Server Query Processor Algorithm Library • Data Prep • Descriptive Stats • Sampling • Statistical Tests • Predictive Models • Variable Selection • Clustering • Classification • Custom APIs for R + CRAN • Parallel Scoring Fast, Parallel, Storage Efficient Algorithms Microsoft R Open • 100% Open Source R • Fully CRAN Compatible • Accelerated Math Open Source R Interpreter
Run R In-Database from TSQL SQL Server 2016 In-Database Execution of R + CRAN + SQL In-Database Execution of:  R Code  CRAN Packages Move the Work to the Data Run R From the Query Processor Retrieve Models, Scores, Transformed Data, Plots/Images Operationalise scoring/predictio n in database for data batches or real-time
SQL In-Database Execution:  Remote Execution  Parallelized Compute SQL Server Remote Execution Context Explore and Model:  In Parallel, In-Database  Parallelize distributable R and CRAN Operationlize:  Score In Parallel Parallel Worker Tasks Move BIG Work to the Data Large Data Sets in Chunks Parallel Algorithm Iterate/ Sequence Run Parallel Algorithms in Database from an R client
SQL 2016 ScaleR PEMAs: Fast, Parallel, Storage Efficient Algorithms R Interpreter Conceptual Flow
SQL Processor Data Segments (CTP3 is via files) R IDE XSP RTerm.exe R.dll (MSLP$ SQL16) BxlServer.exe (MSLP$SQL16) Input Data Set via ODBC ScaleR Master Process Worker Process Worker Process Worker Process Data Segments Console Out Spawn Worker Proc’s. Assemble Intermediate Results Iterate/ Sequence MPI Ring Results – Models, Data Parallelized Algorithms in Database
22
Introducing Microsoft R Server
 Gradient Boosted Decision Trees  Naïve Bayes Scale R – ParallelizedAlgorithms& Functions  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Preparation Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination  rxDataStep  rxExec New  PEMA-R API Custom Algorithms
ScaleR - Performance comparison Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.  US flight data for 20 years  Linear Regression on Arrival Delay  Run on 4 core laptop, 16GB RAM and 500GB SSD
DistributedR ScaleR ConnectR DevelopR DistributedR - Model development and model compute choice: “Write Once. DeployAnywhere.” Code Portability Across Platforms In the Cloud Workstations & Servers Linux Windows EDW Teradata Hadoop Hortonworks Cloudera MapR + HD Insights + Hadoop Spark + R Tools for Visual Studio + Azure ML Roadmap Azure Marketplace + SQL Server v16 MicrosoftRServer
DistributedR - How Does RemoteExecutionWork? Algorithm Master Big Data Predictive Algorithm Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results The Results: • Even Faster Computation • Larger Data Set Capacity • Fewer Security Concerns • No Data Movement, No Copies Work “Pack and Ship” Requests to Remote Environments Results Microsoft R Server functions • A compute context defines remote connection • Microsoft R functions prefixed with rx • Current compute context determines processing location
DistributedR - Revolution Code Portability ### SETUP HADOOP ENVIRONMENT VARIABLES ### myHadoopCCC <- RxHadoopMR() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(myHadoopCC) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() AirlineDataSet <- RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”) , fileSystem = hdfsFS) ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1) ### CrossTab the data rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T) ### Linear Model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### SETUP LOCAL ENVIRONMENT VARIABLES ### myLocalCC <- “localpar” ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext(myLocalCC) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxFS <- RxNativeFileSystem() ) AirlineDataSet <- RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”, fileSystem = linuxFS) Local Parallel processing – Linux or Windows In – Hadoop ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce Compute context R script – sets where the model will run Functional model R script – does not need to change to run in Hadoop
DistributedR - In-Hadoop Uses Hadoop nodes for R computations Eliminate data movement latency on very large data Remove data duplication Faster model development No MapReduce R coding Develop better models using all the data = Microsoft R Server
MRS and Hadoop Architecture options R R R R R R R R R R ScaleR Production RStudio Server Pro Microsoft R Server 1. Copy 2. Stream 3. Send
DistributedR - Hadoop ProcessingMethods Method 1: Local (Linux) parallel processing using all cores on one node, copying data from HDFS to store in local Linux file-system. Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Linux (Local) File-System HDFS Csv, Xdf Processing Data 1 Edge node 1:n data nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf Linux FS Read / write Method 1 (“Beside” or “Edge”) Copy to Local File Method 2: Local (Linux) parallel processing using all cores on one node, streaming data from / to HDFS Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf 1:n nodes 1:n disks 1:(n x number of nodes) disks 1 Edge node
Method 3 Method 3: Hadoop (Map-Reduce) parallel processing using all cores on n nodes, using HDFS data on each node Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf Processing Data 1:n nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf HDFS Read / write (“inside”) R script sent to data nodes 1 Edge node R model script sent to Master Node: 1. Starts a master process 2. Distribute work 3. Master tasks for each node 4. Master initiates distributed work 1.Hadoop schedules mapper for each split 2.Algorithm computes intermediate result 3.Reducer combines intermediate results 5. Master process evaluates completion 6. Iterates as required by the algorithm 7. Returns consolidated answer to script
DistributedR - What processing mode to use, when? Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm) guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop processing) Low Medium High Small Data < 10GB Medium Data < 50GB Bigger Data > 50GB Edge Node Linux processing In-Hadoop processing Local Linux file-system Hadoop file-system Legend Processing Complexity Data Size
While Open Source R delivers: • Capability • 6500+ Algorithm & Connector Packages Available for Free in CRAN • Simplicity • R Skills Transfer / Lower cost of Talent • Ease of Integration with Other Analytics Packages & Data • Access to Huge Libraries of R Analytical Algorithms • Speed • Intel-Optimized Computation • Peace of mind • Knowledge that your business is using a stable platform backed with commercial support and services • Platform longevity for more predictability around costs • Speed and scalability • Faster decisions using advanced analytics that were previously unachievable • In-Hadoop & In Teradata Analysis • Efficiency • Continue getting returns on existing hardware and software investments • Developers can write code once and deploy it anywhere, keeping costs low • Flexibility and agility • Model data in a hybrid environment: on-premises, in the cloud, or both • Scripting, modeling, and in-database analytics across platforms shrinks analysis time and enables agile response to business needs SQL Server R Services and Microsoft R Server deliver:
Introducing Microsoft R Open • Enhanced Open Source R distribution • Based on the latestOpenSourceR (3.1.2) • Built,testedanddistributed by Microsoft • EnhancedbyIntelMKL Libraryto speedup linearalgebra functions • Compatible with all R-related software • CRANpackages,RStudio, third-partyR integrations,… • Revolutions Open-Source R packages • ReproducibleR Toolkit– Checkpoint, miniCRAN • ParallelR– parallelise execution via‘foreach’loop • Rhadoop– rhdfs, rhbase,ravro,rmr2, plyrmr • AzureML– read/writedatatoAzureML,publishR code asML API • MRAN website mran.revolutionanalytics.com • Enhanceddocumentation andlearningresources • Discover6500 free add-on Rpackages • Open source (GPLv2 license) - 100% free to download, use and share
Datasize In-memory In-memory In-Memoryor Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high- speed functions Licence Open Source Open Source Commercial license. Supported release with indemnity CRAN, MRO, MRS Comparison Microsoft R Open Microsoft R Server
More efficient and multi-threaded math computation. Benefits math intensive processing. No benefit to program logic and data transform CRAN R compared to Microsoft R Open • Matrix calculation – upto 27x faster • Matrix functions – upto 16x faster • Programation – 0x faster

microsoft r server for distributed computing

  • 1.
    Introducing Microsoft R Server& Microsoft R Open Krit Kamtuo Technical Evangelist Microsoft (Thailand) Limited
  • 2.
    What is R? Language Platform Community Ecosystem •A programming language for statistics, analytics, and data science • A data visualization framework • Provided as Open Source • Used by 2.5M+ data scientists, statisticians and analysts • Taught in most university statistics programs • Active and thriving user groups across the world • CRAN: 7000+ freely available algorithms, test data and evaluation • Many of these are applicable to big data if scaled • New and recent graduates prefer it
  • 3.
    20152009200420032000199719951993 Research Projectin New Zealand Open Source Project R-Core Group R-1.0.0 released RFoundation First user New York Times article R-3.2.0 and R Consortium (foundedby Microsoft) History of R
  • 4.
    $? Challenges posed byopen source R Uncertain total cost of ownership Inadequate access to important business data Limited business agility Limited business value
  • 5.
  • 6.
    6 • Free andopen source R distribution • Enhanced and distributed by Revolution Analytics Microsoft R Open • Built in Advanced Analytics and Stand Alone Server Capability • Leverages the Benefits of SQL 2016 Enterprise Edition SQL Server R Services Microsoft R Products
  • 7.
    Microsoft R Server •Microsoft R Server for Redhat Linux • Microsoft R Server for SUSE Linux • Microsoft R Server for Teradata DB • Microsoft R Server for Hadoop on Redhat Microsoft R Server
  • 8.
    Introducing SQL Server2016 R Services Enterprise speed and performance Near-DB analytics Parallel threading and processing Model on-premises, store in cloud—or vice versa Hybrid memory and disk scalability Not bound by memory- enabling limits of larger datasets Included in SQL Server 2016 Reuse and optimize existing R code Eliminate data movement across machines Write once, deploy anywhere
  • 9.
    Microsoft R serverfor distributed computing The First NIDA Business Analytics and Data Sciences Contest/Conference วันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์ -แนะนํา Microsoft R Server -Distributed Computing มีวิธีการอย่างไร และมีประโยชน์อย่างไร -แนะนําวิธีการ Configuration สําหรับ Distributed Computing https://businessanalyticsnida.wordpress.com https://www.facebook.com/BusinessAnalyticsNIDA/ กฤษฏิ์ คําตื้อ, Technical Evangelist, Microsoft (Thailand) -Distributed computing กับ Big Data -Analytics บน R server -สาธิตและสอนในลักษณะ workshop Computer Lab 2 ชั้น 10 อาคารสยามบรมราชกุมารี 1 กันยายน 2559 เวลา 9.00-12.30
  • 10.
    Scalable in-database analytics DataScientist Interacts directly with data Creates models and experiments Data Analyst/DBA Manages data and analytics together Example Solutions • Fraud detection • Sales forecasting • Warehouse efficiency • Predictive maintenance 010010 100100 010101 Relational Data Extensibility ? R R Integration Analytic Library Open Source R Revolution PEMA T-SQL Interface How is it Integrated? • T-SQL calls a Stored Procedure • Script is run in SQL through extensibility model • Result sets sent through Web API to database or applications Benefits • Faster deployment of ML models • Less data movement, faster insights • Work with large datasets: mitigate R memory and scalability limitations
  • 11.
    Cost effectiveness • BestAdvanced Analytics Value • R Services and Polybase are built-in o Part of SQL Server 2016 Enterprise Edition • In DB analytics shrinks analysis cost and time o No data movement reduces costs • No Proprietary Hardware Requirement o Can be installed in commodity hardware • Integration between cloud and open source offerings SQL SERVER 2016 $ 648 K + $120 Per user for PowerBI Costs based on a Server with 2 proc/ 8 Cores
  • 12.
  • 13.
    High-performance open sourceR plus: • Data source connectivity to big-data objects • Big-data advanced analytics • Multi-platform environment support • In-Hadoop and in-Teradata predictive modeling • Development and production environment support • IDE for data scientist developers • Secure, Scalable R Deployment DeployR R Open R Server DevelopR Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling Introducing Microsoft R Server
  • 14.
    R Open MicrosoftRServer DeployRDevelopR The Microsoft R Server Platform ConnectR • High-speed & direct connectors Available for: • High-performance XDF • SAS, SPSS, delimited& fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database & Aster • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelizedanalytics • Data prep & data distillation • Descriptive statistics & statistical tests • Range of predictive functions • User tools for distributingcustomizedR algorithms across nodes • Wide data sets supported – thousands of variables DistributedR • Distributed computingframework • Delivers cross-platformportability R+CRAN • Open source R interpreter • R 3.1.2 • Freely-available huge range of R algorithms • Algorithms callable by RevoR • Embeddable in R scripts • 100% Compatible with existingR scripts, functions and packages RevoR • Performance enhancedR interpreter • Based on open source R • Adds high-performance math libraryto speed up linear algebra functions
  • 15.
    ScaleR – Parallel+ “Big Data” Stream data in to RAM in blocks. “Big Data” can be any data size. We handle Megabytes to Gigabytes to Terabytes… Our ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Interim results are collected and combined analytically to produce the output on the entire data set XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.
  • 16.
  • 17.
    SQL Server 2016Enterprise Edition SQL Server R Services Integration Facilities: • Component Integration • Launchers • Parameter Passing • Results Return • Console Output Return • Parallel Data Exchange (RTM) • Stored Procedures • Package Administration SQL Server Query Processor Algorithm Library • Data Prep • Descriptive Stats • Sampling • Statistical Tests • Predictive Models • Variable Selection • Clustering • Classification • Custom APIs for R + CRAN • Parallel Scoring Fast, Parallel, Storage Efficient Algorithms Microsoft R Open • 100% Open Source R • Fully CRAN Compatible • Accelerated Math Open Source R Interpreter
  • 18.
    Run R In-Databasefrom TSQL SQL Server 2016 In-Database Execution of R + CRAN + SQL In-Database Execution of:  R Code  CRAN Packages Move the Work to the Data Run R From the Query Processor Retrieve Models, Scores, Transformed Data, Plots/Images Operationalise scoring/predictio n in database for data batches or real-time
  • 19.
    SQL In-Database Execution:  RemoteExecution  Parallelized Compute SQL Server Remote Execution Context Explore and Model:  In Parallel, In-Database  Parallelize distributable R and CRAN Operationlize:  Score In Parallel Parallel Worker Tasks Move BIG Work to the Data Large Data Sets in Chunks Parallel Algorithm Iterate/ Sequence Run Parallel Algorithms in Database from an R client
  • 20.
    SQL 2016 ScaleR PEMAs:Fast, Parallel, Storage Efficient Algorithms R Interpreter Conceptual Flow
  • 21.
    SQL Processor Data Segments (CTP3 is via files) RIDE XSP RTerm.exe R.dll (MSLP$ SQL16) BxlServer.exe (MSLP$SQL16) Input Data Set via ODBC ScaleR Master Process Worker Process Worker Process Worker Process Data Segments Console Out Spawn Worker Proc’s. Assemble Intermediate Results Iterate/ Sequence MPI Ring Results – Models, Data Parallelized Algorithms in Database
  • 22.
  • 23.
  • 24.
     Gradient BoostedDecision Trees  Naïve Bayes Scale R – ParallelizedAlgorithms& Functions  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Preparation Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination  rxDataStep  rxExec New  PEMA-R API Custom Algorithms
  • 25.
    ScaleR - Performancecomparison Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.  US flight data for 20 years  Linear Regression on Arrival Delay  Run on 4 core laptop, 16GB RAM and 500GB SSD
  • 26.
    DistributedR ScaleR ConnectR DevelopR DistributedR - Modeldevelopment and model compute choice: “Write Once. DeployAnywhere.” Code Portability Across Platforms In the Cloud Workstations & Servers Linux Windows EDW Teradata Hadoop Hortonworks Cloudera MapR + HD Insights + Hadoop Spark + R Tools for Visual Studio + Azure ML Roadmap Azure Marketplace + SQL Server v16 MicrosoftRServer
  • 27.
    DistributedR - HowDoes RemoteExecutionWork? Algorithm Master Big Data Predictive Algorithm Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results The Results: • Even Faster Computation • Larger Data Set Capacity • Fewer Security Concerns • No Data Movement, No Copies Work “Pack and Ship” Requests to Remote Environments Results Microsoft R Server functions • A compute context defines remote connection • Microsoft R functions prefixed with rx • Current compute context determines processing location
  • 28.
    DistributedR - RevolutionCode Portability ### SETUP HADOOP ENVIRONMENT VARIABLES ### myHadoopCCC <- RxHadoopMR() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(myHadoopCC) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() AirlineDataSet <- RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”) , fileSystem = hdfsFS) ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1) ### CrossTab the data rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T) ### Linear Model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### SETUP LOCAL ENVIRONMENT VARIABLES ### myLocalCC <- “localpar” ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext(myLocalCC) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxFS <- RxNativeFileSystem() ) AirlineDataSet <- RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”, fileSystem = linuxFS) Local Parallel processing – Linux or Windows In – Hadoop ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce Compute context R script – sets where the model will run Functional model R script – does not need to change to run in Hadoop
  • 29.
    DistributedR - In-Hadoop UsesHadoop nodes for R computations Eliminate data movement latency on very large data Remove data duplication Faster model development No MapReduce R coding Develop better models using all the data = Microsoft R Server
  • 30.
    MRS and HadoopArchitecture options R R R R R R R R R R ScaleR Production RStudio Server Pro Microsoft R Server 1. Copy 2. Stream 3. Send
  • 31.
    DistributedR - HadoopProcessingMethods Method 1: Local (Linux) parallel processing using all cores on one node, copying data from HDFS to store in local Linux file-system. Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Linux (Local) File-System HDFS Csv, Xdf Processing Data 1 Edge node 1:n data nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf Linux FS Read / write Method 1 (“Beside” or “Edge”) Copy to Local File Method 2: Local (Linux) parallel processing using all cores on one node, streaming data from / to HDFS Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf 1:n nodes 1:n disks 1:(n x number of nodes) disks 1 Edge node
  • 32.
    Method 3 Method 3:Hadoop (Map-Reduce) parallel processing using all cores on n nodes, using HDFS data on each node Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf Processing Data 1:n nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf HDFS Read / write (“inside”) R script sent to data nodes 1 Edge node R model script sent to Master Node: 1. Starts a master process 2. Distribute work 3. Master tasks for each node 4. Master initiates distributed work 1.Hadoop schedules mapper for each split 2.Algorithm computes intermediate result 3.Reducer combines intermediate results 5. Master process evaluates completion 6. Iterates as required by the algorithm 7. Returns consolidated answer to script
  • 33.
    DistributedR - Whatprocessing mode to use, when? Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm) guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop processing) Low Medium High Small Data < 10GB Medium Data < 50GB Bigger Data > 50GB Edge Node Linux processing In-Hadoop processing Local Linux file-system Hadoop file-system Legend Processing Complexity Data Size
  • 34.
    While Open SourceR delivers: • Capability • 6500+ Algorithm & Connector Packages Available for Free in CRAN • Simplicity • R Skills Transfer / Lower cost of Talent • Ease of Integration with Other Analytics Packages & Data • Access to Huge Libraries of R Analytical Algorithms • Speed • Intel-Optimized Computation • Peace of mind • Knowledge that your business is using a stable platform backed with commercial support and services • Platform longevity for more predictability around costs • Speed and scalability • Faster decisions using advanced analytics that were previously unachievable • In-Hadoop & In Teradata Analysis • Efficiency • Continue getting returns on existing hardware and software investments • Developers can write code once and deploy it anywhere, keeping costs low • Flexibility and agility • Model data in a hybrid environment: on-premises, in the cloud, or both • Scripting, modeling, and in-database analytics across platforms shrinks analysis time and enables agile response to business needs SQL Server R Services and Microsoft R Server deliver:
  • 36.
    Introducing Microsoft ROpen • Enhanced Open Source R distribution • Based on the latestOpenSourceR (3.1.2) • Built,testedanddistributed by Microsoft • EnhancedbyIntelMKL Libraryto speedup linearalgebra functions • Compatible with all R-related software • CRANpackages,RStudio, third-partyR integrations,… • Revolutions Open-Source R packages • ReproducibleR Toolkit– Checkpoint, miniCRAN • ParallelR– parallelise execution via‘foreach’loop • Rhadoop– rhdfs, rhbase,ravro,rmr2, plyrmr • AzureML– read/writedatatoAzureML,publishR code asML API • MRAN website mran.revolutionanalytics.com • Enhanceddocumentation andlearningresources • Discover6500 free add-on Rpackages • Open source (GPLv2 license) - 100% free to download, use and share
  • 37.
    Datasize In-memory In-memory In-Memoryor DiskBased Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high- speed functions Licence Open Source Open Source Commercial license. Supported release with indemnity CRAN, MRO, MRS Comparison Microsoft R Open Microsoft R Server
  • 38.
    More efficient andmulti-threaded math computation. Benefits math intensive processing. No benefit to program logic and data transform CRAN R compared to Microsoft R Open • Matrix calculation – upto 27x faster • Matrix functions – upto 16x faster • Programation – 0x faster