| © Copyright 2015 Hitachi Consulting1 Operational Machine Learning Using Microsoft Technologies for Applied Data Science Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
| © Copyright 2015 Hitachi Consulting2 Outline  Introduction to Data Science  From Experimental Data Science to Operational Machine Learning  MS Technologies for Data Science & Advanced Analytics  Demos & Screenshots  Concluding Remarks
| © Copyright 2015 Hitachi Consulting3 Introduction to Data Science and Machine Learning
| © Copyright 2015 Hitachi Consulting4 Data Science and Machine Learning What? Data Science Machine Learning Statistics Artificial Intelligence Databases Other Technologies “Data mining, an interdisciplinary subfield of computer science, is the computational process of automatic discovering interesting and useful patterns in large data sets” Other Related Technologies:  Visualization  Big Data  High Performance Computing  Cloud Computing  Others..
| © Copyright 2015 Hitachi Consulting5 Data Science and Machine Learning Why? Vision Analytics Recommendation engines Advertising analysis Weather forecasting for business planning Social network analysis Legal discovery and document archiving Pricing analysis Fraud detection Churn analysis Predictive Maintenance Location-based tracking and services Personalized Insurance The objective of data science is to provide you with actionable insights to support decision making….
| © Copyright 2015 Hitachi Consulting6 Data Science and Machine Learning How? Classification Learning Build a model that can predict the target class of an input case Cluster Analysis Discover natural groupings within the data points Association Rule Discovery Extract frequent patterns present in the data Regression Modeling Build a model that can estimate the response value given an input case Time Series Analysis Analysis of temporal data to forecast future values Probabilistic Modeling Compute the probability of an event to occur given a set of conditions Similarity Analysis Identify similar cases to a given input case based on the input features Collaborative Filtering Filtering of information using techniques involving collaboration viewpoints IF .. AND .. AND .. THEN A ELSE IF .. AND .. THEN C ELSE IF .. AND .. THEN B .. .. ELSE C
| © Copyright 2015 Hitachi Consulting7 From Experimental Data Science to Operational Machine Learning
| © Copyright 2015 Hitachi Consulting8 Exploratory Data Analysis Data Science Activities Experimentation vs. Operationalization Collect Data Blend Visualize Prepare ML Experiment Algorithm Selection Parameter Tuning Training & Testing Model Learning Dataset Report of Visuals & Findings Decision! Data Analysis & Experimentation  Interactive  Easy to perform  Rich Visualizations
| © Copyright 2015 Hitachi Consulting9 Online Apps Automated ML Pipeline Data Science Activities Experimentation vs. Operationalization Model Data Ingestion Data Processing Model Training Scoring Deploy Web APIs Predict Train Export Batch Real-time Operational ML Pipelines  Pipelined (ETL Integration)  Scalable  Apps Integration
| © Copyright 2015 Hitachi Consulting10 Microsoft Advanced Analytics Technologies
| © Copyright 2015 Hitachi Consulting11 Microsoft Advanced Analytics Cortana Intelligence Suite https://gallery.cortanaintelligence.com/
| © Copyright 2015 Hitachi Consulting12 Microsoft Advanced Analytics Data Science, Machine Learning, & Intelligence Data Mining – SQL Server Analysis Services Azure Machine Learning Spark ML – Azure HDInsight Microsoft R Server – SQL Server R Services Azure Cognitive Services Cognitive Features – Azure Data Lake Analytics Microsoft Bot Framework
| © Copyright 2015 Hitachi Consulting13 Microsoft Azure Machine Learning
| © Copyright 2015 Hitachi Consulting14 Azure Machine Learning MS Cloud-native Data Science  Cloud-based Machine Learning Services  Interactive Data Science Studio  Rich built-in functionality  Imports data from everywhere  Easy to develop and productionize – Web Services  Extensible via R and Python scripts Azure Machine Learning Build and deploy models in the cloud Import Data Publish Result Input Web Services Batch Scoring Retrain Model Limitations  Only Cloud-based (Data Regulations)  Scalability – Maximum dataset size = 10GB  Microsoft R Open is not supported, yet  No Source Control
| © Copyright 2015 Hitachi Consulting15 Azure Machine Learning Real-time Predictions App Event Hub Stream Analytics Power BI Azure ML Web Service Send data points Consume messages Send Input Receive Output Send Results (Input, Output)
| © Copyright 2015 Hitachi Consulting16 Azure Machine Learning Built-in Features
| © Copyright 2015 Hitachi Consulting17 Azure Machine Learning Algorithms Cheat Sheet
| © Copyright 2015 Hitachi Consulting18 Azure Machine Learning ML Studio
| © Copyright 2015 Hitachi Consulting19 Azure Machine Learning Web Service
| © Copyright 2015 Hitachi Consulting20 Azure Machine Learning Stream Analytics Integration
| © Copyright 2015 Hitachi Consulting21 Azure Machine Learning AzureML R Library
| © Copyright 2015 Hitachi Consulting22 Microsoft R Server
| © Copyright 2015 Hitachi Consulting23 Microsoft R Server R in Microsoft World Microsoft R Open (MRO)  Based on latest Open Source R (3.2.2.) - Built, tested, and distributed by Microsoft  More efficient and multi-threaded computation  Enhanced by Intel Math Kernel Library (MKL) to speed up linear algebra functions  Compatible with all R-related software
| © Copyright 2015 Hitachi Consulting24 Microsoft R Server Comparison CRAN MRO MRS Data size In-memory In-memory In-memory & disk Efficiency Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Functionality 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high-speed functions Licence Open Source Open Source Commercial license.
| © Copyright 2015 Hitachi Consulting25 Microsoft R Server Components & Compute Contexts Microsoft R Server CRAN&MSROpen ScaleR DistributedR ConnectR MicrosoftML-Package Operationalization (msrdeploy) RStudio | RTVS MS R Client Scale & Deploy DifferentComputeContexts  Installed on Windows or Linux  ScaleR - Optimized for parallel execution on Big Data, to eliminate memory limitations.  ConnectR – Provides access to local file systems, hdfs, hive, sqlserver, Teradata, etc.  DistributeR - Adaptable parallel execution framework to enable running on different (distributed) compute contexts.  Operationalization (msrdeploy) – Deploy the model as a Web API.
| © Copyright 2015 Hitachi Consulting26 Microsoft R Server Microsoft R Server – ScaleR Example Check Environment Load XDF Prepare Data – Process XDF Build Predictive Model Perform Prediction
| © Copyright 2015 Hitachi Consulting27 Microsoft R Server Microsoft R Server – ScaleR Functionality
| © Copyright 2015 Hitachi Consulting28 SQL Server (in-database) R Services
| © Copyright 2015 Hitachi Consulting29 SQL Server R Services In-database Analytics  R Services (in-database) – Keep your analytics close to the data  T-SQL Script – Can be encapsulated in Stored Procedures  Models are built, trained, saved as part of the ETL process (SSIS)  Used for batch prediction (as part of the ETL process)  Visual Studio SQL Database Project, Source Controlled, etc.  Uses Microsoft ScaleR libraries Limitations  Not supported in Azure SQL DB/DW, yet  Not suitable for Interactive Data Science  Only R, no python, yet. Process Data Train R Model Serialize Store Models Maintain Models Process Data Load Model Perform Prediction Store Results ETL Using SSIS Data Sources Prediction Pipeline Training Pipeline EXECUTE sp_execute_external_script
| © Copyright 2015 Hitachi Consulting30 SQL Server R Services T-SQL Script PredictionModel Summary Prediction Output Build and Save Model Configure
| © Copyright 2015 Hitachi Consulting31 Microsoft Analysis Services Data Mining
| © Copyright 2015 Hitachi Consulting32 SQL Server Analysis Services Data Mining Limitations  Limited Extensibility  Limited Algorithms & Functionalities  No Azure PaaS Service Azure SQL DW/DB SQL Server Analysis Services Online Apps Build Model Result Explore/ Interpret Model DMX Query Batch Scoring Retrain Model  Process data from many OLEDB and ODBC data sources  Easy to build, interpret, deploy, and productionize  SSIS Support – Tasks to Train & Predict  Interactive Visuals for model interpretation  Excel Integration – Data Mining Add-in
| © Copyright 2015 Hitachi Consulting33 SQL Server Analysis Services Overview Data Source View Mining Structure Mining Algorithm Mining Model  Decision Tress  Naïve-Bayes  Linear Regression  Neural Networks  Association Rules  Clustering  Sequence Clustering  Time Series
| © Copyright 2015 Hitachi Consulting34 SQL Server Analysis Services Visualizing Models
| © Copyright 2015 Hitachi Consulting35 SQL Server Analysis Services Excel Data Mining Add-in
| © Copyright 2015 Hitachi Consulting36 Azure Cognitive Services
| © Copyright 2015 Hitachi Consulting37 Azure Cognitive Services Ready-to-use Intelligence
| © Copyright 2015 Hitachi Consulting38 Azure Cognitive Services Setup a Cognitive Services API https://www.microsoft.com/cognitive-services/
| © Copyright 2015 Hitachi Consulting39 Cognitive Features in Azure Data Lake Analytics
| © Copyright 2015 Hitachi Consulting40 Azure Data Lake Analytics Cognitive Features  Pre-built intelligence – Text & Image Analysis  Integrated with your data processing pipelines (DLA)  Used for batch recognition (not singleton real-time)  Scheduled & Automated using Azure Data Factory  R & Python Extensions!  Scalable – Suitable for Big Data Ingest Polybase Input Output Data Processing & Patten Recognition Source Data (Text, Images, etc.) Enterprise Data Warehouse Azure SQL DW Data Lake Analytics Jobs Data Lake Store Azure Data Factory Data Lake Store Limitations  Limited Features  Not suitable for real-time scoring
| © Copyright 2015 Hitachi Consulting41 Azure Data Lake Analytics First-time Installation
| © Copyright 2015 Hitachi Consulting42 Azure Data Lake Analytics U-SQL Script
| © Copyright 2015 Hitachi Consulting43 Azure Data Lake Analytics Execution & Output
| © Copyright 2015 Hitachi Consulting44 Spark ML on HDInsight
| © Copyright 2015 Hitachi Consulting45 Spark ML on HDInsight Scalable ML for Big Data  Rich Spark ML Libraries  Scalable, distributed, in-memory  Extensible – Python, R, Java, Scala  Suitable for Big Data - Batch Model Training and Scoring  Spark Streaming for Real-time predictions  Scheduled & Automated Using Azure Data Factory Ingest - Process Data - Build Model - Save Model - Load Model - Perform Predictions - Save Results Source Data Save Load Polybase Enterprise Data Warehouse Azure SQL DW Azure Data Factory HDInsight Limitations  Expensive to keep it up & running  Slow to spin-up
| © Copyright 2015 Hitachi Consulting46 Spark ML on HDInsight Spark ML Pipelines Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple task into a single pipeline, or workflow.  Transformers – used for data pre-processing. Input: DataFrame - Output:DataFrame  Estimators – ML algorithm used to build a predictive model. Input: DataFrame - Output: Model.  Parameters – Configurations for Transformers and Estimators  Pipeline – Chains Transformers and Estimators ML Pipeline Dataset (DataFrame) Transformer A (pre-processing) Estimator (ML Learning Algorithm) Model Evaluation Parameters Transformer Z (pre-processing) …
| © Copyright 2015 Hitachi Consulting47 Spark ML on HDInsight Spark ML Functionality Text Feature Extraction  TF-IDF (HashingTF and IDF)  Word2Vec  CountVectorizer  Tokenizer  StopWordsRemover  n-gram Feature Selection  VectorSlicer  RFormula  ChiSqSelector Dimensionality Reduction  PCA Features Vector Preparation  VectorAssembler  VectorIndexer  StringIndexer  IndexToString Transformers Feature Type Conversion  Binarizer  Discrete Cosine Transform (DCT)  OneHotEncoder  Bucketizer  QuantileDiscretizer Feature Scaling  Normalizer  StandardScaler  MinMaxScaler Feature Construction  SQLTransformer  ElementwiseProduct  PolynomialExpansion Estimators (supervised) Classification  Decision Trees – Ensembles  Naïve-Bayes  SVM Regression  Linear Regression  SVM Other (Unsupervised) Clustering Collaborative Filtering Frequent Pattern Mining
| © Copyright 2015 Hitachi Consulting48 Spark ML on HDInsight Spark ML - Example
| © Copyright 2015 Hitachi Consulting49 Spark ML on HDInsight BigDL – Intel’s Distributed Deep Learning Library https://azure.microsoft.com/en-us/blog/use-bigdl-on-hdinsight-spark-for-distributed-deep-learning/
| © Copyright 2015 Hitachi Consulting50 Concluding Remarks Interactive Data Science Studio  Azure ML Extensibility  Spark on HDI  Azure ML  Microsoft R Server Built-in Features  Azure ML  Spark on HDI Rich Model Interpretability  SSAS Data Mining  Microsoft R Server Scalability (Big Data)  Microsoft R Server  Spark on HDI ML Pipelining  Spark on HDI  Azure Data Lake Analytics  SQL Server R Services  Data Mining SSAS Integration with Operational Apps  Azure ML  Azure Cognitive Services  Microsoft R Operationalization Pre-built Intelligence  Azure Cognitive Services  Azure Data Lake Analytics
| © Copyright 2015 Hitachi Consulting51 My Background Applying Computational Intelligence in Data Mining  Honorary Research Fellow, School of Computing , University of Kent.  Ph.D. Computer Science, University of Kent, Canterbury, UK.  28+ published journal and conference papers in the fields of AI and ML https://www.researchgate.net/profile/Khalid_Salama https://www.linkedin.com/in/khalid-salama-24403144/ https://github.com/khalid-m-salama/sqlbits-2017
| © Copyright 2015 Hitachi Consulting52 Thanks!

Operational Machine Learning: Using Microsoft Technologies for Applied Data Science

  • 1.
    | © Copyright2015 Hitachi Consulting1 Operational Machine Learning Using Microsoft Technologies for Applied Data Science Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
  • 2.
    | © Copyright2015 Hitachi Consulting2 Outline  Introduction to Data Science  From Experimental Data Science to Operational Machine Learning  MS Technologies for Data Science & Advanced Analytics  Demos & Screenshots  Concluding Remarks
  • 3.
    | © Copyright2015 Hitachi Consulting3 Introduction to Data Science and Machine Learning
  • 4.
    | © Copyright2015 Hitachi Consulting4 Data Science and Machine Learning What? Data Science Machine Learning Statistics Artificial Intelligence Databases Other Technologies “Data mining, an interdisciplinary subfield of computer science, is the computational process of automatic discovering interesting and useful patterns in large data sets” Other Related Technologies:  Visualization  Big Data  High Performance Computing  Cloud Computing  Others..
  • 5.
    | © Copyright2015 Hitachi Consulting5 Data Science and Machine Learning Why? Vision Analytics Recommendation engines Advertising analysis Weather forecasting for business planning Social network analysis Legal discovery and document archiving Pricing analysis Fraud detection Churn analysis Predictive Maintenance Location-based tracking and services Personalized Insurance The objective of data science is to provide you with actionable insights to support decision making….
  • 6.
    | © Copyright2015 Hitachi Consulting6 Data Science and Machine Learning How? Classification Learning Build a model that can predict the target class of an input case Cluster Analysis Discover natural groupings within the data points Association Rule Discovery Extract frequent patterns present in the data Regression Modeling Build a model that can estimate the response value given an input case Time Series Analysis Analysis of temporal data to forecast future values Probabilistic Modeling Compute the probability of an event to occur given a set of conditions Similarity Analysis Identify similar cases to a given input case based on the input features Collaborative Filtering Filtering of information using techniques involving collaboration viewpoints IF .. AND .. AND .. THEN A ELSE IF .. AND .. THEN C ELSE IF .. AND .. THEN B .. .. ELSE C
  • 7.
    | © Copyright2015 Hitachi Consulting7 From Experimental Data Science to Operational Machine Learning
  • 8.
    | © Copyright2015 Hitachi Consulting8 Exploratory Data Analysis Data Science Activities Experimentation vs. Operationalization Collect Data Blend Visualize Prepare ML Experiment Algorithm Selection Parameter Tuning Training & Testing Model Learning Dataset Report of Visuals & Findings Decision! Data Analysis & Experimentation  Interactive  Easy to perform  Rich Visualizations
  • 9.
    | © Copyright2015 Hitachi Consulting9 Online Apps Automated ML Pipeline Data Science Activities Experimentation vs. Operationalization Model Data Ingestion Data Processing Model Training Scoring Deploy Web APIs Predict Train Export Batch Real-time Operational ML Pipelines  Pipelined (ETL Integration)  Scalable  Apps Integration
  • 10.
    | © Copyright2015 Hitachi Consulting10 Microsoft Advanced Analytics Technologies
  • 11.
    | © Copyright2015 Hitachi Consulting11 Microsoft Advanced Analytics Cortana Intelligence Suite https://gallery.cortanaintelligence.com/
  • 12.
    | © Copyright2015 Hitachi Consulting12 Microsoft Advanced Analytics Data Science, Machine Learning, & Intelligence Data Mining – SQL Server Analysis Services Azure Machine Learning Spark ML – Azure HDInsight Microsoft R Server – SQL Server R Services Azure Cognitive Services Cognitive Features – Azure Data Lake Analytics Microsoft Bot Framework
  • 13.
    | © Copyright2015 Hitachi Consulting13 Microsoft Azure Machine Learning
  • 14.
    | © Copyright2015 Hitachi Consulting14 Azure Machine Learning MS Cloud-native Data Science  Cloud-based Machine Learning Services  Interactive Data Science Studio  Rich built-in functionality  Imports data from everywhere  Easy to develop and productionize – Web Services  Extensible via R and Python scripts Azure Machine Learning Build and deploy models in the cloud Import Data Publish Result Input Web Services Batch Scoring Retrain Model Limitations  Only Cloud-based (Data Regulations)  Scalability – Maximum dataset size = 10GB  Microsoft R Open is not supported, yet  No Source Control
  • 15.
    | © Copyright2015 Hitachi Consulting15 Azure Machine Learning Real-time Predictions App Event Hub Stream Analytics Power BI Azure ML Web Service Send data points Consume messages Send Input Receive Output Send Results (Input, Output)
  • 16.
    | © Copyright2015 Hitachi Consulting16 Azure Machine Learning Built-in Features
  • 17.
    | © Copyright2015 Hitachi Consulting17 Azure Machine Learning Algorithms Cheat Sheet
  • 18.
    | © Copyright2015 Hitachi Consulting18 Azure Machine Learning ML Studio
  • 19.
    | © Copyright2015 Hitachi Consulting19 Azure Machine Learning Web Service
  • 20.
    | © Copyright2015 Hitachi Consulting20 Azure Machine Learning Stream Analytics Integration
  • 21.
    | © Copyright2015 Hitachi Consulting21 Azure Machine Learning AzureML R Library
  • 22.
    | © Copyright2015 Hitachi Consulting22 Microsoft R Server
  • 23.
    | © Copyright2015 Hitachi Consulting23 Microsoft R Server R in Microsoft World Microsoft R Open (MRO)  Based on latest Open Source R (3.2.2.) - Built, tested, and distributed by Microsoft  More efficient and multi-threaded computation  Enhanced by Intel Math Kernel Library (MKL) to speed up linear algebra functions  Compatible with all R-related software
  • 24.
    | © Copyright2015 Hitachi Consulting24 Microsoft R Server Comparison CRAN MRO MRS Data size In-memory In-memory In-memory & disk Efficiency Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Functionality 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high-speed functions Licence Open Source Open Source Commercial license.
  • 25.
    | © Copyright2015 Hitachi Consulting25 Microsoft R Server Components & Compute Contexts Microsoft R Server CRAN&MSROpen ScaleR DistributedR ConnectR MicrosoftML-Package Operationalization (msrdeploy) RStudio | RTVS MS R Client Scale & Deploy DifferentComputeContexts  Installed on Windows or Linux  ScaleR - Optimized for parallel execution on Big Data, to eliminate memory limitations.  ConnectR – Provides access to local file systems, hdfs, hive, sqlserver, Teradata, etc.  DistributeR - Adaptable parallel execution framework to enable running on different (distributed) compute contexts.  Operationalization (msrdeploy) – Deploy the model as a Web API.
  • 26.
    | © Copyright2015 Hitachi Consulting26 Microsoft R Server Microsoft R Server – ScaleR Example Check Environment Load XDF Prepare Data – Process XDF Build Predictive Model Perform Prediction
  • 27.
    | © Copyright2015 Hitachi Consulting27 Microsoft R Server Microsoft R Server – ScaleR Functionality
  • 28.
    | © Copyright2015 Hitachi Consulting28 SQL Server (in-database) R Services
  • 29.
    | © Copyright2015 Hitachi Consulting29 SQL Server R Services In-database Analytics  R Services (in-database) – Keep your analytics close to the data  T-SQL Script – Can be encapsulated in Stored Procedures  Models are built, trained, saved as part of the ETL process (SSIS)  Used for batch prediction (as part of the ETL process)  Visual Studio SQL Database Project, Source Controlled, etc.  Uses Microsoft ScaleR libraries Limitations  Not supported in Azure SQL DB/DW, yet  Not suitable for Interactive Data Science  Only R, no python, yet. Process Data Train R Model Serialize Store Models Maintain Models Process Data Load Model Perform Prediction Store Results ETL Using SSIS Data Sources Prediction Pipeline Training Pipeline EXECUTE sp_execute_external_script
  • 30.
    | © Copyright2015 Hitachi Consulting30 SQL Server R Services T-SQL Script PredictionModel Summary Prediction Output Build and Save Model Configure
  • 31.
    | © Copyright2015 Hitachi Consulting31 Microsoft Analysis Services Data Mining
  • 32.
    | © Copyright2015 Hitachi Consulting32 SQL Server Analysis Services Data Mining Limitations  Limited Extensibility  Limited Algorithms & Functionalities  No Azure PaaS Service Azure SQL DW/DB SQL Server Analysis Services Online Apps Build Model Result Explore/ Interpret Model DMX Query Batch Scoring Retrain Model  Process data from many OLEDB and ODBC data sources  Easy to build, interpret, deploy, and productionize  SSIS Support – Tasks to Train & Predict  Interactive Visuals for model interpretation  Excel Integration – Data Mining Add-in
  • 33.
    | © Copyright2015 Hitachi Consulting33 SQL Server Analysis Services Overview Data Source View Mining Structure Mining Algorithm Mining Model  Decision Tress  Naïve-Bayes  Linear Regression  Neural Networks  Association Rules  Clustering  Sequence Clustering  Time Series
  • 34.
    | © Copyright2015 Hitachi Consulting34 SQL Server Analysis Services Visualizing Models
  • 35.
    | © Copyright2015 Hitachi Consulting35 SQL Server Analysis Services Excel Data Mining Add-in
  • 36.
    | © Copyright2015 Hitachi Consulting36 Azure Cognitive Services
  • 37.
    | © Copyright2015 Hitachi Consulting37 Azure Cognitive Services Ready-to-use Intelligence
  • 38.
    | © Copyright2015 Hitachi Consulting38 Azure Cognitive Services Setup a Cognitive Services API https://www.microsoft.com/cognitive-services/
  • 39.
    | © Copyright2015 Hitachi Consulting39 Cognitive Features in Azure Data Lake Analytics
  • 40.
    | © Copyright2015 Hitachi Consulting40 Azure Data Lake Analytics Cognitive Features  Pre-built intelligence – Text & Image Analysis  Integrated with your data processing pipelines (DLA)  Used for batch recognition (not singleton real-time)  Scheduled & Automated using Azure Data Factory  R & Python Extensions!  Scalable – Suitable for Big Data Ingest Polybase Input Output Data Processing & Patten Recognition Source Data (Text, Images, etc.) Enterprise Data Warehouse Azure SQL DW Data Lake Analytics Jobs Data Lake Store Azure Data Factory Data Lake Store Limitations  Limited Features  Not suitable for real-time scoring
  • 41.
    | © Copyright2015 Hitachi Consulting41 Azure Data Lake Analytics First-time Installation
  • 42.
    | © Copyright2015 Hitachi Consulting42 Azure Data Lake Analytics U-SQL Script
  • 43.
    | © Copyright2015 Hitachi Consulting43 Azure Data Lake Analytics Execution & Output
  • 44.
    | © Copyright2015 Hitachi Consulting44 Spark ML on HDInsight
  • 45.
    | © Copyright2015 Hitachi Consulting45 Spark ML on HDInsight Scalable ML for Big Data  Rich Spark ML Libraries  Scalable, distributed, in-memory  Extensible – Python, R, Java, Scala  Suitable for Big Data - Batch Model Training and Scoring  Spark Streaming for Real-time predictions  Scheduled & Automated Using Azure Data Factory Ingest - Process Data - Build Model - Save Model - Load Model - Perform Predictions - Save Results Source Data Save Load Polybase Enterprise Data Warehouse Azure SQL DW Azure Data Factory HDInsight Limitations  Expensive to keep it up & running  Slow to spin-up
  • 46.
    | © Copyright2015 Hitachi Consulting46 Spark ML on HDInsight Spark ML Pipelines Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple task into a single pipeline, or workflow.  Transformers – used for data pre-processing. Input: DataFrame - Output:DataFrame  Estimators – ML algorithm used to build a predictive model. Input: DataFrame - Output: Model.  Parameters – Configurations for Transformers and Estimators  Pipeline – Chains Transformers and Estimators ML Pipeline Dataset (DataFrame) Transformer A (pre-processing) Estimator (ML Learning Algorithm) Model Evaluation Parameters Transformer Z (pre-processing) …
  • 47.
    | © Copyright2015 Hitachi Consulting47 Spark ML on HDInsight Spark ML Functionality Text Feature Extraction  TF-IDF (HashingTF and IDF)  Word2Vec  CountVectorizer  Tokenizer  StopWordsRemover  n-gram Feature Selection  VectorSlicer  RFormula  ChiSqSelector Dimensionality Reduction  PCA Features Vector Preparation  VectorAssembler  VectorIndexer  StringIndexer  IndexToString Transformers Feature Type Conversion  Binarizer  Discrete Cosine Transform (DCT)  OneHotEncoder  Bucketizer  QuantileDiscretizer Feature Scaling  Normalizer  StandardScaler  MinMaxScaler Feature Construction  SQLTransformer  ElementwiseProduct  PolynomialExpansion Estimators (supervised) Classification  Decision Trees – Ensembles  Naïve-Bayes  SVM Regression  Linear Regression  SVM Other (Unsupervised) Clustering Collaborative Filtering Frequent Pattern Mining
  • 48.
    | © Copyright2015 Hitachi Consulting48 Spark ML on HDInsight Spark ML - Example
  • 49.
    | © Copyright2015 Hitachi Consulting49 Spark ML on HDInsight BigDL – Intel’s Distributed Deep Learning Library https://azure.microsoft.com/en-us/blog/use-bigdl-on-hdinsight-spark-for-distributed-deep-learning/
  • 50.
    | © Copyright2015 Hitachi Consulting50 Concluding Remarks Interactive Data Science Studio  Azure ML Extensibility  Spark on HDI  Azure ML  Microsoft R Server Built-in Features  Azure ML  Spark on HDI Rich Model Interpretability  SSAS Data Mining  Microsoft R Server Scalability (Big Data)  Microsoft R Server  Spark on HDI ML Pipelining  Spark on HDI  Azure Data Lake Analytics  SQL Server R Services  Data Mining SSAS Integration with Operational Apps  Azure ML  Azure Cognitive Services  Microsoft R Operationalization Pre-built Intelligence  Azure Cognitive Services  Azure Data Lake Analytics
  • 51.
    | © Copyright2015 Hitachi Consulting51 My Background Applying Computational Intelligence in Data Mining  Honorary Research Fellow, School of Computing , University of Kent.  Ph.D. Computer Science, University of Kent, Canterbury, UK.  28+ published journal and conference papers in the fields of AI and ML https://www.researchgate.net/profile/Khalid_Salama https://www.linkedin.com/in/khalid-salama-24403144/ https://github.com/khalid-m-salama/sqlbits-2017
  • 52.
    | © Copyright2015 Hitachi Consulting52 Thanks!

Editor's Notes

  • #2 Hello everyone and welcome to the last day of Sqlbits… My name is Khalid Salama. I work at Hitachi Consulting, in this Business Insights & Analytics practice, focusing on designing and delivering Data & Analytics Solutions I n this session, I would like to explore with you the various Microsoft technologies that can help to operationalize your Machine Learning pipelines and enable scalable data science. Well, it’s more of an engineering session than a data science one to be fair, however, I think it is an important topic to discuss because, data science is perceived as experimental, isolated activity… While in many contemporary applications, specially with the rise of digital transformation and IoT, your data science products need to be incorporated with your operational systems, and you ML pipelines need to be an integral part of your ETL process. So, we will try to touch on various the Microsoft options to perform both experimental data science and operational ML.
  • #3 So without over due, we have a lot of ground to cover… I’ll start with a very quick intro to data science, I assume everybody here has “a” background on data science Then, I give some insights on the difference between exploratory data science and Operational ML After that, we are going to delve into the MS technologies for Advanced Analytics and show several demos…. And finally, I will conclude with some general remarks.
  • #4 So let’s get started: Data Science and Machine Learning
  • #5 So Data Science, also has been known in the academia as data mining, is the process of discovering interesting & useful patterns hidden in your data… It’s an interdisciplinary field of computing, where concepts and techniques for different areas are employed, such as statistics, artificial intelligence, machine learning, databases, and others, like, visualization, big data, cloud computing, et cetera…
  • #6 The objective of data science is to provide you with actionable insights, that is, valuable findings that support decision making, for example, whether to invest in this new product line, or whether to perform a certain critical medical operation…. These findings may form a reliable model that is able, to a certain degree of confidence, to predict, estimate, or forecast a certain value or a future event, which can allow your business to better optimize its responsive actions… for instance, to perform stock optimization or service-package personalization There is an enormous number of data science & ML applications in many business domains, of which some I had the opportunity to work on, including Customer Propensity modelling for campaigning and churn analysis, automatic risk detection in security operations, Social media & Customer Feedback analysis, demand sensing, and many others….
  • #7 The principal data mining tasks are classification, regression, clustering, Association Rule Discovery (or frequent pattern mining), time series analysis, probabilistic modelling, similarity analysis, and collaborative filtering Each focuses on tackling a particular analytics challenge Some, of course, fall under the category of supervised learning, others are considered unsupervised learning techniques.
  • #8  Now let’s take a look onto the activities of any a science process, to try to discriminate between experimental data science and operational machine learning
  • #9 It starts with an exploratory data analysis phase… After being presented with an analytics problem, you start with collecting the relevant data and importing it to your environment… Then you blend this data by performing some generic data engineering tasks, such as merging, joining, aggerating, and so on…. After that, you apply some machine learning-specific data preparation tasks, also know as features engineering, including features construction, extraction, selection, and feature tuning, like scaling, handling missing values & outliers, and so on. The output of this phase is a learning dataset, that will be used in your ML experimentation phase. In this phase, you perform iterative steps training & testing to select the algorithm & parameters that produce the model that best captures the hidden patterns in your data… The final output of this whole experimentation phase is a report of findings, along with comprehensive visuals. That can be in the form of a markdown file, using jupyter notebooks, that tills the end-to-end data analysis story and support reproducibility. These results may lead to a specific decision or recommendation. In some scenarios, these results are the ultimate output of the data science activity
  • #10 However, in many other scenarios, where you need repeated and real-time intelligence, such as targeted advertising and recommender systems, you need to productionize the models produced from the previous data science process, and integrate them with your operational systems to perform online predictions and recommendations In which case, the whole ML pipeline, including data ingestion, processing, model training and/or scoring, needs to be a repeatable, automated process The process should produce a model that exposes Web API to be integrated with your operational apps and consumed real-time
  • #11 Now let’s switch gears now and talk about technology…
  • #12 You have probably seen this diagram more that 100 times during the last couple of days This is the Cortana Intelligence Suite, which provides you with a plethora of services to build end-to-end, batch and real-time, data analytics platform… You can visit the Cortana Intelligence Gallery online to find many templates for Analytics Solutions However, we are going to focus on only the machine learning and intelligence parts of it…
  • #13 Here we have the different Microsoft Technologies for data science & machine learning, which are: Azure Machine Learning Microsoft R Server & SQL Server (in-database) R Services Analysis Services Data Mining Spark Machine Learning on Azure HDInsight Cognitive Features in Azure Data Lake Analytics Azure Cognitive Services, & Microsoft Bot framework Does anyone know any other Microsoft tool or technology for Data Mining or Machine Learning? So I can declare this as an inclusive list  Alright, we will try to touch on each of these ones - except the Bot framework - discuss features and limitations, and show a simple demo for each.
  • #14 So Let’s start with Azure Machine Learning… Azure ML has around for while now, and gained a lot of popularity in both experimental data science and operationalizing ML models…
  • #15 It is a cloud-based PaaS service Provides an interactive (drag and drop) Data Science studio to perform you experiments With a lot of built-in functionality for data processing and modelling You can import data from different sources However, the most interesting feature in Azure ML is that you can easily productionize your ML model as Web Services, so that you can integrate them in your ETL for training and batch scoring Or your operational applications for real-time predictions You can also extend your ML experiments using python and R scripts Let’s quickly switch to the Azure portal and have a look at it
  • #23 Microsoft R Server Probably the most important analytics product for Microsoft at the moment…. If you are an R developer, you will probably know that open-source R has scalability limitations, because it is single-threaded and in-memory only… You needed to use commercial R libraries to make your program multi-threaded, process your data partly in-memory and partly on-desk, so that you can handle data sizes bigger than your workstation’s memory, and run your R app on a cluster for distributed computing and scaling your data processing…
  • #24  Well, Microsoft has acquired a company that builds such libraries, called Revolution Analytics, and included their open-source libraries in MRO, and their commercial ones in MRS Besides, MSR Open has enhanced Math Kernel Library, for more efficient mathematical computations and it is compatible with all R-related software
  • #25 So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  • #26 Let’s take a closer look to the main components of MS R Server ScaleR – The core libraries in MS R, optimized for parallel execution and uses external data frames to overcome the memory limitation ConnecR – provides access to various data sources including distributed file systems and relational databases DistributeR – allows you R application to run in different execution context, including distributed one So you can write you application once, and with a few lines of code, you can configure your application to run on different execution context in order to scale it MS R Server Operationalization - allows you to deploy your R models, on a configured R Server, as Web APIs (similar to what we have seen in Azure ML) using msrdeploy libraries Let’s have a look on a sample MS R code
  • #29 SQL Server R Services…
  • #30 Supporting the execution of R scripts within SQL Server has opened interesting opportunities to bring your analytics close your data, and integrate your Machine Learning process with your SQL-based ETL. Now you can encapsulate you ML task, as R scripts, into a T-SQL stored procedure, to perform model training or batch scoring Then you can call this proc from an SSIS package As for a training pipeline, you use data in SQL Server Tables to train and R model, then you serialize this Model and stored and maintain it in a table Similarly for a prediction pipeline, you load the R model that is stored in the tables, perform prediction using data in a SQL Table, and save back the results This is all using sp_execute_external_script Another advantage is that your ML scripts are managed and source controlled as part of your SQL Database project in Visual Studio In terms of limitations, R Services are not supported in Azure SQL DB/DW, yet It is not suitable for Interactive Data Science, and it’s a bit hard to debug. It also supports R only, which is not ideal for Python Data Scientists. However, I think we will see SQL Server in-database Python Services in the future… Let’s have a look on how this works
  • #32 Well, I can’t talk about Microsoft & Machine Learning without mentioning my old friend Analysis Services Data Mining…. I’ve personally delivered a couple of interesting Data Mining Solutions using this technology It has been around for more than 10 years, since SQL Server 2005, yet no one now is talking about it…
  • #33 Although SQL Server Analysis Services has limited Data Mining features, as well as very limited extensibility (that is, you need to write you own ML algorithms and integrate them in using only C++) I think it has a number of useful features that makes it a good candidate for delivering and productionizing an ML Solutions It can process data from various OLEDB and ODBC data sources, that includes Azure SQL DB and DW, It is very easy to build and deploy your data mining models in an Analysis Services database and use the model for batch prediction using DMX Seamless integration with SSIS, and the latter includes special tasks to perform train and predict queries However, the most interesting thing about Analysis Services is that provides very useful interactive console to explore and interpret the constructed models It has an Excel Add-in to allow non-technical users to do Data Mining
  • #34 In Analysis Services Data Mining, you build your mining structures based on the tables in the data source view. Then you run an algorithm on the data in the mining structure to produce a mining model The algorithms available are:… With a wide range of parameter configurations
  • #38 If you want your system to perform generic intelligence tasks, such as text analysis or image recognition, you should probably consider Azure Cognitive Services, before you build your own models.. You have a number of REST APIs that perform various analytics, including language, speech, vision, and search, that you can directly be consumed from you operational App… Very good documentation is supplied for each API
  • #39 You can create a cognitive service on you Azure Portal by specifying the API type and Service level (which affects how much you will pay), and look at the documentation to learn how consume a specific service Let’s have a look on demo that uses the emotion API
  • #41 While Azure Cognitive Services APIs are used in your operational apps for real-time predictions, Data Lake Analytics provide cognitive features to be integrated in your batch big data processing pipeline This includes image and text analytics So you can benefit from the scalability of the U-SQL jobs to process large amount of data, as well as the orchestration of azure data factory to streamline your process In addition, Data Lake Analytics supports Python extension in the U-SQL Script, let’s have a look So basically you can write some python function to process your data as part of the U-SQL Job Now let’s see a Text Analytics demo using Data Lake Analytics Cognitive Features
  • #46 If you are a spark developer, you will probably know that spark has rich ML libraries to perform various data processing and model learning tasks It is scalable, and extensible (you can implement your own algorithms using PySpark, SparkR, Java, or Scala)… Typically suitable for model training and batch scoring… And as it is available on HDInsight, you Spark data processing and ML pipelines can be automated and scheduled using Azure Data Factory Here is the link for the Data Factory Template for this…
  • #47 Spark has standard APIs for ML to make it easier to combine multiple task, including pre-processing and modelling, onto a single workflow, or pipeline