Introduction to Machine Learning with Azure & Databricks
The document outlines a series of upcoming workshops focused on data modernization, analytics, and data governance, hosted by CCG. It features sessions on machine learning fundamentals, classification, and clustering methods, aiming to provide insights into the applications and advantages of machine learning in various business contexts. Additionally, it mentions the importance of aligning people, processes, technology, and data for effective machine learning implementation.
Introduction to Machine Learning with Azure & Databricks
1.
CCG: Upcoming Workshops Data Modernizationina Day | March 30th | 9:00 AM – 12:00PM EST • This workshop will cover everything from whiteboarding migration strategiesto hands-on experiences with data migration tools. Analytics in a Day Ft. Synapse Workshop| April 20th | 9:00 AM – 1:00 PM EST • Learn how to simplify and accelerate your journey towards the modern data warehouse. Data Governance Workshopwith CCG+Profisee | May 4th | 9:00 AM – 12:00 PM EST • Learn how leveraging an MDM strategywithin the contextof Data Governance drives organizationalalignment, ensures data quality, and accelerates Digital Transformation. Readmore and registeratccganalytics.com/events Follow usonLinkedIn@CCGAnalyticstostayupto date on events
2.
PLEASE POSTQUESTIONSIN THE CHAT! PLEASEMUTE YOUR LINE WHEN NOT SPEAKING! CCG WILL CONTROLMUTING AND UNMUTING. THIS SESSION WILL BE RECORDED. WE WILL SHARE SLIDES WITH YOU. TO MAKE PRESENTATION LARGER, DRAG THE BOTTOMHALF OF SCREEN ‘UP’ Housekeeping
CCG is afull-service cloud analytics provider. Strategy and Governance • Data GovernanceSolution • Data Privacy Solution • Strategy Roadmap Solution Services • Health Assessments • Roadmaps • Data Governance • Data Privacy • Master Data Management with Profisee • Metadata Management Analytics and Insights • Customer Intelligence Solution • Visualization & Reporting Solutions Services • Dashboards & Visualizations • OperationalReporting • Data Exploration • Customer Insights • Marketing Analytics • Power BI • D365 Customer Insights AI and Data Science • Machine Learning Solution • Model As A Service Services • PrescriptiveAnalytics • AzureCognitive Services • Natural LanguageProcessing • Computer Vision / Image • ML Ops • Data Mining • Data Science Enablement • Data Science Roadmap Data and Infrastructure • PlatformModernization Solution • Cloud Migration and Management Services • DR/BC • Security • AzureGovernance • Data Warehousing • Data Integration • Data Architecture • PowerApps • Synapse DW
5.
Brian Beesley Data SciencePractice Director At CCG since 2017, Brian has led Machine Learning and AI initiatives with a broad range of clients from numerous industries, including financial services, healthcare, industrial goods, retail, professional services, construction, and entertainment. Spent 2012-2017 mostly focused on large banks and insurance clients on both regulatory compliance and discretionary projects doing: – Model Process & Governance – Data Aggregation & Reporting – Data Management & Governance – Business Analysis – Program Management & Org Change Management Master’s Cert in Financial Services Analytics from Stevens Institute of Technology ’16 Bachelor’s in Business Economics and Public Policy from Indiana University Kelley School of Business ‘09 TODAY’S SPEAKER
An introduction toMachine Learning and its uses in business Machine Learning 101
8.
Why should anyonecare about Machine Learning? What is Machine Learning? How does Machine Learning work? Ok, but how does it really work? How can an organization use Machine Learning? Agenda
9.
The concepts inMachine Learning are not new. https://www.quantinsti.com/blog/machine-learning-basics anotherhuman. four
10.
Though the conceptshave been around, Machine Learning has just started getting buzz in recent years because the barriers to entry are much lower. Flood of data and decreasingcosts of storage Increasing computationalpower Increasedattention from researchers Growth of open source technologies Supportfrom industries
Sales/Marketing • Price Optimization •Inventory Forecasting • Customer Segmentation • Cross Sell / Upsell / Recommendation Engines • Customer Churn Predictions • Customer Lifetime Value Finance • AssetPricing • Risk Analysis • Fraud detection • Market Forecasting • Anti Money Laundering Operations • Inventory Forecasting • Robotics • Automated Workflows • Predictive Maintenance • Schedule Optimization • IoTProduction Line Monitoring Service • Single View of Customer • Customer Serviceanalysis • Chat Bots / Digital Assistants • Social Media Analysis • Lead Scoring Machine Learning doesn’t just have to be the realm of high tech. There are practical ways to incorporate it across the business.
Why should anyonecare about Machine Learning? What is Machine Learning? How does Machine Learning work? Ok, but how does it really work? How can an organization use Machine Learning? Agenda
15.
Machine Learning isa discipline that supports the data science process. It is a technique, and its value is in the outputs it drives. Discipline Process Decision Actions Data Science A broad process for generating insights that may involve data ingestion from one or many sources (including external data, streaming data, or big data), data processing and cleansing, model generation using either statistical ormachine learning approaches, model selection, model deployment and maintenance, and visualization of data. Advanced Analytics Apply data science to predictive (what will happen?) or prescriptive (what should we do?) business use cases. Artificial Intelligence / Cognitive Computing Apply data science to approximate human intuition and decision making (e.g. strategy, creativity, planning) or human sensory function s (e.g. computer vision, natural language understanding, etc.) Statistics A branch of math for generating descriptions or inferences about a population, often based on samples of the population. Inferences may take the form of “models,” which are equations that approximate the data’s inherent relationships. Machine Learning Combines computer science with math concepts to generate models by rapidly iterating on large datasets. Other Analytics Disciplines High Performance Computing, Data Engineering, Visualization, etc. Automation / Robotics / Intelligent Devices Strategy / Operations
16.
Advanced Analytics canenable predictive and prescriptive uses of data. Traditional analytics focus on understanding and explaining the data that has been collected. Advanced Analytics focus on generating new data in the form of predictions or decisions, and going the extra step to automate decision-making when possible.
17.
Simply put, machinelearning is the science of making best guesses by iterative trial and error. 101010101010101010101010101010101010 010101010101010101010101010101010101
18.
Why should anyonecare about Machine Learning? What is Machine Learning? How does Machine Learning work? Ok, but how does it really work? How can an organization use Machine Learning? Agenda
19.
Machine Learning worksby using “algorithms” to generate “models”. A model is a repeatable, data-driven approach to making a best guess. It does this by formalizing mathematical relationships between data in the form of either: – Rules (e.g. predict applicants will default on a loan if Credit Score < 700 and Debt to Income Ratio > 30%) – Or an equation (e.g. predict Home Price = 100*Square Footage + 2*Average Income in the Area) Note that this is different from other types of models, like operating models or data models Statistical Model Data Model OperatingModel People Process Technology Data Guide Support Enable
20.
What’s a model? Data Priormonth sales: $4MM 2 months prior: $3MM 3 months prior: $2MM Program / Model This month sales = (prior month + 2 months prior + 3 months prior) / 3 Answer This month’s sales = $3MM? In the past we’ve told computers how to use data to answer our questions.
21.
Answer Lastmonth’s sales: $2MM Data Priormonth sales: $4MM 2 months prior: $3MM 3 months prior: $1MM Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $1MM Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $1MM Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $1MM But we’ve found that if we give the machine historic facts, we can let it find the right program / model to plug in for future answers. Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $2MM Program / Model This month’s sales = 1/8 * Prior month + 1/3 * 2 months prior + 1/4 * 3 months prior What’s a model?
22.
Answer Lastmonth’s sales: $2MM Data Priormonth sales: $4MM 2 months prior: $3MM 3 months prior: $1MM Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $1MM Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $1MM Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $1MM Once we have our machine-defined program, we can use it with new data to make better predictions. Answer Lastmonth’s sales: $2MM Data Prior month sales: $4MM 2 months prior: $3MM 3 months prior: $2MM Program / Model This month’s sales = 1/8 * Prior month + 1/3 * 2 months prior + 1/4 * 3 months prior New Data Prior month sales: $8MM 2 months prior: $6MM 3 months prior: $8MM Answer This month’s sales = $5MM What’s a model?
23.
What is analgorithm? The word algorithm gets used a lot, but it isn’t always defined. A defined set of steps for solving a problem Often involves repeating steps In Machine Learning, it may or may not have an ending condition. Some common ending conditions are: – The problem is solved to our satisfaction • For example – stop when the last 4 iterations have been 95% accurate or better – The problem hasn’t been solved but we don’t seem to be getting any closer to solving it • For example – stop if the last 10 iterations have not seen any improvement in accuracy – The process has run for a long time • For example – stop after the program has run for 12 hours, regardless of whether progress is still being made
24.
Collect the dataand randomly create initial decision rules. Design a method for measurably evaluating how good or bad your hypothesis is. Update your hypothesis in a way that marginally improves the performance of your decision rules. Continue this process until either you are satisfied with the results, or your hypothesis can’t improve anymore with the data available. What is an algorithm? Create a hypothesis Evaluate the hypothesis Adjust the hypothesis Repeat until convergence Almost all machine learning algorithms follow the same general pattern.
25.
Why should anyonecare about Machine Learning? What is Machine Learning? How does Machine Learning work? Ok, but how does it really work? How can an organization use Machine Learning? Agenda
26.
Supervised Learning UnsupervisedLearning We know the “right answers” for some of the scenarios. – We may have history we can look back on – We may be hoping to replicate human decision making There aren’t necessarily “right answers,” we just want to get a better understanding of our data. There are two main families of algorithms to choose from. Image credit: Gowthamy Vaseekaran via Medium.com, available at https://medium.com/@gowthamy/machine-learning-supervised-learning-vs-unsupervised-learning-f1658e12a780 Predict our profits next quarter. Identify the number written on a check. Predict a user’s rating for a given product. Group our customers into segments. Find the most important variables in a dataset. Identify credit card transactions that are out of the ordinary.
27.
Now let’s walkthrough two of the most popular machine learning approaches and discuss how the algorithms are applied. Classification Clustering
28.
Everyone will repaytheir loan. Create a hypothesis 20 outstanding loans Use classification when you want to guess a non-numeric value, like a yes/no answer. We will take a decision tree approach.
29.
Calculate accuracy asthe % of predictions that are correct based on your current set of rules. Evaluate the hypothesis 20 outstanding loans 12 repaid, 8 defaulted Accuracy = 12/20 = 60% Use classification when you want to guess a non-numeric value, like a yes/no answer. We will take a decision tree approach.
30.
Income > 60k Income< 60k Find the next branch by looking for the data split that would have the biggest impact on the purity of each node. There are several ways to do this mathematically (Gini Index, Information Gain, Chi- Square). Adjust the hypothesis 20 outstanding loans 20 outstanding loans Credit Score > 700 Credit Score < 700 20 outstanding loans DTI > 40% DTI < 40% 70% 50% 60% weighted 71% 53% 59% weighted 80% 73% 75% weighted Use classification when you want to guess a non-numeric value, like a yes/no answer. We will take a decision tree approach.
31.
Repeat the processfor each of your new “leaf” nodes. Stop when you reach an acceptable level of accuracy, or when your accuracy begins getting worse with independent data. Repeat until convergence 20 outstanding loans DTI > 40% DTI < 40% Credit Score > 700 Credit Score < 700 Income > $60k Income < $60k 100% 50% 100% 100% 80% weighted Use classification when you want to guess a non-numeric value, like a yes/no answer. We will take a decision tree approach.
32.
Classification is usedfor lots of problems that copy human intuition. Think about how you classify information to identify these images! But with more advanced approaches like convolutional neural networks these pictures can definitely be classified by a machine. These use cases areobviously morecomplex than our simple decision tree.
33.
Now let’s walkthrough two of the most popular machine learning approaches and discuss how the algorithms are applied. Classification Clustering
34.
Imagine Marketing has askedyou to split these customers into 3 groups. How would you do it? Use clustering when there’s no “correct” classification, but you still want to assign individuals to groups. This algorithm is called k-means clustering.
35.
I can segmentmy customers by assigning them to 3 groups. We’ll set down 3 random “anchors” and assign each customer to its closest anchor. Create a hypothesis Use clustering when there’s no “correct” classification, but you still want to assign individuals to groups. This algorithm is called k-means clustering.
36.
Move the anchorsto the center of each cluster. Count how many anchors are actually closer to one of the other anchors. Evaluate the hypothesis Use clustering when there’s no “correct” classification, but you still want to assign individuals to groups. This algorithm is called k-means clustering.
37.
Re-assign each customerto the group corresponding to the center they’re closest to. Adjust the hypothesis Use clustering when there’s no “correct” classification, but you still want to assign individuals to groups. This algorithm is called k-means clustering.
38.
Repeat until convergence Move theanchors again. Continue re-assigning customers and moving the anchors until the anchors stop moving. Use clustering when there’s no “correct” classification, but you still want to assign individuals to groups. This algorithm is called k-means clustering.
39.
This is justthe tip of the iceberg. There are several algorithms available for various types of problems.
40.
Why should anyonecare about Machine Learning? What is Machine Learning? How does Machine Learning work? Ok, but how does it really work? How can an organization use Machine Learning? Agenda
41.
Engaging with MachineLearning Image inspired by Microsoft Delivering analytics with Machine Learning requires alignment across people, process, technology, and data.
42.
The sources ofdata can for machine learning can be quite broad. People Process Technology Data Data Warehouses •Curated & Governed data •Big data •Cloud or on-prem Data Lakes •Unstructured & Semi- structured data •Streaming data •Partially curated Externally Procured Data •May be purchased from3rd party providers •May be scraped fromthe web •May requiredesigning research experiments Data science teams typically have the programming and data integration skills to use data from anywhereit can be found.
43.
Data scientists combinebroad skills to integrate data, build models, and drive business value. People Process Technology Data
Traditional Analytics Store andaccess data. Filter and aggregate it. Visualize it. Show it to the business so they can take action. Machine Learning Filter and aggregate it. Create a model. Generate new data (predictions,etc.). The new data can be stored with the rest of the data for use in analytics. Or it can be visualized directly to gain insights. Or it can automate decisions or actions, allowing better processes to run faster and 24/7. The outputs of the data science process can be used in traditional analytics, analyzed directly, or fed into automated decision-making. People Process Technology Data
46.
We’ll spend therest of the workshop talking about the tools that enable all this to happen. + Develop models faster with automated machine learning Use any Python environment and ML frameworks Manage models across the cloud and the edge. Prepare data clean data at massive scale Enable collaboration between data scientists and data engineers Access machine learning optimized clusters Azure Machine Learning Python-based machine learning service Azure Databricks Apache Spark-based big-data service People Process Technology Data
Train and evaluatemodel Azure Machine Learning offers a suite of tools for managing the Machine Learning lifecycle. Organize model assets A B C Deploy and manage model
49.
Train and evaluatemodel Azure Machine Learning offers a suite of tools for managing the Machine Learning lifecycle. Organize model assets A B C Deploy and manage model
50.
Automated ML providesan easy way to quickly iterate through multiple models. Train Organize A B C Deploy How much is this car worth?
51.
Automated ML providesan easy way to quickly iterate through multiple models. Train Organize A B C Deploy Mileage Condition Car brand Year of make Regulations … Parameter 1 Parameter 2 Parameter 3 Parameter 4 … Gradient Boosted Nearest Neighbors SGD Bayesian Regression LGBM … Mileage Gradient Boosted Criterion Loss Min Samples Split Min Samples Leaf XYZ Model Which algorithm? Which parameters? Which features? Car brand Year of make
52.
Automated ML providesan easy way to quickly iterate through multiple models. Train Organize A B C Deploy Which algorithm? Which parameters? Which features? Mileage Condition Car brand Year of make Regulations … Gradient Boosted Nearest Neighbors SGD Bayesian Regression LGBM … Nearest Neighbors Criterion Loss Min Samples Split Min Samples Leaf XYZ Model Iterate Gradient Boosted N Neighbors Weights Metric P ZYX Mileage Car brand Year of make Car brand Year of make Condition
53.
Automated ML providesan easy way to quickly iterate through multiple models. Train Organize A B C Deploy Which algorithm? Which parameters? Which features? Iterate
54.
Automated ML providesan easy way to quickly iterate through multiple models. Train Organize A B C Deploy Enter data Define goals Apply constraints Input Intelligently test multiple models in parallel Optimized model
55.
For those whoprefer a no-code experience, there’s a drag-n-drop interface in Azure Machine Learning Designer. Train Organize A B C Deploy
56.
Azure Machine Learningallows you to take advantage of cloud compute through local tools or Azure Notebooks. Train Organize A B C Deploy
57.
For image data,you can also train custom object detection models with the intuitive Labeling interface. Train Organize A B C Deploy
58.
Train and evaluatemodel Azure Machine Learning offers a suite of tools for managing the Machine Learning lifecycle. Organize model assets A B C Deploy and manage model
59.
Experiments allow youto capture training metrics to run side-by-side comparisons and easily select the best model. Train Organize A B C Deploy
60.
Pipelines can organizemultiple data preparation and modeling steps into a single resource. Train Organize A B C Deploy
61.
Explain machine learningmodels to support business users and compliance processes. Train Organize A B C Deploy
Train and evaluatemodel Azure Machine Learning offers a suite of tools for managing the Machine Learning lifecycle. Organize model assets A B C Deploy and manage model
Models can bedeployed to containers and shipped to the edge or accessed via Rest APIs. Train Organize A B C Deploy • Identify and promote your best models • Capture model telemetry • Retrain models with APIs • Deploy models anywhere • Scale out to containers • Infuse intelligence into the IoT edge • Build and deploy models in minutes • Iterate quickly on serverless infrastructure • Easily change environments Proactivelymanage model performance Deploy models closer to your data Bringmodels to life quickly Train and evaluate models Model MGMT, experimentation, and run history Azure ML service Containers AKS ACI IoT edge Docker Azure ML service
66.
Monitor data driftover time to know when your model may require re-training. Train Organize A B C Deploy
67.
That was ahigh-level overview of the tools and utilities. Let’s dive in to see the citizen data science tool in action.
What is Databricks,in a nutshell? is a unified platform powered by Apache Spark, capable of abstracting complex cluster management to scale out your data processing and machine learning workloads, with intelligent optimizations to dynamically reallocate workersgiven computational demands.
71.
Scaling Out withDistributed Processing vs. Scaling Up Option A A-G H-N O-T U-Z Imagine this… I need to find every entry in the phone book with my first name. I’d like to hire someone to read through the entire phone book and pick them out. Option B
72.
Scaling Out withDistributed Processing vs. Scaling Up Option A A-D E-I J-M N-R Imagine this… I need to find every entry in the phone book with my first name. I’d like to hire someone to read through the entire phone book and pick them out. S-V W-Z Option B
73.
Imagine this… Ineed to extract every numeric column in my dataset and normalize the values in each. I need to perform a grid search of hyperparameters to improve the accuracy of my classification model. I need to train an algorithm to make correct classifications based on several features. Scaling Out with Distributed Processing vs. Scaling Up Option A Option B These common pieces of machine learning pipelinesmay sound simple, but in working with big data, tasks like these can add hours, days, or weeksto your timeline, or be too cost inefficient to complete at all. More flexible More easily scalable WithDatabricks and Spark, easy tospin up and manage
74.
What is Spark? 2010 Startedat UC Berkeley 2013 Databricks started & donated to ASF 2014 Spark 1.0 and additions to Spark Core (SQL, ML, GraphX) 2015 DataFrames/Datasets Tungsten ML Pipelines Apache Spark 2.0 2016 Apache Spark 3.0 released,Adaptive Query Execution, new Pandas function APIs 2020 Continuedfeature development to further support distributedML 2018 Easier Smarter Faster is an open source framework enablingdistributed cluster computing for large scale data processing. The Spark architectureworks toscale processing out across compute resources withamanaging driver node assigning processing tasks toworker nodes. Spark was foundedwith the singular goal to “democratize”the “superpower”of big databy offering high-level APIs anda unifiedengine to complete processing at all steps of the datapipeline. Since then, thousands of contributors have developedSpark projectsthat improve the accessibility andversatility of the Spark framework anddistributedprocessing. . . .
75.
Scaling Out withDatabricks is a unified platform powered by Apache Spark, capable of abstracting complex cluster management to scale out your data processing and machine learning workloads, with intelligent optimizations to dynamically reallocate workersgiven computational demands. Databricks brings scaling out to your workloads in a way that’s easy to spin up, familiar to work with, and integrates with tools you already use every day.
In an accessible setting Multiple languagesin Databricks Notebooks (Python, R, Scala, SQL) Databricks Connect: connect external tools with Databricks (IDEs, RStudio, Jupyter…) Work on a single node and utilize the mostcommon ML frameworks FamiliarOptions & Distributed Frameworks on Databricks Distributed machine learning Spark MLlib for distributed models Migrate Single Node to distributed with just a few lines of code changes Distributed hyperparameter search (Hyperopt, Gridsearch) PandasUDF to distribute models over subsets of data or hyperparameters Koalas: Pandas DataFrameAPI on Spark Deep Learning distributedtraining(HorovodRunner)
78.
Enhanced Accessibility onAzure Databricks Not an Azure Marketplace or a 3rd party hosted service PAAS: Platform as a Service Azure Databricksis a first party service on Azure. Azure Storage Services: Directly access data in Azure Blob Storage and Azure DataLake Store Azure Active Directory: For user authentication, eliminate the need to maintain two separate sets of users in Databricks and Azure. Azure Databricksis integrated seamlessly with Azure services.
79.
MLflow is anopen source platform for managing the end-to-end machine learning lifecycle. MLflow offers an integrated experience for tracking and securing machine learning model training runs and running machine learning projects. What is MLflow?
80.
Tracking • Record andquery experiments: code, data, configuration, results Projects • Package data science code in a format to reproduce runs on any platform Models • Deploy machine learning models in diverse serving environments Registry • Store, annotate, discover, and manage models in a central repository Serving • Host ML models as REST endpoints that are updated automatically MLflow’s Five Key Components
We can workwith your business to deliver custom predictive and prescriptive analytics across the lifecycle. Machine Learning Strategy • Develop a backlog of predictive and prescriptive use cases • Refine and prioritize use cases by value • Develop a predictive roadmap Model Development / Data Mining • Aggregate data from across internal and external data sources • Perform correlation analyses, develop models, and find new relationships in your data Model Maintenance • Monitor and maintain statistical models to sustain predictive power • Develop a model telemetry dashboard • Test model design changes to improve predictive power Model Governance & Operating Model • Assess existing Data Science & Artificial Intelligence maturity • Develop standards and processes to help guide data science output • Build a Data Science Center of Excellence Model Deployment / MLOps • Customize and deploy pre- existing models from Azure Cognitive Services • Deploy custom model as an API or batch job, or support deployment in existing systems RapidInsight Prototype Offering Model as a Service Subscription Offering Elastic AI Research & Development MLOps POC Managed Services Accelerators