Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Vertica Open Source Relations Manager Python + MPP Database = In Production Faster Paige Roberts

3 https://www.brighttalk.com/webcast/8913/351928 Mauro Barbieri, Senior Scientist at Philips

SQL Server Philips Remote Service Network Teradata (Salesforce, SAP data) Visualization / Reporting / Application Distributed Pub/Sub System Data Sources Large-Scale Storage ETL – Extract, Transform, Load MPP Analytics / Machine Learning Batch Low Latency

Production Machine Learning Needs Speed Fast data processing without heavy operations cost Ease of Use High level of abstraction functions Features A wide panel of functionalities Flexibility Open Architecture Being able to connect with a lot of different technologies Change is constant – code, deployment, data sources, algorithms, …

Advantages of Python Broad Utility Many functionalities - one of the most broadly useful programming languages. Flexibility It Many right paths to do things, a lot of freedom, works on many platforms. Ease of Use High level of abstraction makes Python one of the easiest programming languages. Strong Community Most data scientists master Python. Many useful packages (pandas, scikit, …)

Python Uses & Challenges Python is great for …  Predictive Maintenance  Ensuring Quality of Service  Proactive Sales  New Products & Markets  Differentiation  A/B Testing  Marketing behaviors and click analysis … Data Science Python has challenges with:  Performance with big data - Global interpreter lock - CPU Thread management - Access to data in multiple nodes - Methods for efficiently accessing data (indexing and data optimization) - Concurrency

End-to-End Machine Learning Process 8 Business Understanding Data Analysis Data Preparation Modeling Evaluation Deployment

End-to-End Machine Learning Process 9 Business Understanding Data Analysis Data Preparation Modeling DeploymentEvaluation

Challenges of Machine Learning at Scale The need for speed at reasonable cost Not easy to move big data around Sub-sampling can compromise accuracy

Challenges of Machine Learning at Scale Sub-sampling can compromise accuracy Work with all of your data

Sampling vs. Full Dataset 13 Source: https://towardsdatascience.com/breaking-the-curse-of-small- datasets-in-machine-learning-part-1-36f28b0c044d  Data usually matters more than algorithms for complex problems  Small data sets usually lack generalization and are prone to over-fitting Large datasets result in better model generalization

Challenges of Machine Learning at Scale Not easy to move big data around Bring models to the data

Bring Data to the Model Slow  Data transfer is bottleneck – fighting inertia  Need to downsample reduces accuracy  Results are not where you need them to interact with production systems 15 Data Has Gravity

Bring the Model to the Data Fast!  Ease of integration with production systems  Parallelized  Data stays where it is – security, provenance  Model management in the database 16 Data Has Gravity

Challenges of Machine Learning at Scale The need for speed at reasonable cost Pick (the right) scaling architecture

RDBMS MySQL, PostgreSQL … Cassandra, Key/Value DB Schema Enforced ETL (Flattened, Modeled Tables) Hive, Spark, Presto, Notebooks Recent Data Applications: • ETL/Modeling • CityOps • Machine Learning • Experiments Ad Hoc Analytics: • CityOps • Data Scientists Batch Low Latency Ingestion EL (Extract, Load) Visualization / Reporting / Application Distributed Pub/Sub System Data Sources Large-Scale Storage ETL – Extract, Transform, Load MPP Analytics / Machine Learning

Advantages of MPP Analytical Database MPP Scale Clusters with no name node or other single point of failure allow unlimited scale Speed and Concurrency Query optimization and resource management across multiple nodes Features ML algorithm parallelization, moving windows, geospatial analysis, time series joins, fast data prep... Open Architecture Integration with many other applications - BI, ETL, Kafka, Spark, Data Science Labs …

High Performance + High Concurrency 20 Get data quickly enough to act upon it, explore your data interactively, and enable everyone to make their own data-driven decisions Enable everyone to make their own data-driven decisions. Get data quickly enough to act on it. Explore data interactively. Scale Data Volumes Scale Users SQL Database ++ Analytics & ML Query Engine

Advantages of Python + MPP Analytical Database MPP Scale Clusters with no name node or other single point of failure allow unlimited scale Speed and Concurrency Query optimization and resource management across multiple nodes Features ML algorithm parallelization, Moving Windows, Geospatial, Time Series, fast data prep... Open Architecture Integration with many other applications - BI, ETL, Kafka, Spark, Data Science Labs … Broad Utility Many functionalities - one of the most broadly useful programming languages. Flexibility It Many right paths to do things, a lot of freedom, works on many platforms. Ease of Use High level of abstraction makes Python one of the easiest programming languages. Strong Community Most data scientists master Python. Many useful packages (pandas, scikit, …)

Parallelization 22 Predicting and scoring on multiple nodes  Python models get copied to all nodes  Different portions of data are processed simultaneously  Result: Fast response Node 3 Data Node 2Node 1 DataData

Built-In Statistical and Quality Functions Business Understanding Data Exploration Data Preparation Modeling Evaluation Deployment Parallel Machine Learning Algorithms Speed ANSI SQL Scalability Parallel Data Preparation Deploy Anywhere Outlier Detection Normalization Imbalanced Data Processing Sampling Missing Value Imputation And More… Pattern Matching Date/ Time Algebra Window/ Partition Date Type Handling Sequences And More… Sessionize Time Series Statistical Summary SQL SQLSQL SQLSQL

Automate Model Training and Validation Business Understanding Data Exploration Data Preparation Modeling Evaluation Deployment Parallel Machine Learning Algorithms Speed ANSI SQL Scalability Parallel Data Preparation Deploy Anywhere Outlier Detection Normalization Imbalanced Data Processing Sampling Missing Value Imputation And More… Pattern Matching Date/ Time Algebra Window/ Partition Date Type Handling Sequences And More… Sessionize Time Series Statistical Summary SQL SQLSQL SQLSQL SVM Random Forests Logistic Regression Linear Regression Ridge Regression Naive Bayes Cross Validation And More… Model-level Stats ROC Tables Error Rate Lift Table Confusion Matrix R-Squared MSE

Manage Model Life Cycle Business Understanding Data Exploration Data Preparation Modeling Evaluation Deployment Parallel Machine Learning Algorithms Speed ANSI SQL Scalability Parallel Data Preparation Deploy Anywhere Outlier Detection Normalization Imbalanced Data Processing Sampling Missing Value Imputation And More… Pattern Matching Date/ Time Algebra Window/ Partition Date Type Handling Sequences And More… Sessionize Time Series Statistical Summary SQL SQLSQL SQLSQL SVM Random Forests Logistic Regression Linear Regression Ridge Regression Naive Bayes Cross Validation And More… Model-level Stats ROC Tables Error Rate Lift Table Confusion Matrix R-Squared MSE In-Database Scoring Speed Scale Security

26 Bring your R, TensorFlow, and Python code inside the database – analyze the data in place. https://github.com/vertica/vertica-python https://github.com/vertica/Vertica-ML-Python

 Huge improvements in stability and performance after moving to Vertica  24 mins on Spark, 3 mins in Vertica  Can incorporate other data like weather to optimize predictive thermostat efficiency after moving to Vertica ML  Citing speed of analytics, ease of use when coding in SQL, and improvements in the accuracy of models after moving workloads to Vertica ML  Solving issues that were previously unsolvable  Minimal hardware, software, and personnel investments when differentiating with data science. 27

Thank you! Learn More: academy.vertica.com Try it Free: vertica.com/try Paige Roberts Open Source Relations Manager E: Paige.Roberts@microfocus.com

Advantages of In-Database Machine Learning • Eliminate overhead of data transfer • Keep data secure with clear provenance • Store and manage models and data together • Serve hundreds of concurrent users • Use highly scalable, high performance machine learning functionalities • Avoid maintenance cost of a separate analytical system • Increase productivity with simple SQL calls instead of coding everything • Prep data faster 30 Node 1 Node 2 Node 3 Schema Tables Models Schema Tables Models Schema Tables Models Network

Benefits of In-database Machine Learning 31 Scale Speed Accuracy Empower more users within your organization to leverage machine learning and increase data scientist productivity with a simple SQL interface You need massively parallel processing power to build and train models at the speed of business Run machine learning models based on all your historical data, not just a subset of down-sampled data Democratized predictive analytics applications Faster time to market for machine learning projects Deploy predictive use cases and stay ahead In-database machine learning transforms the way data scientists and analysts interact with data

Simple SQL Execution 32 Put the power of predictive analytics in the hands of more analysts and database users With Vertica, users can create, train and deploy machine learning models using simple SQL calls, at massive scale Linear Regression Logistic Regression K-Means Clustering Random Forrest Naive Bayes Support Vector Machines SQL

An Open Architecture with a Rich Ecosystem Python SQL C++ Geospatial TimeSeries EventSeries Real-time User-DefinedStorage Security External Tables:Analyze inPlace MachineLearning TextAnalytics Regression PatternMatching User-DefinedFunctions DataTransformation Streaming ETL User-Defined Loads BI &Visualization ODBC JDBC OLEDB S3 R Java

The Vertica Analytics Platform 34 Native High Availability Standard SQL Interface Column Orientation Machine Learning Advanced Compression MPP Massive Parallel Processing Leverages BI, ETL, Hadoop/MapReduce and OLTP investments No disk I/O bottleneck simultaneously load & query Native DB-aware clustering on low-cost x86 Linux nodes Built-in redundancy that also speeds up queries In-database machine learning functions for predictive analytics at scale Up to 90% space reduction using 10+ algorithms  10-50x faster than legacy databases  Scales from TB to PB with industry-standard hardware  Simple integration with existing ETL and BI solutions  SQL-99+ compliant  Ultimate deployment flexibility  Extended analytics  In-database machine learning  24/7 Load & Query

Predictive Maintenance Demo 36 Analyze sensor data from cooling towers across the US , enabling equipment manufacturers to predict and prevent equipment failure

Flight Tracker Demo 37 Vertica operates at the “edge” with flight track detail. Sensor data is collected using a Raspberry pi with radio receiver and antenna. Data is loaded into Vertica as thousands of records per second and builds to billions of flight data points collected within a 250-mile radius. https://www.vertica.com/blog/blog-post-series-using-vertica-track- commercial-aircraft-near-real-time/

Moving data science workloads from Spark on Hadoop to in-database Improvements in stability and performance Creating customer segmentation via clustering algorithms on a 15 million customer dataset took 24 mins on Spark - 3 mins in database Concurrently running other algorithms without performance impact Cardlytics partners with more than 1,500 financial institutions to run their online and mobile banking rewards programs, which gives us a robust view into where and when consumers are spending their money.

Fidelis Cybersecurity protects the world's most sensitive data by identifying and removing attackers no matter where they're hiding on your network and endpoints. 40 Data science team was experiencing challenges with performance while using Spark ML Moving workloads from Spark ML to in-database ML provided: Speed of analytics Ease of use when coding in SQL Increased accuracy of models

Some Vertica IoT Customer Resources Case Studies  Anritsu ROI case study: https://www.vertica.com/wp- content/uploads/2017/01/r24-HPE-Vertica-ROI-case-study-Anritsu.pdf  Infographic of ROI: https://www.vertica.com/wp- content/uploads/2017/03/Anritsu-v2.pdf  Nimble Storage ROI case study: https://www.vertica.com/wp- content/uploads/2017/08/Nimble-Storage-ROI.pdf  Optimal+ case study: https://www.vertica.com/wp- content/uploads/2017/06/Optimal-MF-rebrand-FINAL-lo-res.pdf  *Climate Corp case study: https://www.vertica.com/wp- content/uploads/2019/01/Climate-Corp_Success-Story-FINAL.pdf Webcasts – Data Disruptors  Philips: https://www.brighttalk.com/webcast/10477/277693  Climate Corp: https://www.brighttalk.com/webcast/8913/336201  Nimble Storage (HPE InfoBright): https://www.brighttalk.com/webcast/8913/330769  Zebrium: https://www.brighttalk.com/webcast/8913/332838  Simpli.fi: https://www.brighttalk.com/webcast/8913/354325/simpli-fi- delivers-advertising-insights-on-billions-of-streaming-bid-messages Videos  Optimal+: https://www.youtube.com/watch?v=IZkkoy5ZT1M&feature=youtu.be  Anritsu: https://www.youtube.com/watch?v=QZ5vWqblVXU&feature=youtu.be 41

42 Try Vertica • 3 Easy ways to try Vertica (https://www.vertica.com/try/) o Get Started in Minutes with Vertica by the Hour from AWS Marketplace, Google Cloud or Microsoft Azure o Free Community Edition (for up to 1TB and 3-node cluster) o Vertica Start-Up Accelerator Program (Free 1-year term, 25 TB license) vertica.com/try

Python + MPP Database = Large Scale AI/ML Projects in Production Faster

More Related Content

What's hot

Similar to Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Recently uploaded

Python + MPP Database = Large Scale AI/ML Projects in Production Faster