Continuous Machine and Deep Learning with Apache Ignite

Continuous Machine and Deep Learning at Scale With Apache Ignite Denis Magda Apache Ignite Committer & PMC Chair @denismagda

2019 © GridGain Systems @denismagda @ApacheIgnite Agenda 1 • Why Machine Learning at Scale? • Ignite Machine Learning Intro • TensorFlow Integration • Ignite Machine Learning Internals • Q&A

2019 © GridGain Systems @denismagda @ApacheIgnite2 5 Mins Guide to Ignite: Overview and why to support ML?

2019 © GridGain Systems @denismagda @ApacheIgnite Why Machine Learning at Scale? 3 • Scalability – Data exceed capacity of single server – Burden for dev and business • Models trained and deployed in different systems – Move data out for training – Wait for training to complete – Redeploy models in production

2019 © GridGain Systems @denismagda @ApacheIgnite App Continuous Learning Approach Without ETL Periodic update of models Periodic ETL of terabytes of data Loading data for training Model training & testing Storing and processing working set Before Storing and processing working set Instant updates of models After (With CL) App ML/DL Engine Model training & testing No ETL

2019 © GridGain Systems @denismagda @ApacheIgnite Apache Ignite Overview Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsStreamingMessagingTransactionsSQLKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store

2019 © GridGain Systems @denismagda @ApacheIgnite6 Ignite Deployment Modes Enhance Legacy Architecture - IMDG Simplified Modern Architecture - IMDB Ignite In-Memory Storage Application Layer Web-Scale Apps Mobile AppsIoT Social Media Ignite In-Memory Storage External Database NoSQLRDBMS Hadoop Application Layer Web-Scale Apps Mobile AppsIoT Social Media Ignite Persistence

2019 © GridGain Systems @denismagda @ApacheIgnite7 Ignite Machine Learning: Slightly More Details

2019 © GridGain Systems @denismagda @ApacheIgnite Ignite Machine and Deep Learning Ignite Persistence Distributed Machine Learning Datasets TensorFLowRegressionsK-Means Decision Trees In-Memory Data Store Ignite Machine and Deep Learning Compute and Service Grid C++.NETJava Python Binary Protocal (Thin client) Distributed Algorithms Large Scale Parallelization Multi-language Support No ETL Distributed Dataset based on partitioned caches

2019 © GridGain Systems @denismagda @ApacheIgnite Distributed Classification • Logistic Regression • SVM, KNN, ANN • Decision trees • Random Forest • Naive Bayes

2019 © GridGain Systems @denismagda @ApacheIgnite Distributed Regression • KNN Regression • Linear Regression • Decision tree regression • Random forest regression • Gradient-boosted tree regression

2019 © GridGain Systems @denismagda @ApacheIgnite Distributed Clustering • K-means • GMM

2019 © GridGain Systems @denismagda @ApacheIgnite Multilayer Perceptron Neural Network

2019 © GridGain Systems @denismagda @ApacheIgnite Ignite ML API Usage IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new SampleVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

2019 © GridGain Systems @denismagda @ApacheIgnite Machine Learning Pipelines

2019 © GridGain Systems @denismagda @ApacheIgnite Pipelining with Apache Ignite IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers(ignite); // Extracts "pclass", "sibsp", "parch", "sex", "embarked", "age", "fare". Vectorizer<Integer, Vector, Integer, Double> vectorizer = new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1); PipelineMdl<Integer, Vector> mdl = new Pipeline<Integer, Vector, Integer, Double>() .addVectorizer(vectorizer) .addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>() .withEncoderType(EncoderType.STRING_ENCODER) .withEncodedFeature(1) .withEncodedFeature(6)) .addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>()) .addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>()) .addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>() .withP(1)) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)) .fit(ignite, dataCache);

2019 © GridGain Systems @denismagda @ApacheIgnite Continuous Learning With Apache Ignite SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer(); SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer); SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);

2019 © GridGain Systems @denismagda @ApacheIgnite17 Demo: Payments Fraud Detection

2019 © GridGain Systems @denismagda @ApacheIgnite18 Ignite and TensorFlow

2019 © GridGain Systems @denismagda @ApacheIgnite TensorFlow Integration: Benefits 19 • Ignite as distributed data source – Perfect fit for distributed TF training • Less ETL – TF nodes deployed together with Ignite nodes – In-machine data movement only

2019 © GridGain Systems @denismagda @ApacheIgnite TensorFlow Integration: Main Features 20 • Distribution of user tasks written in Python • Automatic creation and maintenance of TF cluster • Minimization of ETL costs • Fault tolerance for both Ignite and TF instances >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

2019 © GridGain Systems @denismagda @ApacheIgnite21 Ignite Machine Learning: Internals

2019 © GridGain Systems @denismagda @ApacheIgnite Distributed In-Memory Data Store Ignite Memory-Centric Storage Ignite Cluster Predictable Memory Consumption Fully Transactional WAL (Write Ahead Log) Instantaneous Restarts Automatic Defragmentation Off-heap Removes Noticeable GC Pauses Stores Superset of Data Distributed Persistent Store In-Memory Data Store Persistent Store Server Node In-Memory Data Store Persistent Store Server Node In-Memory Data Store Persistent Store Server Node

2019 © GridGain Systems @denismagda @ApacheIgnite23 Record to Node Mapping Key Partition Server Node ON-DISK

2019 © GridGain Systems @denismagda @ApacheIgnite24 Caches and Partitions K1, V1 K2, V2 K3, V3 K4, V4 Partition 1 K5, V5 K6, V6 K7,V7 K8, V8 K9, V9 Partition 2 Cache

2019 © GridGain Systems @denismagda @ApacheIgnite25 Partitions Distribution Node 1 Node 2 Node 3 Node 4 0 1 2 3 0 1 2 3 Primary Backup

2019 © GridGain Systems @denismagda @ApacheIgnite26 Partition-Based Dataset Node 1 P1 C D Node 2 P2 C D Training Training REDUCE Client Initial solution

2019 © GridGain Systems @denismagda @ApacheIgnite27 Training Failover Node 3 Node 1 P C D* P = Partition C = Partition Context D = Partition Data D* = Local ETL P C D

2019 © GridGain Systems @denismagda @ApacheIgnite Full Python Support and Model Importing 29 • Model Importing from Spark, XGBoost, etc. • Full Python support – https://github.com/gridgain/ml-python-api

2019 © GridGain Systems @denismagda @ApacheIgnite Apache Ignite Benefits for ML Use Cases 31 • Massive scalability – Horizontal + Vertical – RAM + Disk • Minimal ETL – Train models and run algorithms in place • Fault tolerance and continuous learning – Partition-based dataset

2019 © GridGain Systems @denismagda @ApacheIgnite Resources 32 • Documentation: – https://apacheignite.readme.io/docs • Examples and Tutorials: – https://github.com/apache/ignite/tree/master/exam ples/src/main/java/org/apache/ignite/examples/ml • Details on TensorFlow • https://medium.com/tensorflow/tensorflow-on- apache-ignite-99f1fc60efeb

2019 © GridGain Systems @denismagda @ApacheIgnite Apache Ignite – We’re Hiring ;) 33 • Rapidly Growing Community • Great Way to Learn Distributed Storages, Computing, SQL, ML, Transactions • How To Contribute: – https://ignite.apache.org/

2019 © GridGain Systems @denismagda @ApacheIgnite - 50,000 100,000 150,000 200,000 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14 Feb-15 Apr-15 Jun-15 Aug-15 Oct-15 Dec-15 Feb-16 Apr-16 Jun-16 Aug-16 Oct-16 Dec-16 Feb-17 Apr-17 Jun-17 Aug-17 Oct-17 Dec-17 Feb-18 Apr-18 Jun-18 Aug-18 Oct-18 Dec-18 Apache Ignite Is a Top 5 Apache Project Over 2M downloads per year and 4M total downloadsTop 5 Dev Mailing Lists 1. 2. 3. 4. 5. Top 5 User Mailing Lists 1. 2. 3. 4. 5. Monthly Ignite/GridGain Downloads From January 1, 2019 Apache Software Foundation Blog Post: “Apache in 2018 – By The Digits” A Top 5 Apache Software Foundation Project

2019 © GridGain Systems @denismagda @ApacheIgnite Logistics & Transportation Apache Ignite Users IoT AdTech/Media/Entertainment Pharma & Healthcare Reliance Financial Services FinTech Software/Cloud Telecom & Mobile IoT AdTech / Media / Entertainment Logistics & Transportation eCommerce & Retail Pharma & Healthcare

Continuous Machine and Deep Learning with Apache Ignite

More Related Content

What's hot

Similar to Continuous Machine and Deep Learning with Apache Ignite

Recently uploaded

Continuous Machine and Deep Learning with Apache Ignite

Editor's Notes