© 2018 GridGain Systems, Inc. In-Memory Performance Durability of Disk
© 2018 GridGain Systems, Inc. Machine Learning and Deep Learning with Apache Ignite Akmal Chaudhri GridGain Systems
© 2018 GridGain Systems, Inc. • Apache Ignite Overview • Data Science Toolkit • Memory-Centric Storage • Compute Grid • Machine Learning • Genetic Algorithms • Summary • Q&A Agenda
© 2018 GridGain Systems, Inc. Apache Ignite Database, Caching and Processing Platform Memory-Centric Storage Ignite Native Persistence (Flash, SSD, Intel 3D XPoint) Third-Party Persistence (RDBMS, HDFS, NoSQL) SQL Transactions Compute Services MLStreamingKey/Value IoTFinancial Services Pharma & Healthcare E-CommerceTravel & Logistics Telco
© 2018 GridGain Systems, Inc. 1. Models trained and deployed in different systems • Move data out for training • Wait for training to complete • Redeploy models in production 2. Scalability • Data exceed capacity of single server • Need complex solutions • Burden for developers Machine Learning Business Case
© 2018 GridGain Systems, Inc. Memory-Centric Storage Off-heap Removes noticeable GC pauses Automatic Defragmentation Stores Superset of Data Predictable memory consumption Fully Transactional (Write-Ahead Log) DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Server Node Server Node Server Node Memory-Centric Storage Instantaneous Restarts
© 2018 GridGain Systems, Inc. Compute Grid DURABLE MEMORY DURABLE MEMORY Ignite Cluster C1 R1 C2 R2 C = C1 + C2 R = R1 + R2 C = Compute R = Result in T/2 time Automatic Failover Load Balancing Zero Deployment C = Task C1, C2 = Jobs
© 2018 GridGain Systems, Inc. Machine Learning K-Means Regressions Decision Trees R C++ Python Java Server Node Server NodeServer Node Distributed Core Algebra DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Scala REST Random Forest Distributed Algorithms Dense and Sparse Algebra Large Scale Parallelization Multi-Language Support Dense and Sparse Algebra No ETL
© 2018 GridGain Systems, Inc. • Abstraction layer on top of Ignite storage and computation • MapReduce using Compute Grid • Partition data • Can be recovered from another node • Partition context • ML algorithms are iterative and require context Partitioned-Based Dataset
© 2018 GridGain Systems, Inc. Partition-Based Dataset Ignite Node 1 P1 C D Ignite Node 2 P2 C D Training Training REDUCE Client Initial solution
© 2018 GridGain Systems, Inc. Iterative Optimization
© 2018 GridGain Systems, Inc. Recovery after Node Failure Ignite Node 1 Ignite Node 2 P C D P = Partition C = Partition Context D = Partition Data D* = Local ETL P C D*
© 2018 GridGain Systems, Inc. Algorithms and Applicability Classification Regression Description Identify to which category a new observation belongs, on the basis of a training set of data Modeling the relationship between a scalar dependent variable y and one or more explanatory variables x Applicability spam detection, image recognition, credit scoring, disease identification drug response, stock prices, supermarket revenue Algorithms nearest neighbor, decision tree classification, neural network linear regression, decision tree regression, nearest neighbor, neural network
© 2018 GridGain Systems, Inc. Algorithms and Applicability Clustering Preprocessing Description Grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups Feature extraction and normalization Applicability customer segmentation, grouping experiment outcomes, grouping shopping items transform input data, such as text, for use with machine learning algorithms Algorithms k-means Normalization preprocessor
© 2018 GridGain Systems, Inc. Demo
© 2018 GridGain Systems, Inc. • Find good solutions to complex problems • Simulate biological evolution • Population • Consists of chromosomes • Chromosome • Possible solution • Consists of genes • Genes • Combined to derive new chromosomes Genetic Algorithms
© 2018 GridGain Systems, Inc. • Fitness Calculation • Chromosomes contain a fitness score, which is used to compare different solutions • Crossover • The process of combining genes to produce new chromosomes • Mutation • Some genes within chromosomes are updated to produce new characteristics Genetic Algorithms
© 2018 GridGain Systems, Inc. Genetic Algorithms DURABLE MEMORY DURABLE MEMORY Ignite Cluster F2, C2, M2 F = F1 + F2 C = C1 + C2 Collocated Computation Biological Evolution Simulation Chromosome and Genes Cluster M = M1 + M2 F1, C1, M1 F = Fitness Calculation C = Crossover M = Mutation
© 2018 GridGain Systems, Inc. Genetic Algorithms
© 2018 GridGain Systems, Inc. Demo
© 2018 GridGain Systems, Inc. • Distributed Machine Learning and Deep Learning when data do not fit within a single server unit • Zero-ETL • Train models and run algorithms in place • Massive scalability • Horizontal + Vertical • RAM + Disk • Fault tolerance and continuous learning • Partition-based dataset Apache Ignite Benefits
© 2018 GridGain Systems, Inc. GridGain Company Confidential Among Top 5 Apache Projects Top 5 by Commits 1. Hadoop 2. Ambari 3. Camel 4. Ignite 5. Beam Top 5 Developer Mailing Lists 1. Ignite 2. Kafka 3. Tomcat 4. Beam 5. James Over 1M downloads per year Top 5 User Mailing Lists 1. Lucene/Solr 2. Ignite 3. Flink 4. Kafka 5. Cassandra
© 2018 GridGain Systems, Inc. Any Questions? Thank you for joining us. Follow the conversation. http://ignite.apache.org #apacheignite

Machine learning and deep learning with Apache Ignite

  • 1.
    © 2018 GridGainSystems, Inc. In-Memory Performance Durability of Disk
  • 2.
    © 2018 GridGainSystems, Inc. Machine Learning and Deep Learning with Apache Ignite Akmal Chaudhri GridGain Systems
  • 3.
    © 2018 GridGainSystems, Inc. • Apache Ignite Overview • Data Science Toolkit • Memory-Centric Storage • Compute Grid • Machine Learning • Genetic Algorithms • Summary • Q&A Agenda
  • 4.
    © 2018 GridGainSystems, Inc. Apache Ignite Database, Caching and Processing Platform Memory-Centric Storage Ignite Native Persistence (Flash, SSD, Intel 3D XPoint) Third-Party Persistence (RDBMS, HDFS, NoSQL) SQL Transactions Compute Services MLStreamingKey/Value IoTFinancial Services Pharma & Healthcare E-CommerceTravel & Logistics Telco
  • 5.
    © 2018 GridGainSystems, Inc. 1. Models trained and deployed in different systems • Move data out for training • Wait for training to complete • Redeploy models in production 2. Scalability • Data exceed capacity of single server • Need complex solutions • Burden for developers Machine Learning Business Case
  • 6.
    © 2018 GridGainSystems, Inc. Memory-Centric Storage Off-heap Removes noticeable GC pauses Automatic Defragmentation Stores Superset of Data Predictable memory consumption Fully Transactional (Write-Ahead Log) DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Server Node Server Node Server Node Memory-Centric Storage Instantaneous Restarts
  • 7.
    © 2018 GridGainSystems, Inc. Compute Grid DURABLE MEMORY DURABLE MEMORY Ignite Cluster C1 R1 C2 R2 C = C1 + C2 R = R1 + R2 C = Compute R = Result in T/2 time Automatic Failover Load Balancing Zero Deployment C = Task C1, C2 = Jobs
  • 8.
    © 2018 GridGainSystems, Inc. Machine Learning K-Means Regressions Decision Trees R C++ Python Java Server Node Server NodeServer Node Distributed Core Algebra DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Scala REST Random Forest Distributed Algorithms Dense and Sparse Algebra Large Scale Parallelization Multi-Language Support Dense and Sparse Algebra No ETL
  • 9.
    © 2018 GridGainSystems, Inc. • Abstraction layer on top of Ignite storage and computation • MapReduce using Compute Grid • Partition data • Can be recovered from another node • Partition context • ML algorithms are iterative and require context Partitioned-Based Dataset
  • 10.
    © 2018 GridGainSystems, Inc. Partition-Based Dataset Ignite Node 1 P1 C D Ignite Node 2 P2 C D Training Training REDUCE Client Initial solution
  • 11.
    © 2018 GridGainSystems, Inc. Iterative Optimization
  • 12.
    © 2018 GridGainSystems, Inc. Recovery after Node Failure Ignite Node 1 Ignite Node 2 P C D P = Partition C = Partition Context D = Partition Data D* = Local ETL P C D*
  • 13.
    © 2018 GridGainSystems, Inc. Algorithms and Applicability Classification Regression Description Identify to which category a new observation belongs, on the basis of a training set of data Modeling the relationship between a scalar dependent variable y and one or more explanatory variables x Applicability spam detection, image recognition, credit scoring, disease identification drug response, stock prices, supermarket revenue Algorithms nearest neighbor, decision tree classification, neural network linear regression, decision tree regression, nearest neighbor, neural network
  • 14.
    © 2018 GridGainSystems, Inc. Algorithms and Applicability Clustering Preprocessing Description Grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups Feature extraction and normalization Applicability customer segmentation, grouping experiment outcomes, grouping shopping items transform input data, such as text, for use with machine learning algorithms Algorithms k-means Normalization preprocessor
  • 15.
    © 2018 GridGainSystems, Inc. Demo
  • 16.
    © 2018 GridGainSystems, Inc. • Find good solutions to complex problems • Simulate biological evolution • Population • Consists of chromosomes • Chromosome • Possible solution • Consists of genes • Genes • Combined to derive new chromosomes Genetic Algorithms
  • 17.
    © 2018 GridGainSystems, Inc. • Fitness Calculation • Chromosomes contain a fitness score, which is used to compare different solutions • Crossover • The process of combining genes to produce new chromosomes • Mutation • Some genes within chromosomes are updated to produce new characteristics Genetic Algorithms
  • 18.
    © 2018 GridGainSystems, Inc. Genetic Algorithms DURABLE MEMORY DURABLE MEMORY Ignite Cluster F2, C2, M2 F = F1 + F2 C = C1 + C2 Collocated Computation Biological Evolution Simulation Chromosome and Genes Cluster M = M1 + M2 F1, C1, M1 F = Fitness Calculation C = Crossover M = Mutation
  • 19.
    © 2018 GridGainSystems, Inc. Genetic Algorithms
  • 20.
    © 2018 GridGainSystems, Inc. Demo
  • 21.
    © 2018 GridGainSystems, Inc. • Distributed Machine Learning and Deep Learning when data do not fit within a single server unit • Zero-ETL • Train models and run algorithms in place • Massive scalability • Horizontal + Vertical • RAM + Disk • Fault tolerance and continuous learning • Partition-based dataset Apache Ignite Benefits
  • 22.
    © 2018 GridGainSystems, Inc. GridGain Company Confidential Among Top 5 Apache Projects Top 5 by Commits 1. Hadoop 2. Ambari 3. Camel 4. Ignite 5. Beam Top 5 Developer Mailing Lists 1. Ignite 2. Kafka 3. Tomcat 4. Beam 5. James Over 1M downloads per year Top 5 User Mailing Lists 1. Lucene/Solr 2. Ignite 3. Flink 4. Kafka 5. Cassandra
  • 23.
    © 2018 GridGainSystems, Inc. Any Questions? Thank you for joining us. Follow the conversation. http://ignite.apache.org #apacheignite