Till Rohrmann Flink committer trohrmann@apache.org @stsffap Machine Learning with Apache Flink
What is Flink §  Large-scale data processing engine §  Easy and powerful APIs for batch and real-time streaming analysis (Java / Scala) §  Backed by a very robust execution backend •  with true streaming capabilities, •  custom memory manager, •  native iteration execution, •  and a cost-based optimizer. 2
Technology inside Flink §  Technology inspired by compilers + MPP databases + distributed systems §  For ease of use, reliable performance, and scalability case  class  Path  (from:  Long,  to:  Long)   val  tc  =  edges.iterate(10)  {        paths:  DataSet[Path]  =>          val  next  =  paths              .join(edges)              .where("to")              .equalTo("from")  {                  (path,  edge)  =>                        Path(path.from,  edge.to)              }              .union(paths)              .distinct()          next      }   Cost-based optimizer Type extraction stack Memory manager Out-of-core algos real-time streaming Task scheduling Recovery metadata Data serialization stack Streaming network stack ... Pre-flight (client) Master Workers
How do you use Flink? 4
Example: WordCount 5 case  class  Word  (word:  String,  frequency:  Int)     val  env  =  ExecutionEnvironment.getExecutionEnvironment()     val  lines  =  env.readTextFile(...)     lines        .flatMap  {line  =>  line.split("  ").map(word  =>  Word(word,1))}            .groupBy("word").sum("frequency”)        .print()     env.execute()         Flink has mirrored Java and Scala APIs that offer the same functionality, including by-name addressing.
Flink API in a Nutshell §  map, flatMap, filter, groupBy, reduce, reduceGroup, aggregate, join, coGroup, cross, project, distinct, union, iterate, iterateDelta, ... §  All Hadoop input formats are supported §  API similar for data sets and data streams with slightly different operator semantics §  Window functions for data streams §  Counters, accumulators, and broadcast variables 6
Machine learning with Flink 7
Does ML work like that? 8
More realistic scenario! 9
Machine learning pipelines §  Pipelining inspired by scikit-learn §  Transformer: Modify data §  Learner: Train a model §  Reusable components §  Let’s you quickly build ML pipelines §  Model inherits pipeline of learner 10
Linear regression in polynomial space val  polynomialBase  =  PolynomialBase()   val  learner  =  MultipleLinearRegression()     val  pipeline  =  polynomialBase.chain(learner)     val  trainingDS  =  env.fromCollection(trainingData)     val  parameters  =  ParameterMap()      .add(PolynomialBase.Degree,  3)      .add(MultipleLinearRegression.Stepsize,  0.002)      .add(MultipleLinearRegression.Iterations,  100)     val  model  =  pipeline.fit(trainingDS,  parameters)   11 Input  Data   Polynomial   Base   Mapper   Mul4ple   Linear   Regression   Linear   Model  
Current state of Flink-ML §  Existing learners •  Multiple linear regression •  Alternating least squares •  Communication efficient distributed dual coordinate ascent (PR pending) §  Feature transformer •  Polynomial base feature mapper §  Tooling 12
Distributed linear algebra §  Linear algebra universal language for data analysis §  High-level abstraction §  Fast prototyping §  Pre- and post-processing step 13
Example: Gaussian non-negative matrix factorization §  Given input matrix V, find W and H such that §  Iterative approximation 14 Ht+1 = Ht ∗ Wt T V /Wt T Wt Ht( ) Wt+1 = Wt ∗ VHt+1 T /Wt Ht+1Ht+1 T ( ) V ≈ WH var  i  =  0   var  H:  CheckpointedDrm[Int]  =  randomMatrix(k,  V.numCols)   var  W:  CheckpointedDrm[Int]  =  randomMatrix(V.numRows,  k)     while(i  <  maxIterations)  {      H  =  H  *  (W.t  %*%  V  /  W.t  %*%  W  %*%  H)      W  =  W  *  (V  %*%    H.t  /  W  %*%  H  %*%  H.t)      i  +=  1   }  
Why is Flink a good fit for ML? 15
Flink’s features §  Stateful iterations •  Keep state across iterations §  Delta iterations •  Limit computation to elements which matter §  Pipelining •  Avoiding materialization of large intermediate state 16
CoCoA 17 minw∈Rd P(w):= λ 2 w 2 + 1 n ℓi wT xi( ) i=1 n ∑ # $ % & ' (
Bulk Iterations 18 partial solution partial solutionX other datasets Y initial solution iteration result Replace Step function
Delta iterations 19 partial solution delta setX other datasets Y initial solution iteration result workset A B workset Merge deltas Replace initial workset
Effect of delta iterations 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 51 56 61 #ofelementsupdated iteration
Iteration performance 21 0 10 20 30 40 50 60 Hadoop Flink bulk Flink delta Time(minutes) 61 iterations and 30 iterations of PageRank on a Twitter follower graph with Hadoop MapReduce and Flink using bulk and delta iterations 30 iterations 61 iterations MapReduce
How to factorize really large matrices? 22
Collaborative Filtering §  Recommend items based on users with similar preferences §  Latent factor models capture underlying characteristics of items and preferences of user §  Predicted preference: 23 ˆru,i = xu T yi
Matrix factorization 24 minX,Y ru,i − xu T yi( ) 2 + λ nu xu 2 + ni yi 2 i ∑ u ∑ # $ % & ' ( ru,i≠0 ∑ R ≈ XT Y R X Y
Alternating least squares §  Fixing one matrix gives a quadratic form §  Solution guarantees to decrease overall cost function §  To calculate , all rated item vectors and ratings are needed 25 xu = YSu YT + λnuΙ( ) −1 Yru T Sii u = 1 if ru,i ≠ 0 0 else " # $ %$ xu
Data partitioning 26
Naïve ALS case  class  Rating(userID:  Int,  itemID:  Int,  rating:  Double)   case  class  ColumnVector(columnIndex:  Int,  vector:  Array[Double])     val  items:  DataSet[ColumnVector]  =  _   val  ratings:  DataSet[Rating]  =  _     //  Generate  tuples  of  items  with  their  ratings   val  uVA  =  items.join(ratings).where(0).equalTo(1)  {      (item,  ratingEntry)  =>  {          val  Rating(uID,  _,  rating)  =  ratingEntry          (uID,  rating,  item.vector)      }   }       27
Naïve ALS contd. uVA.groupBy(0).reduceGroup  {      vectors  =>  {          var  uID  =  -­‐1          val  matrix  =  FloatMatrix.zeros(factors,  factors)          val  vector  =  FloatMatrix.zeros(factors)          var  n  =  0            for((id,  rating,  v)  <-­‐  vectors)  {              uID  =  id              vector  +=  rating  *  v              matrix  +=  outerProduct(v  ,  v)              n  +=  1          }            for(idx  <-­‐  0  until  factors)  {              matrix(idx,  idx)  +=  lambda  *  n          }            new  ColumnVector(uID,  Solve(matrix,  vector))      }   }   28
Problems of naïve ALS §  Problem: •  Item vectors are sent redundantly à High network load §  Solution: •  Blocking of user and item vectors to share common data •  Avoids blown up intermediate state 29
Data partitioning 30
Performance comparison 31 •  40  node  GCE  cluster,  highmem-­‐8   •  10  ALS  itera4on  with  50  latent  factors   Runtimeinminutes 0 225 450 675 900 Number of non-zero entries (billion) 0 7.5 15 22.5 30 Blocked ALS Blocked ALS highmem-16 Naive ALS 5.5h 14h 2.5h 1h Table 2 Entries in billion Naive Join Naive Join Broadcast Broadcast 80 0.08 201.326 3.35543333333333 190.723 3.17871666666667
Streaming machine learning 32
Why is streaming ML important? §  Spam detection in mails §  Patterns might change over time §  Retraining of model necessary §  Best solution: Online models 33
Applications §  Spam detection §  Recommendation §  News feed personalization §  Credit card fraud detection 34
Apache SAMOA §  Scalable Advanced Massive Online Analysis §  Distributed streaming machine learning framework §  Incubation at the Apache Software Foundation §  Runs on multiple streaming processing engines (S4, Storm, Samza) §  Support for Flink is pending pull request 35
Supported algorithms §  Classification: Vertical Hoeffding Tree §  Clustering: CluStream §  Regression: Adaptive Model Rules §  Frequent pattern mining: PARMA 36
Closing 37
Flink-ML Outlook §  Support more algorithms §  Support for distributed linear algebra §  Integration with streaming machine learning §  Interactive programs and Zeppelin 38
flink.apache.org @ApacheFlink

Machine Learning with Apache Flink at Stockholm Machine Learning Group

  • 1.
  • 2.
    What is Flink § Large-scale data processing engine §  Easy and powerful APIs for batch and real-time streaming analysis (Java / Scala) §  Backed by a very robust execution backend •  with true streaming capabilities, •  custom memory manager, •  native iteration execution, •  and a cost-based optimizer. 2
  • 3.
    Technology inside Flink § Technology inspired by compilers + MPP databases + distributed systems §  For ease of use, reliable performance, and scalability case  class  Path  (from:  Long,  to:  Long)   val  tc  =  edges.iterate(10)  {        paths:  DataSet[Path]  =>          val  next  =  paths              .join(edges)              .where("to")              .equalTo("from")  {                  (path,  edge)  =>                        Path(path.from,  edge.to)              }              .union(paths)              .distinct()          next      }   Cost-based optimizer Type extraction stack Memory manager Out-of-core algos real-time streaming Task scheduling Recovery metadata Data serialization stack Streaming network stack ... Pre-flight (client) Master Workers
  • 4.
    How do youuse Flink? 4
  • 5.
    Example: WordCount 5 case  class  Word  (word:  String,  frequency:  Int)     val  env  =  ExecutionEnvironment.getExecutionEnvironment()     val  lines  =  env.readTextFile(...)     lines        .flatMap  {line  =>  line.split("  ").map(word  =>  Word(word,1))}            .groupBy("word").sum("frequency”)        .print()     env.execute()         Flink has mirrored Java and Scala APIs that offer the same functionality, including by-name addressing.
  • 6.
    Flink API ina Nutshell §  map, flatMap, filter, groupBy, reduce, reduceGroup, aggregate, join, coGroup, cross, project, distinct, union, iterate, iterateDelta, ... §  All Hadoop input formats are supported §  API similar for data sets and data streams with slightly different operator semantics §  Window functions for data streams §  Counters, accumulators, and broadcast variables 6
  • 7.
  • 8.
    Does ML worklike that? 8
  • 9.
  • 10.
    Machine learning pipelines § Pipelining inspired by scikit-learn §  Transformer: Modify data §  Learner: Train a model §  Reusable components §  Let’s you quickly build ML pipelines §  Model inherits pipeline of learner 10
  • 11.
    Linear regression inpolynomial space val  polynomialBase  =  PolynomialBase()   val  learner  =  MultipleLinearRegression()     val  pipeline  =  polynomialBase.chain(learner)     val  trainingDS  =  env.fromCollection(trainingData)     val  parameters  =  ParameterMap()      .add(PolynomialBase.Degree,  3)      .add(MultipleLinearRegression.Stepsize,  0.002)      .add(MultipleLinearRegression.Iterations,  100)     val  model  =  pipeline.fit(trainingDS,  parameters)   11 Input  Data   Polynomial   Base   Mapper   Mul4ple   Linear   Regression   Linear   Model  
  • 12.
    Current state ofFlink-ML §  Existing learners •  Multiple linear regression •  Alternating least squares •  Communication efficient distributed dual coordinate ascent (PR pending) §  Feature transformer •  Polynomial base feature mapper §  Tooling 12
  • 13.
    Distributed linear algebra § Linear algebra universal language for data analysis §  High-level abstraction §  Fast prototyping §  Pre- and post-processing step 13
  • 14.
    Example: Gaussian non-negativematrix factorization §  Given input matrix V, find W and H such that §  Iterative approximation 14 Ht+1 = Ht ∗ Wt T V /Wt T Wt Ht( ) Wt+1 = Wt ∗ VHt+1 T /Wt Ht+1Ht+1 T ( ) V ≈ WH var  i  =  0   var  H:  CheckpointedDrm[Int]  =  randomMatrix(k,  V.numCols)   var  W:  CheckpointedDrm[Int]  =  randomMatrix(V.numRows,  k)     while(i  <  maxIterations)  {      H  =  H  *  (W.t  %*%  V  /  W.t  %*%  W  %*%  H)      W  =  W  *  (V  %*%    H.t  /  W  %*%  H  %*%  H.t)      i  +=  1   }  
  • 15.
    Why is Flinka good fit for ML? 15
  • 16.
    Flink’s features §  Statefuliterations •  Keep state across iterations §  Delta iterations •  Limit computation to elements which matter §  Pipelining •  Avoiding materialization of large intermediate state 16
  • 17.
  • 18.
  • 19.
  • 20.
    Effect of deltaiterations 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 51 56 61 #ofelementsupdated iteration
  • 21.
    Iteration performance 21 0 10 20 30 40 50 60 Hadoop Flinkbulk Flink delta Time(minutes) 61 iterations and 30 iterations of PageRank on a Twitter follower graph with Hadoop MapReduce and Flink using bulk and delta iterations 30 iterations 61 iterations MapReduce
  • 22.
    How to factorizereally large matrices? 22
  • 23.
    Collaborative Filtering §  Recommenditems based on users with similar preferences §  Latent factor models capture underlying characteristics of items and preferences of user §  Predicted preference: 23 ˆru,i = xu T yi
  • 24.
    Matrix factorization 24 minX,Y ru,i− xu T yi( ) 2 + λ nu xu 2 + ni yi 2 i ∑ u ∑ # $ % & ' ( ru,i≠0 ∑ R ≈ XT Y R X Y
  • 25.
    Alternating least squares § Fixing one matrix gives a quadratic form §  Solution guarantees to decrease overall cost function §  To calculate , all rated item vectors and ratings are needed 25 xu = YSu YT + λnuΙ( ) −1 Yru T Sii u = 1 if ru,i ≠ 0 0 else " # $ %$ xu
  • 26.
  • 27.
    Naïve ALS case  class  Rating(userID:  Int,  itemID:  Int,  rating:  Double)   case  class  ColumnVector(columnIndex:  Int,  vector:  Array[Double])     val  items:  DataSet[ColumnVector]  =  _   val  ratings:  DataSet[Rating]  =  _     //  Generate  tuples  of  items  with  their  ratings   val  uVA  =  items.join(ratings).where(0).equalTo(1)  {      (item,  ratingEntry)  =>  {          val  Rating(uID,  _,  rating)  =  ratingEntry          (uID,  rating,  item.vector)      }   }       27
  • 28.
    Naïve ALS contd. uVA.groupBy(0).reduceGroup  {      vectors  =>  {          var  uID  =  -­‐1          val  matrix  =  FloatMatrix.zeros(factors,  factors)          val  vector  =  FloatMatrix.zeros(factors)          var  n  =  0            for((id,  rating,  v)  <-­‐  vectors)  {              uID  =  id              vector  +=  rating  *  v              matrix  +=  outerProduct(v  ,  v)              n  +=  1          }            for(idx  <-­‐  0  until  factors)  {              matrix(idx,  idx)  +=  lambda  *  n          }            new  ColumnVector(uID,  Solve(matrix,  vector))      }   }   28
  • 29.
    Problems of naïveALS §  Problem: •  Item vectors are sent redundantly à High network load §  Solution: •  Blocking of user and item vectors to share common data •  Avoids blown up intermediate state 29
  • 30.
  • 31.
    Performance comparison 31 •  40  node  GCE  cluster,  highmem-­‐8   •  10  ALS  itera4on  with  50  latent  factors   Runtimeinminutes 0 225 450 675 900 Number of non-zero entries (billion) 0 7.5 15 22.5 30 Blocked ALS Blocked ALS highmem-16 Naive ALS 5.5h 14h 2.5h 1h Table 2 Entries in billion Naive Join Naive Join Broadcast Broadcast 80 0.08 201.326 3.35543333333333 190.723 3.17871666666667
  • 32.
  • 33.
    Why is streamingML important? §  Spam detection in mails §  Patterns might change over time §  Retraining of model necessary §  Best solution: Online models 33
  • 34.
    Applications §  Spam detection § Recommendation §  News feed personalization §  Credit card fraud detection 34
  • 35.
    Apache SAMOA §  ScalableAdvanced Massive Online Analysis §  Distributed streaming machine learning framework §  Incubation at the Apache Software Foundation §  Runs on multiple streaming processing engines (S4, Storm, Samza) §  Support for Flink is pending pull request 35
  • 36.
    Supported algorithms §  Classification:Vertical Hoeffding Tree §  Clustering: CluStream §  Regression: Adaptive Model Rules §  Frequent pattern mining: PARMA 36
  • 37.
  • 38.
    Flink-ML Outlook §  Supportmore algorithms §  Support for distributed linear algebra §  Integration with streaming machine learning §  Interactive programs and Zeppelin 38
  • 39.