Building Machine Learning Algorithms on Apache Spark with William Benton

Building machine learning algorithms on Apache Spark William Benton (@willb) Red Hat, Inc. Session hashtag: #EUds5

#EUds5 Forecast Introducing our case study: self-organizing maps Parallel implementations for partitioned collections (in particular, RDDs) Beyond the RDD: data frames and ML pipelines Practical considerations and key takeaways 6

#EUds5 Introducing self-organizing maps

#EUds5 Training self-organizing maps 9

#EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt

#EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order

#EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected

#EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected the learning rate controls how much closer to the example each unit gets

#EUds5 Parallel implementations for partitioned collections

#EUds5 Historical aside: Amdahl’s Law 14 1 1 - p lim So =sp —> ∞

#EUds5 What forces serial execution? 15

#EUds5 What forces serial execution? 16

#EUds5 What forces serial execution? 17 state[t+1] = combine(state[t], x)

#EUds5 What forces serial execution? 18 state[t+1] = combine(state[t], x)

#EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T

#EUds5 How can we fix these? 20 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

#EUds5 How can we fix these? 24 L-BGFSSGD a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

#EUds5 How can we fix these? 25 SGD L-BGFS a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) There will be examples of each of these approaches for many problems in the literature and in open-source code!

#EUds5 Implementing atop RDDs We’ll start with a batch implementation of our technique: 26 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)

#EUds5 Implementing atop RDDs 27 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models

#EUds5 Implementing atop RDDs 28 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models partition

#EUds5 Implementing atop RDDs 29 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) This won’t always work!

#EUds5 An implementation template 30 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) }

#EUds5 An implementation template 31 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “fold”: update the state for this partition with a single new example

#EUds5 An implementation template 32 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “reduce”: combine the states from two partitions

#EUds5 An implementation template 33 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } this will cause the model object to be serialized with the closure nextModel

#EUds5 An implementation template 34 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } broadcast the current working model for this iterationvar nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the broadcast variable

#EUds5 An implementation template 35 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale broadcasted model var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist

#EUds5 An implementation template 36 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } the wrong implementation of the right interface

#EUds5 workersdriver (aggregate) Implementing on RDDs 37 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

#EUds5 workersdriver (aggregate) Implementing on RDDs 38 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

#EUds5 workersdriver (aggregate) Implementing on RDDs 38

#EUds5 workersdriver (treeAggregate) Implementing on RDDs 39 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

#EUds5 workersdriver (treeAggregate) Implementing on RDDs 40 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

#EUds5 workersdriver (treeAggregate) Implementing on RDDs 41 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

#EUds5 workersdriver (treeAggregate) Implementing on RDDs 42 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ driver (treeAggregate) ⊕ ⊕ ⊕ ⊕

#EUds5 workersdriver (treeAggregate) Implementing on RDDs 43 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

#EUds5 workersdriver (treeAggregate) Implementing on RDDs 44 ⊕ ⊕

#EUds5 driver (treeAggregate) workers Implementing on RDDs 45 ⊕ ⊕

#EUds5 driver (treeAggregate) workers Implementing on RDDs 45 ⊕ ⊕⊕

#EUds5 Beyond the RDD: Data frames and ML Pipelines

#EUds5 RDDs: some good parts 47 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()

#EUds5 RDDs: some good parts 48 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile

#EUds5 RDDs: some good parts 49 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile

#EUds5 RDDs: some good parts 50 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile crashes at runtime

#EUds5 RDDs: some good parts 51 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))

#EUds5 RDDs: some good parts 52 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))

#EUds5 RDDs versus query planning 53 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))

#EUds5 RDDs versus query planning 54 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))

#EUds5 RDDs and the JVM heap 55 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))

#EUds5 RDDs and the Java heap 56 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0

#EUds5 RDDs and the Java heap 57 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…

#EUds5 RDDs and the Java heap 58 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…

#EUds5 ML pipelines: a quick example 59 from pyspark.ml.clustering import KMeans K, SEED = 100, 0xdea110c8 randomDF = make_random_df() kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")

#EUds5 Working with ML pipelines 60 estimator.fit(df)

#EUds5 Working with ML pipelines 61 estimator.fit(df) model.transform(df)

#EUds5 Working with ML pipelines 62 model.transform(df)

#EUds5 Working with ML pipelines 63 model.transform(df)

#EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df)

#EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed

#EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed outputCol

#EUds5 Defining parameters 65 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

#EUds5 Defining parameters 66 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

#EUds5 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... Defining parameters 67 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...

#EUds5 Don’t repeat yourself 71 /** * Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }

#EUds5 Estimators and transformers 72 estimator.fit(df)

#EUds5 Estimators and transformers 72 estimator.fit(df) model.transform(df)

#EUds5 Validate and transform at once 73 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }

#EUds5 Validate and transform at once 74 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }

#EUds5 Validate and transform at once 75 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }

#EUds5 Validate and transform at once 76 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }

#EUds5 Validate and transform at once 77 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }

#EUds5 Training on data frames 78 def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }

#EUds5 Practical considerations  and key takeaways

#EUds5 Retire your visibility hacks 80 package org.apache.spark.ml.hacks object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }

#EUds5 Retire your visibility hacks 81 package org.apache.spark.ml.linalg /* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }

#EUds5 Caching training data 82 val wasUncached = examples.storageLevel == StorageLevel.NONE if (wasUncached) { examples.cache() } /* actually train here */ if (wasUncached) { examples.unpersist() }

#EUds5 Improve serial execution times Are you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms. Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication. Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library. 83

#EUds5 Key takeaways There are several techniques you can use to develop parallel implementations of machine learning algorithms. The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark. As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release! 84

#EUds5 @willb • willb@redhat.com https://chapeau.freevariable.com https://radanalytics.io Thanks!

Building Machine Learning Algorithms on Apache Spark with William Benton

More Related Content

What's hot

Viewers also liked

Similar to Building Machine Learning Algorithms on Apache Spark with William Benton

More from Spark Summit

Recently uploaded

Building Machine Learning Algorithms on Apache Spark with William Benton