Spark MLlib
Overview • MLlib is Spark’s library of machine learning (ML) functions designed to run in parallel on clusters. MLlib contains a variety of learning algorithms • MLlib invokes various algorithms on RDDs • Some classic ML algorithms are not included with Spark MLlib because they were not designed for parallel
Overview • Divided into two packages: • spark.mllib contains the original API built on top of RDDs. • spark.ml provides higher-level API built on top of DataFrames • Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. Plan is to keep supporting spark.mllib along with the development of spark.ml.
Machine Learning Recap • Machine learning algorithms try to predict or make decisions based on training data. • There are multiple types of learning problems, including classification, regression, or clustering. All of which have different objectives.
Spark MLlib Data Types • MLlib contains a few specific data types including Vector, LabeledPoint, Rating, Matrix (local and distributed) and various Model classes.
MLlib Supported Supervised Algorithm Methods • Binary Classification Problems • linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive bayes • Multiclass Classification Problems • logistic regression, decision trees, random forests, naive Bayes • Regression Problems • linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression
MLlib Supported Unsupervised Models • K-means • Gaussian mixture • Power iteration clustering (PIC) • Latent Dirichlet allocation (LDA) • Bisecting k-means • Streaming k-means
Recommender Systems • Collaborative filtering is commonly used for recommender systems. • spark.mllib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. • spark.mllib uses the alternating least squares (ALS) algorithm to learn these latent factors.
For more, visit https://supergloo.com

Machine Learning with Spark MLlib

  • 1.
  • 2.
    Overview • MLlib isSpark’s library of machine learning (ML) functions designed to run in parallel on clusters. MLlib contains a variety of learning algorithms • MLlib invokes various algorithms on RDDs • Some classic ML algorithms are not included with Spark MLlib because they were not designed for parallel
  • 3.
    Overview • Divided intotwo packages: • spark.mllib contains the original API built on top of RDDs. • spark.ml provides higher-level API built on top of DataFrames • Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. Plan is to keep supporting spark.mllib along with the development of spark.ml.
  • 4.
    Machine Learning Recap •Machine learning algorithms try to predict or make decisions based on training data. • There are multiple types of learning problems, including classification, regression, or clustering. All of which have different objectives.
  • 5.
    Spark MLlib DataTypes • MLlib contains a few specific data types including Vector, LabeledPoint, Rating, Matrix (local and distributed) and various Model classes.
  • 6.
    MLlib Supported SupervisedAlgorithm Methods • Binary Classification Problems • linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive bayes • Multiclass Classification Problems • logistic regression, decision trees, random forests, naive Bayes • Regression Problems • linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression
  • 7.
    MLlib Supported UnsupervisedModels • K-means • Gaussian mixture • Power iteration clustering (PIC) • Latent Dirichlet allocation (LDA) • Bisecting k-means • Streaming k-means
  • 8.
    Recommender Systems • Collaborativefiltering is commonly used for recommender systems. • spark.mllib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. • spark.mllib uses the alternating least squares (ALS) algorithm to learn these latent factors.
  • 9.
    For more, visithttps://supergloo.com