Machine learning for (JVM) developers Mateusz Dymczyk Software Engineer H2O.ai 11th May 2016
Say who? • Software Engineer @ H2O.ai • Ph.D. drop-out (AGH in Krakow) • ex Fujitsu Laboratories research trainee
Say what? • Status quo of data • Why Machine Learning? • Intro to Machine Learning • Machine Learning and the JVM • Machine Learning Demo
The state of data
Exponential growth
Text Data source Data collection Data storage Simple analytics Data processing
Ideas • Alerting from real time data • Similarity search Retail Healthcare Insurance/ banking • Recommendations • Store layout • Ad targetting • Stock price predictions • Anomaly/fraud detection • Automatic investments https://www.kaggle.com/wiki/DataScienceUseCases
Machine Learning
Definition “The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”
Simply speaking… • Subfield of Artificial Intelligence which… • Tries to find patterns in data using… • Math, statistics, probability, optimisation theory etc. to create… • Model which can be used to predict values or cluster • Theoretical concept with many implementations
Basic terminology
Observations are objects which are used for learning and evaluation. Anything that can be described using quantitative features. Observations {	"title":	"Email	schema",	"type":	"object",	"properties":	{	"age":	{	"type":	"float"	},	"rooms":	{	"type":	"int"	},	"size":	{	"type":	"float"	},	"location":	{	"type":	"string"	}	} }
Feature is a quantitative trait that (partially) represents an observation. Feature vector is an n-dimentional vector of features that represents an observation. Feature extraction vs. feature selection Feature {	"title":	"Email	schema",	"type":	"object",	"properties":	{	"age":	{	"type":	"float"	},	"rooms":	{	"type":	"int"	},	"size":	{	"type":	"float"	},	"location":	{	"type":	"string"	}	} } [5,	3,	60.5]
• System is a set of related objects forming a complex whole (e.g. set of all possible distinct observations) • In our case set of all possible houses System
• Model is the description of a system using mathematical concepts/language. • Result of a machine learning technique • Can be used for predictions/clustering • Online or offline Model
Supervised Learning • User needs to know: • the structure of the data • possible outputs • Sample data has to be labeled for training
Classification • Required: • all possible labels • already labeled samples • Output: predicted label for new inputs • Examples: • spam classification based on email content • gender classification based on physical features
Regression • Required: • samples with actual values associated • Output: predicted values for new inputs • Examples: • price prediction based on historical prices
Unsupervised Learning • Doesn’t require the user to know what should be the output • No labelling necessary by the user • Useful for finding structure in data • Examples: • grouping users (clustering)
Clustering • Required: • data, no labelling necessary • Output: data grouped into clusters • Examples: • grouping users with similar tastes
Types of machine learning eg. regression, 
 when you want to predict a real number eg. clustering, 
 when you want to cluster or have too much data eg. classification, when you want to assign to a category eg. association analysis, when you want to find relations between data
Predictions/ clusters Generic flow Raw data Feature extraction Machine learning magic TRAINING Model Incoming new data Feature extraction PREDICTING
Validation • How do we know the model is good? • Cross validation: • divide the data into training and testing subsets (sometimes third one is necessary) • train using the training set, validate using testing set • do those splits multiple times and take the average!
Common pitfalls • Overfitting • Underfitting
ML and the JVM
The tools… • SMILE • Weka • Mahout • Deeplearning4j/s • TridentML (Storm) • MLlib (Spark) • FlinkML (Flink) • H2O
Spark? • Distributed, fast, in-memory computational framework • Based on RDDs (Resilient Distributed Dataset: abstract, immutable, distributed, easily rebuilt data format) • Support for Scala, Java, Python and R • Focuses on well known methods 
 (map(), flatMap(), filter(), reduce() …)
Spark? val conf = new SparkConf().setAppName("Spark App") val sc = new SparkContext(conf) val textFile: RDD[String] = sc.textFile("hdfs://...") val counts: RDD[(String, Int)] = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) println(s"Found ${counts.count()}") counts.saveAsTextFile("hdfs://...")
Why Spark/MLlib PROS • extensive community, part of Spark (Databricks support) • Java, Scala, Python, R APIs • solid implementation of most popular algorithms • easy to use, well documented, multitude of examples • fast and robust CONS • only Spark • very young • mainly simple algorithms • still pretty “low level”
Demos
Price prediction Raw house data Feature extraction Logistic regression modelling TRAINING Predicted price Model Incoming new data Feature extraction PREDICTING
Date Open 26 708.58 25 700.01 24 688.92 23 701.45 22 707.45 19 695.03 18 710 17 699 16 692.98 12 690.26 11 675 10 686.86 9 672.32 8 667.85
660 672.5 685 697.5 710 0 6.5 13 19.5 26
600 650 700 750 800 0 6.5 13 19.5 26
Spam classification Spam/not spam Model Incoming emails Feature extraction PREDICTING Raw spam emails Feature extraction Logistic regression modelling TRAINING Raw ok emails Feature extraction
Word representation • Some algorithms are ok with strings • Stopword extraction, form normalisation • Many approaches to transform into numerical values: • set of words • bag of words (TF) • TF-IDF • ...
Term frequency All terms i love like cake pie cookies Document1 1 0 1 1 0 0 Document2 1 1 0 0 1 0 Document3 1 1 0 0 0 1
What next? • Get ideas: o https://www.kaggle.com/wiki/DataScienceUseCases • Learn the basics: o https://www.coursera.org/learn/machine-learning o https://work.caltech.edu/telecourse.html • Get started with MLlib: o http://spark.apache.org/docs/latest/mllib-guide.html o https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x • Try out other frameworks and courses: o https://github.com/h2oai/sparkling-water o https://www.coursera.org/course/mmds • Practical books: o “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media o “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media
Thank you! @mdymczyk Mateusz Dymczyk mateusz@h2o.ai
Q&A

Machine Learning for (JVM) Developers

  • 1.
    Machine learning for (JVM)developers Mateusz Dymczyk Software Engineer H2O.ai 11th May 2016
  • 2.
    Say who? • SoftwareEngineer @ H2O.ai • Ph.D. drop-out (AGH in Krakow) • ex Fujitsu Laboratories research trainee
  • 3.
    Say what? • Statusquo of data • Why Machine Learning? • Intro to Machine Learning • Machine Learning and the JVM • Machine Learning Demo
  • 4.
  • 5.
  • 6.
    Text Data source Data collection Datastorage Simple analytics Data processing
  • 7.
    Ideas • Alerting fromreal time data • Similarity search Retail Healthcare Insurance/ banking • Recommendations • Store layout • Ad targetting • Stock price predictions • Anomaly/fraud detection • Automatic investments https://www.kaggle.com/wiki/DataScienceUseCases
  • 8.
  • 9.
    Definition “The field ofmachine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”
  • 10.
    Simply speaking… • Subfieldof Artificial Intelligence which… • Tries to find patterns in data using… • Math, statistics, probability, optimisation theory etc. to create… • Model which can be used to predict values or cluster • Theoretical concept with many implementations
  • 11.
  • 12.
    Observations are objectswhich are used for learning and evaluation. Anything that can be described using quantitative features. Observations { "title": "Email schema", "type": "object", "properties": { "age": { "type": "float" }, "rooms": { "type": "int" }, "size": { "type": "float" }, "location": { "type": "string" } } }
  • 13.
    Feature is aquantitative trait that (partially) represents an observation. Feature vector is an n-dimentional vector of features that represents an observation. Feature extraction vs. feature selection Feature { "title": "Email schema", "type": "object", "properties": { "age": { "type": "float" }, "rooms": { "type": "int" }, "size": { "type": "float" }, "location": { "type": "string" } } } [5, 3, 60.5]
  • 14.
    • System isa set of related objects forming a complex whole (e.g. set of all possible distinct observations) • In our case set of all possible houses System
  • 15.
    • Model isthe description of a system using mathematical concepts/language. • Result of a machine learning technique • Can be used for predictions/clustering • Online or offline Model
  • 16.
    Supervised Learning • Userneeds to know: • the structure of the data • possible outputs • Sample data has to be labeled for training
  • 17.
    Classification • Required: • allpossible labels • already labeled samples • Output: predicted label for new inputs • Examples: • spam classification based on email content • gender classification based on physical features
  • 18.
    Regression • Required: • sampleswith actual values associated • Output: predicted values for new inputs • Examples: • price prediction based on historical prices
  • 19.
    Unsupervised Learning • Doesn’trequire the user to know what should be the output • No labelling necessary by the user • Useful for finding structure in data • Examples: • grouping users (clustering)
  • 20.
    Clustering • Required: • data,no labelling necessary • Output: data grouped into clusters • Examples: • grouping users with similar tastes
  • 21.
    Types of machinelearning eg. regression, 
 when you want to predict a real number eg. clustering, 
 when you want to cluster or have too much data eg. classification, when you want to assign to a category eg. association analysis, when you want to find relations between data
  • 22.
  • 23.
    Validation • How dowe know the model is good? • Cross validation: • divide the data into training and testing subsets (sometimes third one is necessary) • train using the training set, validate using testing set • do those splits multiple times and take the average!
  • 24.
  • 25.
  • 26.
    The tools… • SMILE •Weka • Mahout • Deeplearning4j/s • TridentML (Storm) • MLlib (Spark) • FlinkML (Flink) • H2O
  • 27.
    Spark? • Distributed, fast,in-memory computational framework • Based on RDDs (Resilient Distributed Dataset: abstract, immutable, distributed, easily rebuilt data format) • Support for Scala, Java, Python and R • Focuses on well known methods 
 (map(), flatMap(), filter(), reduce() …)
  • 28.
    Spark? val conf =new SparkConf().setAppName("Spark App") val sc = new SparkContext(conf) val textFile: RDD[String] = sc.textFile("hdfs://...") val counts: RDD[(String, Int)] = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) println(s"Found ${counts.count()}") counts.saveAsTextFile("hdfs://...")
  • 29.
    Why Spark/MLlib PROS • extensivecommunity, part of Spark (Databricks support) • Java, Scala, Python, R APIs • solid implementation of most popular algorithms • easy to use, well documented, multitude of examples • fast and robust CONS • only Spark • very young • mainly simple algorithms • still pretty “low level”
  • 30.
  • 31.
  • 32.
    Date Open 26 708.58 25700.01 24 688.92 23 701.45 22 707.45 19 695.03 18 710 17 699 16 692.98 12 690.26 11 675 10 686.86 9 672.32 8 667.85
  • 33.
  • 34.
  • 35.
  • 36.
    Word representation • Somealgorithms are ok with strings • Stopword extraction, form normalisation • Many approaches to transform into numerical values: • set of words • bag of words (TF) • TF-IDF • ...
  • 37.
    Term frequency All termsi love like cake pie cookies Document1 1 0 1 1 0 0 Document2 1 1 0 0 1 0 Document3 1 1 0 0 0 1
  • 38.
    What next? • Getideas: o https://www.kaggle.com/wiki/DataScienceUseCases • Learn the basics: o https://www.coursera.org/learn/machine-learning o https://work.caltech.edu/telecourse.html • Get started with MLlib: o http://spark.apache.org/docs/latest/mllib-guide.html o https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x • Try out other frameworks and courses: o https://github.com/h2oai/sparkling-water o https://www.coursera.org/course/mmds • Practical books: o “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media o “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media
  • 39.
  • 40.