Machine Learning for (JVM) Developers

Machine learning for (JVM) developers Mateusz Dymczyk Software Engineer H2O.ai 11th May 2016

Say who? • Software Engineer @ H2O.ai • Ph.D. drop-out (AGH in Krakow) • ex Fujitsu Laboratories research trainee

Say what? • Status quo of data • Why Machine Learning? • Intro to Machine Learning • Machine Learning and the JVM • Machine Learning Demo

Text Data source Data collection Data storage Simple analytics Data processing

Ideas • Alerting from real time data • Similarity search Retail Healthcare Insurance/ banking • Recommendations • Store layout • Ad targetting • Stock price predictions • Anomaly/fraud detection • Automatic investments https://www.kaggle.com/wiki/DataScienceUseCases

Definition “The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”

Simply speaking… • Subfield of Artificial Intelligence which… • Tries to find patterns in data using… • Math, statistics, probability, optimisation theory etc. to create… • Model which can be used to predict values or cluster • Theoretical concept with many implementations

Observations are objects which are used for learning and evaluation. Anything that can be described using quantitative features. Observations { "title": "Email schema", "type": "object", "properties": { "age": { "type": "float" }, "rooms": { "type": "int" }, "size": { "type": "float" }, "location": { "type": "string" } } }

Feature is a quantitative trait that (partially) represents an observation. Feature vector is an n-dimentional vector of features that represents an observation. Feature extraction vs. feature selection Feature { "title": "Email schema", "type": "object", "properties": { "age": { "type": "float" }, "rooms": { "type": "int" }, "size": { "type": "float" }, "location": { "type": "string" } } } [5, 3, 60.5]

• System is a set of related objects forming a complex whole (e.g. set of all possible distinct observations) • In our case set of all possible houses System

• Model is the description of a system using mathematical concepts/language. • Result of a machine learning technique • Can be used for predictions/clustering • Online or offline Model

Supervised Learning • User needs to know: • the structure of the data • possible outputs • Sample data has to be labeled for training

Classification • Required: • all possible labels • already labeled samples • Output: predicted label for new inputs • Examples: • spam classification based on email content • gender classification based on physical features

Regression • Required: • samples with actual values associated • Output: predicted values for new inputs • Examples: • price prediction based on historical prices

Unsupervised Learning • Doesn’t require the user to know what should be the output • No labelling necessary by the user • Useful for finding structure in data • Examples: • grouping users (clustering)

Clustering • Required: • data, no labelling necessary • Output: data grouped into clusters • Examples: • grouping users with similar tastes

Types of machine learning eg. regression,   when you want to predict a real number eg. clustering,   when you want to cluster or have too much data eg. classification, when you want to assign to a category eg. association analysis, when you want to find relations between data

Predictions/ clusters Generic flow Raw data Feature extraction Machine learning magic TRAINING Model Incoming new data Feature extraction PREDICTING

Validation • How do we know the model is good? • Cross validation: • divide the data into training and testing subsets (sometimes third one is necessary) • train using the training set, validate using testing set • do those splits multiple times and take the average!

Common pitfalls • Overfitting • Underfitting

The tools… • SMILE • Weka • Mahout • Deeplearning4j/s • TridentML (Storm) • MLlib (Spark) • FlinkML (Flink) • H2O

Spark? • Distributed, fast, in-memory computational framework • Based on RDDs (Resilient Distributed Dataset: abstract, immutable, distributed, easily rebuilt data format) • Support for Scala, Java, Python and R • Focuses on well known methods   (map(), flatMap(), filter(), reduce() …)

Spark? val conf = new SparkConf().setAppName("Spark App") val sc = new SparkContext(conf) val textFile: RDD[String] = sc.textFile("hdfs://...") val counts: RDD[(String, Int)] = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) println(s"Found ${counts.count()}") counts.saveAsTextFile("hdfs://...")

Why Spark/MLlib PROS • extensive community, part of Spark (Databricks support) • Java, Scala, Python, R APIs • solid implementation of most popular algorithms • easy to use, well documented, multitude of examples • fast and robust CONS • only Spark • very young • mainly simple algorithms • still pretty “low level”

Price prediction Raw house data Feature extraction Logistic regression modelling TRAINING Predicted price Model Incoming new data Feature extraction PREDICTING

Date Open 26 708.58 25 700.01 24 688.92 23 701.45 22 707.45 19 695.03 18 710 17 699 16 692.98 12 690.26 11 675 10 686.86 9 672.32 8 667.85

660 672.5 685 697.5 710 0 6.5 13 19.5 26

600 650 700 750 800 0 6.5 13 19.5 26

Spam classification Spam/not spam Model Incoming emails Feature extraction PREDICTING Raw spam emails Feature extraction Logistic regression modelling TRAINING Raw ok emails Feature extraction

Word representation • Some algorithms are ok with strings • Stopword extraction, form normalisation • Many approaches to transform into numerical values: • set of words • bag of words (TF) • TF-IDF • ...

Term frequency All terms i love like cake pie cookies Document1 1 0 1 1 0 0 Document2 1 1 0 0 1 0 Document3 1 1 0 0 0 1

What next? • Get ideas: o https://www.kaggle.com/wiki/DataScienceUseCases • Learn the basics: o https://www.coursera.org/learn/machine-learning o https://work.caltech.edu/telecourse.html • Get started with MLlib: o http://spark.apache.org/docs/latest/mllib-guide.html o https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x • Try out other frameworks and courses: o https://github.com/h2oai/sparkling-water o https://www.coursera.org/course/mmds • Practical books: o “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media o “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media

Thank you! @mdymczyk Mateusz Dymczyk mateusz@h2o.ai

Machine Learning for (JVM) Developers

More Related Content

What's hot

Viewers also liked

Similar to Machine Learning for (JVM) Developers

Recently uploaded

Machine Learning for (JVM) Developers