Introduction to Machine Learning with Python and scikit-learn Python Atlanta Nov. 14th 2013 Matt Hagy matt@liveramp.com
Machine Learning (ML): • Finding patterns in data • Modeling patterns • Use models to make predictions Slide #2 Intro to Machine Learning with Python matt@liveramp.com
ML can be easy* • You already have ML applications! • You can start applying ML methods now with Python &scikit-learn • Theoretical knowledge of ML not needed (initially)* *Gaining more background, theory, and experience will help Slide #3 Intro to Machine Learning with Python matt@liveramp.com
Simple Example Slide #4 Intro to Machine Learning with Python matt@liveramp.com
Simple Model Slide #5 Intro to Machine Learning with Python matt@liveramp.com
import numpyas np from sklearn.linear_modelimport LinearRegression x,y = np.load('data.npz') x_test = np.linspace(0, 200) model = LinearRegression() model.fit(x[::, np.newaxis], y) y_test = model.predict(x_test[::, np.newaxis]) Slide #6 Intro to Machine Learning with Python matt@liveramp.com
Slide #7 Intro to Machine Learning with Python matt@liveramp.com
Variance/Bias Trade Off • Need models that can adapt to relationships in our data • Highly adaptable models can over-fit and will not generalize • Regularization – Common strategy to address variance/bias trade off Slide #8 Intro to Machine Learning with Python matt@liveramp.com
Slide #9 Intro to Machine Learning with Python matt@liveramp.com
import numpy as np from sklearn.svmimport SVR from sklearn.pipelineimport Pipeline from sklearn.preprocessingimport StandardScaler x,y = np.load('data.npz') x_test = np.linspace(0, 200) regularization term model = Pipeline([ ('standardize', StandardScaler()), ('svr', SVR(kernel='rbf', verbose=0, C=5e6, epsilon=20)) ]) model.fit(x[::, np.newaxis], y) y_test = model.predict(x_test[::, np.newaxis]) Slide #10 Intro to Machine Learning with Python matt@liveramp.com
Supervised Learning Output, Y 0 3 1 3 4 2 9 3 4 1 6 3 7 9 3 17 6 7 Sample Input, X Slide #11 Modeling relationship between inputs and outputs Intro to Machine Learning with Python matt@liveramp.com
Multiple Inputs Input, X Sample X1 X2 X3 Xn Output, Y 0 3 1 3 4 2 9 3 4 2 3 1 6 8 9 1 2 3 1 0 3 1 2 7 5 4 2 4 7 0 2 9 1 3 2 1 1 6 3 7 9 3 17 6 7 Slide #12 … Intro to Machine Learning with Python matt@liveramp.com
Example: Image Classification • Classify handwritten digits with ML models • Each input is an entire image • Output is digit in the image Slide #13 Intro to Machine Learning with Python matt@liveramp.com
Input, X Output, Y 9 2 Slide #14 Intro to Machine Learning with Python matt@liveramp.com
import numpyas np from sklearn.ensembleimport RandomForestClassifier with np.load(’train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels’] with np.load(’test.npz') as data: pixels_test = data['pixels'] # flatten X_train = pixels_train.reshape(pixels_train.shape[0], -1) X_test = pixels_test.reshape(pixels_test.shape[0], -1) model = RandomForestClassifier(n_estimators=50) model.fit(X_train, labels_train) labels_test = model.predict(X_test) Slide #15 Intro to Machine Learning with Python matt@liveramp.com
Predicting the tags of Stack Overflow questions with machine learning Kaggle Data Science Competition • Given 6 million training questions labeled with tags • Predict the tags for 2 million unlabeled test questions www.users.globalnet.co.uk/~slocks/instructions.html stackoverflow.com/questions/895371/bubble-sort-homework Slide #16 Intro to Machine Learning with Python matt@liveramp.com
Text Classification Overview Feature Extraction & Selection Raw Posts Slide #17 Model Selection & Training Vector Space Intro to Machine Learning with Python Machine Learning Model matt@liveramp.com
Term Frequency Feature Extraction Characterize text by the frequency of specific words in each text entry Slide #18 processing sorted array faster “Why is processing a sorted array faster than processing an array this is not sorted?” Term Frequencies why Example Title: 1 2 2 2 1 Ignore common words (i.e. stop words) Intro to Machine Learning with Python matt@liveramp.com
sorted array faster need help java homework Title 1 1 2 2 2 1 0 0 0 0 Title 2 0 0 0 0 0 1 1 1 1 Title 3 0 0 1 1 0 0 1 0 1 why processing Frequency of key terms is anticipated to be correlated with the tags of the question Slide #19 Intro to Machine Learning with Python matt@liveramp.com
Example Model Coefficients Slide #22 Intro to Machine Learning with Python matt@liveramp.com
ML can be easy* • You already have ML problems! • You can start applying ML methods now with Python &scikit-learn • Theoretical knowledge of ML not needed (initially)* scikit-learn.org github.com/scikit-learn Slide #24 Intro to Machine Learning with Python matt@liveramp.com
Helping companies use their marketing data to delight customers Tools Opportunities • Backend Engineers • Data Scientists • Full-Stack Engineers • Java • Hadoop (Map/Reduce) • Ruby Build and work with large distributed systems that process massive data sets. Check out: liveramp.com/careers Slide #25 Intro to Machine Learning with Python matt@liveramp.com

Introduction to Machine Learning with Python and scikit-learn

  • 1.
    Introduction to Machine Learningwith Python and scikit-learn Python Atlanta Nov. 14th 2013 Matt Hagy matt@liveramp.com
  • 2.
    Machine Learning (ML): •Finding patterns in data • Modeling patterns • Use models to make predictions Slide #2 Intro to Machine Learning with Python matt@liveramp.com
  • 3.
    ML can beeasy* • You already have ML applications! • You can start applying ML methods now with Python &scikit-learn • Theoretical knowledge of ML not needed (initially)* *Gaining more background, theory, and experience will help Slide #3 Intro to Machine Learning with Python matt@liveramp.com
  • 4.
    Simple Example Slide #4 Introto Machine Learning with Python matt@liveramp.com
  • 5.
    Simple Model Slide #5 Introto Machine Learning with Python matt@liveramp.com
  • 6.
    import numpyas np fromsklearn.linear_modelimport LinearRegression x,y = np.load('data.npz') x_test = np.linspace(0, 200) model = LinearRegression() model.fit(x[::, np.newaxis], y) y_test = model.predict(x_test[::, np.newaxis]) Slide #6 Intro to Machine Learning with Python matt@liveramp.com
  • 7.
    Slide #7 Intro toMachine Learning with Python matt@liveramp.com
  • 8.
    Variance/Bias Trade Off •Need models that can adapt to relationships in our data • Highly adaptable models can over-fit and will not generalize • Regularization – Common strategy to address variance/bias trade off Slide #8 Intro to Machine Learning with Python matt@liveramp.com
  • 9.
    Slide #9 Intro toMachine Learning with Python matt@liveramp.com
  • 10.
    import numpy asnp from sklearn.svmimport SVR from sklearn.pipelineimport Pipeline from sklearn.preprocessingimport StandardScaler x,y = np.load('data.npz') x_test = np.linspace(0, 200) regularization term model = Pipeline([ ('standardize', StandardScaler()), ('svr', SVR(kernel='rbf', verbose=0, C=5e6, epsilon=20)) ]) model.fit(x[::, np.newaxis], y) y_test = model.predict(x_test[::, np.newaxis]) Slide #10 Intro to Machine Learning with Python matt@liveramp.com
  • 11.
    Supervised Learning Output, Y 0 3 1 3 4 2 9 3 4 1 6 3 7 9 3 17 6 7 Sample Input,X Slide #11 Modeling relationship between inputs and outputs Intro to Machine Learning with Python matt@liveramp.com
  • 12.
    Multiple Inputs Input, X Sample X1 X2 X3 Xn Output,Y 0 3 1 3 4 2 9 3 4 2 3 1 6 8 9 1 2 3 1 0 3 1 2 7 5 4 2 4 7 0 2 9 1 3 2 1 1 6 3 7 9 3 17 6 7 Slide #12 … Intro to Machine Learning with Python matt@liveramp.com
  • 13.
    Example: Image Classification •Classify handwritten digits with ML models • Each input is an entire image • Output is digit in the image Slide #13 Intro to Machine Learning with Python matt@liveramp.com
  • 14.
    Input, X Output, Y 9 2 Slide#14 Intro to Machine Learning with Python matt@liveramp.com
  • 15.
    import numpyas np fromsklearn.ensembleimport RandomForestClassifier with np.load(’train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels’] with np.load(’test.npz') as data: pixels_test = data['pixels'] # flatten X_train = pixels_train.reshape(pixels_train.shape[0], -1) X_test = pixels_test.reshape(pixels_test.shape[0], -1) model = RandomForestClassifier(n_estimators=50) model.fit(X_train, labels_train) labels_test = model.predict(X_test) Slide #15 Intro to Machine Learning with Python matt@liveramp.com
  • 16.
    Predicting the tagsof Stack Overflow questions with machine learning Kaggle Data Science Competition • Given 6 million training questions labeled with tags • Predict the tags for 2 million unlabeled test questions www.users.globalnet.co.uk/~slocks/instructions.html stackoverflow.com/questions/895371/bubble-sort-homework Slide #16 Intro to Machine Learning with Python matt@liveramp.com
  • 17.
    Text Classification Overview FeatureExtraction & Selection Raw Posts Slide #17 Model Selection & Training Vector Space Intro to Machine Learning with Python Machine Learning Model matt@liveramp.com
  • 18.
    Term Frequency FeatureExtraction Characterize text by the frequency of specific words in each text entry Slide #18 processing sorted array faster “Why is processing a sorted array faster than processing an array this is not sorted?” Term Frequencies why Example Title: 1 2 2 2 1 Ignore common words (i.e. stop words) Intro to Machine Learning with Python matt@liveramp.com
  • 19.
    sorted array faster need help java homework Title 1 1 2 2 2 1 0 0 0 0 Title2 0 0 0 0 0 1 1 1 1 Title 3 0 0 1 1 0 0 1 0 1 why processing Frequency of key terms is anticipated to be correlated with the tags of the question Slide #19 Intro to Machine Learning with Python matt@liveramp.com
  • 20.
    Example Model Coefficients Slide#22 Intro to Machine Learning with Python matt@liveramp.com
  • 22.
    ML can beeasy* • You already have ML problems! • You can start applying ML methods now with Python &scikit-learn • Theoretical knowledge of ML not needed (initially)* scikit-learn.org github.com/scikit-learn Slide #24 Intro to Machine Learning with Python matt@liveramp.com
  • 23.
    Helping companies usetheir marketing data to delight customers Tools Opportunities • Backend Engineers • Data Scientists • Full-Stack Engineers • Java • Hadoop (Map/Reduce) • Ruby Build and work with large distributed systems that process massive data sets. Check out: liveramp.com/careers Slide #25 Intro to Machine Learning with Python matt@liveramp.com