Missing Data: A Machine Learning Approach
DAMIAN MINGLE CHIEF DATA SCIENTIST, WPC Healthcare @DamianMingle
GET THE FULL STORY bit.ly/UseSciKitNow
What’s Imputation Anyway?  Some models don’t do well with missing values, so filling with values could prove useful.  Missing values can be replaced by the mean, median, or frequent value.
Why Imputation Matters  Imputing the missing values can give better results than discarding the samples containing any missing value.  Imputing does not always improve the predictions – cross-validation is good to review.  In some cases, dropping rows or using marker values is more effective.
 Preprocessing  Clustering  Regression  Classification  Dimensionality Reduction  Model Selection
Let’s Look at an ML Recipe Imputation
The Imports import numpy as np import urllib from sklearn.preprocessing import Imputer
Load Dataset with Missing Values url = “https://goo.gl/3jvZXE” raw_data = urllib.urlopen(url) dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)
Separate Features from Target X = dataset[:,0:7] y = dataset[:,8
Mark Values with 0 X[X==0]=np.nan
Impute Missing Values with Mean imp = Imputer(missing_values =‘NaN’, strategy=‘mean’) imputed_X = imp.fit_transform(X)
Imputation Recipe # Impute missing values with the mean import numpy as np import urllib from sklearn.preprocessing import Imputer # Load dataset from UCI Machine Learning Repo url = “https://goo.gl/3jvZXE” raw_data = urllib.urlopen(url) dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape) # Segregate the data by features and target X = dataset[:,0:7] y = dataset[:,8] # All values with 0 become “not actual number” (NaN) X[X==0]=np.nan # Make use of the mean value for attribute imp = Imputer(missing_values='NaN', strategy='mean') imputed_X = imp.fit_transform(X)
Missing Data: A Machine Learning Approach
DAMIAN MINGLE CHIEF DATA SCIENTIST, WPC Healthcare @DamianMingle
GET THE FULL STORY bit.ly/UseSciKitNow
Resources  Society of Data Scientists  SciKit Learn  Also:  Fit the imputer on X, fit(X[,y])  Fit to data, then transform it, fit_transform (X[,y])  Impute all missing values in X, transform(X)

Scikit Learn: How to Deal with Missing Values

  • 1.
    Missing Data: A MachineLearning Approach
  • 2.
    DAMIAN MINGLE CHIEF DATASCIENTIST, WPC Healthcare @DamianMingle
  • 3.
    GET THE FULLSTORY bit.ly/UseSciKitNow
  • 4.
    What’s Imputation Anyway? Some models don’t do well with missing values, so filling with values could prove useful.  Missing values can be replaced by the mean, median, or frequent value.
  • 5.
    Why Imputation Matters Imputing the missing values can give better results than discarding the samples containing any missing value.  Imputing does not always improve the predictions – cross-validation is good to review.  In some cases, dropping rows or using marker values is more effective.
  • 6.
     Preprocessing  Clustering Regression  Classification  Dimensionality Reduction  Model Selection
  • 7.
    Let’s Look at an MLRecipe Imputation
  • 8.
    The Imports import numpyas np import urllib from sklearn.preprocessing import Imputer
  • 9.
    Load Dataset withMissing Values url = “https://goo.gl/3jvZXE” raw_data = urllib.urlopen(url) dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)
  • 10.
    Separate Features fromTarget X = dataset[:,0:7] y = dataset[:,8
  • 11.
    Mark Values with0 X[X==0]=np.nan
  • 12.
    Impute Missing Valueswith Mean imp = Imputer(missing_values =‘NaN’, strategy=‘mean’) imputed_X = imp.fit_transform(X)
  • 13.
    Imputation Recipe # Imputemissing values with the mean import numpy as np import urllib from sklearn.preprocessing import Imputer # Load dataset from UCI Machine Learning Repo url = “https://goo.gl/3jvZXE” raw_data = urllib.urlopen(url) dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape) # Segregate the data by features and target X = dataset[:,0:7] y = dataset[:,8] # All values with 0 become “not actual number” (NaN) X[X==0]=np.nan # Make use of the mean value for attribute imp = Imputer(missing_values='NaN', strategy='mean') imputed_X = imp.fit_transform(X)
  • 14.
    Missing Data: A MachineLearning Approach
  • 15.
    DAMIAN MINGLE CHIEF DATASCIENTIST, WPC Healthcare @DamianMingle
  • 16.
    GET THE FULLSTORY bit.ly/UseSciKitNow
  • 17.
    Resources  Society ofData Scientists  SciKit Learn  Also:  Fit the imputer on X, fit(X[,y])  Fit to data, then transform it, fit_transform (X[,y])  Impute all missing values in X, transform(X)