Evaluating Classifier Performance for Titanic Survivor Dataset! Raman Kannan Adjunct NYU Tandon School of Engineering Tandon OnLine Introduction to Machine Learning CS 6923
Random Forest Improving Classifier Performance Process Preparing the Data – train and testing data set Baseline Classifier – GLM Tree – Refine and fine tune,all features on entire dataset Random Forest – Aggregating Weak Learners to make a strong Learner – Random subset of features, bootstrapped dataset Compare Performance – ROC Curve, AUC Prepare dataset Baseline Classifer D Tree Random Forest Training + testing dataset Compare refine Simplest strategy may sometimes be the best solution Sophisticated strategies may not always yield the best solution. Setup objective means to compare Same training set and testing set Same measurement criteria
Dataset Preparation titanic<- read.csv("http://christianherta.de/lehre/dataScience/machine Learning/data/titanic-train.csv",header=T) sm_titantic_3<-titanic[,c(2,3,5,6,10)] #dont need all the columns sm_titanic_3<-sm_titantic_3[complete.cases(sm_titantic_3),] set.seed(43) tst_idx<-sample(714,200,replace=FALSE) tstdata<-sm_titanic_3[tst_idx,] trdata<-sm_titanic_3[-tst_idx,]
Running GLM Create a model using logistic regression glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) Predict using the test dataset and glm model created glm_predicted<- predict(glm_sm_titanic_3,tstdata[,2:5],type="response");
How well did GLM perform? require(ROCR) # use ROCR package True/False Positives, True/False Negatives We will use Precision, Recall, and derived measures Precision (related to false positives) Recall (related to false negatives) https://en.wikipedia.org/wiki/Precision_and_recall
Binary Classifier Precision → (TRUE POSITIVES)/(TP+FP) Recall → (TP)/(All positives in the sample) https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/
Calculate glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) glm_predicted<-predict(glm_sm_titanic_3,tstdata[,2:5],type="response"); require(ROCR) glm_auc_1<-prediction(glm_predicted,tstdata$Survived) glm_prf<-performance(glm_auc_1, measure="tpr", x.measure="fpr") glm_slot_fp<-slot(glm_auc_1,"fp") glm_slot_tp<-slot(glm_auc_1,"tp") Model with training and Predict with test Extract fp and tp using ROCR package glm_fpr3<-unlist(glm_slot_fp)/unlist(slot(glm_auc_1,"n.neg")) glm_tpr3<-unlist(glm_slot_tp)/unlist(slot(glm_auc_1,"n.pos")) Calculate fpr and tpr vectors glm_perf_AUC=performance(glm_auc_1,"auc") glm_AUC=glm_perf_AUC@y.values[[1]] Calculate Area Under Curve – ROC
Plot the results for GLM Plot RoC plot(glm_fpr3,glm_tpr3, main="ROC Curve from first principles -- raw counts", xlab="FPR",ylab="TPR") points(glm_fpr3,glm_fpr3,cex=0.3) # will generate a diagonal plot Let us draw a diagonal and see the lift text(0.4,0.6, paste("GLM AUC = ",format(glm_AUC, digits=5, scientific=FALSE))) Let us place AUC on the plot
Let us repeat the steps for individual trees tree<-rpart(Survived~.,data=trdata) tree_predicted_prob_03<- predict(pruned.tree.03,tstdata[,2:5]); tree_predicted_class_03<-round(tree_predicted_prob_03) tree_prediction_rocr_03<-prediction(tree_predicted_class_03,tstdata$Survived) tree_prf_rocr_03<-performance(tree_prediction_rocr_03, measure="tpr", x.measure="fpr") tree_perf_AUC_03= performance(tree_prediction_rocr_03,"auc") Model with training and Predict with test Extract measures using ROCR package plot(tree_prf_rocr_03,main="ROC plot cp=0.03(DTREE using rpart)") text(0.5,0.5,paste("AUC=",format(tree_perf_AUC_03@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use recursive partition package – there are many other packages Visualize
Motivation for Ensemble Methods Tree while non-parametric, includes all features and all observations, and consequently can result in over-fitting. Bootstrap aggregation (aka Bagging). Construct multiple individual trees At each split while constructing the tree Randomly select a subset of features Bootstrap dataset (with REPLACEMENT) Aggregate results from multiple trees Using any strategy Allow voting and pick most voted Simply average over all the trees
Let us repeat the steps for individual trees fac.titanic.rf<-randomForest(as.factor(Survived)~.,data=trdata,keep.inbag=TRUE, type=classification,importance=TRUE,keep.forest=TRUE,ntree=193) predicted.rf <- predict(fac.titanic.rf, tstdata[,-1], type='response') confusionTable <- table(predicted.rf, tstdata[,1],dnn = c('Predicted','Observed')) table( predicted.rf==tstdata[,1]) pred.rf<-prediction(as.numeric(predicted.rf),as.numeric(tstdata[,1])) perf.rf<-performance(pred.rf,measure="tpr",x.measure="fpr") auc_rf<-performance(pred.rf,measure="auc") Model with training and Predict with test Extract measures using ROCR package plot(perf.rf, col=rainbow(7), main="ROC curve Titanic (Random Forest)", xlab="FPR", ylab="TPR") text(0.5,0.5,paste("AUC=",format(auc_rf@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use randomForest package (which uses CART)– there are many other packages Visualize
Comparing Models Random Forests always outperforms Individual Trees In this experiment, GLM outperforms Random Forest.
Research Directions: Random Forest R Exercise : extract and reconstruct individual trees for further investigation GLM still rules...because superior LIFT AUC for GLM → 0.84 AUC for DT → 0.78 AUC for RF → 0.82 AUC (more area) → LIFT

Evaluating classifierperformance ml-cs6923

  • 1.
    Evaluating Classifier Performancefor Titanic Survivor Dataset! Raman Kannan Adjunct NYU Tandon School of Engineering Tandon OnLine Introduction to Machine Learning CS 6923
  • 2.
    Random Forest Improving ClassifierPerformance Process Preparing the Data – train and testing data set Baseline Classifier – GLM Tree – Refine and fine tune,all features on entire dataset Random Forest – Aggregating Weak Learners to make a strong Learner – Random subset of features, bootstrapped dataset Compare Performance – ROC Curve, AUC Prepare dataset Baseline Classifer D Tree Random Forest Training + testing dataset Compare refine Simplest strategy may sometimes be the best solution Sophisticated strategies may not always yield the best solution. Setup objective means to compare Same training set and testing set Same measurement criteria
  • 3.
    Dataset Preparation titanic<- read.csv("http://christianherta.de/lehre/dataScience/machine Learning/data/titanic-train.csv",header=T) sm_titantic_3<-titanic[,c(2,3,5,6,10)] #dontneed all the columns sm_titanic_3<-sm_titantic_3[complete.cases(sm_titantic_3),] set.seed(43) tst_idx<-sample(714,200,replace=FALSE) tstdata<-sm_titanic_3[tst_idx,] trdata<-sm_titanic_3[-tst_idx,]
  • 4.
    Running GLM Create amodel using logistic regression glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) Predict using the test dataset and glm model created glm_predicted<- predict(glm_sm_titanic_3,tstdata[,2:5],type="response");
  • 5.
    How well didGLM perform? require(ROCR) # use ROCR package True/False Positives, True/False Negatives We will use Precision, Recall, and derived measures Precision (related to false positives) Recall (related to false negatives) https://en.wikipedia.org/wiki/Precision_and_recall
  • 6.
    Binary Classifier Precision →(TRUE POSITIVES)/(TP+FP) Recall → (TP)/(All positives in the sample) https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/
  • 7.
    Calculate glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) glm_predicted<-predict(glm_sm_titanic_3,tstdata[,2:5],type="response"); require(ROCR) glm_auc_1<-prediction(glm_predicted,tstdata$Survived) glm_prf<-performance(glm_auc_1, measure="tpr", x.measure="fpr") glm_slot_fp<-slot(glm_auc_1,"fp") glm_slot_tp<-slot(glm_auc_1,"tp") Modelwith training and Predict with test Extract fp and tp using ROCR package glm_fpr3<-unlist(glm_slot_fp)/unlist(slot(glm_auc_1,"n.neg")) glm_tpr3<-unlist(glm_slot_tp)/unlist(slot(glm_auc_1,"n.pos")) Calculate fpr and tpr vectors glm_perf_AUC=performance(glm_auc_1,"auc") glm_AUC=glm_perf_AUC@y.values[[1]] Calculate Area Under Curve – ROC
  • 8.
    Plot the resultsfor GLM Plot RoC plot(glm_fpr3,glm_tpr3, main="ROC Curve from first principles -- raw counts", xlab="FPR",ylab="TPR") points(glm_fpr3,glm_fpr3,cex=0.3) # will generate a diagonal plot Let us draw a diagonal and see the lift text(0.4,0.6, paste("GLM AUC = ",format(glm_AUC, digits=5, scientific=FALSE))) Let us place AUC on the plot
  • 9.
    Let us repeatthe steps for individual trees tree<-rpart(Survived~.,data=trdata) tree_predicted_prob_03<- predict(pruned.tree.03,tstdata[,2:5]); tree_predicted_class_03<-round(tree_predicted_prob_03) tree_prediction_rocr_03<-prediction(tree_predicted_class_03,tstdata$Survived) tree_prf_rocr_03<-performance(tree_prediction_rocr_03, measure="tpr", x.measure="fpr") tree_perf_AUC_03= performance(tree_prediction_rocr_03,"auc") Model with training and Predict with test Extract measures using ROCR package plot(tree_prf_rocr_03,main="ROC plot cp=0.03(DTREE using rpart)") text(0.5,0.5,paste("AUC=",format(tree_perf_AUC_03@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use recursive partition package – there are many other packages Visualize
  • 10.
    Motivation for EnsembleMethods Tree while non-parametric, includes all features and all observations, and consequently can result in over-fitting. Bootstrap aggregation (aka Bagging). Construct multiple individual trees At each split while constructing the tree Randomly select a subset of features Bootstrap dataset (with REPLACEMENT) Aggregate results from multiple trees Using any strategy Allow voting and pick most voted Simply average over all the trees
  • 11.
    Let us repeatthe steps for individual trees fac.titanic.rf<-randomForest(as.factor(Survived)~.,data=trdata,keep.inbag=TRUE, type=classification,importance=TRUE,keep.forest=TRUE,ntree=193) predicted.rf <- predict(fac.titanic.rf, tstdata[,-1], type='response') confusionTable <- table(predicted.rf, tstdata[,1],dnn = c('Predicted','Observed')) table( predicted.rf==tstdata[,1]) pred.rf<-prediction(as.numeric(predicted.rf),as.numeric(tstdata[,1])) perf.rf<-performance(pred.rf,measure="tpr",x.measure="fpr") auc_rf<-performance(pred.rf,measure="auc") Model with training and Predict with test Extract measures using ROCR package plot(perf.rf, col=rainbow(7), main="ROC curve Titanic (Random Forest)", xlab="FPR", ylab="TPR") text(0.5,0.5,paste("AUC=",format(auc_rf@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use randomForest package (which uses CART)– there are many other packages Visualize
  • 12.
    Comparing Models Random Forestsalways outperforms Individual Trees In this experiment, GLM outperforms Random Forest.
  • 13.
    Research Directions: RandomForest R Exercise : extract and reconstruct individual trees for further investigation GLM still rules...because superior LIFT AUC for GLM → 0.84 AUC for DT → 0.78 AUC for RF → 0.82 AUC (more area) → LIFT