Evaluating classifierperformance ml-cs6923

Evaluating Classifier Performance for Titanic Survivor Dataset! Raman Kannan Adjunct NYU Tandon School of Engineering Tandon OnLine Introduction to Machine Learning CS 6923

Random Forest Improving Classifier Performance Process Preparing the Data – train and testing data set Baseline Classifier – GLM Tree – Refine and fine tune,all features on entire dataset Random Forest – Aggregating Weak Learners to make a strong Learner – Random subset of features, bootstrapped dataset Compare Performance – ROC Curve, AUC Prepare dataset Baseline Classifer D Tree Random Forest Training + testing dataset Compare refine Simplest strategy may sometimes be the best solution Sophisticated strategies may not always yield the best solution. Setup objective means to compare Same training set and testing set Same measurement criteria

Dataset Preparation titanic<- read.csv("http://christianherta.de/lehre/dataScience/machine Learning/data/titanic-train.csv",header=T) sm_titantic_3<-titanic[,c(2,3,5,6,10)] #dont need all the columns sm_titanic_3<-sm_titantic_3[complete.cases(sm_titantic_3),] set.seed(43) tst_idx<-sample(714,200,replace=FALSE) tstdata<-sm_titanic_3[tst_idx,] trdata<-sm_titanic_3[-tst_idx,]

Running GLM Create a model using logistic regression glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) Predict using the test dataset and glm model created glm_predicted<- predict(glm_sm_titanic_3,tstdata[,2:5],type="response");

How well did GLM perform? require(ROCR) # use ROCR package True/False Positives, True/False Negatives We will use Precision, Recall, and derived measures Precision (related to false positives) Recall (related to false negatives) https://en.wikipedia.org/wiki/Precision_and_recall

Binary Classifier Precision → (TRUE POSITIVES)/(TP+FP) Recall → (TP)/(All positives in the sample) https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/

Calculate glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) glm_predicted<-predict(glm_sm_titanic_3,tstdata[,2:5],type="response"); require(ROCR) glm_auc_1<-prediction(glm_predicted,tstdata$Survived) glm_prf<-performance(glm_auc_1, measure="tpr", x.measure="fpr") glm_slot_fp<-slot(glm_auc_1,"fp") glm_slot_tp<-slot(glm_auc_1,"tp") Model with training and Predict with test Extract fp and tp using ROCR package glm_fpr3<-unlist(glm_slot_fp)/unlist(slot(glm_auc_1,"n.neg")) glm_tpr3<-unlist(glm_slot_tp)/unlist(slot(glm_auc_1,"n.pos")) Calculate fpr and tpr vectors glm_perf_AUC=performance(glm_auc_1,"auc") glm_AUC=glm_perf_AUC@y.values[[1]] Calculate Area Under Curve – ROC

Plot the results for GLM Plot RoC plot(glm_fpr3,glm_tpr3, main="ROC Curve from first principles -- raw counts", xlab="FPR",ylab="TPR") points(glm_fpr3,glm_fpr3,cex=0.3) # will generate a diagonal plot Let us draw a diagonal and see the lift text(0.4,0.6, paste("GLM AUC = ",format(glm_AUC, digits=5, scientific=FALSE))) Let us place AUC on the plot

Let us repeat the steps for individual trees tree<-rpart(Survived~.,data=trdata) tree_predicted_prob_03<- predict(pruned.tree.03,tstdata[,2:5]); tree_predicted_class_03<-round(tree_predicted_prob_03) tree_prediction_rocr_03<-prediction(tree_predicted_class_03,tstdata$Survived) tree_prf_rocr_03<-performance(tree_prediction_rocr_03, measure="tpr", x.measure="fpr") tree_perf_AUC_03= performance(tree_prediction_rocr_03,"auc") Model with training and Predict with test Extract measures using ROCR package plot(tree_prf_rocr_03,main="ROC plot cp=0.03(DTREE using rpart)") text(0.5,0.5,paste("AUC=",format(tree_perf_AUC_03@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use recursive partition package – there are many other packages Visualize

Motivation for Ensemble Methods Tree while non-parametric, includes all features and all observations, and consequently can result in over-fitting. Bootstrap aggregation (aka Bagging). Construct multiple individual trees At each split while constructing the tree Randomly select a subset of features Bootstrap dataset (with REPLACEMENT) Aggregate results from multiple trees Using any strategy Allow voting and pick most voted Simply average over all the trees

Let us repeat the steps for individual trees fac.titanic.rf<-randomForest(as.factor(Survived)~.,data=trdata,keep.inbag=TRUE, type=classification,importance=TRUE,keep.forest=TRUE,ntree=193) predicted.rf <- predict(fac.titanic.rf, tstdata[,-1], type='response') confusionTable <- table(predicted.rf, tstdata[,1],dnn = c('Predicted','Observed')) table( predicted.rf==tstdata[,1]) pred.rf<-prediction(as.numeric(predicted.rf),as.numeric(tstdata[,1])) perf.rf<-performance(pred.rf,measure="tpr",x.measure="fpr") auc_rf<-performance(pred.rf,measure="auc") Model with training and Predict with test Extract measures using ROCR package plot(perf.rf, col=rainbow(7), main="ROC curve Titanic (Random Forest)", xlab="FPR", ylab="TPR") text(0.5,0.5,paste("AUC=",format(auc_rf@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use randomForest package (which uses CART)– there are many other packages Visualize

Comparing Models Random Forests always outperforms Individual Trees In this experiment, GLM outperforms Random Forest.

Research Directions: Random Forest R Exercise : extract and reconstruct individual trees for further investigation GLM still rules...because superior LIFT AUC for GLM → 0.84 AUC for DT → 0.78 AUC for RF → 0.82 AUC (more area) → LIFT

Evaluating classifierperformance ml-cs6923

More Related Content

What's hot

Similar to Evaluating classifierperformance ml-cs6923

More from Raman Kannan

Recently uploaded

Evaluating classifierperformance ml-cs6923