Regression and Classification with R ∗ Yanchang Zhao http://www.RDataMining.com R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China July 2019 ∗ Chapters 4 & 5, in R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 1 / 53
Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 2 / 53
Regression and Classification with R † Basics of regression and classification Building a linear regression model to predict CPI data Building a generalized linear model (GLM) Building decision trees with package party and rpart Training a random forest model with package randomForest † Chapter 4: Decision Trees and Random Forest & Chapter 5: Regression, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 3 / 53
Regression and Classification Regression: to predict a continuous value, such as the volume of rain Classification: to predict a categorical class label, such as weather: rainy, sunnny, cloudy or snowy 4 / 53
Regression Regression is to build a function of independent variables (also known as predictors) to predict a dependent variable (also called response). For example, banks assess the risk of home-loan applicants based on their age, income, expenses, occupation, number of dependents, total credit limit, etc. Linear regression models Generalized linear models (GLM) 5 / 53
An Example of Decision Tree Edible Mushroom decision tree‡ ‡ http://users.cs.cf.ac.uk/Dave.Marshall/AI2/node147.html 6 / 53
Random Forest Ensemble learning with many decision trees Each tree is trained with a random sample of the training dataset and on a randomly chosen subspace. The final prediction result is derived from the predictions of all individual trees, with mean (for regression) or majority voting (for classification). Better performance and less likely to overfit than a single decision tree, but with less interpretability 7 / 53
Regression Evaluation MAE: Mean Absolute Error MAE = 1 n n i=1 |ˆyi − yi | (1) MSE: Mean Squared Error MSE = 1 n n i=1 (ˆyi − yi )2 (2) RMSE: Root Mean Squared Error RMSE = 1 n n i=1 (ˆyi − yi )2 (3) where yi is actual value and ˆyi is predicted value. 8 / 53
Overfitting A model is over complex and performs very well on training data but poorly on unseen data. To evaluate models with out-of-sample test data, i.e., data that are not included in training data 9 / 53
Training and Test Randomly split into training and test sets 80/20, 70/30, 60/40 ... Training Test 10 / 53
k-Fold Cross Validation Split data into k subsets of equal size Reserve one set for test and use the rest for training Average performance of all above 11 / 53
An Example: 5-Fold Cross Validation Training Test 12 / 53
Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 13 / 53
Linear Regression Linear regression is to predict response with a linear function of predictors as follows: y = c0 + c1x1 + c2x2 + · · · + ckxk, where x1, x2, · · · , xk are predictors, y is the response to predict, and c0, c1, · · · , ck are cofficients to learn. Linear regression in R: lm() The Australian Consumer Price Index (CPI) data: quarterly CPIs from 2008 to 2010 § § From Australian Bureau of Statistics, http://www.abs.gov.au. 14 / 53
The CPI Data ## CPI data year <- rep(2008:2010, each = 4) quarter <- rep(1:4, 3) cpi <- c(162.2, 164.6, 166.5, 166.0, 166.2, 167.0, 168.6, 169.5, 171.0, 172.1, 173.3, 174.0) plot(cpi, xaxt="n", ylab="CPI", xlab="") # draw x-axis, where "las=3" makes text vertical axis(1, labels=paste(year,quarter,sep="Q"), at=1:12, las=3) 162164166168170172174 CPI 2008Q1 2008Q2 2008Q3 2008Q4 2009Q1 2009Q2 2009Q3 2009Q4 2010Q1 2010Q2 2010Q3 2010Q4 15 / 53
Linear Regression ## correlation between CPI and year / quarter cor(year, cpi) ## [1] 0.9096316 cor(quarter, cpi) ## [1] 0.3738028 ## build a linear regression model with function lm() fit <- lm(cpi ~ year + quarter) fit ## ## Call: ## lm(formula = cpi ~ year + quarter) ## ## Coefficients: ## (Intercept) year quarter ## -7644.488 3.888 1.167 16 / 53
With the above linear model, CPI is calculated as cpi = c0 + c1 ∗ year + c2 ∗ quarter, where c0, c1 and c2 are coefficients from model fit. What will the CPI be in 2011? # make prediction cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]] * 2011 + fit$coefficients[[3]] * (1:4) cpi2011 ## [1] 174.4417 175.6083 176.7750 177.9417 17 / 53
With the above linear model, CPI is calculated as cpi = c0 + c1 ∗ year + c2 ∗ quarter, where c0, c1 and c2 are coefficients from model fit. What will the CPI be in 2011? # make prediction cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]] * 2011 + fit$coefficients[[3]] * (1:4) cpi2011 ## [1] 174.4417 175.6083 176.7750 177.9417 An easier way is to use function predict(). 17 / 53
More details of the model can be obtained with the code below. ## attributes of the model attributes(fit) ## $names ## [1] "coefficients" "residuals" "effects" ## [4] "rank" "fitted.values" "assign" ## [7] "qr" "df.residual" "xlevels" ## [10] "call" "terms" "model" ## ## $class ## [1] "lm" fit$coefficients ## (Intercept) year quarter ## -7644.487500 3.887500 1.166667 18 / 53
Function residuals(): differences btw observed & fitted values ## differences between observed values and fitted values residuals(fit) ## 1 2 3 4 5 ## -0.57916667 0.65416667 1.38750000 -0.27916667 -0.46666667 ## 6 7 8 9 10 ## -0.83333333 -0.40000000 -0.66666667 0.44583333 0.37916667 ## 11 12 ## 0.41250000 -0.05416667 summary(fit) ## ## Call: ## lm(formula = cpi ~ year + quarter) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.8333 -0.4948 -0.1667 0.4208 1.3875 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7644.4875 518.6543 -14.739 1.31e-07 *** ## year 3.8875 0.2582 15.058 1.09e-07 *** 19 / 53
3D Plot of the Fitted Model library(scatterplot3d) s3d <- scatterplot3d(year, quarter, cpi, highlight.3d=T, type="h", lab=c(2,3)) # lab: number of tickmarks on x-/y-axes s3d$plane3d(fit) # draws the fitted plane 2008 2009 2010 160165170175 1 2 3 4 year quarter cpi 20 / 53
Prediction of CPIs in 2011 data2011 <- data.frame(year=2011, quarter=1:4) cpi2011 <- predict(fit, newdata=data2011) style <- c(rep(1,12), rep(2,4)) plot(c(cpi, cpi2011), xaxt="n", ylab="CPI", xlab="", pch=style, col=style) txt <- c(paste(year,quarter,sep="Q"), "2011Q1", "2011Q2", "2011Q3", "2011Q4") axis(1, at=1:16, las=3, labels=txt) 165170175 CPI 2008Q1 2008Q2 2008Q3 2008Q4 2009Q1 2009Q2 2009Q3 2009Q4 2010Q1 2010Q2 2010Q3 2010Q4 2011Q1 2011Q2 2011Q3 2011Q4 21 / 53
Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 22 / 53
Generalized Linear Model (GLM) Generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be a function of its predicted value Unifies various other statistical models, including linear regression, logistic regression and Poisson regression Function glm(): fits generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution 23 / 53
Build a Generalized Linear Model ## build a regression model data("bodyfat", package="TH.data") myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth bodyfat.glm <- glm(myFormula, family=gaussian("log"), data=bodyfat) summary(bodyfat.glm) ## ## Call: ## glm(formula = myFormula, family = gaussian("log"), data = b... ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -11.5688 -3.0065 0.1266 2.8310 10.0966 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.734293 0.308949 2.377 0.02042 * ## age 0.002129 0.001446 1.473 0.14560 ## waistcirc 0.010489 0.002479 4.231 7.44e-05 *** ## hipcirc 0.009702 0.003231 3.003 0.00379 ** ## elbowbreadth 0.002355 0.045686 0.052 0.95905 24 / 53
Prediction with Generalized Linear Regression Model ## make prediction and visualise result pred <- predict(bodyfat.glm, type = "response") plot(bodyfat$DEXfat, pred, xlab = "Observed", ylab = "Prediction") abline(a = 0, b = 1) 10 20 30 40 50 60 20304050 Observed Prediction 25 / 53
Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 26 / 53
The iris Data str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0... ## $ Species : Factor w/ 3 levels "setosa","versicolor",.... # split data into two subsets: training (70%) and test (30%); set # a fixed random seed to make results reproducible set.seed(1234) ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3)) train.data <- iris[ind == 1, ] test.data <- iris[ind == 2, ] 27 / 53
Build a ctree Control the training of decision trees: MinSplit, MinBusket, MaxSurrogate and MaxDepth Target variable: Species Independent variables: all other variables ## build a ctree library(party) # myFormula <- Species ~ . # predict Species with all other # variables myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris.ctree <- ctree(myFormula, data = train.data) # check the prediction table(predict(iris.ctree), train.data$Species) ## ## setosa versicolor virginica ## setosa 40 0 0 ## versicolor 0 37 3 ## virginica 0 1 31 28 / 53
Print ctree print(iris.ctree) ## ## Conditional inference tree with 4 terminal nodes ## ## Response: Species ## Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width ## Number of observations: 112 ## ## 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 ## 2)* weights = 40 ## 1) Petal.Length > 1.9 ## 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 ## 4) Petal.Length <= 4.4; criterion = 0.974, statistic = ... ## 5)* weights = 21 ## 4) Petal.Length > 4.4 ## 6)* weights = 19 ## 3) Petal.Width > 1.7 ## 7)* weights = 32 29 / 53
plot(iris.ctree) Petal.Length p < 0.001 1 ≤ 1.9 > 1.9 Node 2 (n = 40) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 Petal.Width p < 0.001 3 ≤ 1.7 > 1.7 Petal.Length p = 0.026 4 ≤ 4.4 > 4.4 Node 5 (n = 21) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 Node 6 (n = 19) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 Node 7 (n = 32) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 30 / 53
plot(iris.ctree, type = "simple") Petal.Length p < 0.001 1 ≤ 1.9 > 1.9 n = 40 y = (1, 0, 0) 2 Petal.Width p < 0.001 3 ≤ 1.7 > 1.7 Petal.Length p = 0.026 4 ≤ 4.4 > 4.4 n = 21 y = (0, 1, 0) 5 n = 19 y = (0, 0.842, 0.158) 6 n = 32 y = (0, 0.031, 0.969) 7 31 / 53
Test # predict on test data testPred <- predict(iris.ctree, newdata = test.data) table(testPred, test.data$Species) ## ## testPred setosa versicolor virginica ## setosa 10 0 0 ## versicolor 0 12 2 ## virginica 0 0 14 32 / 53
Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 33 / 53
The bodyfat Dataset ## build a decision tree with rpart data("bodyfat", package = "TH.data") dim(bodyfat) ## [1] 71 10 # str(bodyfat) head(bodyfat, 5) ## age DEXfat waistcirc hipcirc elbowbreadth kneebreadth ## 47 57 41.68 100.0 112.0 7.1 9.4 ## 48 65 43.29 99.5 116.5 6.5 8.9 ## 49 59 35.41 96.0 108.5 6.2 8.9 ## 50 58 22.79 72.0 96.5 6.1 9.2 ## 51 60 36.42 89.5 100.5 7.1 10.0 ## anthro3a anthro3b anthro3c anthro4 ## 47 4.42 4.95 4.50 6.13 ## 48 4.63 5.01 4.48 6.37 ## 49 4.12 4.74 4.60 5.82 ## 50 4.03 4.48 3.91 5.66 ## 51 4.24 4.68 4.15 5.91 34 / 53
Train a Decision Tree with Package rpart # split into training and test subsets set.seed(1234) ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3)) bodyfat.train <- bodyfat[ind==1,] bodyfat.test <- bodyfat[ind==2,] # train a decision tree library(rpart) myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth bodyfat.rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10)) # print(bodyfat.rpart£cptable) library(rpart.plot) rpart.plot(bodyfat.rpart) 35 / 53
The rpart Tree ## n= 56 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 56 7265.0290000 30.94589 ## 2) waistcirc< 88.4 31 960.5381000 22.55645 ## 4) hipcirc< 96.25 14 222.2648000 18.41143 ## 8) age< 60.5 9 66.8809600 16.19222 * ## 9) age>=60.5 5 31.2769200 22.40600 * ## 5) hipcirc>=96.25 17 299.6470000 25.97000 ## 10) waistcirc< 77.75 6 30.7345500 22.32500 * ## 11) waistcirc>=77.75 11 145.7148000 27.95818 ## 22) hipcirc< 99.5 3 0.2568667 23.74667 * ## 23) hipcirc>=99.5 8 72.2933500 29.53750 * ## 3) waistcirc>=88.4 25 1417.1140000 41.34880 ## 6) waistcirc< 104.75 18 330.5792000 38.09111 ## 12) hipcirc< 109.9 9 68.9996200 34.37556 * ## 13) hipcirc>=109.9 9 13.0832000 41.80667 * ## 7) waistcirc>=104.75 7 404.3004000 49.72571 * 36 / 53
The rpart Tree waistcirc < 88 hipcirc < 96 age < 61 waistcirc < 78 hipcirc < 100 waistcirc < 105 hipcirc < 110 31 100% 23 55% 18 25% 16 16% 22 9% 26 30% 22 11% 28 20% 24 5% 30 14% 41 45% 38 32% 34 16% 42 16% 50 12% yes no 37 / 53
Select the Best Tree # select the tree with the minimum prediction error opt <- which.min(bodyfat.rpart$cptable[, "xerror"]) cp <- bodyfat.rpart$cptable[opt, "CP"] # prune tree bodyfat.prune <- prune(bodyfat.rpart, cp = cp) # plot tree rpart.plot(bodyfat.prune) 38 / 53
Selected Tree waistcirc < 88 hipcirc < 96 age < 61 waistcirc < 78 waistcirc < 105 hipcirc < 110 31 100% 23 55% 18 25% 16 16% 22 9% 26 30% 22 11% 28 20% 41 45% 38 32% 34 16% 42 16% 50 12% yes no 39 / 53
Model Evaluation ## make prediction DEXfat_pred <- predict(bodyfat.prune, newdata = bodyfat.test) xlim <- range(bodyfat$DEXfat) plot(DEXfat_pred ~ DEXfat, data = bodyfat.test, xlab = "Observed", ylab = "Prediction", ylim = xlim, xlim = xlim) abline(a = 0, b = 1) 10 20 30 40 50 60 102030405060 Observed Prediction 40 / 53
Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 41 / 53
R Packages for Random Forest Package randomForest very fast cannot handle data with missing values a limit of 32 to the maximum number of levels of each categorical attribute extensions: extendedForest, gradientForest Package party: cforest() not limited to the above maximum levels slow needs more memory 42 / 53
Train a Random Forest # split into two subsets: training (70%) and test (30%) ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3)) train.data <- iris[ind==1,] test.data <- iris[ind==2,] # use all other variables to predict Species library(randomForest) rf <- randomForest(Species ~ ., data=train.data, ntree=100, proximity=T) 43 / 53
table(predict(rf), train.data$Species) ## ## setosa versicolor virginica ## setosa 36 0 0 ## versicolor 0 32 2 ## virginica 0 0 34 print(rf) ## ## Call: ## randomForest(formula = Species ~ ., data = train.data, ntr... ## Type of random forest: classification ## Number of trees: 100 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 1.92% ## Confusion matrix: ## setosa versicolor virginica class.error ## setosa 36 0 0 0.00000000 ## versicolor 0 32 0 0.00000000 ## virginica 0 2 34 0.05555556 attributes(rf) 44 / 53
Error Rate of Random Forest plot(rf, main = "") 0 20 40 60 80 100 0.000.050.100.150.20 trees Error 45 / 53
Variable Importance importance(rf) ## MeanDecreaseGini ## Sepal.Length 6.834364 ## Sepal.Width 1.383795 ## Petal.Length 28.207859 ## Petal.Width 32.043213 46 / 53
Variable Importance varImpPlot(rf) Sepal.Width Sepal.Length Petal.Length Petal.Width 0 5 10 15 20 25 30 rf MeanDecreaseGini 47 / 53
Margin of Predictions The margin of a data point is as the proportion of votes for the correct class minus maximum proportion of votes for other classes. Positive margin means correct classification. irisPred <- predict(rf, newdata = test.data) table(irisPred, test.data$Species) ## ## irisPred setosa versicolor virginica ## setosa 14 0 0 ## versicolor 0 17 3 ## virginica 0 1 11 plot(margin(rf, test.data$Species)) 48 / 53
Margin of Predictions 0 20 40 60 80 100 −0.20.00.20.40.60.81.0 Index x 49 / 53
Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 50 / 53
Online Resources Book titled R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining-book.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/RDataMining-reference-card.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (27,000+ members) http://group.rdatamining.com Twitter (3,300+ followers) @RDataMining 51 / 53
The End Thanks! Email: yanchang(at)RDataMining.com Twitter: @RDataMining 52 / 53
How to Cite This Work Citation Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN 978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256 pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf. BibTex @BOOK{Zhao2012R, title = {R and Data Mining: Examples and Case Studies}, publisher = {Academic Press, Elsevier}, year = {2012}, author = {Yanchang Zhao}, pages = {256}, month = {December}, isbn = {978-0-123-96963-7}, keywords = {R, data mining}, url = {http://www.rdatamining.com/docs/RDataMining-book.pdf} } 53 / 53

RDataMining slides-regression-classification

  • 1.
    Regression and Classificationwith R ∗ Yanchang Zhao http://www.RDataMining.com R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China July 2019 ∗ Chapters 4 & 5, in R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 1 / 53
  • 2.
    Contents Introduction Linear Regression Generalized LinearRegression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 2 / 53
  • 3.
    Regression and Classificationwith R † Basics of regression and classification Building a linear regression model to predict CPI data Building a generalized linear model (GLM) Building decision trees with package party and rpart Training a random forest model with package randomForest † Chapter 4: Decision Trees and Random Forest & Chapter 5: Regression, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 3 / 53
  • 4.
    Regression and Classification Regression:to predict a continuous value, such as the volume of rain Classification: to predict a categorical class label, such as weather: rainy, sunnny, cloudy or snowy 4 / 53
  • 5.
    Regression Regression is tobuild a function of independent variables (also known as predictors) to predict a dependent variable (also called response). For example, banks assess the risk of home-loan applicants based on their age, income, expenses, occupation, number of dependents, total credit limit, etc. Linear regression models Generalized linear models (GLM) 5 / 53
  • 6.
    An Example ofDecision Tree Edible Mushroom decision tree‡ ‡ http://users.cs.cf.ac.uk/Dave.Marshall/AI2/node147.html 6 / 53
  • 7.
    Random Forest Ensemble learningwith many decision trees Each tree is trained with a random sample of the training dataset and on a randomly chosen subspace. The final prediction result is derived from the predictions of all individual trees, with mean (for regression) or majority voting (for classification). Better performance and less likely to overfit than a single decision tree, but with less interpretability 7 / 53
  • 8.
    Regression Evaluation MAE: MeanAbsolute Error MAE = 1 n n i=1 |ˆyi − yi | (1) MSE: Mean Squared Error MSE = 1 n n i=1 (ˆyi − yi )2 (2) RMSE: Root Mean Squared Error RMSE = 1 n n i=1 (ˆyi − yi )2 (3) where yi is actual value and ˆyi is predicted value. 8 / 53
  • 9.
    Overfitting A model isover complex and performs very well on training data but poorly on unseen data. To evaluate models with out-of-sample test data, i.e., data that are not included in training data 9 / 53
  • 10.
    Training and Test Randomlysplit into training and test sets 80/20, 70/30, 60/40 ... Training Test 10 / 53
  • 11.
    k-Fold Cross Validation Splitdata into k subsets of equal size Reserve one set for test and use the rest for training Average performance of all above 11 / 53
  • 12.
    An Example: 5-FoldCross Validation Training Test 12 / 53
  • 13.
    Contents Introduction Linear Regression Generalized LinearRegression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 13 / 53
  • 14.
    Linear Regression Linear regressionis to predict response with a linear function of predictors as follows: y = c0 + c1x1 + c2x2 + · · · + ckxk, where x1, x2, · · · , xk are predictors, y is the response to predict, and c0, c1, · · · , ck are cofficients to learn. Linear regression in R: lm() The Australian Consumer Price Index (CPI) data: quarterly CPIs from 2008 to 2010 § § From Australian Bureau of Statistics, http://www.abs.gov.au. 14 / 53
  • 15.
    The CPI Data ##CPI data year <- rep(2008:2010, each = 4) quarter <- rep(1:4, 3) cpi <- c(162.2, 164.6, 166.5, 166.0, 166.2, 167.0, 168.6, 169.5, 171.0, 172.1, 173.3, 174.0) plot(cpi, xaxt="n", ylab="CPI", xlab="") # draw x-axis, where "las=3" makes text vertical axis(1, labels=paste(year,quarter,sep="Q"), at=1:12, las=3) 162164166168170172174 CPI 2008Q1 2008Q2 2008Q3 2008Q4 2009Q1 2009Q2 2009Q3 2009Q4 2010Q1 2010Q2 2010Q3 2010Q4 15 / 53
  • 16.
    Linear Regression ## correlationbetween CPI and year / quarter cor(year, cpi) ## [1] 0.9096316 cor(quarter, cpi) ## [1] 0.3738028 ## build a linear regression model with function lm() fit <- lm(cpi ~ year + quarter) fit ## ## Call: ## lm(formula = cpi ~ year + quarter) ## ## Coefficients: ## (Intercept) year quarter ## -7644.488 3.888 1.167 16 / 53
  • 17.
    With the abovelinear model, CPI is calculated as cpi = c0 + c1 ∗ year + c2 ∗ quarter, where c0, c1 and c2 are coefficients from model fit. What will the CPI be in 2011? # make prediction cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]] * 2011 + fit$coefficients[[3]] * (1:4) cpi2011 ## [1] 174.4417 175.6083 176.7750 177.9417 17 / 53
  • 18.
    With the abovelinear model, CPI is calculated as cpi = c0 + c1 ∗ year + c2 ∗ quarter, where c0, c1 and c2 are coefficients from model fit. What will the CPI be in 2011? # make prediction cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]] * 2011 + fit$coefficients[[3]] * (1:4) cpi2011 ## [1] 174.4417 175.6083 176.7750 177.9417 An easier way is to use function predict(). 17 / 53
  • 19.
    More details ofthe model can be obtained with the code below. ## attributes of the model attributes(fit) ## $names ## [1] "coefficients" "residuals" "effects" ## [4] "rank" "fitted.values" "assign" ## [7] "qr" "df.residual" "xlevels" ## [10] "call" "terms" "model" ## ## $class ## [1] "lm" fit$coefficients ## (Intercept) year quarter ## -7644.487500 3.887500 1.166667 18 / 53
  • 20.
    Function residuals(): differencesbtw observed & fitted values ## differences between observed values and fitted values residuals(fit) ## 1 2 3 4 5 ## -0.57916667 0.65416667 1.38750000 -0.27916667 -0.46666667 ## 6 7 8 9 10 ## -0.83333333 -0.40000000 -0.66666667 0.44583333 0.37916667 ## 11 12 ## 0.41250000 -0.05416667 summary(fit) ## ## Call: ## lm(formula = cpi ~ year + quarter) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.8333 -0.4948 -0.1667 0.4208 1.3875 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7644.4875 518.6543 -14.739 1.31e-07 *** ## year 3.8875 0.2582 15.058 1.09e-07 *** 19 / 53
  • 21.
    3D Plot ofthe Fitted Model library(scatterplot3d) s3d <- scatterplot3d(year, quarter, cpi, highlight.3d=T, type="h", lab=c(2,3)) # lab: number of tickmarks on x-/y-axes s3d$plane3d(fit) # draws the fitted plane 2008 2009 2010 160165170175 1 2 3 4 year quarter cpi 20 / 53
  • 22.
    Prediction of CPIsin 2011 data2011 <- data.frame(year=2011, quarter=1:4) cpi2011 <- predict(fit, newdata=data2011) style <- c(rep(1,12), rep(2,4)) plot(c(cpi, cpi2011), xaxt="n", ylab="CPI", xlab="", pch=style, col=style) txt <- c(paste(year,quarter,sep="Q"), "2011Q1", "2011Q2", "2011Q3", "2011Q4") axis(1, at=1:16, las=3, labels=txt) 165170175 CPI 2008Q1 2008Q2 2008Q3 2008Q4 2009Q1 2009Q2 2009Q3 2009Q4 2010Q1 2010Q2 2010Q3 2010Q4 2011Q1 2011Q2 2011Q3 2011Q4 21 / 53
  • 23.
    Contents Introduction Linear Regression Generalized LinearRegression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 22 / 53
  • 24.
    Generalized Linear Model(GLM) Generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be a function of its predicted value Unifies various other statistical models, including linear regression, logistic regression and Poisson regression Function glm(): fits generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution 23 / 53
  • 25.
    Build a GeneralizedLinear Model ## build a regression model data("bodyfat", package="TH.data") myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth bodyfat.glm <- glm(myFormula, family=gaussian("log"), data=bodyfat) summary(bodyfat.glm) ## ## Call: ## glm(formula = myFormula, family = gaussian("log"), data = b... ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -11.5688 -3.0065 0.1266 2.8310 10.0966 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.734293 0.308949 2.377 0.02042 * ## age 0.002129 0.001446 1.473 0.14560 ## waistcirc 0.010489 0.002479 4.231 7.44e-05 *** ## hipcirc 0.009702 0.003231 3.003 0.00379 ** ## elbowbreadth 0.002355 0.045686 0.052 0.95905 24 / 53
  • 26.
    Prediction with GeneralizedLinear Regression Model ## make prediction and visualise result pred <- predict(bodyfat.glm, type = "response") plot(bodyfat$DEXfat, pred, xlab = "Observed", ylab = "Prediction") abline(a = 0, b = 1) 10 20 30 40 50 60 20304050 Observed Prediction 25 / 53
  • 27.
    Contents Introduction Linear Regression Generalized LinearRegression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 26 / 53
  • 28.
    The iris Data str(iris) ##'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0... ## $ Species : Factor w/ 3 levels "setosa","versicolor",.... # split data into two subsets: training (70%) and test (30%); set # a fixed random seed to make results reproducible set.seed(1234) ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3)) train.data <- iris[ind == 1, ] test.data <- iris[ind == 2, ] 27 / 53
  • 29.
    Build a ctree Controlthe training of decision trees: MinSplit, MinBusket, MaxSurrogate and MaxDepth Target variable: Species Independent variables: all other variables ## build a ctree library(party) # myFormula <- Species ~ . # predict Species with all other # variables myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris.ctree <- ctree(myFormula, data = train.data) # check the prediction table(predict(iris.ctree), train.data$Species) ## ## setosa versicolor virginica ## setosa 40 0 0 ## versicolor 0 37 3 ## virginica 0 1 31 28 / 53
  • 30.
    Print ctree print(iris.ctree) ## ## Conditionalinference tree with 4 terminal nodes ## ## Response: Species ## Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width ## Number of observations: 112 ## ## 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 ## 2)* weights = 40 ## 1) Petal.Length > 1.9 ## 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 ## 4) Petal.Length <= 4.4; criterion = 0.974, statistic = ... ## 5)* weights = 21 ## 4) Petal.Length > 4.4 ## 6)* weights = 19 ## 3) Petal.Width > 1.7 ## 7)* weights = 32 29 / 53
  • 31.
    plot(iris.ctree) Petal.Length p < 0.001 1 ≤1.9 > 1.9 Node 2 (n = 40) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 Petal.Width p < 0.001 3 ≤ 1.7 > 1.7 Petal.Length p = 0.026 4 ≤ 4.4 > 4.4 Node 5 (n = 21) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 Node 6 (n = 19) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 Node 7 (n = 32) setosa versicolor virginica 0 0.2 0.4 0.6 0.8 1 30 / 53
  • 32.
    plot(iris.ctree, type ="simple") Petal.Length p < 0.001 1 ≤ 1.9 > 1.9 n = 40 y = (1, 0, 0) 2 Petal.Width p < 0.001 3 ≤ 1.7 > 1.7 Petal.Length p = 0.026 4 ≤ 4.4 > 4.4 n = 21 y = (0, 1, 0) 5 n = 19 y = (0, 0.842, 0.158) 6 n = 32 y = (0, 0.031, 0.969) 7 31 / 53
  • 33.
    Test # predict ontest data testPred <- predict(iris.ctree, newdata = test.data) table(testPred, test.data$Species) ## ## testPred setosa versicolor virginica ## setosa 10 0 0 ## versicolor 0 12 2 ## virginica 0 0 14 32 / 53
  • 34.
    Contents Introduction Linear Regression Generalized LinearRegression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 33 / 53
  • 35.
    The bodyfat Dataset ##build a decision tree with rpart data("bodyfat", package = "TH.data") dim(bodyfat) ## [1] 71 10 # str(bodyfat) head(bodyfat, 5) ## age DEXfat waistcirc hipcirc elbowbreadth kneebreadth ## 47 57 41.68 100.0 112.0 7.1 9.4 ## 48 65 43.29 99.5 116.5 6.5 8.9 ## 49 59 35.41 96.0 108.5 6.2 8.9 ## 50 58 22.79 72.0 96.5 6.1 9.2 ## 51 60 36.42 89.5 100.5 7.1 10.0 ## anthro3a anthro3b anthro3c anthro4 ## 47 4.42 4.95 4.50 6.13 ## 48 4.63 5.01 4.48 6.37 ## 49 4.12 4.74 4.60 5.82 ## 50 4.03 4.48 3.91 5.66 ## 51 4.24 4.68 4.15 5.91 34 / 53
  • 36.
    Train a DecisionTree with Package rpart # split into training and test subsets set.seed(1234) ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3)) bodyfat.train <- bodyfat[ind==1,] bodyfat.test <- bodyfat[ind==2,] # train a decision tree library(rpart) myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth bodyfat.rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10)) # print(bodyfat.rpart£cptable) library(rpart.plot) rpart.plot(bodyfat.rpart) 35 / 53
  • 37.
    The rpart Tree ##n= 56 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 56 7265.0290000 30.94589 ## 2) waistcirc< 88.4 31 960.5381000 22.55645 ## 4) hipcirc< 96.25 14 222.2648000 18.41143 ## 8) age< 60.5 9 66.8809600 16.19222 * ## 9) age>=60.5 5 31.2769200 22.40600 * ## 5) hipcirc>=96.25 17 299.6470000 25.97000 ## 10) waistcirc< 77.75 6 30.7345500 22.32500 * ## 11) waistcirc>=77.75 11 145.7148000 27.95818 ## 22) hipcirc< 99.5 3 0.2568667 23.74667 * ## 23) hipcirc>=99.5 8 72.2933500 29.53750 * ## 3) waistcirc>=88.4 25 1417.1140000 41.34880 ## 6) waistcirc< 104.75 18 330.5792000 38.09111 ## 12) hipcirc< 109.9 9 68.9996200 34.37556 * ## 13) hipcirc>=109.9 9 13.0832000 41.80667 * ## 7) waistcirc>=104.75 7 404.3004000 49.72571 * 36 / 53
  • 38.
    The rpart Tree waistcirc< 88 hipcirc < 96 age < 61 waistcirc < 78 hipcirc < 100 waistcirc < 105 hipcirc < 110 31 100% 23 55% 18 25% 16 16% 22 9% 26 30% 22 11% 28 20% 24 5% 30 14% 41 45% 38 32% 34 16% 42 16% 50 12% yes no 37 / 53
  • 39.
    Select the BestTree # select the tree with the minimum prediction error opt <- which.min(bodyfat.rpart$cptable[, "xerror"]) cp <- bodyfat.rpart$cptable[opt, "CP"] # prune tree bodyfat.prune <- prune(bodyfat.rpart, cp = cp) # plot tree rpart.plot(bodyfat.prune) 38 / 53
  • 40.
    Selected Tree waistcirc <88 hipcirc < 96 age < 61 waistcirc < 78 waistcirc < 105 hipcirc < 110 31 100% 23 55% 18 25% 16 16% 22 9% 26 30% 22 11% 28 20% 41 45% 38 32% 34 16% 42 16% 50 12% yes no 39 / 53
  • 41.
    Model Evaluation ## makeprediction DEXfat_pred <- predict(bodyfat.prune, newdata = bodyfat.test) xlim <- range(bodyfat$DEXfat) plot(DEXfat_pred ~ DEXfat, data = bodyfat.test, xlab = "Observed", ylab = "Prediction", ylim = xlim, xlim = xlim) abline(a = 0, b = 1) 10 20 30 40 50 60 102030405060 Observed Prediction 40 / 53
  • 42.
    Contents Introduction Linear Regression Generalized LinearRegression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 41 / 53
  • 43.
    R Packages forRandom Forest Package randomForest very fast cannot handle data with missing values a limit of 32 to the maximum number of levels of each categorical attribute extensions: extendedForest, gradientForest Package party: cforest() not limited to the above maximum levels slow needs more memory 42 / 53
  • 44.
    Train a RandomForest # split into two subsets: training (70%) and test (30%) ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3)) train.data <- iris[ind==1,] test.data <- iris[ind==2,] # use all other variables to predict Species library(randomForest) rf <- randomForest(Species ~ ., data=train.data, ntree=100, proximity=T) 43 / 53
  • 45.
    table(predict(rf), train.data$Species) ## ## setosaversicolor virginica ## setosa 36 0 0 ## versicolor 0 32 2 ## virginica 0 0 34 print(rf) ## ## Call: ## randomForest(formula = Species ~ ., data = train.data, ntr... ## Type of random forest: classification ## Number of trees: 100 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 1.92% ## Confusion matrix: ## setosa versicolor virginica class.error ## setosa 36 0 0 0.00000000 ## versicolor 0 32 0 0.00000000 ## virginica 0 2 34 0.05555556 attributes(rf) 44 / 53
  • 46.
    Error Rate ofRandom Forest plot(rf, main = "") 0 20 40 60 80 100 0.000.050.100.150.20 trees Error 45 / 53
  • 47.
    Variable Importance importance(rf) ## MeanDecreaseGini ##Sepal.Length 6.834364 ## Sepal.Width 1.383795 ## Petal.Length 28.207859 ## Petal.Width 32.043213 46 / 53
  • 48.
  • 49.
    Margin of Predictions Themargin of a data point is as the proportion of votes for the correct class minus maximum proportion of votes for other classes. Positive margin means correct classification. irisPred <- predict(rf, newdata = test.data) table(irisPred, test.data$Species) ## ## irisPred setosa versicolor virginica ## setosa 14 0 0 ## versicolor 0 17 3 ## virginica 0 1 11 plot(margin(rf, test.data$Species)) 48 / 53
  • 50.
    Margin of Predictions 020 40 60 80 100 −0.20.00.20.40.60.81.0 Index x 49 / 53
  • 51.
    Contents Introduction Linear Regression Generalized LinearRegression Decision Trees with Package party Decision Trees with Package rpart Random Forest Online Resources 50 / 53
  • 52.
    Online Resources Book titledR and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining-book.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/RDataMining-reference-card.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (27,000+ members) http://group.rdatamining.com Twitter (3,300+ followers) @RDataMining 51 / 53
  • 53.
  • 54.
    How to CiteThis Work Citation Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN 978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256 pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf. BibTex @BOOK{Zhao2012R, title = {R and Data Mining: Examples and Case Studies}, publisher = {Academic Press, Elsevier}, year = {2012}, author = {Yanchang Zhao}, pages = {256}, month = {December}, isbn = {978-0-123-96963-7}, keywords = {R, data mining}, url = {http://www.rdatamining.com/docs/RDataMining-book.pdf} } 53 / 53