+ 1
Decision
Trees and
Modeling
Data Mining Methods According to 2
the Nature of the Data
Continuous Categorical No
Response Response Response
Continuous Linear Logistic regression Principal
predictors regression Neural nets components
Neural nets Discriminant Cluster
k-nearest analysis analysis
neighbors k-nearest neighbors
Categorical Linear Neural nets Association
predictors regression Classification rules
Neural nets trees
Regression trees Logistic regression
Naïve Bayes
3
Supervised
Types of
Algorithms used in classification and
Data prediction
Mining Classification models discrete-valued
functions
Methods Prediction models continuous-valued
functions
Require data in which the value of the
outcome of interest (e.g., purchase or no
purchase) is known
Use training data, validation data, and test
data
Unsupervised
Algorithms where there is no outcome
variable to predict or classify
Include association rules, data reduction
methods, and clustering techniques
Supervised Learning Process: 4
Two Steps
Learning (Training)
Learn a model using the training data
Testing
Test the model using unseen test data to assess model
accuracy
apply
model
Induction Deduction
Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.1., p. 66, Springer, 2011.
Supervised Learning: 5
Classification
The task of learning a target function f that maps each
attribute set x to one of the predefined class labels y
Input Output
Attribute set Classification Class label
(x) model (y)
Thetarget function is also known informally at a
classification model which is useful for predicting or
describing data with binary or nominal categories
6
Classification: Explanatory vs.
Predictive Modeling
Descriptive Models Predictive Models
Serve as an explanatory Predict the class label of
tool to distinguish between new cases accurately
objects of different classes
Data is split into a training and
Fit the data closely validation set (holdout set)
Use the entire dataset for Training set estimate the
estimating the best-fit model model
to maximize information Holdout set test the
about the hypothesized model’s performance on
relationship in the population new unobserved data
Performance measures Performance measures
assess how well the model assess predictive accuracy
approximates the data
7
Introducing Decision Trees
Decision trees are the workhorses of many practical
applications of machine learning because they are fast and
produce intelligible output that is often surprisingly accurate.
Aim: Partition the data into smaller, more homogenous groups,
(i.e., the nodes of the split are more pure)
Fast training performance
High degree of accuracy
Relatively simple computing methodology
Easy to build
Easy to interpret
Donot require any assumptions about the distribution of the
measurements in each group which can be categorical, discrete
numeric, or continuous
8
A Simple
Decision
Tree For
Buying A
Car
• Each node
specifies a test
(question) of
an attribute.
• Each branch
from a node is a
possible value
of this
attribute
Source: IBM SPSS Modeler 14.2 Modeling Nodes, 2011, Figure 6-2, p. 115.
9
Decision Tree Goals
1. Accurate classification
minimize error
2. Efficient decision making
fewest # of decisions/tests
10
How Decision Trees Work
Generate a tree structure of training data subsets by splitting a
dataset on selected input attributes in a top-down recursive
manner:
1. Begin with a single node (root) composed of all
observations in the training set
2. Choose the best input attribute whose values within the
node can be used to split the node into subsets that are
more homogenous with respect to the classification
attribute than the unsplit node (Attributes are selected on
the basis of an impurity function)
3. For each subset of the split, determine if that subset can
or should split. For each subset of the split, return to step
2, until the stopping criteria is reached.
How A Decision Tree Works: 11
Three Types of Nodes
A root node
Has no incoming edges and zero or more outgoing
edges
Internal nodes
Each has exactly one incoming edge and two or
more outgoing edges
Leaf or terminal nodes
Each has exactly one incoming edge and no
outgoing edges
Each has a class label
12
The Play Golf Training Data
Day Outlook Temperature Humidity Windy Play Golf
D1 sunny 85 85 false no
D2 sunny 80 90 true no
D3 overcast 83 86 false yes
D4 rainy 70 96 false yes
D5 rainy 68 80 false yes
D6 rainy 65 70 true no
D7 overcast 64 65 true yes
D8 sunny 72 95 false no
D9 sunny 69 70 false yes
D10 rainy 75 80 false yes
D11 sunny 75 70 true yes
D12 overcast 72 90 true yes
D13 overcast 81 75 false yes
D14 rainy 71 91 true no
13
Decision Tree Example: Play Golf
Root node
• Each node
specifies a test of
an attribute.
• Each branch from
Internal a node is a
nodes possible value of
this attribute.
Leave nodes
Decision trees seek to create a set of leaf nodes
that are as ”pure” as possible!
Source: http://gautam.lis.illinois.edu/monkmiddleware/public/analytics/decisiontree.html
14
What do we mean by “pure”?
Very impure group Less impure Very pure group
Three Requirements that must be met 15
before DT algorithms may be applied:
1. DTs represent supervised learning thus
require pre-classified target variables.
A training data set must be supplied
which provides the algorithm with the
values of the target variable.
Three requirements that must be met 16
before DT algorithms may be applied:
2. The training set should be rich and varied,
providing the algorithm with a healthy cross section
of the types of records for which classification may
be needed in the future
Training set too large created tree may overfit
DTs learn by example if examples are
systematically lacking for a definable subset of
records, classification and prediction for this subset
will be problematic or impossible.
Training data set too small tree might not be specific
enough to work with more general data (underfit - both
training and test errors are large)
Three requirements that must be met 17
before DT algorithms may be applied:
3. The target attribute classes must be discrete,
cannot be applied to continuous target variable
Target variable must take on values that are
clearly demarcated as either belonging to a
particular class or not belonging
Play golf (yes, no)
Credit risk (high, low)
Loan applications (accept, reject)
Credit card transaction classification (fraudulent,
legitimate)
Medical diagnosis (cancer, no cancer)
Spam filters (spam, no spam)
Security screening (threat, no threat)
18
Decision DT algorithms mainly differ on
Tree Splitting criteria
Algorithms Which variable to split first?
What values to use to split?
How many splits to form for each node?
Stopping criteria
When to stop building the tree
Pruning (generalization method)
Pre-pruning versus post-pruning
Most popular DT algorithms
ID3,C4.5, C5
CART (Classification And Regression
Tree binary decision tree)
CHAID, M5
19
SPSS Modeler Decision Tree Nodes
Name Target Field # of Method Used For
Measurement splits Splitting
Type
C&RT Continuous, Binary Dispersion measure (Gini) or
categorical, flag, least squared deviation
nominal, or ordinal (continuous target) to minimize
impurity at each step
CHAID Continuous, Multiple Chi-square statistics to identify
categorical, flag, optimal splits
nominal, or ordinal
QUEST Categorical, flag, Binary Chi-square test (categorical
or nominal predictors), ANOVA (continuous
inputs)
C5.0 Flag, nominal, or Multiple Information theory measure:
ordinal information gain ratio
Alternative Measures for 20
Selecting the Best Split
Gini index
Determines the purity of a specific class as a result of a
decision to branch along a particular attribute (used in CART)
Chooses attribute that maximizes reduction in impurity as
splitting attribute at a particular node
See the video: https://
www.youtube.com/watch?v=Zze7SKuz9QQ
Information gain
Most common splitting criterion based on entropy, a
measure of the extent of uncertainty or randomness of a
particular attribute/value split (used in ID3, C4.5, C5)
Measures the change in entropy due to any amount of new
information being added: how much would be gained by
branching on an attribute
Chooses attribute with the highest information gain as
splitting attribute at a particular node
Chi-square statistics (used in CHAID)
A Loan Application Training 21
Data Set
Class attribute:
indicates if loan
application was
approved (Yes)
in the past or not
(No)
Source: Bing Liu, Web Data Mining, 2nd ed., Table 3.1., p. 64, Springer, 2011.
Chapter 3 can be downloaded from:
http://www.springer.com/cda/content/document/cda_downloaddocument/9783642194597-c3.pdf?SGWID=0-
0-45-1229046-p174122057
Loan Application Training Data Set 22
Which attributes should be used for splitting?
In which order?
What splits are needed, what is the test condition?
Application
Approved
23
Selecting The Best Split
The objective is to reduce the impurity or uncertainty
in data as much as possible
A subset of data is pure if all instances belong to the
same class.
An attribute is good when
The resulting node contains records of mostly one class,
i.e. it’s “pure” or homogenous
An attribute is poor when
• It provides no discrimination - attribute is immaterial to the
decision
• For each value we have the same number of positive and
negative instances
Choosing Splitting Attributes: Which 24
one discriminates best vis-à-vis the
class attribute “Approved”?
Multi-way split Two-way split
No: 60% No: 40% No: 20% No: 0% No: 66.67%
Yes: 40% Yes: 60% Yes: 80% Yes: 100% Yes: 33.33%
Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.5., p. 71, Springer, 2011.
+ 25
One Possible Tree
The tree generalizes the data as the tree captures the
key regularities in the data.
Dependent Variable: Approved
Yes 9
No 6
Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.2., p. 67, Springer, 2011.
Using the tree to predict the 26
class of a new loan applicant
Age Has_job Own_house Credit-rating Class
young false false good ?
Dependent Variable: Support - the proportion of all
Decision rule records in the data set that rest
IF Age = young Approved in that particular terminal leaf
AND Has_job = false node
THEN Class = No Yes 9
No 6 Confidence of the rule – the
Support = 3/15 = 20% proportion of records in the leaf
Confidence = 3/3 = 100% node for which the decision rule
is true (here leaf nodes are
pure)
27
Is the first tree unique?
No, there are many Dependent Variable: Approved
possible trees that can
be learned from the Yes 9
No 6
data.
Here is a simpler,
smaller tree.
In practice, one wants
to have a small and
accurate tree
It’s more general
It’s easier to understand
by human users
Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.3., p. 68, Springer, 2011.
Using the smaller tree to 28
predict the class of a new loan
Age Has_job Own_house Credit-rating Class
young false false good ?
Dependent Variable: Approved It is possible to read a set of
rules directly off a decision tree.
Yes 9
No 6 Preconditions are logically
ANDed together, and all the tests
must succeed if the rule is to fire.
IF Own_house = false
AND Has_job = false
THEN Class = No
Support = 6/15 = 40%
Confidence = 6/6 = 100%
29
C5.0 Tree
Built with
SPSS
Modeler
18
One of the most
attractive aspect
of decision trees
lies in their
interpretability,
especially with
respect to the
construction of
decision rules.
30
From Trees to Rules
Each path from the root to a leaf is a rule.
The tree paths or rules are mutually exclusive and exhaustive.
Every data instance is covered by a single rule (a tree path).
Rule Antecedent Consequent Support Confidence
1 If Own_house = true Then Yes 6/15 6/6 = 100%
(loan approved)
2 If Own_house = false Then Yes 3/15 3/3 = 100%
And Has_job = true (loan approved)
3 If Own_house = false Then No 6/15 6/6 = 100%
And Has_job = false (loan not approved)
31
Classifier Predictive accuracy
A simple measure of classification error.
Evaluation
Speed/Efficiency
time to construct the model
time to use the model
Robustness
Ability
to make reasonably accurate predictions, given
noisy data or data with missing and erroneous values
Scalability
Ability
to construct a prediction model given a rather
large amount of data
Interpretability
Level
of understanding & insight provided by the
model
Compactness of the model
Size of the tree, or the number of rules.
Judging Classification Performance:
Evaluation methods
Theavailable data set is divided into two disjoint
subsets
the training set (for learning a model)
the test set (for testing the model)
The training set should not be used in testing and
the test set should not be used in learning.
Unseen test set provides a unbiased estimate of accuracy.
The test set is also called the holdout set.
Mainly used when the data set D is large.
33
Judging Classification Performance
Errors committed by a The goal is a good
classification model: classification model with:
Training errors Low training error
Number of misclassification fits the training data well
errors committed on training
records AND
Generalization errors Low generalization error
Expected error of the model accurately classifies
on previously unseen records it has never seen
records (testing records) before
+ Classifier Evaluation: Confusion Matrix 34
A matrix showing the predicted and actual classifications. A
confusion matrix is of size L x L, where L is the number of
different label values. The following confusion matrix is for L=2:
Accuracy Measures
The following terms are defined for a 2 x 2 confusion matrix:
Source: Kohavi & Provost, Glossary of Terms, Machine Learning, 30, 271-274 (1998),
http://robotics.stanford.edu/~ronnyk/glossary.html
Evaluating Performance When
One Class is More Important
Data
sets with imbalanced class distributions are quite
common in many real applications.
In
many such cases it is more important to identify
members of one class
Taxfraud
Credit card fraud
Credit default
Response to promotional offer
Detecting electronic network intrusion
Predicting delayed flights
In
such cases, we are willing to tolerate greater overall
error, in return for better identifying the important
class for further attention
36
Judging Classification Performance:
Different Kinds of Wrong Predictions
False Positive False Negative
A record that is A record that is
classified as positive classified as negative
but is actually negative but is actually positive
Example: Insurance company wants to predict if an
uninvestigated claim is fraudulent
Claim predicted as Claim is predicted as
fraudulent that is actually legitimate but is actually
legitimate fraudulent
Cost: Additional money to Cost: Payoff of claim that
investigate a claim that should be not be paid
should be paid
Judging Classification Performance: 37
Which Model Performs Better?
Good results correspond to large numbers down
the main diagonal (green, from top left to bottom right)
and small, ideally zero, off-diagonal elements.
Model 1 Predicted Class
N = 500 Class = Yes Class = No
Actual Class = Yes 150 40
Class Class = No 60 250
Model 2 Predicted Class
N = 500 Class = Yes Class = No
Actual Class = Yes 250 45
Class Class = No 5 200
Judging Classification Performance: 38
A numeric example with 3 classes?
How accurate is the model that
produced this classification matrix?
N = 200 Predicted Class
a b c total
Actual a 88 10 2 100
Class b 14 40 6 60
c 18 10 12 40
total 120 60 20 200
The Cost of Classification: 39
A Targeted Marketing Example
False Consumer classified as Cost of preparing
positive likely responder, but does and mailing
not respond marketing materials
False Consumer predicted not to No money was
negative be a likely responder, but spent, nothing was
would have bought if offered gained
True A consumer is offered the Profit based on
positive product and buys it revenue minus costs
True Consumer who was not No cost incurred, no
negative offered a deal and would not profit
have bought it even it if had
been offered
If costs are known, they can be 40
incorporated into a financial analysis
of the decision-making process.
Cost Predicted Class • A standard accuracy measure would
Matrix Yes No have preferred model M2 over M1.
• Incorporating costs, model M2 is
inferior since the improvement comes
Actual Yes -1 100 at the expense of increasing the more
Class No 1 0 costly false negative errors.
Model 1 Predicted Class Model 2 Predicted Class
N = 500 Yes No N = 500 Yes No
Actual Yes 150 40 Actual Yes 250 45
Class Class
No
Overall accuracy: 60
400/500 250
= .80 No 5 200
Overall accuracy: 450/500 = .90
FP rate: 60/310 = .1935 FP rate: 5/205 = .0243
FN rate = 40/190 = .2105 FN rate = 45/295 = .1525
Cost M1 = 150 * (-1) + 40 * 100 + 60 * 1 + 250 * 0 = 3910
Cost M2 = 250 * (-1) + 45 * 100 + 5 * 1 + 200 * 0 = 4255
Benchmarking the Model’s Performance 41
to Morrison’s Proportional Chance
Criterion
Morrison’s likelihood analysis (1969) provides a criterion that
may be used to compare the proportion of correctly
classified observations with the proportion expected by
chance.
The overall percentage of objects likely to be classified
correctly by chance alone:
Cpro = alpha2 + (1 – alpha)2
Alpha: proportion of cases in class = Yes
1 – alpha: proportion of cases in class = No
Predictive accuracy of model should be higher than the
proportional chance criterion!
Example:
Class = Yes 1: 73% of the cases, Class = No: 27% of the cases
C = 0.732 + 0.272 = 0.59
Use as a cutoff estimated probability
42
Cutoff for Classification
Most DM algorithms classify via a 2-step process:
For each record,
Compute probability of belonging to class “1”
Compare to cutoff value, and classify accordingly
Default cutoff value is 0.50
If >= 0.50, classify as “1”
If < 0.50, classify as “0”
Can use different cutoff values
Typically, error rate is lowest for cutoff = 0.50
43
Cutoff
Table
Example
Cutoff Value = 0.75:
8 records are
classified as “1”
Cutoff Value = 0.50:
13 records are
classified as “1”
Cutoff Value = 0.25:
15 records are
classified as “1”
Source: Figure 5.7 in Shmueli et al. (2010), Data Mining for Business Intelligence, p. 104.
44
Cutoff Value = 0.75 Predicted Class
Classification Actual Class 0 1
Confusion 0 11 1
Matrices for 1 5 7
Various True Positive Rate: 7/12 = .5833
Cutoff Values False Positive Rate: 1/12 = .0833
Cutoff Value = 0.50 Predicted Class
Actual Class 0 1
0 10 2
1 1 11
True Positive Rate: 11/12 = .9166
False Positive Rate: 2/12 = .1667
Cutoff Value = 0.25 Predicted Class
Actual Class 0 1
0 8 4
1 1 11
True Positive Rate: 11/12 = .9166
False Positive Rate: 4/12 = .3333
45
The Class Imbalance Problem
Animbalance occurs when one or more classes have very
low proportions in the training data as compared to the other
classes
Example:
1% of credit card transactions are fraudulent
Fraudulent = yes is a “rare” event / class
Detectinginstances of the rare class is akin to finding a
needle in a haystack
Predictionresults will likely be that every record is predicted
as a non fraudulent (predicted target = 0), making the
prediction result insufficiently informative, i.e., model does
not detect any class 1 (fraudulent)
Accuracy Limitations with 46
Imbalanced Data Sets
The accuracy measure may be misleading
and not be well-suited for evaluating models
derived from imbalanced data sets.
Remedies for Severe Class 47
Imbalance: Sampling Methods
Up-sampling (oversampling)
Simulates / imputes additional data points to improve
balance across the classes
Replicates the positive cases (the “rare” event cases)
Down-sampling (undersampling)
Reduces the number of samples to improve balance
across the classes
Eliminates negative examples, i.e., majority event
cases
48
Remedying the Imbalance Problem with
SPSS Modeler’s Balance Node (Record Ops)
Performs random undersampling by assigning a
factor smaller than 1 to the majority class
Remedying the Imbalance Problem with 49
SPSS Modeler’s Balance Node (Record Ops)
Carries out simple random oversampling by
assigning a factor larger than 1 to the minority class
50
Avoid Overfitting in Classification
Model overfitting
Model fits the training data too well
Good accuracy on training data but poor on test data
Symptoms:
Tree too large - too deep and too many branches
Some may reflect anomalies in the training data due to
noise or outliers
Trees are pruned to avoid overfitting
Removes the weakest, least reliable branches so a tree
captures the patterns but not the noise in the training data
Resulting tree is smaller and less complex, thus easier to
understand
Faster and better at correctly classifying independent test
data (i.e., previously unseen tuples) than unpruned tree
51
How Does Tree Pruning Work?
Prepruning approach Postpruning
Tree construction is Successively selecting
halted early, e.g., no a decision node and
further splits or redesignating it as a
partitions at a given leaf node
node Removes subtrees from
Based on measures of a fully grown tree, i.e.,
statistical significance, lop off the branches
information gain, Gini extending beyond a
index, etc. particular decision node
Upon halting, the node
becomes a leaf
Pruning A Decision Tree: 52
When To Stop Splitting Data
All examples for a given node belong to the same class
There are no examples left
Thereare no remaining attributes for further partitioning
– majority class is the leaf
No attribute satisfies a minimum information gain
threshold
A maximal depth is reached
There are less than a certain number of examples in the
current subtree: again a mechanism to prevent
overfitting
+
IBM SPSS Modeler 18
Building Decision Tree
Hands-On Exercise
• Drug Reaction (C&RT, C5.0)
• Survival of Passengers on the Titanic (CHAID)
CHAID Tree without Partition
Stream 1: CHAID Tree Table Node
$R-Survived: Predicted score
$RC-Survived: Confidence value
(between 0.0 and 1.0) is model’s own
estimation how accurate each
predicted value is)
CHAID Tree Partition Node
CHAID Tree Analysis Node