Decision Trees
Decision tree induction is the
learning of decision trees from
class-labelled training examples.
A decision-tree is a flow-chart like
tree structure, where each internal
node denotes a test on an attribute,
each branch represents an outcome
of the test and each leaf node holds
a class label.
Decision Tree Induction
ID3Iterative Dichotomiser 3 by J. Ross
Quinlan
C4.5 is a successor of ID3
Classification
and Regression
Trees(CART) – generation of binary trees
ID3, C4.5 and CART adopt a
greedy(nonbacktracking) approach in
which decision trees are constructed in a
top-down recursive divide-and-conquer
manner.
Algorithm starts with a training set of
tuples and their associated class labels.
Training stet is recursively partitioned
into smaller subsets as the tree is being
built.
The Algorithm
Create a root node N for the tree
If all examples are of same class C,
then return N as the leaf node labeled
with the class C.
If attribute_list is empty, then return N
as the leaf node labeled with the
majority class in D(majority voting)
Apply Attribute_Selection_method(D,
attribute_list) to find “best”
splitting_criteria;
Label node N with splitting_criterion
If splitting_attribute is discrete_valued
and multiway splits allowed then
attribute_list attribute_list –
splitting_attribute
For each outcome j of splitting_criterion
Let Dj be the set of data tuples in D satisfying
outcome j;
If Dj is empty then
Attach a leaf labelled with the majority class in
D to node N;
else
Attach the node returned by
Generate_decision_tree(Dj, attribute_list) to
node N
Endfor
Return N
To partition attributes in D, there are three possible
scenarios. Let A be the splitting attribute. A has v
distinct values, {a1,a2,…,av}, based on training
data.
A is discrete-valued:
A is continuous-valued:
A is discrete-valued and a binary tree must be
produced.
The recursive partitioning stops only when one of
the following terminating condition is true:
All the tuples in partition D belong to the same class
There are no remaining attributes on which the tuples
may be further partitioned. In this case majority
voting is employed.
There are no tuples for a given branch, that is, a
partition Dj, is empty. In this case, a leaf is created
with the majority class in D.
What is the computational complexity?
Attribute Selection Procedure
It is a heuristic for selecting the splitting criterion
that “best” separates a given data partition, D, of
class-labelled training tuples into individual
classes.
It is also known as splitting rules because they
determine how the tuples at a given node are to be
split.
It provides ranking for each attribute describing the
given training tuple. The attribute having the best
score for the measure is chosen as the splitting
attribute for the given tuples.
Well known attribute selection measures are :
information gain, gain ratio, gini index.
Information
Gain(ID3/C4.5)
Select the attribute with the highest information
gain.
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a
tuple in D:
m
Info( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v
partitions) to classify D:
v | Dj |
Info A ( D) I (D j )
j 1 |D|
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
Exercise
Calculate the information(entropy)
Given:
Set S contains14 examples
9 Positive values
5 Negative values
Exercise
Info(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14)
= 0.940
Information Gain
Information gain is based on the decrease in
entropy after a dataset is split on an attribute.
Looking for which attribute creates the most
homogeneous branches
Information Gain Example
14 examples, 9 positive 5 negative
The attribute is Wind.
Values of wind are Weak and Strong
Exercise (cont.)
8 occurrences of weak winds
6 occurrences of strong winds
For the weak winds, 6 are positive and 2 are
negative
For the strong winds, 3 are positive and 3 are
negative
Exercise (cont.)
Gain(S,Wind) =
Info(S) - (8/14) * Info(Weak)
-(6/14) * Info(Strong)
Info(Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) =
0.811
Info(Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) =
1.00
Exercise (cont.)
So…
0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048
age income student credit_rating buys_computer
youth high no fair no
youth high no excellent no
middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_aged medium no excellent yes
middle_aged high yes fair yes
senior medium no excellent no
Sample training data to determine
whether an animal lays eggs.
Dependent/
Independent/Condition attributes Decision
attributes
Animal Warm- Feathers Fur Swims Lays Eggs
blooded
Ostrich Yes Yes No No Yes
Crocodile No No No Yes Yes
Raven Yes Yes No No Yes
Albatross Yes Yes No No Yes
Dolphin Yes No No Yes No
Koala Yes No Yes No No
Entropy(4Y,2N): -(4/6)log2(4/6) – (2/6)log2(2/6)
= 0.91829
Now, we have to find the IG for all four attributes
Warm-blooded, Feathers, Fur, Swims
For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,2N] E(SYes) = 0.97095
SNo = [1Y,0N] E(SNo) = 0 (all members belong to same class)
Gain(S,Warm-blooded) = 0.91829 – [(5/6)*0.97095 + (1/6)*0]
= 0.10916
For attribute ‘Feathers’:
Values(Feathers) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,0N] E(SYes) = 0
SNo = [1Y,2N] E(SNo) = 0.91829
Gain(S,Feathers) = 0.91829 – [(3/6)*0 + (3/6)*0.91829]
= 0.45914
For attribute ‘Fur’:
Values(Fur) : [Yes,No]
S = [4Y,2N]
SYes = [0Y,1N] E(SYes) = 0
SNo = [4Y,1N] E(SNo) = 0.7219
Gain(S,Fur) = 0.91829 – [(1/6)*0 + (5/6)*0.7219]
= 0.3167
For attribute ‘Swims’:
Values(Swims) : [Yes,No]
S = [4Y,2N]
SYes = [1Y,1N] E(SYes) = 1 (equal members in both classes)
SNo = [3Y,1N] E(SNo) = 0.81127
Gain(S,Swims) = 0.91829 – [(2/6)*1 + (4/6)*0.81127]
Gain(S,Warm-blooded) = 0.10916
Gain(S,Feathers) = 0.45914
Gain(S,Fur) = 0.31670
Gain(S,Swims) = 0.04411
Gain(S,Feathers) is maximum, so it is considered as the root node
Anim War Feath Fur Swim Lays
The ‘Y’ descendant has only
al m- ers s Eggs positive examples and becomes
blood the leaf node with classification
ed ‘Lays Eggs’
Ostric Yes Yes No No Yes Feathers
h
Croco No No No Yes Yes Y N
dile
Raven Yes Yes No No Yes
[Ostrich, [Crocodile,
Albatr Yes Yes No No Yes
Raven, Dolphin,
oss
Dolph Yes No No Yes No Albatross] Koala]
in
Lays Eggs ?
Koala Yes No Yes No No
Animal Warm- Feathers Fur Swims Lays Eggs
blooded
Crocodile No No No Yes Yes
Dolphin Yes No No Yes No
Koala Yes No Yes No No
We now repeat the procedure,
S: [Crocodile, Dolphin, Koala]
S: [1+,2-]
Entropy(S) = -(1/3)log2(1/3) – (2/3)log2(2/3)
= 0.91829
For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [1Y,2N]
SYes = [0Y,2N] E(SYes) = 0
SNo = [1Y,0N] E(SNo) = 0
Gain(S,Warm-blooded) = 0.91829 – [(2/3)*0 + (1/3)*0] = 0.91829
For attribute ‘Fur’:
Values(Fur) : [Yes,No]
S = [1Y,2N]
SYes = [0Y,1N] E(SYes) = 0
SNo = [1Y,1N] E(SNo) = 1
Gain(S,Fur) = 0.91829 – [(1/3)*0 + (2/3)*1] = 0.25162
For attribute ‘Swims’:
Values(Swims) : [Yes,No]
S = [1Y,2N]
SYes = [1Y,1N] E(SYes) = 1
SNo = [0Y,1N] E(SNo) = 0
Gain(S,Swims) = 0.91829 – [(2/3)*1 + (1/3)*0] = 0.25162
Gain(S,Warm-blooded) is maximum
The final decision tree will be:
Feathers
Y N
Lays eggs Warm-blooded
Y N
Does not lay eggs Lays Eggs
Factors affecting sunburn
Name Hair Height Weight Lotion Sunburned
Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Alex Brown Short Average Yes No
Annie Blonde Short Average No Yes
Emily Red Average Heavy No Yes
Pete Brown Tall Heavy No No
John Brown Average Heavy No No
Katie Blonde Short Light Yes No
S = [3+, 5-]
Entropy(S) = -(3/8)log2(3/8) – (5/8)log2(5/8)
= 0.95443
Find IG for all 4 attributes: Hair, Height, Weight, Lotion
For attribute ‘Hair’:
Values(Hair) : [Blonde, Brown, Red]
S = [3+,5-]
SBlonde = [2+,2-] E(SBlonde) = 1
SBrown = [0+,3-] E(SBrown) = 0
SRed = [1+,0-] E(SRed) = 0
Gain(S,Hair) = 0.95443 – [(4/8)*1 + (3/8)*0 + (1/8)*0]
= 0.45443
For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
SAverage = [2+,1-] E(SAverage) = 0.91829
STall = [0+,2-] E(STall) = 0
SShort = [1+,2-] E(SShort) = 0.91829
Gain(S,Height) = 0.95443 – [(3/8)*0.91829 + (2/8)*0 + (3/8)*0.91829] = 0.26571
For attribute ‘Weight’:
Values(Weight) : [Light, Average, Heavy]
SLight = [1+,1-] E(SLight) = 1
SAverage = [1+,2-] E(SAverage) = 0.91829
SHeavy = [1+,2-] E(SHeavy) = 0.91829
Gain(S,Weight) = 0.95443 – [(2/8)*1 + (3/8)*0.91829 + (3/8)*0.91829] = 0.01571
For attribute ‘Lotion’:
Values(Lotion) : [Yes, No]
SYes = [0+,3-] E(SYes) = 0
SNo = [3+,2-] E(SNo) = 0.97095
Gain(S,Hair) = 0.45443
Gain(S,Height) = 0.26571
Gain(S,Weight) = 0.01571
Gain(S,Lotion) = 0.3475
Gain(S,Hair) is maximum, so it is considered as the root node
Name Hair Height Weight Lotion Sunbur
ned
Sarah Blonde Averag Light No Yes
e
Hair
Dana Blonde Tall Averag Yes No
e Blond Brow
e n
Alex Brown Short Averag Yes No Red
e [Sarah, Dana, [Alex, Pete,
Annie Blonde Short Averag No Yes Annie, Katie] John]
e Not
? Sunburned
Emily Red Averag Heavy No Yes
e [Emily
Pete Brown Tall Heavy No No ]
John Brown Averag Heavy No No Sunburned
e
Name Hair Height Weight Lotion Sunburned
Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Annie Blonde Short Average No Yes
Katie Blonde Short Light Yes No
Repeating again:
S = [Sarah, Dana, Annie, Katie]
S: [2+,2-]
Entropy(S) = 1
Find IG for remaining 3 attributes Height, Weight, Lotion
For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
S = [2+,2-]
SAverage = [1+,0-] E(SAverage) = 0
STall = [0+,1-] E(STall) = 0
SShort = [1+,1-] E(SShort) = 1
Gain(S,Height) = 1 – [(1/4)*0 + (1/4)*0 + (2/4)*1] = 0.5
For attribute ‘Weight’:
Values(Weight) : [Average, Light]
S = [2+,2-]
SAverage = [1+,1-] E(SAverage) = 1
SLight = [1+,1-] E(SLight) = 1
Gain(S,Weight) = 1 – [(2/4)*1 + (2/4)*1]= 0
For attribute ‘Lotion’:
Values(Lotion) : [Yes, No]
S = [2+,2-]
SYes = [0+,2-] E(SYes) = 0
SNo = [2+,0-] E(SNo) = 0
Gain(S,Lotion) = 1 – [(2/4)*0 + (2/4)*0]= 1
Therefore, Gain(S,Lotion) is maximum
In this case, the final decision tree will be
Hair
Blonde Brow
Re n
d
Sunburned Not
Lotion
Sunburned
Y N
Not Sunburned
Sunburned
Computing Information-Gain for
Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for
A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set
of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes
with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/1.557 = 0.019
The attribute with the maximum gain ratio is selected as
the splitting attribute
age income student credit_rating buys_computer
youth high no fair no
youth high no excellent no
middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_aged medium no excellent yes
middle_aged high yes fair yes
senior medium no excellent no
Gini Index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D) 1 p j
j 1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A ( D) gini( D1) 2 gini( D 2)
Reduction in Impurity: |D| |D|
gini( A) gini( D) giniA ( D)
The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Computation of Gini Index
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D1: {low, medium}
and 4 in D2
10 4
giniincome{low,medium} ( D) Gini( D1 ) Gini( D2 )
14 14
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
Comparing Attribute Selection Measures
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in both partitions
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistic: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
Preparing Data for
Classification and Prediction
Data Cleaning : remove or reduce noise (apply smoothing) and
treatment of missing values (replacing a missing value with
the most commonly occurring values for that attribute, or
with the most probable value based on statistics)
Relevance Analysis : 1) Correlation analysis is used to check
whether any two given attributes are statistically related. 2)
Dataset may also contain irrelevant attributes. These kind of
attributes may otherwise slow down and possibly mislead the
learning step.
Data Transformation and Reduction : may be transformed by
normalization – scaling attributes such that it comes in one
range. Or by generalizing it to higher-level concepts.
Advantage of ID3
Understandable prediction rules are created from
the training data.
Builds the fastest tree.
Builds a short tree.
Only need to test enough attributes until all data is
classified.
Finding leaf nodes enables test data to be pruned,
reducing number of tests.
Disadvantage of ID3
Data may be over-fitted or over-classified, if a small
sample is tested.
Only one attribute at a time is tested for making a
decision.
Classifying continuous data may be
computationally expensive, as many trees must be
generated to see where to break the continuum.
Bayes’ Theorem: Basics
Total probability Theorem:
M
P(B) P(B | A )P( A )
i i
i 1
Bayes’ Theorem:
P(H | X) P(X | H )P(H ) P(X | H ) P(H ) / P(X)
P(X)
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes’ theorem
P(H | X) P(X | H ) P( H ) P(X | H ) P( H ) / P(X)
P(X)
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C )
i i i
needs to be maximized
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
( x ) 2
standard deviation σ 1
g ( x, , )
2
e 2
2
and P(xk|Ci) is
P ( X | C i ) g ( xk , Ci , Ci )