0% found this document useful (0 votes)
546 views52 pages

Decision Tree Induction

The document discusses decision trees, which are flow-chart like tree structures used for classification or regression. Decision trees work by recursively splitting a dataset based on attribute tests at internal nodes. Popular algorithms for constructing decision trees include ID3, C4.5, and CART, which use a greedy top-down approach to build the tree. The algorithm starts with a root node and all training data, and recursively partitions the data into purer subsets based on the best splitting attribute until reaching leaf nodes with class labels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
546 views52 pages

Decision Tree Induction

The document discusses decision trees, which are flow-chart like tree structures used for classification or regression. Decision trees work by recursively splitting a dataset based on attribute tests at internal nodes. Popular algorithms for constructing decision trees include ID3, C4.5, and CART, which use a greedy top-down approach to build the tree. The algorithm starts with a root node and all training data, and recursively partitions the data into purer subsets based on the best splitting attribute until reaching leaf nodes with class labels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Decision Trees

 Decision tree induction is the


learning of decision trees from
class-labelled training examples.
 A decision-tree is a flow-chart like
tree structure, where each internal
node denotes a test on an attribute,
each branch represents an outcome
of the test and each leaf node holds
a class label.
Decision Tree Induction

 ID3Iterative Dichotomiser 3 by J. Ross


Quinlan
 C4.5 is a successor of ID3
 Classification
and Regression
Trees(CART) – generation of binary trees
 ID3, C4.5 and CART adopt a
greedy(nonbacktracking) approach in
which decision trees are constructed in a
top-down recursive divide-and-conquer
manner.
 Algorithm starts with a training set of
tuples and their associated class labels.
 Training stet is recursively partitioned
into smaller subsets as the tree is being
built.
The Algorithm

 Create a root node N for the tree


 If all examples are of same class C,
then return N as the leaf node labeled
with the class C.
 If attribute_list is empty, then return N
as the leaf node labeled with the
majority class in D(majority voting)
 Apply Attribute_Selection_method(D,
attribute_list) to find “best”
splitting_criteria;
 Label node N with splitting_criterion
 If splitting_attribute is discrete_valued
and multiway splits allowed then
 attribute_list  attribute_list –
splitting_attribute
 For each outcome j of splitting_criterion
Let Dj be the set of data tuples in D satisfying
outcome j;
If Dj is empty then
 Attach a leaf labelled with the majority class in
D to node N;
else
 Attach the node returned by
Generate_decision_tree(Dj, attribute_list) to
node N
 Endfor
 Return N
 To partition attributes in D, there are three possible
scenarios. Let A be the splitting attribute. A has v
distinct values, {a1,a2,…,av}, based on training
data.
 A is discrete-valued:
 A is continuous-valued:
 A is discrete-valued and a binary tree must be
produced.
 The recursive partitioning stops only when one of
the following terminating condition is true:
 All the tuples in partition D belong to the same class
 There are no remaining attributes on which the tuples
may be further partitioned. In this case majority
voting is employed.
 There are no tuples for a given branch, that is, a
partition Dj, is empty. In this case, a leaf is created
with the majority class in D.
 What is the computational complexity?
Attribute Selection Procedure

 It is a heuristic for selecting the splitting criterion


that “best” separates a given data partition, D, of
class-labelled training tuples into individual
classes.
 It is also known as splitting rules because they
determine how the tuples at a given node are to be
split.
 It provides ranking for each attribute describing the
given training tuple. The attribute having the best
score for the measure is chosen as the splitting
attribute for the given tuples.
 Well known attribute selection measures are :
information gain, gain ratio, gini index.
Information
Gain(ID3/C4.5)
 Select the attribute with the highest information
gain.
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a
tuple in D:

m
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v
partitions) to classify D:
v | Dj |
Info A ( D)    I (D j )
j 1 |D|
 Information gained by branching on attribute A

Gain(A)  Info(D)  Info A(D)


Exercise

 Calculate the information(entropy)


 Given:
 Set S contains14 examples
 9 Positive values
 5 Negative values
Exercise

 Info(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14)


 = 0.940
Information Gain

 Information gain is based on the decrease in


entropy after a dataset is split on an attribute.
 Looking for which attribute creates the most
homogeneous branches
Information Gain Example

 14 examples, 9 positive 5 negative


 The attribute is Wind.
 Values of wind are Weak and Strong
Exercise (cont.)

 8 occurrences of weak winds


 6 occurrences of strong winds
 For the weak winds, 6 are positive and 2 are
negative
 For the strong winds, 3 are positive and 3 are
negative
Exercise (cont.)

 Gain(S,Wind) =
 Info(S) - (8/14) * Info(Weak)
-(6/14) * Info(Strong)
 Info(Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) =
0.811
 Info(Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) =
1.00
Exercise (cont.)

 So…
 0.940 - (8/14)*0.811 - (6/14)*1.00
 = 0.048
age income student credit_rating buys_computer
youth high no fair no
youth high no excellent no
middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_aged medium no excellent yes
middle_aged high yes fair yes
senior medium no excellent no
Sample training data to determine
whether an animal lays eggs.
Dependent/
Independent/Condition attributes Decision
attributes
Animal Warm- Feathers Fur Swims Lays Eggs
blooded
Ostrich Yes Yes No No Yes
Crocodile No No No Yes Yes
Raven Yes Yes No No Yes
Albatross Yes Yes No No Yes
Dolphin Yes No No Yes No
Koala Yes No Yes No No
Entropy(4Y,2N): -(4/6)log2(4/6) – (2/6)log2(2/6)
= 0.91829

Now, we have to find the IG for all four attributes


Warm-blooded, Feathers, Fur, Swims
For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,2N] E(SYes) = 0.97095
SNo = [1Y,0N] E(SNo) = 0 (all members belong to same class)
Gain(S,Warm-blooded) = 0.91829 – [(5/6)*0.97095 + (1/6)*0]
= 0.10916
For attribute ‘Feathers’:
Values(Feathers) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,0N] E(SYes) = 0
SNo = [1Y,2N] E(SNo) = 0.91829
Gain(S,Feathers) = 0.91829 – [(3/6)*0 + (3/6)*0.91829]
= 0.45914
For attribute ‘Fur’:
Values(Fur) : [Yes,No]
S = [4Y,2N]
SYes = [0Y,1N] E(SYes) = 0
SNo = [4Y,1N] E(SNo) = 0.7219
Gain(S,Fur) = 0.91829 – [(1/6)*0 + (5/6)*0.7219]
= 0.3167
For attribute ‘Swims’:
Values(Swims) : [Yes,No]
S = [4Y,2N]
SYes = [1Y,1N] E(SYes) = 1 (equal members in both classes)
SNo = [3Y,1N] E(SNo) = 0.81127
Gain(S,Swims) = 0.91829 – [(2/6)*1 + (4/6)*0.81127]
Gain(S,Warm-blooded) = 0.10916
Gain(S,Feathers) = 0.45914
Gain(S,Fur) = 0.31670
Gain(S,Swims) = 0.04411
Gain(S,Feathers) is maximum, so it is considered as the root node

Anim War Feath Fur Swim Lays


The ‘Y’ descendant has only
al m- ers s Eggs positive examples and becomes
blood the leaf node with classification
ed ‘Lays Eggs’
Ostric Yes Yes No No Yes Feathers
h
Croco No No No Yes Yes Y N
dile
Raven Yes Yes No No Yes
[Ostrich, [Crocodile,
Albatr Yes Yes No No Yes
Raven, Dolphin,
oss
Dolph Yes No No Yes No Albatross] Koala]
in
Lays Eggs ?
Koala Yes No Yes No No
Animal Warm- Feathers Fur Swims Lays Eggs
blooded
Crocodile No No No Yes Yes
Dolphin Yes No No Yes No
Koala Yes No Yes No No

We now repeat the procedure,


S: [Crocodile, Dolphin, Koala]
S: [1+,2-]

Entropy(S) = -(1/3)log2(1/3) – (2/3)log2(2/3)


= 0.91829
 For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [1Y,2N]
SYes = [0Y,2N] E(SYes) = 0
SNo = [1Y,0N] E(SNo) = 0
Gain(S,Warm-blooded) = 0.91829 – [(2/3)*0 + (1/3)*0] = 0.91829
 For attribute ‘Fur’:
Values(Fur) : [Yes,No]
S = [1Y,2N]
SYes = [0Y,1N] E(SYes) = 0
SNo = [1Y,1N] E(SNo) = 1
Gain(S,Fur) = 0.91829 – [(1/3)*0 + (2/3)*1] = 0.25162
 For attribute ‘Swims’:
Values(Swims) : [Yes,No]
S = [1Y,2N]
SYes = [1Y,1N] E(SYes) = 1
SNo = [0Y,1N] E(SNo) = 0
Gain(S,Swims) = 0.91829 – [(2/3)*1 + (1/3)*0] = 0.25162
Gain(S,Warm-blooded) is maximum
The final decision tree will be:

Feathers

Y N

Lays eggs Warm-blooded

Y N

Does not lay eggs Lays Eggs


Factors affecting sunburn

Name Hair Height Weight Lotion Sunburned


Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Alex Brown Short Average Yes No
Annie Blonde Short Average No Yes
Emily Red Average Heavy No Yes
Pete Brown Tall Heavy No No
John Brown Average Heavy No No
Katie Blonde Short Light Yes No
S = [3+, 5-]
Entropy(S) = -(3/8)log2(3/8) – (5/8)log2(5/8)
= 0.95443
Find IG for all 4 attributes: Hair, Height, Weight, Lotion
 For attribute ‘Hair’:

Values(Hair) : [Blonde, Brown, Red]


S = [3+,5-]
SBlonde = [2+,2-] E(SBlonde) = 1
SBrown = [0+,3-] E(SBrown) = 0
SRed = [1+,0-] E(SRed) = 0
Gain(S,Hair) = 0.95443 – [(4/8)*1 + (3/8)*0 + (1/8)*0]
= 0.45443
 For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
SAverage = [2+,1-] E(SAverage) = 0.91829
STall = [0+,2-] E(STall) = 0
SShort = [1+,2-] E(SShort) = 0.91829
Gain(S,Height) = 0.95443 – [(3/8)*0.91829 + (2/8)*0 + (3/8)*0.91829] = 0.26571
 For attribute ‘Weight’:
Values(Weight) : [Light, Average, Heavy]
SLight = [1+,1-] E(SLight) = 1
SAverage = [1+,2-] E(SAverage) = 0.91829
SHeavy = [1+,2-] E(SHeavy) = 0.91829
Gain(S,Weight) = 0.95443 – [(2/8)*1 + (3/8)*0.91829 + (3/8)*0.91829] = 0.01571
 For attribute ‘Lotion’:
Values(Lotion) : [Yes, No]
SYes = [0+,3-] E(SYes) = 0
SNo = [3+,2-] E(SNo) = 0.97095
Gain(S,Hair) = 0.45443
Gain(S,Height) = 0.26571
Gain(S,Weight) = 0.01571
Gain(S,Lotion) = 0.3475
Gain(S,Hair) is maximum, so it is considered as the root node
Name Hair Height Weight Lotion Sunbur
ned
Sarah Blonde Averag Light No Yes
e
Hair
Dana Blonde Tall Averag Yes No
e Blond Brow
e n
Alex Brown Short Averag Yes No Red
e [Sarah, Dana, [Alex, Pete,
Annie Blonde Short Averag No Yes Annie, Katie] John]
e Not
? Sunburned
Emily Red Averag Heavy No Yes
e [Emily
Pete Brown Tall Heavy No No ]
John Brown Averag Heavy No No Sunburned
e
Name Hair Height Weight Lotion Sunburned
Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Annie Blonde Short Average No Yes
Katie Blonde Short Light Yes No

Repeating again:
S = [Sarah, Dana, Annie, Katie]
S: [2+,2-]
Entropy(S) = 1
Find IG for remaining 3 attributes Height, Weight, Lotion
 For attribute ‘Height’:

Values(Height) : [Average, Tall, Short]


S = [2+,2-]
SAverage = [1+,0-] E(SAverage) = 0
STall = [0+,1-] E(STall) = 0
SShort = [1+,1-] E(SShort) = 1
Gain(S,Height) = 1 – [(1/4)*0 + (1/4)*0 + (2/4)*1] = 0.5
 For attribute ‘Weight’:
Values(Weight) : [Average, Light]
S = [2+,2-]
SAverage = [1+,1-] E(SAverage) = 1
SLight = [1+,1-] E(SLight) = 1
Gain(S,Weight) = 1 – [(2/4)*1 + (2/4)*1]= 0
 For attribute ‘Lotion’:

Values(Lotion) : [Yes, No]


S = [2+,2-]
SYes = [0+,2-] E(SYes) = 0
SNo = [2+,0-] E(SNo) = 0
Gain(S,Lotion) = 1 – [(2/4)*0 + (2/4)*0]= 1

Therefore, Gain(S,Lotion) is maximum


 In this case, the final decision tree will be

Hair

Blonde Brow
Re n
d

Sunburned Not
Lotion
Sunburned

Y N

Not Sunburned
Sunburned
Computing Information-Gain for
Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information requirement for
A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set
of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes
with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 |D| |D|

 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019
 The attribute with the maximum gain ratio is selected as
the splitting attribute
age income student credit_rating buys_computer
youth high no fair no
youth high no excellent no
middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_aged medium no excellent yes
middle_aged high yes fair yes
senior medium no excellent no
Gini Index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D)  1  p j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A ( D)  gini( D1)  2 gini( D 2)
 Reduction in Impurity: |D| |D|
gini( A)  gini( D)  giniA ( D)
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
   
9 5
gini ( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low, medium}
and 4 in D2
 10  4
giniincome{low,medium} ( D)   Gini( D1 )   Gini( D2 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes
Comparing Attribute Selection Measures

 The three measures, in general, return good results but


 Information gain:
 biased towards multivalued attributes

 Gain ratio:
 tends to prefer unbalanced splits in which one partition is much smaller than the others

 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and purity in both partitions
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
Preparing Data for
Classification and Prediction
 Data Cleaning : remove or reduce noise (apply smoothing) and
treatment of missing values (replacing a missing value with
the most commonly occurring values for that attribute, or
with the most probable value based on statistics)
 Relevance Analysis : 1) Correlation analysis is used to check
whether any two given attributes are statistically related. 2)
Dataset may also contain irrelevant attributes. These kind of
attributes may otherwise slow down and possibly mislead the
learning step.
 Data Transformation and Reduction : may be transformed by
normalization – scaling attributes such that it comes in one
range. Or by generalizing it to higher-level concepts.
Advantage of ID3

 Understandable prediction rules are created from


the training data.
 Builds the fastest tree.
 Builds a short tree.
 Only need to test enough attributes until all data is
classified.
 Finding leaf nodes enables test data to be pruned,
reducing number of tests.
Disadvantage of ID3

 Data may be over-fitted or over-classified, if a small


sample is tested.
 Only one attribute at a time is tested for making a
decision.
 Classifying continuous data may be
computationally expensive, as many trees must be
generated to see where to break the continuum.
Bayes’ Theorem: Basics
 Total probability Theorem:
M
P(B)   P(B | A )P( A )
i i
i 1
 Bayes’ Theorem:
P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)

P(X)
Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes’ theorem

P(H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)


P(X)
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i )   P( x | C i )  P( x | C i )  P( x | C i )  ... P( x | C i )
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts
the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
( x ) 2

standard deviation σ 1 
g ( x,  ,  ) 
2
e 2
2 

and P(xk|Ci) is
P ( X | C i )  g ( xk ,  Ci ,  Ci )

You might also like