Data Mining Classification Trees (2) Ad Feelders Universiteit Utrecht September 16, 2020 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 1 / 55
Basic Tree Construction Algorithm Construct tree nodelist ← {{ training data }} Repeat current node ← select node from nodelist nodelist ← nodelist − current node if impurity(current node) > 0 then S ← set of candidate splits in current node s* ← arg max s ∈ S impurity reduction(s,current node) child nodes ← apply(s*,current node) nodelist ← nodelist ∪ child nodes fi Until nodelist = ∅ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 2 / 55
Overfitting and Pruning The tree growing algorithm continues splitting until all leaf nodes of T contain examples of a single class (i.e. resubstitution error R ( T ) = 0). Is this a good tree for predicting the class of new examples? Not unless the problem is truly “deterministic”! Problem of overfitting . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 3 / 55
Proposed Solutions How can we prevent overfitting? Stopping Rules: e.g. don’t expand a node if the impurity reduction of the best split is below some threshold. Pruning: grow a very large tree T max and merge back nodes. Note: in the practical assignment we do use a stopping rule based on the nmin and minleaf parameters. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 4 / 55
Stopping Rules Disadvantage: sometimes you first have to make a weak split to be able to follow up with a good split. Since we only look one step ahead we may miss the good follow-up split. x 2 x 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 5 / 55
Pruning To avoid the problem of stopping rules, we first grow a very large tree on the training sample, and then prune this large tree. Objective: select the pruned subtree that has lowest true error rate. Problem: how to find this pruned subtree? Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 6 / 55
Pruning Methods Cost-complexity pruning (Breiman et al.; CART), also called weakest link pruning. Reduced-error pruning (Quinlan) Pessimistic pruning (Quinlan; C4.5) . . . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 7 / 55
Terminology: Tree T t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 ˜ T denotes the collection of leaf nodes of tree T . T = { t 5 , t 6 , t 7 , t 8 , t 9 } , | ˜ ˜ T | = 5 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 8 / 55
Terminology: Pruning T in node t 2 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 9 / 55
Terminology: T after pruning in t 2 : T − T t 2 t 1 t 2 t 3 t 6 t 7 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 10 / 55
Terminology: Branch T t 2 t 2 t 4 t 5 t 8 t 9 T t 2 = { t 5 , t 8 , t 9 } , | ˜ ˜ T t 2 | = 3 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 11 / 55
Cost-complexity pruning The total number of pruned subtrees of a balanced binary tree with ℓ leaves is ⌊ 1 . 5028369 ℓ ⌋ With just 40 leaf nodes we have approximately 12 million pruned subtrees. Exhaustive search not recommended. Basic idea of cost-complexity pruning: reduce the number of pruned subtrees we have to consider by selecting the ones that are the “best of their kind” (in a sense to be defined shortly...) Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 12 / 55
Total cost of a tree Strike a balance between fit and complexity. Total cost C α ( T ) of tree T C α ( T ) = R ( T ) + α | ˜ T | Total cost consists of two components: resubstitution error R ( T ), and a penalty for the complexity of the tree α | ˜ T | , ( α ≥ 0). Note: R ( T ) = number of wrong classifications made by T number of examples in the training sample Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 13 / 55
Tree with lowest total cost Depending on the value of α , different pruned subtrees will have the lowest total cost. For α = 0 (no complexity penalty) the tree with smallest resubstitution error wins. For higher values of α , a less complex tree that makes a few more errors might win. As it turns out, we can find a nested sequence of pruned subtrees of T max , such that the trees in the sequence minimize total cost for consecutive intervals of α values. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 14 / 55
Smallest minimizing subtree For any value of α , there exists a smallest minimizing subtree T ( α ) of T max that satisfies the following conditions: 1 C α ( T ( α )) = min T ≤ T max C α ( T ) (that is, T ( α ) minimizes total cost for that value of α ). 2 If C α ( T ) = C α ( T ( α )) then T ( α ) ≤ T . (that is, T ( α ) is a pruned subtree of all trees that minimize total cost). Note : T ′ ≤ T means T ′ is a pruned subtree of T , i.e. it can be obtained by pruning T in 0 or more nodes. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 15 / 55
Sequence of subtrees Construct a decreasing sequence of pruned subtrees of T max T max > T 1 > T 2 > T 3 > . . . > { t 1 } (where t 1 is the root node of the tree) such that T k is the smallest minimizing subtree for α ∈ [ α k , α k +1 ). Note : From a computational viewpoint, the important property is that T k +1 is a pruned subtree of T k , i.e. it can be obtained by pruning T k . No backtracking is required. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 16 / 55
Decomposition of total cost Total cost has an additive decomposition over the leaf nodes of a tree: � C α ( T ) = ( R ( t ) + α ) t ∈ ˜ T R ( t ) is the number of errors we make in node t if we predict the majority class, divided by the total number of observations in the training sample. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 17 / 55
Effect on cost of pruning in node t Before pruning in t After pruning in t t t C α ( { t } ) = R ( t ) + α T t T t ( R ( t ′ ) + α ) C α ( T t ) = � t ′ ∈ ˜ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 18 / 55
Finding the T k and corresponding α k T t : branch of T with root node t . After pruning in t , its contribution to total cost is: C α ( { t } ) = R ( t ) + α, The contribution of T t to the total cost is: � ( R ( t ′ ) + α ) = R ( T t ) + α | ˜ C α ( T t ) = T t | t ′ ∈ ˜ T t T − T t becomes better than T when C α ( { t } ) = C α ( T t ) Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 19 / 55
Computing contributions to total cost of T t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 C α ( { t 2 } ) = R ( t 2 ) + α = 3 10 + α C α ( T t 2 ) = R ( T t 2 ) + α | ˜ T t 2 | = α | ˜ T t 2 R ( t ′ ) = 3 α + 0 T t 2 | + � t ′ ∈ ˜ Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 20 / 55
Solving for α The total cost of T and T − T t become equal when C α ( { t } ) = C α ( T t ) , At what value of α does this happen? R ( t ) + α = R ( T t ) + α | ˜ T t | Solving for α we get α = R ( t ) − R ( T t ) | ˜ T t | − 1 Note: for this value of α total cost of T and T − T t is the same, but T − T t is preferred because we want the smallest minimizing subtree. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 21 / 55
Computing g ( t ): the “critical” α value for node t For each non-terminal node t we compute its “critical” alpha value: g ( t ) = R ( t ) − R ( T t ) | ˜ T t | − 1 In words: increase in error due to pruning in t g ( t ) = decrease in # leaf nodes due to pruning in t Subsequently, we prune in the nodes for which g ( t ) is the smallest (the “weakest links”). This process is repeated until we reach the root node. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 22 / 55
Computing g ( t ): the “critical” α value for node t t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 g ( t 1 ) = 1 8 , g ( t 2 ) = 3 20 , g ( t 3 ) = 1 20 , g ( t 5 ) = 1 20 . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 23 / 55
Computing g ( t ): the “critical” α value for node t Calculation examples: g ( t 1 ) = R ( t 1 ) − R ( T t 1 ) = 1 / 2 − 0 = 1 | ˜ 5 − 1 8 T t 1 | − 1 g ( t 2 ) = R ( t 2 ) − R ( T t 2 ) = 3 / 10 − 0 = 3 | ˜ 3 − 1 20 T t 2 | − 1 g ( t 3 ) = R ( t 3 ) − R ( T t 3 ) = 1 / 20 − 0 = 1 | ˜ 2 − 1 20 T t 3 | − 1 g ( t 5 ) = R ( t 5 ) − R ( T t 5 ) = 1 / 20 − 0 = 1 | ˜ 2 − 1 20 T t 5 | − 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 24 / 55
Finding the weakest links t 1 100 100 90 60 40 10 t 2 t 3 t 4 0 t 5 60 t 6 0 t 7 0 40 80 10 10 0 60 10 0 t 8 t 9 g ( t 1 ) = 1 8 , g ( t 2 ) = 3 20 , g ( t 3 ) = 1 20 , g ( t 5 ) = 1 20 . Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 25 / 55
Pruning in the weakest links t 1 100 100 90 60 10 40 t 2 t 3 t 4 80 0 t 5 10 60 By pruning the weakest links we obtain the next tree in the sequence. Ad Feelders ( Universiteit Utrecht ) Data Mining September 16, 2020 26 / 55
Recommend
More recommend