applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Decision Trees - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives decision trees: model cost function how it


  1. Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k 6 . 2

  2. Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k cost( R N , D ) ∑ k N total cost in both cases is the normalized sum k k 6 . 2

  3. Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k cost( R N , D ) ∑ k N total cost in both cases is the normalized sum k k it is sometimes possible to build a tree with zero cost : build a large tree with each instance having its own region ( overfitting !) 6 . 2

  4. Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k cost( R N , D ) ∑ k N total cost in both cases is the normalized sum k k it is sometimes possible to build a tree with zero cost : build a large tree with each instance having its own region ( overfitting !) new objective : find a decision tree with K tests minimizing the cost function 6 . 2

  5. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly 6 . 3

  6. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly not produced by a decision tree 6 . 3

  7. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? not produced by a decision tree 6 . 3

  8. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 exponential in K not produced by a decision tree 6 . 3

  9. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 not produced by a decision tree 6 . 3

  10. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d not produced by a decision tree 6 . 3

  11. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d not produced by a decision tree moreover, for each feature different choices of splitting S ∈ s d , n d 6 . 3

  12. Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d not produced by a decision tree moreover, for each feature different choices of splitting S ∈ s d , n d bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)

  13. Greedy heuristic Greedy heuristic recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting 7 . 1

  14. Greedy heuristic Greedy heuristic recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting R D function fit­tree( , ,depth) node R R , R = greedy­test ( , ) node D left right R , R if not worth­splitting(depth, ) left right R return node else R left­set = fit­tree( , , depth+1) left D R right­set = fit­tree( , , depth+1) D right return {left­set, right­set} 7 . 1

  15. Greedy heuristic Greedy heuristic recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting R D function fit­tree( , ,depth) node R R , R = greedy­test ( , ) node D left right R , R if not worth­splitting(depth, ) left right R {{ R , R }, { R , { R , R return }} node 1 2 3 4 5 else R left­set = fit­tree( , , depth+1) left D R right­set = fit­tree( , , depth+1) D right return {left­set, right­set} final decision tree in the form of nested list of regions 7 . 1

  16. Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost 7 . 2

  17. Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedy­test ( , ) node D best­cost = ­inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) split­cost right left left right N N node node if split­cost < best­cost: best­cost = split­cost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ left right 7 . 2

  18. Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedy­test ( , ) node D best­cost = ­inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d creating new regions R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) split­cost right left left right N N node node if split­cost < best­cost: best­cost = split­cost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ left right 7 . 2

  19. Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedy­test ( , ) node D best­cost = ­inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d creating new regions R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) evaluate their cost split­cost right left left right N N node node if split­cost < best­cost: best­cost = split­cost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ left right 7 . 2

  20. Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedy­test ( , ) node D best­cost = ­inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d creating new regions R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) evaluate their cost split­cost right left left right N N node node if split­cost < best­cost: best­cost = split­cost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ return the split with the lowest greedy cost left right 7 . 2

  21. Stopping the recursion Stopping the recursion worth­splitting subroutine R if we stop when has zero cost, we may overfit node image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

  22. Stopping the recursion Stopping the recursion worth­splitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

  23. Stopping the recursion Stopping the recursion worth­splitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

  24. Stopping the recursion Stopping the recursion worth­splitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

  25. Stopping the recursion Stopping the recursion worth­splitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right is a good approximation, the cost is small enough image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

  26. Stopping the recursion Stopping the recursion worth­splitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right is a good approximation, the cost is small enough w k image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

  27. Stopping the recursion Stopping the recursion worth­splitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right is a good approximation, the cost is small enough w k reduction in cost by splitting is small N cost( R N cost( R cost( R , D ) − , D ) + , D ) ) ( right left node left right N N node node image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3 Winter 2020 | Applied Machine Learning (COMP551)

  28. revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic 8 . 1

  29. revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example (.5, 100%) R node R R (.25, 50%) (.75, 50%) right left 8 . 1

  30. revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left 8 . 1

  31. revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left 8 . 1

  32. revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left however the second split may be preferable because one region does not need further splitting 8 . 1

  33. revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left however the second split may be preferable because one region does not need further splitting use a measure for homogeneity of labels in regions 8 . 1

  34. Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 8 . 2

  35. Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c 8 . 2

  36. Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information of p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) 8 . 2

  37. Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information of p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) a uniform distribution has the highest entropy C 1 1 H ( y ) = − ∑ c =1 log = log C C C 8 . 2

  38. Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information of p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) a uniform distribution has the highest entropy C 1 1 H ( y ) = − ∑ c =1 log = log C C C a deterministic random variable has the lowest entropy H ( y ) = −1 log(1) = 0 8 . 2

  39. Mutual information Mutual information for two random variables t , y 8 . 3

  40. Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t 8 . 3

  41. Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) 8 . 3

  42. Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 8 . 3

  43. Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log this is symmetric wrt y and t p ( y = c ) p ( t = l ) 8 . 3

  44. Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log this is symmetric wrt y and t p ( y = c ) p ( t = l ) = H ( t ) − H ( t ∣ y ) = I ( y , t ) 8 . 3

  45. Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log this is symmetric wrt y and t p ( y = c ) p ( t = l ) = H ( t ) − H ( t ∣ y ) = I ( y , t ) it is always positive and zero only if y and t are independent try to prove these properties 8 . 3

  46. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 8 . 4

  47. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k 8 . 4

  48. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k 8 . 4

  49. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k 8 . 4

  50. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels 8 . 4

  51. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node 8 . 4

  52. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node = H ( y ) − ( p ( x )) ) ≥ ) H ( p ( y ∣ x ≥ )) + p ( x < ) H ( p ( y ∣ x < s s s s d , n d , n d , n d , n d d d d 8 . 4

  53. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node = H ( y ) − ( p ( x )) ) ≥ ) H ( p ( y ∣ x ≥ )) + p ( x < ) H ( p ( y ∣ x < = I ( y , x > s ) s s s s d , n d , n d , n d , n d , n d d d d 8 . 4

  54. Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node = H ( y ) − ( p ( x )) ) ≥ ) H ( p ( y ∣ x ≥ )) + p ( x < ) H ( p ( y ∣ x < = I ( y , x > s ) s s s s d , n d , n d , n d , n d , n d d d d choosing the test which is maximally informative about labels 8 . 4

  55. Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left 8 . 5

  56. Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 ⋅ + ⋅ = 8 4 8 4 4 8 . 5

  57. Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = 8 4 8 4 4 8 3 8 2 4 8 . 5

  58. Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 8 . 5

  59. Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 entropy cost (using base 2 logarithm) 8 . 5

  60. Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 entropy cost (using base 2 logarithm) ( − ) ) + ( − ) ) ≈ 4 1 1 3 3 4 1 1 3 3 log( ) − log( log( ) − log( .81 8 4 4 4 4 8 4 4 4 4 8 . 5

  61. Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 entropy cost (using base 2 logarithm) ( − ) ) + ( − ) ) ≈ ( − ) ) + 4 1 1 3 3 4 1 1 3 3 log( ) − log( log( ) − log( .81 6 1 1 2 2 2 log( ) − log( ⋅ 0 ≈ .68 8 4 4 4 4 8 4 4 4 4 8 3 3 3 3 8 lower cost split 8 . 5

  62. Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k 8 . 6

  63. Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k Gini index it is the expected error rate 8 . 6

  64. Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k Gini index it is the expected error rate C cost( R , D ) = p ( c )(1 − p ( c )) ∑ c =1 k probability of class c probability of error 8 . 6

  65. Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k Gini index it is the expected error rate C cost( R , D ) = p ( c )(1 − p ( c )) ∑ c =1 k probability of class c probability of error C C C 2 2 = p ( c ) − p ( c ) = 1 − p ( c ) ∑ c =1 ∑ c =1 ∑ c =1 8 . 6

  66. Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k comparison of costs of a node when we have 2 classes Gini index it is the expected error rate C cost( R , D ) = p ( c )(1 − p ( c )) ∑ c =1 k cost probability of class c probability of error C C C 2 2 = p ( c ) − p ( c ) = 1 − p ( c ) ∑ c =1 ∑ c =1 ∑ c =1 p ( y = 1) 8 . 6 Winter 2020 | Applied Machine Learning (COMP551)

  67. Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 9 . 1

  68. Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 1 9 . 1

  69. Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 2 1 9 . 1

  70. Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 2 1 3 9 . 1

  71. Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 2 decision boundaries suggest overfitting confirmed using a validation set training accuracy ~ 85% 1 3 (Cross) validation accuracy ~ 70% 9 . 1

  72. Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

  73. Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

  74. Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) 2 2 D there are such functions, why? image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

  75. Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) 2 2 D there are such functions, why? large decision trees have a high variance - low bias (low training error, high test error) image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

  76. Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) 2 2 D there are such functions, why? large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

Recommend


More recommend