Applied Machine Learning Applied Machine Learning Decision Trees - PowerPoint PPT Presentation

Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k 6 . 2

Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k cost( R N , D ) ∑ k N total cost in both cases is the normalized sum k k 6 . 2

Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k cost( R N , D ) ∑ k N total cost in both cases is the normalized sum k k it is sometimes possible to build a tree with zero cost : build a large tree with each instance having its own region ( overfitting !) 6 . 2

Cost function Cost function objective : find a decision tree minimizing the cost function classification cost regression cost ∈ {1, … , C } R for predicting constant class w ∈ w for predicting constant k k cost per region (mean squared error - MSE) cost per region (misclassification rate) 1 ∑ x 1 ∑ x ( n )  w cost( R ( n ) k 2 cost( R I ( y , D ) = ( y − ) , D ) = = ) w k ( n ) ∈ R k ( n ) ∈ D k N N R k k k k number of instances in region k ( n ) ( n ) R mean( y ∣ x ∈ ) ( n ) ( n ) R mode( y ∣ x ∈ ) k k cost( R N , D ) ∑ k N total cost in both cases is the normalized sum k k it is sometimes possible to build a tree with zero cost : build a large tree with each instance having its own region ( overfitting !) new objective : find a decision tree with K tests minimizing the cost function 6 . 2

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly 6 . 3

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly not produced by a decision tree 6 . 3

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? not produced by a decision tree 6 . 3

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 exponential in K not produced by a decision tree 6 . 3

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 not produced by a decision tree 6 . 3

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d not produced by a decision tree 6 . 3

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d not produced by a decision tree moreover, for each feature different choices of splitting S ∈ s d , n d 6 . 3

Search space Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function alternatively, find the smallest tree (K) that classifies all examples correctly assuming D features how many different partitions of size K+1? R 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) k K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d not produced by a decision tree moreover, for each feature different choices of splitting S ∈ s d , n d bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)

Greedy heuristic Greedy heuristic recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting 7 . 1

Greedy heuristic Greedy heuristic recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting R D function fittree( , ,depth) node R R , R = greedytest ( , ) node D left right R , R if not worthsplitting(depth, ) left right R return node else R leftset = fittree( , , depth+1) left D R rightset = fittree( , , depth+1) D right return {leftset, rightset} 7 . 1

Greedy heuristic Greedy heuristic recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting R D function fittree( , ,depth) node R R , R = greedytest ( , ) node D left right R , R if not worthsplitting(depth, ) left right R {{ R , R }, { R , { R , R return }} node 1 2 3 4 5 else R leftset = fittree( , , depth+1) left D R rightset = fittree( , , depth+1) D right return {leftset, rightset} final decision tree in the form of nested list of regions 7 . 1

Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost 7 . 2

Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedytest ( , ) node D bestcost = inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) splitcost right left left right N N node node if splitcost < bestcost: bestcost = splitcost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ left right 7 . 2

Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedytest ( , ) node D bestcost = inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d creating new regions R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) splitcost right left left right N N node node if splitcost < bestcost: bestcost = splitcost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ left right 7 . 2

Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedytest ( , ) node D bestcost = inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d creating new regions R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) evaluate their cost splitcost right left left right N N node node if splitcost < bestcost: bestcost = splitcost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ left right 7 . 2

Choosing tests Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R function greedytest ( , ) node D bestcost = inf S for d ∈ {1, … , D }, s ∈ d , n d R R = ∪ { x < } s d , n left node d creating new regions R R = ∪ { x ≥ } s d , n right node d N cost( R cost( R = N , D ) + , D ) evaluate their cost splitcost right left left right N N node node if splitcost < bestcost: bestcost = splitcost R ∗ R = left left R ∗ R = right right return R ∗ , R ∗ return the split with the lowest greedy cost left right 7 . 2

Stopping the recursion Stopping the recursion worthsplitting subroutine R if we stop when has zero cost, we may overfit node image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

Stopping the recursion Stopping the recursion worthsplitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

Stopping the recursion Stopping the recursion worthsplitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

Stopping the recursion Stopping the recursion worthsplitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

Stopping the recursion Stopping the recursion worthsplitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right is a good approximation, the cost is small enough image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

Stopping the recursion Stopping the recursion worthsplitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right is a good approximation, the cost is small enough w k image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3

Stopping the recursion Stopping the recursion worthsplitting subroutine R if we stop when has zero cost, we may overfit node heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R R left right is a good approximation, the cost is small enough w k reduction in cost by splitting is small N cost( R N cost( R cost( R , D ) − , D ) + , D ) ) ( right left node left right N N node node image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ 7 . 3 Winter 2020 | Applied Machine Learning (COMP551)

revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic 8 . 1

revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example (.5, 100%) R node R R (.25, 50%) (.75, 50%) right left 8 . 1

revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left 8 . 1

revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left 8 . 1

revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left however the second split may be preferable because one region does not need further splitting 8 . 1

revisiting the classification cost classification cost revisiting the ideally we want to optimize the 0-1 loss (misclassification rate) 1 ∑ x ( n )  w cost( R I ( y , D ) = = ) ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left however the second split may be preferable because one region does not need further splitting use a measure for homogeneity of labels in regions 8 . 1

Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 8 . 2

Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c 8 . 2

Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information of p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) 8 . 2

Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information of p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) a uniform distribution has the highest entropy C 1 1 H ( y ) = − ∑ c =1 log = log C C C 8 . 2

Entropy Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information of p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) a uniform distribution has the highest entropy C 1 1 H ( y ) = − ∑ c =1 log = log C C C a deterministic random variable has the lowest entropy H ( y ) = −1 log(1) = 0 8 . 2

Mutual information Mutual information for two random variables t , y 8 . 3

Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t 8 . 3

Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) 8 . 3

Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 8 . 3

Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log this is symmetric wrt y and t p ( y = c ) p ( t = l ) 8 . 3

Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log this is symmetric wrt y and t p ( y = c ) p ( t = l ) = H ( t ) − H ( t ∣ y ) = I ( y , t ) 8 . 3

Mutual information Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log this is symmetric wrt y and t p ( y = c ) p ( t = l ) = H ( t ) − H ( t ∣ y ) = I ( y , t ) it is always positive and zero only if y and t are independent try to prove these properties 8 . 3

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node = H ( y ) − ( p ( x )) ) ≥ ) H ( p ( y ∣ x ≥ )) + p ( x < ) H ( p ( y ∣ x < s s s s d , n d , n d , n d , n d d d d 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node = H ( y ) − ( p ( x )) ) ≥ ) H ( p ( y ∣ x ≥ )) + p ( x < ) H ( p ( y ∣ x < = I ( y , x > s ) s s s s d , n d , n d , n d , n d , n d d d d 8 . 4

Entropy for classification cost Entropy for classification cost ( n ) I ( y = c ) ∑ x ( n ) ∈ R ( y = c ) = we care about the distribution of labels p k k N k 1 ∑ x ( n )  w misclassification cost cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R entropy cost , D ) = H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N , D ) + N left left node left right N N node node = H ( y ) − ( p ( x )) ) ≥ ) H ( p ( y ∣ x ≥ )) + p ( x < ) H ( p ( y ∣ x < = I ( y , x > s ) s s s s d , n d , n d , n d , n d , n d d d d choosing the test which is maximally informative about labels 8 . 4

Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left 8 . 5

Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 ⋅ + ⋅ = 8 4 8 4 4 8 . 5

Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = 8 4 8 4 4 8 3 8 2 4 8 . 5

Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 8 . 5

Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 entropy cost (using base 2 logarithm) 8 . 5

Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 entropy cost (using base 2 logarithm) ( − ) ) + ( − ) ) ≈ 4 1 1 3 3 4 1 1 3 3 log( ) − log( log( ) − log( .81 8 4 4 4 4 8 4 4 4 4 8 . 5

Entropy for classification cost Entropy for classification cost example (.5, 100%) R (.5, 100%) node R R (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) right left misclassification cost 4 1 4 1 1 6 1 2 0 1 ⋅ + ⋅ = ⋅ + ⋅ = the same costs 8 4 8 4 4 8 3 8 2 4 entropy cost (using base 2 logarithm) ( − ) ) + ( − ) ) ≈ ( − ) ) + 4 1 1 3 3 4 1 1 3 3 log( ) − log( log( ) − log( .81 6 1 1 2 2 2 log( ) − log( ⋅ 0 ≈ .68 8 4 4 4 4 8 4 4 4 4 8 3 3 3 3 8 lower cost split 8 . 5

Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k 8 . 6

Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k Gini index it is the expected error rate 8 . 6

Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k Gini index it is the expected error rate C cost( R , D ) = p ( c )(1 − p ( c )) ∑ c =1 k probability of class c probability of error 8 . 6

Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k Gini index it is the expected error rate C cost( R , D ) = p ( c )(1 − p ( c )) ∑ c =1 k probability of class c probability of error C C C 2 2 = p ( c ) − p ( c ) = 1 − p ( c ) ∑ c =1 ∑ c =1 ∑ c =1 8 . 6

Gini index Gini index another cost for selecting the test in classification ( n )  w 1 ∑ x misclassification (error) rate cost( R I ( y , D ) = = ) = 1 − p ( w ) ∈ R k ( n ) k k N k k entropy cost( R , D ) = H ( y ) k comparison of costs of a node when we have 2 classes Gini index it is the expected error rate C cost( R , D ) = p ( c )(1 − p ( c )) ∑ c =1 k cost probability of class c probability of error C C C 2 2 = p ( c ) − p ( c ) = 1 − p ( c ) ∑ c =1 ∑ c =1 ∑ c =1 p ( y = 1) 8 . 6 Winter 2020 | Applied Machine Learning (COMP551)

Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 9 . 1

Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 1 9 . 1

Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 2 1 9 . 1

Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 2 1 3 9 . 1

Example Example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries 2 decision boundaries suggest overfitting confirmed using a validation set training accuracy ~ 85% 1 3 (Cross) validation accuracy ~ 70% 9 . 1

Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) 2 2 D there are such functions, why? image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) 2 2 D there are such functions, why? large decision trees have a high variance - low bias (low training error, high test error) image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

Overfitting Overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) 2 2 D there are such functions, why? large decision trees have a high variance - low bias (low training error, high test error) idea 1. grow a small tree image credit: https://www.wikiwand.com/en/Binary_decision_diagram 9 . 2

Applied Machine Learning Applied Machine Learning Decision Trees - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives decision trees: model cost function how it

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak

Hardware-Software Codesign 4. System Partitioning Lothar Thiele Swiss Federal Computer

COST-MINIMIZATION C ( w, r, Q ) = min { w L + r K | F ( K, L ) Q } FONCs w = Q/ L,

lti Is convex Perceptron Boosting Max-Margin Conditional Likelihood MIRA Based on Uses a

2. The economists view: a multiproduct firm A multiproduct firm is a firm that produces

Optimal Rapidly-exploring Random Trees Miguel Vargas Material taken form: S. Karaman, E.

Value function and optimal trajectories for a control problem with supremum cost function and

Prr t

Optimal Control Theory The theory Optimal control theory is a mature mathematical discipline

Sambuz

Useful Links

Newsletter

Mail Us