applied machine learning
play

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP - PowerPoint PPT Presentation

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP 551 (Fall 2020) https://scholarstrikecanada.ca Admin We have created groups for students who are in time zones very different from EST you can use these groups to more easily


  1. Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP 551 (Fall 2020)

  2. https://scholarstrikecanada.ca

  3. Admin We have created groups for students who are in time zones very different from EST you can use these groups to more easily search for teammates

  4. Admin your input on class format we may use either format depending on the topic questions on Numpy Winter 2020 | Applied Machine Learning (COMP551)

  5. Learning objectives Decision trees: how does it model the data? how to specify the best model using a cost function how the cost function is optimized

  6. Decision trees: motivation pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization cons. they could easily overfit and they are unstable image credit:https://mymodernmet.com/the-30second-rule-a-decision/

  7. Notation overview x , y denote the input and labels x = [ x , x , … , x ] 1 2 D we use D to denote the number of features (dimensionality of the input space) (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} this is our dataset; we use N to denote the size of the dataset and n for indexing for classification problems, we use C for number of classes y ∈ {1, … , C }

  8. Decision trees: idea R , … , R divide the input space into regions using a tree structure 1 K assign a prediction label to each region for classification this is the class label w I ( x ∈ R ) f ( x ) = ∑ k k k for regression, this is a real scalar or vector how to build the regions and the tree? split regions successively based on the value of a single variable test each region is a set of conditions R = { x ≤ t , x ≤ t } 2 1 1 2 4 w 5 w 3 w 1 x 1 x 2

  9. Possible tests next questions: what are all the possible tests? which test do we choose next? Continuous features all the values that appear in the dataset can be used to split Categorical features if a feature can take C values x ∈ {1, … , C } i , … , x ∈ {0, 1} convert that feature into C binary features (one-hot coding) x i ,1 i , C split based on the value of a binary feature alternatives: multi-way split: can lead to regions with few datapoints binary splits that produce balanced subsets

  10. Cost function ML algorithms usually minimize a cost function or maximize an objective function find a decision tree minimizing the following cost function cost function specifies "what is a good decision or regression tree?" R k first calculate cost per region regression cost R k we predict for region ( n ) ( n ) R ) w = mean( y ∣ x ∈ k k mean squared error (MSE) 1 ∑ x cost( R , D ) = ( n ) k 2 ( y − w ) ∈ R ( n ) k N k k number of instances in region k truth prediction cost( R , D ) N k cost( D ) = ∑ k N total cost is the normalized sum over all regions k

  11. Cost function ML algorithms usually minimize a cost function or maximize an objective function find a decision tree minimizing the following cost function cost function specifies "what is a good decision or regression tree?" R k classification cost again, calculate the cost per region ( n ) ( n ) R ) for each region we predict the most frequent label w = mode( y ∣ x ∈ k k misclassification rate ( n )  w ) 1 ∑ x cost( R , D ) = I ( y = k ( n ) ∈ D k N k R k number of instances in region k truth prediction cost( R , D ) N k total cost is the normalized sum cost( D ) = ∑ k N k

  12. Cost function ML algorithms usually minimize a cost function or maximize an objective function find a decision tree minimizing the following cost function cost function specifies "what is a good decision or regression tree?" cost( R , D ) total cost is the normalized sum cost( D ) = N k ∑ k N k problem it is sometimes possible to build a tree with zero cost : build a large tree with each instance having its own region ( overfitting !) example use features such as height, eye color etc, to make perfect prediction on training data solution find a decision tree with at most K tests minimizing the cost function K tests = K internal node in our binary tree = K+1 leaves (regions) Winter 2020 | Applied Machine Learning (COMP551)

  13. Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function R k 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d moreover, for each feature different choices of splitting bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem

  14. Greedy heuristic finding the optimal tree is too difficult, instead use a greedy heuristic to find a good tree recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting R node D function fit-tree( , ,depth) R node D R , R = greedy-test ( , ) left right R , R if not worth-splitting(depth, ) {{ R , R }, { R , { R , R }} left right 1 2 3 4 5 R node return else R left D left-set = fit-tree( , , depth+1) R right right-set = fit-tree( , , depth+1) D return {left-set, right-set} final decision tree in the form of nested list of regions

  15. Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R node D function greedy-test ( , ) best-cost = -inf for each feature and each possible test d ∈ {1, … , D } R node R , R split into based on the test left right cost( R N right cost( R N left = , D ) + , D ) split-cost left right N node N node if split-cost < best-cost: best-cost = split-cost R ∗ R left = left R ∗ R right = right return R ∗ , R ∗ left right

  16. Stopping the recursion worth-splitting subroutine R node if we stop when has zero cost, we may overfit heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R left R right is a good approximation, the cost is small enough w k reduction in cost by splitting is small N right cost( R N left cost( R cost( R , D ) − ( , D ) + , D ) ) node left right N node N node image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ Winter 2020 | Applied Machine Learning (COMP551)

  17. revisiting the classification cost ideally we want to optimize the misclassification rate 1 ∑ x ( n )  w ) cost( R , D ) = I ( y = ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R node (.5, 100%) R right R left (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) however the second split may be preferable because one region does not need further splitting idea: use a measure for homogeneity of labels in regions

  18. Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information if p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) a uniform distribution has the highest entropy C 1 1 H ( y ) = − ∑ c =1 log = log C C C a deterministic random variable has the lowest entropy H ( y ) = −1 log(1) = 0

  19. Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log p ( y = c ) p ( t = l ) this is symmetric wrt y and t = H ( t ) − H ( t ∣ y ) = I ( y , t ) mutual information is always positive and zero only if y and t are independent

  20. Entropy for classification cost I ( y ( n ) ∑ x = c ) ( n ) ∈ R we care about the distribution of labels in each region p ( y = c ) = k k N k 1 ∑ x ( n )  w ) = misclassification cost cost( R , D ) = I ( y = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R , D ) = entropy cost H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N left , D ) + N left node left right N node N node = H ( y ) − ( p ( x ≥ t )) ) = I ( y , x > t ) t ) H ( p ( y ∣ x ≥ t )) + p ( x < t ) H ( p ( y ∣ x < d d d d this means by using entropy as our cost, we are choosing the test which is maximally informative about labels

Recommend


More recommend