decision trees discussion
play

Decision Trees: Discussion Machine Learning 1 Some slides from Tom - PowerPoint PPT Presentation

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others This lecture: Learning Decision Trees 1. Representation : What are decision trees? 2. Algorithm : Learning decision trees The ID3 algorithm: A


  1. Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

  2. This lecture: Learning Decision Trees 1. Representation : What are decision trees? 2. Algorithm : Learning decision trees The ID3 algorithm: A greedy heuristic – 3. Some extensions 2

  3. This lecture: Learning Decision Trees 1. Representation : What are decision trees? 2. Algorithm : Learning decision trees The ID3 algorithm: A greedy heuristic – 3. Some extensions 3

  4. Tips and Tricks 1. Decision tree variants 2. Handling examples with missing feature values 3. Non-Boolean features 4. Avoiding overfitting 4

  5. 1. Variants of information gain Information gain is defined using entropy to measure the disorder/impurity of the labels. There are other ways to measure disorder. Eg: MajorityError, Gini Index Example: MajorityError computes: “ Suppose the tree was not grown below this node and the most frequent label were chosen, what would be the error? ” Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError ? Answer: ¼ Works like entropy 5

  6. 1. Variants of information gain Information gain is defined using entropy to measure the disorder/impurity of the labels. There are other ways to measure disorder. Eg: MajorityError, Gini Index Example: MajorityError computes: “ Suppose the tree was not grown below this node and the most frequent label were chosen, what would be the error? ” Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError ? Answer: ¼ Works like entropy 6

  7. 1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Gini Index: 1 − 𝑞 ! + 1 − 𝑞 ! MajorityError: min(𝑞, 1 − 𝑞) p (fraction of positive examples) 7

  8. 1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Each measure peaks when uncertainty is Gini Index: highest (i.e. p = 0.5) 1 − 𝑞 ! + 1 − 𝑞 ! MajorityError: min(𝑞, 1 − 𝑞) p (fraction of positive examples) 8

  9. 1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Gini Index: 1 − 𝑞 ! + 1 − 𝑞 ! MajorityError: Lowest (zero) when min(𝑞, 1 − 𝑞) uncertainty is lowest (i.e. p=0 or p=1) p (fraction of positive examples) 9

  10. 1. Variants of information gain Let 𝑞 denote the fraction of positive examples. Then 1 − 𝑞 is the fraction of negative examples. Entropy: − 𝑞 log ! 𝑞 + 1 − 𝑞 log ! 1 − 𝑞 Each of these work like entropy. Gini Index: They can replace entropy in the 1 − 𝑞 ! + 1 − 𝑞 ! definition of information gain. MajorityError: min(𝑞, 1 − 𝑞) p (fraction of positive examples) 10

  11. 2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild ??? Weak No 9 Sunny Cool High Weak Yes 11 Sunny Mild Normal Strong Yes 11

  12. 2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Different methods to “Complete the example”: – Using the most common value of the attribute in the data – Using the most common value of the attribute among all examples with the same output – Using fractional counts of the attribute values • Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/15 Rain} • Exercise : Will this change probability computations? 12

  13. 2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Different methods to “Complete the example”: – Using the most common value of the attribute in the data – Using the most common value of the attribute among all examples with the same output – Using fractional counts of the attribute values • Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/15 Rain} • Exercise : Will this change probability computations? At test time? 13

  14. 2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Different methods to “Complete the example”: – Using the most common value of the attribute in the data – Using the most common value of the attribute among all examples with the same output – Using fractional counts of the attribute values • Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/15 Rain} • Exercise : Will this change probability computations? At test time? Use the same method 14

  15. 3. Non-Boolean features • If the features can take multiple values – We have seen one edge per value (i.e a multi-way split) Outlook Rain Sunny Overcast 15

  16. 3. Non-Boolean features • If the features can take multiple values – We have seen one edge per value (i.e a multi-way split) – Another option: Make the attributes Boolean by testing for each value { Outlook:Sunny=True, Outlook:Overcast=False, Convert Outlook=Sunny → Outlook:Rain=False } – Or, perhaps group values into disjoint sets 16

  17. 3. Non-Boolean features • If the features can take multiple values – We have seen one edge per value (i.e a multi-way split) – Another option: Make the attributes Boolean by testing for each value { Outlook:Sunny=True, Outlook:Overcast=False, Convert Outlook=Sunny → Outlook:Rain=False } – Or, perhaps group values into disjoint sets • For numeric features, use thresholds or ranges to get Boolean/discrete alternatives 17

  18. 4. Overfitting 18

  19. The “First Bit” function • A Boolean function with n inputs • Simply returns the value of the first input, all others irrelevant What is the decision tree X 0 X 1 Y for this function? F F F F T F T F T T T T Y = X 0 X 1 is irrelvant 19

  20. The “First Bit” function • A Boolean function with n inputs • Simply returns the value of the first input, all others irrelevant What is the decision tree X 0 X 1 Y for this function? F F F F T F X 0 T F T T F T T T T F 20

  21. The “First Bit” function • A Boolean function with n inputs • Simply returns the value of the first input, all others irrelevant What is the decision tree X 0 X 1 Y for this function? F F F F T F X 0 T F T T F T T T T F Exercise : Convince yourself that ID3 will generate this tree 21

  22. The best case scenario: Perfect data Suppose we have all 2 n examples for training. What will the error be on any future examples? Zero! Because we have seen every possible input! And the decision tree can represent the function and ID3 will build a consistent tree 22

  23. The best case scenario: Perfect data Suppose we have all 2 n examples for training. What will the error be on any future examples? Zero! Because we have seen every possible input! And the decision tree can represent the function and ID3 will build a consistent tree 23

  24. Noisy data What if the data is noisy? And we have all 2 n examples. X 0 X 1 X 2 Y Suppose, the outputs of both F F F F training and test sets are F F T F randomly corrupted F T F F Train and test sets are no longer F T T F identical. T F F T T F T T Both have noise, possibly different T T F T T T T T 24

  25. Noisy data What if the data is noisy? And we have all 2 n examples. X 0 X 1 X 2 Y Suppose, the outputs of both F F F F training and test sets are F F T F T randomly corrupted F T F F Train and test sets are no longer F T T F identical. T F F T T F T T F Both have noise, possibly different T T F T T T T T 25

  26. E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. Test accuracy for different input sizes 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features The error bars are generated by running the same 26 experiment multiple times for the same setting

  27. E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. Test accuracy for different input sizes 1 0.9 0.8 Error = approx 0.375 0.7 0.6 0.5 We can analytically compute test error in this case 0.4 Correct prediction: 0.3 P(Training example uncorrupted AND test example uncorrupted) = 0.75 £ 0.75 0.2 P(Training example corrupted AND test example corrupted) = 0.25 £ 0.25 0.1 P(Correct prediction) = 0.625 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Incorrect prediction: P(Training example uncorrupted AND test example corrupted) = 0.75 £ 0.25 Number of features P(Training example corrupted and AND example uncorrupted) = 0.25 £ 0.75 P(incorrect prediction) = 0.375 27

  28. E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. Test accuracy for different input sizes 1 0.9 What about the training accuracy? 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features 28

Recommend


More recommend