Decision Trees Gavin Brown
Every Learning Method has Limitations Linear model? KNN ? SVM ?
Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above decision?
Different types of data Rugby players - height, weight can be plotted in 2-d. How do you plot hair colour? (Black, Brown, Blonde?) Predicting heart disease - how do you plot blood type? (A, B, O)? In general, how do you deal with categorical data?
The Tennis Problem You are working for the local tennis club. They want a program that will advise inexperienced new members on whether they are likely to enjoy a game today, given the current weather conditions. However they need the program to pop out interpretable rules so they can be sure it’s not giving bad advice. They provide you with some historical data....
The Tennis Problem Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Note: 9 examples say ’yes’, 5 examples say ’no’.
A Decision Tree for the Tennis Problem This tree works for any example in the table — try it!
Learning a Decision Tree : Basic recursive algorithm tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature, call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif
Example: partitioning data by “wind” feature Outlook Temp Humid Wind Play? Outlook Temp Humid Wind Play? 2 Sunny Hot High Strong No 1 Sunny Hot High Weak No 6 Rain Cool Normal Strong No 3 Overcast Hot High Weak Yes 7 Overcast Cool Normal Strong Yes 4 Rain Mild High Weak Yes 11 Sunny Mild Normal Strong Yes 5 Rain Cool Normal Weak Yes 12 Overcast Mild High Strong Yes 8 Sunny Mild High Weak No 14 Rain Mild High Strong No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 13 Overcast Hot Normal Weak Yes 3 examples say yes, 3 say no. 6 examples say yes, 2 examples say no.
Learning a Decision Tree : Basic recursive algorithm tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature , call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif Which is the most important feature?
Thinking in Probabilities... Before the split : 9 ’yes’, 5 ’no’, ......... p ( ′ yes ′ ) = 9 14 ≈ 0 . 64 On the left branch : 3 ’yes’, 3 ’no’, ....... p ( ′ yes ′ ) = 3 6 = 0 . 5 On the right branch : 6 ’yes’, 2 ’no’, ...... p ( ′ yes ′ ) = 6 8 = 0 . 75 Remember... p ( ′ no ′ ) = 1 − p ( ′ yes ′ )
The ‘Information’ contained in a variable - Entropy More uncertainty = Less information H ( X ) = 1 . 0
The ‘Information’ contained in a variable - Entropy Lower uncertainty = More information H ( X ) = 0 . 72193
Entropy The amount of randomness in a variable X is called the ’entropy’. � H ( X ) = − p ( x i ) log p ( x i ) (1) i The log is base 2, giving us units of measurement ’bits’.
Reducing Entropy = Maximise Information Gain The variable of interest is “T” (for tennis), taking on ’yes’ or ’no’ values. Before the split : 9 ’yes’, 5 ’no’, ......... p ( ′ yes ′ ) = 9 14 ≈ 0 . 64 In the whole dataset, the entropy is: � H ( T ) = − p ( x i ) log p ( x i ) i � 5 14 log 5 14 + 9 14 log 9 � = − = 0 . 94029 14 H ( T ) is the entropy before we split. See worked example in the supporting material.
Reducing Entropy = Maximise Information Gain H ( T ) is the entropy before we split. H ( T | W = strong ) is the entropy of the data on the left branch. H ( T | W = weak ) is the entropy of the data on the right branch. H ( T | W ) is the weighted average of the two. Choose the feature with maximum value of H ( T ) − H ( T | W ). See worked example in the supporting material.
Learning a Decision Tree : the ID3 algorithm tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature , call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif Or, in very simple terms: Step 1. Pick the feature that maximises information gain. Step 2. Recurse on each branch.
The ID3 algorithm function id3( examples ) returns tree T if all the items in examples have the same conclusion, return a leaf node with value = majority conclusion let A be the feature with the largest information gain Create a blank tree T let s(1), s(2), s(3) etc be the data subsets produced by splitting examples on feature A For each subset s(n), tree t(n) = id3( s(n) ) add t(n) as a new branch of T Endfor return T
A Decision Tree for the Tennis Problem Following each path down the tree, we can make up a list of rules. if ( sunny AND high ) → NO if ( sunny AND normal ) → YES if ( overcast ) → YES if ( rain AND strong ) → NO if ( rain AND weak ) → YES
’Overfitting’ a tree ◮ The number of possible paths tells you the number of rules. ◮ More rules = more complicated. ◮ We could have N rules where N is the size of the dataset. This would mean no generalisation outside of the training data, or the tree is overfitted Overfitting = fine tuning
Overfitting What if it’s rainy and hot? Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No
Overfitting How do you know if you’ve overfitted? ◮ “Validation” dataset - another dataset that you do not use to train, but just to check whether you’ve overfitted or not. How can we avoid it? ◮ Stop after a certain depth (i.e. keep the tree short) ◮ Post-Prune the final tree ◮ ... both in order to control validation error
Overfitting
Missing data? Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Normal No 7 Overcast Cool Normal Yes 8 Sunny High No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Normal Strong Yes 12 Overcast High Strong Yes 13 Overcast Normal Weak Yes 14 Rain Mild High Strong No Insert average (mean, median or mode) of the available values. Or other more complex strategies such as using Bayes Rule... NEXT WEEK... Ultimately best strategy is problem dependent .
Conclusion Decision Trees provide a flexible and interpretable model. There are many variations on the simple id3 algorithm. Further reading: www.decisiontrees.net . (site written by a former student of this course) Why wasn’t the Temperature feature used in the tree? Answer in the next session.
Recommend
More recommend