Decision Trees Gavin Brown Every Learning Method has Limitations - PowerPoint PPT Presentation

Decision Trees Gavin Brown

Every Learning Method has Limitations Linear model? KNN ? SVM ?

Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above decision?

Different types of data Rugby players - height, weight can be plotted in 2-d. How do you plot hair colour? (Black, Brown, Blonde?) Predicting heart disease - how do you plot blood type? (A, B, O)? In general, how do you deal with categorical data?

The Tennis Problem You are working for the local tennis club. They want a program that will advise inexperienced new members on whether they are likely to enjoy a game today, given the current weather conditions. However they need the program to pop out interpretable rules so they can be sure it’s not giving bad advice. They provide you with some historical data....

The Tennis Problem Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Note: 9 examples say ’yes’, 5 examples say ’no’.

A Decision Tree for the Tennis Problem This tree works for any example in the table — try it!

Learning a Decision Tree : Basic recursive algorithm tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature, call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif

Example: partitioning data by “wind” feature Outlook Temp Humid Wind Play? Outlook Temp Humid Wind Play? 2 Sunny Hot High Strong No 1 Sunny Hot High Weak No 6 Rain Cool Normal Strong No 3 Overcast Hot High Weak Yes 7 Overcast Cool Normal Strong Yes 4 Rain Mild High Weak Yes 11 Sunny Mild Normal Strong Yes 5 Rain Cool Normal Weak Yes 12 Overcast Mild High Strong Yes 8 Sunny Mild High Weak No 14 Rain Mild High Strong No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 13 Overcast Hot Normal Weak Yes 3 examples say yes, 3 say no. 6 examples say yes, 2 examples say no.

Learning a Decision Tree : Basic recursive algorithm tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature , call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif Which is the most important feature?

Thinking in Probabilities... Before the split : 9 ’yes’, 5 ’no’, ......... p ( ′ yes ′ ) = 9 14 ≈ 0 . 64 On the left branch : 3 ’yes’, 3 ’no’, ....... p ( ′ yes ′ ) = 3 6 = 0 . 5 On the right branch : 6 ’yes’, 2 ’no’, ...... p ( ′ yes ′ ) = 6 8 = 0 . 75 Remember... p ( ′ no ′ ) = 1 − p ( ′ yes ′ )

The ‘Information’ contained in a variable - Entropy More uncertainty = Less information H ( X ) = 1 . 0

The ‘Information’ contained in a variable - Entropy Lower uncertainty = More information H ( X ) = 0 . 72193

Entropy The amount of randomness in a variable X is called the ’entropy’. � H ( X ) = − p ( x i ) log p ( x i ) (1) i The log is base 2, giving us units of measurement ’bits’.

Reducing Entropy = Maximise Information Gain The variable of interest is “T” (for tennis), taking on ’yes’ or ’no’ values. Before the split : 9 ’yes’, 5 ’no’, ......... p ( ′ yes ′ ) = 9 14 ≈ 0 . 64 In the whole dataset, the entropy is: � H ( T ) = − p ( x i ) log p ( x i ) i � 5 14 log 5 14 + 9 14 log 9 � = − = 0 . 94029 14 H ( T ) is the entropy before we split. See worked example in the supporting material.

Reducing Entropy = Maximise Information Gain H ( T ) is the entropy before we split. H ( T | W = strong ) is the entropy of the data on the left branch. H ( T | W = weak ) is the entropy of the data on the right branch. H ( T | W ) is the weighted average of the two. Choose the feature with maximum value of H ( T ) − H ( T | W ). See worked example in the supporting material.

Learning a Decision Tree : the ID3 algorithm tree ← learntree( data ) if all examples in data have same label, return leaf node with that label else pick the most “important” feature , call it F for each possible value v of F data(v) ← all examples where F == v add branch ← learntree( data(v) ) endfor return tree endif Or, in very simple terms: Step 1. Pick the feature that maximises information gain. Step 2. Recurse on each branch.

The ID3 algorithm function id3( examples ) returns tree T if all the items in examples have the same conclusion, return a leaf node with value = majority conclusion let A be the feature with the largest information gain Create a blank tree T let s(1), s(2), s(3) etc be the data subsets produced by splitting examples on feature A For each subset s(n), tree t(n) = id3( s(n) ) add t(n) as a new branch of T Endfor return T

A Decision Tree for the Tennis Problem Following each path down the tree, we can make up a list of rules. if ( sunny AND high ) → NO if ( sunny AND normal ) → YES if ( overcast ) → YES if ( rain AND strong ) → NO if ( rain AND weak ) → YES

’Overfitting’ a tree ◮ The number of possible paths tells you the number of rules. ◮ More rules = more complicated. ◮ We could have N rules where N is the size of the dataset. This would mean no generalisation outside of the training data, or the tree is overfitted Overfitting = fine tuning

Overfitting What if it’s rainy and hot? Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No

Overfitting How do you know if you’ve overfitted? ◮ “Validation” dataset - another dataset that you do not use to train, but just to check whether you’ve overfitted or not. How can we avoid it? ◮ Stop after a certain depth (i.e. keep the tree short) ◮ Post-Prune the final tree ◮ ... both in order to control validation error

Overfitting

Missing data? Outlook Temperature Humidity Wind Play Tennis? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Normal No 7 Overcast Cool Normal Yes 8 Sunny High No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Normal Strong Yes 12 Overcast High Strong Yes 13 Overcast Normal Weak Yes 14 Rain Mild High Strong No Insert average (mean, median or mode) of the available values. Or other more complex strategies such as using Bayes Rule... NEXT WEEK... Ultimately best strategy is problem dependent .

Conclusion Decision Trees provide a flexible and interpretable model. There are many variations on the simple id3 algorithm. Further reading: www.decisiontrees.net . (site written by a former student of this course) Why wasn’t the Temperature feature used in the tree? Answer in the next session.

Decision Trees Gavin Brown Every Learning Method has Limitations - PowerPoint PPT Presentation

Decision Trees Gavin Brown Every Learning Method has Limitations Linear model? KNN ? SVM ? Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above decision? Different types of data Rugby

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision trees Decision Trees / Discrete Variables Location Season Location Fun? Ski Slope

WORKING FAITH studies from the book of JAMES JoLynn Gower 493-6151

BLOCKING SETS OF HALL PLANES, AND VALUE SETS OF POLYNOMIALS OVER FINITE FIELDS Fq13, Gaeta June

Parametric Tiling with Inter-Tile Data Reuse Alexandre Isoard Alain Darte Compsys, LIP

Integrating Non-blocking Synchronisation in Parallel Applications: Performance Advantages and

Bayesian networks: basics Machine Intelligence Thomas D. Nielsen September 2008 Bayesian

Instability of extreme black holes James Lucietti University of Edinburgh EMPG seminar, 31 Oct

Enhancement of near-cloaking using multilayer structures Mikyoung LIM (KAIST) June 23, 2012

Advance Caching 1 Way-associative cache blocks sharing the block/line address same index