Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010
Learning a good prediction rule • Learn a mapping • Best prediction rule • Hypothesis space/Function class – Parametric classes (Gaussian, binomial etc.) – Conditionally independent class densities (Naïve Bayes) – Linear decision boundary (Logistic regression) – Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today ) • Given training data, find a hypothesis/function in that is close to the best prediction rule. 2
First … • What does a decision tree represent • Given a decision tree, how do we assign label to a test point 3
Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt • Each internal node: test Married Single, Divorced one feature X i • Each branch from a node: TaxInc NO selects one value for X i < 80K > 80K • Each leaf node: predict Y NO YES 4
Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 5
Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Cheat Status Income Refund No Married 80K ? Yes No 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 6
Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? ? No Married 80K Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 7
Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income Refund No Married 80K ? ? No Married 80K Yes No No 10 10 NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 8
Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? ? ? No No Married Married 80K 80K Yes No No 10 10 10 NO MarSt Married Married Single, Divorced TaxInc NO < 80K > 80K NO YES 9
Decision Tree for Tax Fraud Detection Query Data Refund Marital Taxable Refund Marital Refund Marital Taxable Taxable Cheat Cheat Cheat Status Income Status Status Income Income Refund No Married 80K ? ? ? No No Married Married 80K 80K Yes No No 10 10 10 NO MarSt Assign Cheat to “No” Married Married Single, Divorced TaxInc NO < 80K > 80K NO YES 10
Decision Tree more generally… • Features can be discrete, continuous or categorical • Each internal node: test 1 1 some set of features {X i } • Each branch from a node: 0 1 1 selects a set of value for 1 0 {X i } • Each leaf node: predict Y 1 1 1 1 0 0 1 11
So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point Now … • How do we learn a decision tree from training data • What is the decision on each leaf 12
So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point Now … • How do we learn a decision tree from training data • What is the decision on each leaf 13
How to learn a decision tree • Top- down induction *ID3, C4.5, CART, …+ Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES 14
Which feature is best to split? X 1 X 2 Y T T T T F T F T F T T T T Y: 4 Ts Y: 1 Ts Y: 3 Ts Y: 2 Ts T F T 0 Fs 3 Fs 1 Fs 2 Fs F T T Absolutely Kind of Kind of Absolutely sure sure sure unsure F F F F T F Good split if we are more certain F F F about classification after split – Uniform distribution of labels is bad 15
Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y 16
Entropy • Entropy of a random variable Y More uncertainty, Uniform more entropy! Max entropy Entropy, H(Y) Y ~ Bernoulli(p) Deterministic Zero entropy p Information Theory interpretation : H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 17
Andrew Moore’s Entropy in a Nutshell High Entropy Low Entropy ..the values (locations ..the values (locations of of soup) sampled soup) unpredictable... almost entirely from within uniformly sampled the soup bowl throughout our dining room 18
Information Gain • Advantage of attribute = decrease in uncertainty – Entropy of Y before split – Entropy of Y after splitting based on X i • Weight by probability of following each branch • Information gain is difference Max Information gain = min conditional entropy 19
Information Gain X 1 X 2 Y T T T T F T F T F T T T T Y: 4 Ts Y: 1 Ts Y: 3 Ts Y: 2 Ts T F T 0 Fs 3 Fs 1 Fs 2 Fs F T T F F F F T F F F F > 0 20
Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) – entropy of Y H(Y|X i ) – conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 21
Expressiveness of Decision Trees • Decision trees can express any function of the input features. • E.g., for Boolean functions, truth table row → path to leaf: • There is a decision tree which perfectly classifies a training set with one path to leaf for each example • But it won't generalize well to new examples - prefer to find more compact decision trees 22
Decision Trees - Overfitting One training example per leaf – overfits, need compact/pruned decision tree 23
Bias-Variance Tradeoff average Classifiers based on classifier different training data coarse partition bias large variance small Ideal classifier fine partition bias small variance large 24
When to Stop? • Many strategies for picking simpler trees: – Pre-pruning • Fixed depth Refund Yes No • Fixed number of leaves MarSt – Post-pruning Married Single, Divorced • Chi-square test NO – Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-square test for independence) – Simplify rule set by eliminating unnecessary rules – Information Criteria: MDL(Minimum Description Length) 25
Information Criteria • Penalize complex models by introducing cost log likelihood cost regression classification penalize trees with more leaves 26
Information Criteria - MDL Penalize complex models based on their information content . # bits needed to describe f MDL (Minimum Description Length) (description length) Example: Binary Decision trees k leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1) 5 leaves => 9 bits to encode structure
So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point Now … • How do we learn a decision tree from training data • What is the decision on each leaf 28
How to assign label to each leaf Classification – Majority vote Regression – ? 29
How to assign label to each leaf Classification – Majority vote Regression – Constant/ Linear/Poly fit 30
Regression trees Num Children? ≥ 2 < 2 Average (fit a constant ) using training data at the leaves 31
Connection between nearest neighbor/histogram classifiers and decision trees 32
Local prediction Histogram, kernel density estimation, k-nearest neighbor classifier, kernel regression D Histogram Classifier 33
Local Adaptive prediction Let neighborhood size adapt to data – small neighborhoods near decision boundary (small bias), large neighborhoods elsewhere (small variance) D x Majority vote Decision Tree Classifier at each leaf 34
Histogram Classifier vs Decision Trees Ideal classifier Decision tree histogram 256 cells in each partition 35
Application to Image Coding 1024 cells in each partition 36
Application to Image Coding JPEG 0.125 bpp JPEG 2000 0.125 bpp non-adaptive partitioning adaptive partitioning 37
What you should know • Decision trees are one of the most popular data mining tools • Simplicity of design • Interpretability • Ease of implementation • Good performance in practice (for small dimensions) • Information gain to select attributes (ID3, C4.5,…) • Can be used for classification, regression and density estimation too • Decision trees will overfit!!! – Must use tricks to find “simple trees”, e.g., • Pre-Pruning: Fixed depth/Fixed number of leaves • Post-Pruning: Chi-square test of independence • Complexity Penalized/MDL model selection 38
Recommend
More recommend