Decision Trees Lecture 11 David Sontag New York - PowerPoint PPT Presentation

Decision ¡Trees ¡ Lecture ¡11 ¡ David ¡Sontag ¡ New ¡York ¡University ¡ Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

A ¡learning ¡problem: ¡predict ¡fuel ¡efficiency ¡ • 40 ¡data ¡points ¡ • Goal: ¡predict ¡ MPG ¡ • Need ¡to ¡find: ¡ f ¡ ¡ : ¡X ¡  ¡Y ¡ • Discrete ¡data ¡ (for ¡now) ¡ X Y From the UCI repository (thanks to Ross Quinlan)

Hypotheses: decision trees f : X  Y • Each internal node tests an attribute x i Cylinders ¡ • Each branch assigns an attribute 3 ¡ 4 ¡ 5 ¡ 6 ¡ 8 ¡ value x i =v good bad bad Maker ¡ Horsepower ¡ • Each leaf assigns a class y low ¡ med ¡ america ¡ asia ¡ europe ¡ high ¡ • To classify input x : bad good bad good good bad traverse the tree from root to leaf, output the labeled y Human ¡interpretable! ¡

Hypothesis space • How many possible hypotheses? • What functions can be represented? Cylinders ¡ 6 ¡ 3 ¡ 4 ¡ 5 ¡ 8 ¡ bad good bad Maker ¡ Horsepower ¡ america ¡ low ¡ med ¡ high ¡ asia ¡ europe ¡ bad good good good bad bad

What ¡funcPons ¡can ¡be ¡represented? ¡ → A • Decision ¡trees ¡can ¡represent ¡ A B A xor B F T F F F any ¡funcPon ¡of ¡the ¡input ¡ B B F T T F T F T T F T aRributes! ¡ T T F F T T F (Figure ¡from ¡Stuart ¡Russell) ¡ • For ¡Boolean ¡funcPons, ¡path ¡ to ¡leaf ¡gives ¡truth ¡table ¡row ¡ Cylinders ¡ • But, ¡could ¡require ¡ 6 ¡ 3 ¡ 4 ¡ 5 ¡ 8 ¡ exponenPally ¡many ¡nodes… ¡ bad good bad Maker ¡ Horsepower ¡ america ¡ low ¡ med ¡ high ¡ asia ¡ europe ¡ bad good good good bad bad cyl=3 ∨ (cyl=4 ∧ (maker=asia ∨ maker=europe)) ∨ …

Hypothesis space • How many possible hypotheses? • What functions can be represented? • How many will be consistent with a given dataset? Cylinders ¡ • How will we choose the 6 ¡ 3 ¡ 4 ¡ 5 ¡ 8 ¡ bad good bad best one? Maker ¡ Horsepower ¡ • Lets first look at how to split america ¡ low ¡ med ¡ high ¡ asia ¡ europe ¡ nodes, then consider how to bad good good good bad bad find the best tree

What ¡is ¡the ¡ Simplest ¡Tree? ¡ predict ¡ mpg=bad ¡ Is ¡this ¡a ¡good ¡tree? ¡ ¡Means: ¡ ¡ [22+, ¡18-‑] ¡ ¡ ¡ ¡correct ¡on ¡22 ¡examples ¡ ¡ ¡ ¡incorrect ¡on ¡18 ¡examples ¡

A ¡Decision ¡Stump ¡

Recursive ¡Step ¡ Records in which cylinders = 4 Records in which cylinders = 5 Take the And partition it Original according Records Dataset.. to the value of in which the attribute we cylinders split on = 6 Records in which cylinders = 8

Recursive ¡Step ¡ Build tree from Build tree from Build tree from Build tree from These records.. These records.. These records.. These records.. Records in Records in which cylinders which cylinders = 8 = 6 Records in Records in which cylinders which cylinders = 5 = 4

Second ¡level ¡of ¡tree ¡ Recursively build a tree from the seven (Similar recursion in records in which there are four cylinders the other cases) and the maker was based in Asia

A full tree

Are ¡all ¡decision ¡trees ¡equal? ¡ • Many ¡trees ¡can ¡represent ¡the ¡same ¡concept ¡ • But, ¡not ¡all ¡trees ¡will ¡have ¡the ¡same ¡size! ¡ – e.g., ¡ φ ¡= ¡(A ¡ ∧ ¡B) ¡ ∨ ¡( ¬ A ¡ ∧ C) ¡-‑-‑ ¡((A ¡and ¡B) ¡or ¡(not ¡A ¡and ¡C)) ¡ B A t f t f C C B C t t f f t f t f _ _ _ A A + + + t f t f _ _ + + • Which tree do we prefer?

Learning ¡decision ¡trees ¡is ¡hard!!! ¡ • Learning ¡the ¡simplest ¡(smallest) ¡decision ¡tree ¡is ¡ an ¡NP-‑complete ¡problem ¡[Hyafil ¡& ¡Rivest ¡’76] ¡ ¡ • Resort ¡to ¡a ¡greedy ¡heurisPc: ¡ – Start ¡from ¡empty ¡decision ¡tree ¡ – Split ¡on ¡ next ¡best ¡a2ribute ¡(feature) ¡ – Recurse ¡

Spliing: ¡choosing ¡a ¡good ¡aRribute ¡ Would we prefer to split on X 1 or X 2 ? X 1 X 2 Y T T T T F T X 1 X 2 T T T t f t f T F T Y=t : 4 Y=t : 1 Y=t : 3 Y=t : 2 F T T Y=f : 0 Y=f : 3 Y=f : 1 Y=f : 2 F F F F T F Idea: use counts at leaves to define F F F probability distributions, so we can measure uncertainty!

Measuring ¡uncertainty ¡ • Good ¡split ¡if ¡we ¡are ¡more ¡certain ¡about ¡ classificaPon ¡ajer ¡split ¡ – DeterminisPc ¡good ¡(all ¡true ¡or ¡all ¡false) ¡ – Uniform ¡distribuPon ¡bad ¡ – What ¡about ¡distribuPons ¡in ¡between? ¡ P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8 P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4

Entropy ¡ Entropy ¡ H(Y) ¡of ¡a ¡random ¡variable ¡ Y Entropy ¡of ¡a ¡coin ¡flip ¡ More uncertainty, more entropy! Entropy ¡ Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) Probability ¡of ¡heads ¡

High, ¡Low ¡Entropy ¡ • “High ¡Entropy” ¡ ¡ – Y ¡is ¡from ¡a ¡uniform ¡like ¡distribuPon ¡ – Flat ¡histogram ¡ – Values ¡sampled ¡from ¡it ¡are ¡less ¡predictable ¡ • “Low ¡Entropy” ¡ ¡ – Y ¡is ¡from ¡a ¡varied ¡(peaks ¡and ¡valleys) ¡ distribuPon ¡ – Histogram ¡has ¡many ¡lows ¡and ¡highs ¡ – Values ¡sampled ¡from ¡it ¡are ¡more ¡predictable ¡ (Slide from Vibhav Gogate)

Entropy ¡of ¡a ¡coin ¡flip ¡ Entropy ¡Example ¡ Entropy ¡ Probability ¡of ¡heads ¡ P(Y=t) = 5/6 X 1 X 2 Y P(Y=f) = 1/6 T T T T F T H(Y) = - 5/6 log 2 5/6 - 1/6 log 2 1/6 T T T = 0.65 T F T F T T F F F

CondiPonal ¡Entropy ¡ CondiPonal ¡Entropy ¡ H( Y |X) ¡of ¡a ¡random ¡variable ¡ Y ¡condiPoned ¡on ¡a ¡ random ¡variable ¡ X X 1 X 2 Y Example: X 1 T T T t f T F T P(X 1 =t) = 4/6 Y=t : 4 Y=t : 1 T T T P(X 1 =f) = 2/6 Y=f : 0 Y=f : 1 T F T F T T H(Y|X 1 ) = - 4/6 (1 log 2 1 + 0 log 2 0) F F F - 2/6 (1/2 log 2 1/2 + 1/2 log 2 1/2) = 2/6

InformaPon ¡gain ¡ • Decrease ¡in ¡entropy ¡(uncertainty) ¡ajer ¡spliing ¡ X 1 X 2 Y In our running example: T T T T F T IG(X 1 ) = H(Y) – H(Y|X 1 ) T T T = 0.65 – 0.33 T F T IG(X 1 ) > 0  we prefer the split! F T T F F F

Learning ¡decision ¡trees ¡ • Start ¡from ¡empty ¡decision ¡tree ¡ • Split ¡on ¡ next ¡best ¡a2ribute ¡(feature) ¡ – Use, ¡for ¡example, ¡informaPon ¡gain ¡to ¡select ¡ aRribute: ¡ • Recurse ¡

Suppose we want to predict MPG Look ¡at ¡all ¡the ¡ informaPon ¡ gains… ¡

A ¡Decision ¡Stump ¡ First split looks good! But, when do we stop?

Base Case One Don’t split a node if all matching records have the same output value

Base Case Two Don’t split a node if data points are identical on remaining attributes

Base Case Two: No attributes can distinguish

Base ¡Cases: ¡An ¡idea ¡ • Base ¡Case ¡One: ¡If ¡all ¡records ¡in ¡current ¡data ¡ subset ¡have ¡the ¡same ¡output ¡then ¡don’t ¡recurse ¡ • Base ¡Case ¡Two: ¡If ¡all ¡records ¡have ¡exactly ¡the ¡ same ¡set ¡of ¡input ¡aRributes ¡then ¡don’t ¡recurse ¡ Proposed Base Case 3: If all attributes have zero information gain then don’t recurse • Is this a good idea?

The ¡problem ¡with ¡Base ¡Case ¡3 ¡ y = a XOR b The information gains: The resulting decision tree:

If ¡we ¡omit ¡Base ¡Case ¡3: ¡ The resulting decision tree: y = a XOR b Is it OK to omit Base Case 3?

Summary: ¡Building ¡Decision ¡Trees ¡ BuildTree( DataSet,Output ) ¡ • If ¡all ¡output ¡values ¡are ¡the ¡same ¡in ¡ DataSet , ¡return ¡a ¡leaf ¡node ¡ that ¡says ¡“predict ¡this ¡unique ¡output” ¡ • If ¡all ¡input ¡values ¡are ¡the ¡same, ¡return ¡a ¡leaf ¡node ¡that ¡says ¡ “predict ¡the ¡majority ¡output” ¡ • Else ¡find ¡aRribute ¡ X ¡with ¡highest ¡Info ¡Gain ¡ • Suppose ¡ X ¡has ¡ n X ¡disPnct ¡values ¡(i.e. ¡X ¡has ¡arity ¡ n X ). ¡ ¡ – Create ¡a ¡non-‑leaf ¡node ¡with ¡ n X ¡children. ¡ ¡ – The ¡ i’ th ¡child ¡should ¡be ¡built ¡by ¡calling ¡ BuildTree( DS i , Output ) ¡ Where ¡ DS i ¡ ¡contains ¡the ¡records ¡in ¡DataSet ¡where ¡X ¡= ¡ i th ¡value ¡of ¡X. ¡

MPG Test set error The test set error is much worse than the training set error… …why?

Decision Trees Lecture 11 David Sontag New York - PowerPoint PPT Presentation

Decision Trees Lecture 11 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore A learning problem: predict fuel efficiency

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Splay Trees and B-Trees CSE 373 Data Structures Lecture 9 Readings Reading Sections

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

Symmetric dense matrix tridiagonalization on a GPU cluster Ichitaro Yamazaki, Tim Dong, Stan

Energy Optimal Control for Time Varying Wireless Networks Michael J. Neely University of

Integrated pollster and vehicle routing S. Gutirrez, A. Miniguano, D. Recalde, L. M. Torres, R.

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

A Bounded Path Propagator on Directed Graphs CP 16 Diego de U na, Graeme Gange, Peter Schachte

publicpolicies,socialnetworksandepidemicprocesses Social networks

Instruction Selection on SSA Graphs Sebastian Hack, Sebastian Buchwald, Andreas Zwinkau Compiler

Computational complexity of stochastic programs A. Shapiro School of Industrial and Systems