tree models
play

Tree Models Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting Instance feature space X


  1. 2019 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

  2. ML Task: Function Approximation • Problem setting • Instance feature space X • Instance label space Y • Unknown underlying function (target) f : X 7! Y f : X 7! Y • Set of function hypothesis H = f h j h : X 7! Yg H = f h j h : X 7! Yg • Input: training data generated from the unknown f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g • Output: a hypothesis that best approximates h 2 H h 2 H f • Optimize in functional space, not just parameter space

  3. Optimize in Functional Space • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Continuous data example x 2 x 2 Class 1 Class 2 x 1 < a 1 Root Node Yes No a 3 a 3 a 2 a 2 Intermediate x 2 < a 2 x 2 < a 3 Node Class 2 Yes No Yes No Class 1 Leaf a 1 a 1 x 1 x 1 y = -1 y = 1 y = 1 y = -1 Node

  4. Optimize in Functional Space • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Discrete/categorical data example Outlook Root Node Sunny Overcast Rain Intermediate Humidity Wind y = 1 Node Leaf High Normal Strong Weak Node Leaf y = -1 y = 1 y = -1 y = 1 Node

  5. Decision Tree Learning • Problem setting • Instance feature space X • Instance label space Y • Unknown underlying function (target) f : X 7! Y f : X 7! Y • Set of function hypothesis H = f h j h : X 7! Yg H = f h j h : X 7! Yg • Input: training data generated from the unknown f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g f ( x ( i ) ; y ( i ) ) g = f ( x (1) ; y (1) ) ; : : : ; ( x ( n ) ; y ( n ) ) g • Output: a hypothesis that best approximates h 2 H h 2 H f • Here each hypothesis is a decision tree h h

  6. Decision Tree – Decision Boundary • Decision trees divide the feature space into axis- parallel (hyper-)rectangles • Each rectangular region is labeled with one label • or a probabilistic distribution over labels Slide credit: Eric Eaton

  7. History of Decision-Tree Research • Hunt and colleagues used exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. • In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. • Simultaneously, Breiman and Friedman and colleagues developed CART (Classification and Regression Trees), similar to ID3. • In the 1980’s a variety of improvements were introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. • Quinlan’s updated decision-tree package (C4.5) released in 1993. • Sklearn (python)Weka (Java) now include ID3 and C4.5 Slide credit: Raymond J. Mooney

  8. Decision Trees • Tree models • Intermediate node for splitting data • Leaf node for label prediction • Key questions for decision trees • How to select node splitting conditions? • How to make prediction? • How to decide the tree structure?

  9. Node Splitting • Which node splitting condition to choose? Outlook Temperature Sunny Overcast Rain Hot Mild Cool • Choose the features with higher classification capacity • Quantitatively, with higher information gain

  10. Fundamentals of Information Theory • Entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. • Suppose X is a random variable with n discrete values P ( X = x i ) = p i P ( X = x i ) = p i • then its entropy H ( X ) is n n X X H ( X ) = ¡ H ( X ) = ¡ p i log p i p i log p i i =1 i =1 • It is easy to verify n n n n X X X X n log 1 n log 1 1 1 H ( X ) = ¡ H ( X ) = ¡ p i log p i · ¡ p i log p i · ¡ n = log n n = log n i =1 i =1 i =1 i =1

  11. Illustration of Entropy • Entropy of binary distribution H ( X ) = ¡ p 1 log p 1 ¡ (1 ¡ p 1 ) log(1 ¡ p 1 ) H ( X ) = ¡ p 1 log p 1 ¡ (1 ¡ p 1 ) log(1 ¡ p 1 )

  12. Cross Entropy • Cross entropy is used to measure the difference between two random variable distributions n n X X H ( X; Y ) = ¡ H ( X; Y ) = ¡ P ( X = i ) log P ( Y = i ) P ( X = i ) log P ( Y = i ) i =1 i =1 • Continuous formulation Z Z H ( p; q ) = ¡ H ( p; q ) = ¡ p ( x ) log q ( x ) dx p ( x ) log q ( x ) dx • Compared to KL divergence Z Z p ( x ) log p ( x ) p ( x ) log p ( x ) D KL ( p k q ) = D KL ( p k q ) = q ( x ) dx = H ( p; q ) ¡ H ( p ) q ( x ) dx = H ( p; q ) ¡ H ( p )

  13. KL-Divergence Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution diverges from a second, expected probability distribution Z Z p ( x ) log p ( x ) p ( x ) log p ( x ) D KL ( p k q ) = D KL ( p k q ) = q ( x ) dx = H ( p; q ) ¡ H ( p ) q ( x ) dx = H ( p; q ) ¡ H ( p )

  14. Review Cross Entropy in Logistic Regression • Logistic regression is a binary classification model 1 1 ¾ ( x ) ¾ ( x ) p μ ( y = 1 j x ) = ¾ ( μ > x ) = p μ ( y = 1 j x ) = ¾ ( μ > x ) = 1 + e ¡ μ > x 1 + e ¡ μ > x e ¡ μ > x e ¡ μ > x p μ ( y = 0 j x ) = p μ ( y = 0 j x ) = 1 + e ¡ μ > x 1 + e ¡ μ > x x • Cross entropy loss function L ( y; x; p μ ) = ¡ y log ¾ ( μ > x ) ¡ (1 ¡ y ) log(1 ¡ ¾ ( μ > x )) L ( y; x; p μ ) = ¡ y log ¾ ( μ > x ) ¡ (1 ¡ y ) log(1 ¡ ¾ ( μ > x )) • Gradient @ L ( y; x; p μ ) @ L ( y; x; p μ ) 1 1 ¡ 1 ¡ 1 = ¡ y = ¡ y ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x ¡ (1 ¡ y ) ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x ¡ (1 ¡ y ) 1 ¡ ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x 1 ¡ ¾ ( μ > x ) ¾ ( z )(1 ¡ ¾ ( z )) x @ μ @ μ = ( ¾ ( μ > x ) ¡ y ) x = ( ¾ ( μ > x ) ¡ y ) x @¾ ( z ) @¾ ( z ) μ Ã μ + ( y ¡ ¾ ( μ > x )) x μ Ã μ + ( y ¡ ¾ ( μ > x )) x = ¾ ( z )(1 ¡ ¾ ( z )) = ¾ ( z )(1 ¡ ¾ ( z )) @z @z

  15. Conditional Entropy n n X X • Entropy H ( X ) = ¡ H ( X ) = ¡ P ( X = i ) log P ( X = i ) P ( X = i ) log P ( X = i ) i =1 i =1 • Specific conditional entropy of X given Y = v n n X X H ( X j Y = v ) = ¡ H ( X j Y = v ) = ¡ P ( X = i j Y = v ) log P ( X = i j Y = v ) P ( X = i j Y = v ) log P ( X = i j Y = v ) i =1 i =1 • Specific conditional entropy of X given Y X X H ( X j Y ) = H ( X j Y ) = P ( Y = v ) H ( X j Y = v ) P ( Y = v ) H ( X j Y = v ) v 2 values( Y ) v 2 values( Y ) • Information Gain of X given Y I ( X; Y ) = H ( X ) ¡ H ( X j Y ) = H ( Y ) ¡ H ( Y j X ) I ( X; Y ) = H ( X ) ¡ H ( X j Y ) = H ( Y ) ¡ H ( Y j X ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) Entropy of ( X , Y ) instead of cross entropy

  16. Information Gain • Information Gain of X given Y I ( X; Y ) = H ( X ) ¡ H ( X j Y ) I ( X; Y ) = H ( X ) ¡ H ( X j Y ) X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( Y = u ) P ( Y = u ) P ( X = v j Y = u ) log P ( X = v j Y = u ) P ( X = v j Y = u ) log P ( X = v j Y = u ) v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( X = v; Y = u ) log P ( X = v j Y = u ) P ( X = v; Y = u ) log P ( X = v j Y = u ) v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) + P ( X = v ) log P ( X = v ) + P ( X = v; Y = u )[log P ( X = v; Y = u ) ¡ log P ( Y = u )] P ( X = v; Y = u )[log P ( X = v; Y = u ) ¡ log P ( Y = u )] v v u u v v X X X X X X = ¡ = ¡ P ( X = v ) log P ( X = v ) ¡ P ( X = v ) log P ( X = v ) ¡ P ( Y = u ) log P ( Y = u ) + P ( Y = u ) log P ( Y = u ) + P ( X = v; Y = u ) log P ( X = v; Y = u ) P ( X = v; Y = u ) log P ( X = v; Y = u ) v v u u u;v u;v = H ( X ) + H ( Y ) ¡ H ( X; Y ) = H ( X ) + H ( Y ) ¡ H ( X; Y ) Entropy of ( X , Y ) instead of cross entropy X X H ( X; Y ) = ¡ H ( X; Y ) = ¡ P ( X = v; Y = u ) log P ( X = v; Y = u ) P ( X = v; Y = u ) log P ( X = v; Y = u ) u;v u;v

Recommend


More recommend