Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Decision trees & random forests Feb 10, 2016
Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random forests Logistic regression Networks Support vector machines Topic models Survival models K-means clustering Neural networks Hierarchical clustering Perceptron
Decision trees Random forests
20 questions
Feature Value lives in Berkeley no yes follow clinton 0 follows contains follow trump 0 Trump “email” yes no yes “benghazi” 0 no follows D R negative sentiment Trump 0 + “benghazi” no yes profile “illegal immigrants” 0 contains “Republican” D R no yes “republican” in 0 profile D R “democrat” in 0 profile self-reported 1 location = Berkeley
lives in contains Berkeley “the” no yes no yes follows contains contains contains Trump “email” “a” “he” yes yes no yes no yes no no follows contains D R D R Trump “she” no yes no yes profile contains contains “they” “Republican” D R D R no yes no yes D R D R how do we find the best tree?
contains “the” no yes contains contains contains “a” “he” “the” yes no yes no yes no contains D R “she” contains contains contains “they” “a” “he” … … yes no yes … no contains contains “an” D R no yes “she” contains contains no yes ”are” “our” contains yes no yes “they” no contains D R “him” D R no yes no yes contains “them” D R D R no yes … contains “her” yes contains “his” no yes contains “hers” how do we find the best tree? D R no yes D R
Decision trees from Flach 2014
<x, y> training data x 2 >15 x 2 ≤ 15 x 2 > 5 x 2 ≤ 5 x 1 > 10 x 1 ≤ 10
Decision trees from Flach 2014
Decision trees • Homogeneous(D): the elements in D are homogeneous enough that they can be labeled with a single label • Label(D): the single most appropriate label for all elements in D
Decision trees Homogeneous Label All (or most) of the Classification elements in D share y the same label y The elements in D the average of Regression have low variance elements in D
Decision trees from Flach 2014
Entropy Measure of uncertainty in a probability distribution � P ( x ) log P ( x ) − x ∈ X • a great _______ • the oakland ______
a great … the oakland … deal 12196 athletics 185 job 2164 raiders 185 idea 1333 museum 92 opportunity 855 hills 72 weekend 585 tribune 51 police 49 player 556 coliseum 41 extent 439 honor 282 pleasure 267 gift 256 humor 221 tool 184 athlete 173 Corpus of Contemporary American English disservice 108 …
Entropy � P ( x ) log P ( x ) − x ∈ X • High entropy means the phenomenon is less predictable • Entropy of 0 means it is entirely predictable.
Entropy 0.4 P(X=x) 6 0.2 6 log 1 1 � 6 = 2 . 58 − 0.0 1 1 2 3 4 5 6 A uniform distribution has maximum entropy 0.4 P(X=x) 5 0.2 − 0 . 4 log 0 . 4 − 0 . 12 log 0 . 12 = 2 . 36 � 1 0.0 1 2 3 4 5 6 This entropy is lower because it is more predictable (if we always guess 2, we would be right 40% of the time)
Conditional entropy • Measures your level of surprise about some phenomenon Y if you have information about another phenomenon X • Y = word, X = preceding bigram (“the oakland ___”) • Y = label (democrat, republican), X = feature (lives in Berkeley)
Conditional entropy • Measures you level of surprise about some phenomenon Y if you have information about another phenomenon X H ( Y | X ) X = feature Y = label � P ( X = x ) H ( Y | X = x ) = value x H ( Y | X = x ) = − � p ( y | x ) log p ( y | x ) y ∈ Y
Information gain • aka “Mutual Information”: the reduction in entropy in Y as a result of knowing information about X H ( Y ) − H ( Y | X ) H ( Y ) = − � p ( y ) log p ( y ) y ∈ Y H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y
1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y Which of these features gives you more information about y?
1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖
x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y 3 P ( y = + | x = 0 ) = 3 + 0 = 1 3 P ( x = 0 ) = 3 + 3 = 0 . 5 0 P ( y = − | x = 0 ) = 3 + 0 = 0 3 0 P ( x = 1 ) = 3 + 3 = 0 . 5 P ( y = + | x = 1 ) = 3 + 0 = 0 3 P ( y = − | x = 1 ) = 3 + 0 = 1
x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y − 3 6 ( 1 log 1 + 0 log 0 ) − 3 6 ( 0 log 0 + 1 log 1 ) = 0
1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖
x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ 1 P ( y = + | x = 0 ) = 1 + 2 = 0 . 33 3 2 P ( x = 0 ) = 3 + 3 = 0 . 5 P ( y = − | x = 0 ) = 1 + 2 = 0 . 67 2 3 P ( y = + | x = 1 ) = 1 + 2 = 0 . 67 P ( x = 1 ) = 3 + 3 = 0 . 5 1 P ( y = − | x = 1 ) = 1 + 2 = 0 . 33
x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y − 3 6 ( 0 . 33 log 0 . 33 + 0 . 67 log 0 . 67 ) − 3 6 ( 0 . 67 log 0 . 67 + 0 . 33 log 0 . 33 ) = 0 . 91
Feature H(Y | X) In decision trees, the feature follow clinton 0.91 with the lowest conditional entropy/highest information follow trump 0.77 gain defines the “best split” “benghazi” 0.45 negative sentiment 0.33 + “benghazi” MI = IG = H ( Y ) − H ( Y | X ) “illegal immigrants” 0 “republican” in 0.31 profile for a given partition, H(Y) is the same for all features, so we can ignore it when deciding “democrat” in among them 0.67 profile self-reported 0.80 location = Berkeley
Feature H(Y | X) follow clinton 0.91 How could we use this in other models (e.g., the perceptron)? follow trump 0.77 “benghazi” 0.45 negative sentiment 0.33 + “benghazi” “illegal immigrants” 0 “republican” in 0.31 profile “democrat” in 0.67 profile self-reported 0.80 location = Berkeley
Decision trees BestSplit identifies the feature with the highest information gain and partitions the data according to values for that feature
Gini impurity • Measure the “purity” of a partition (how diverse the labels are). If we were to pick an element in D and assign a label in proportion to the label distribution in D, how often would we make a mistake? Probability of selecting an item with label y at random p y ( 1 − p y ) � y ∈ Y The probability of randomly assigning it the wrong label
Gini impurity p y ( 1 − p y ) � y ∈ Y x ∈ 𝒴 x ∈ 𝒴 0 1 0 1 x 1 x 2 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ G ( 0 ) = 0 . 33 × ( 1 − 0 . 33 ) + 0 . 67 × ( 1 − 0 . 67 ) = 0 . 44 G ( 0 ) = 1 × ( 1 − 1 ) + 0 × ( 1 − 0 ) = 0 G ( 1 ) = 0 . 67 × ( 1 − 0 . 67 ) + 0 . 33 × ( 1 − 0 . 33 ) = 0 . 44 G ( 0 ) = 0 × ( 1 − 0 ) + 1 × ( 1 − 1 ) = 0 3 3 3 3 G ( x 2 ) = ( 3 + 3 ) 0 . 44 + ( 3 + 3 ) 0 . 44 = 0 . 44 G ( x 1 ) = ( 3 + 3 ) 0 + ( 3 + 3 ) 0 = 0
Classification A mapping h from input data x (drawn from instance space 𝓨 ) to a label (or labels) y from some enumerable output space 𝒵 𝓨 = set of all skyscrapers 𝒵 = {art deco, neo-gothic, modern} x = the empire state building y = art deco
lives in Berkeley Feature Value no yes follow clinton 0 follows contains Trump “email” follow trump 0 yes no yes no follows “benghazi” 0 D R Trump negative sentiment no yes 0 profile + “benghazi” contains “Republican” D R “illegal immigrants” 0 no yes “republican” in D R 0 profile “democrat” in 0 profile self-reported The tree that we’ve learned is the mapping ĥ (x) 1 location = Berkeley
lives in Berkeley Feature Value no yes follow clinton 0 follows contains Trump “email” follow trump 0 yes no yes no follows “benghazi” 0 D R Trump negative sentiment no yes 0 profile + “benghazi” contains “Republican” D R “illegal immigrants” 0 no yes “republican” in D R 0 profile “democrat” in 0 profile How is this different from the perceptron? self-reported 1 location = Berkeley
Regression A mapping from input data x (drawn from instance space 𝓨 ) to a point y in ℝ ( ℝ = the set of real numbers) x = the empire state building y = 17444.5625”
Recommend
More recommend