deconstructing data science
play

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Decision trees & random forests Feb 10, 2016 Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random


  1. 
 Deconstructing Data Science David Bamman, UC Berkeley 
 Info 290 
 Lecture 7: Decision trees & random forests Feb 10, 2016

  2. Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random forests Logistic regression Networks Support vector machines Topic models Survival models K-means clustering Neural networks Hierarchical clustering Perceptron

  3. Decision trees Random forests

  4. 20 questions

  5. Feature Value lives in Berkeley no yes follow clinton 0 follows contains follow trump 0 Trump “email” yes no yes “benghazi” 0 no follows D R negative sentiment Trump 0 + “benghazi” no yes profile “illegal immigrants” 0 contains “Republican” D R no yes “republican” in 0 profile D R “democrat” in 0 profile self-reported 1 location = Berkeley

  6. lives in contains Berkeley “the” no yes no yes follows contains contains contains Trump “email” “a” “he” yes yes no yes no yes no no follows contains D R D R Trump “she” no yes no yes profile contains contains “they” “Republican” D R D R no yes no yes D R D R how do we find the best tree?

  7. contains “the” no yes contains contains contains “a” “he” “the” yes no yes no yes no contains D R “she” contains contains contains “they” “a” “he” … … yes no yes … no contains contains “an” D R no yes “she” contains contains no yes ”are” “our” contains yes no yes “they” no contains D R “him” D R no yes no yes contains “them” D R D R no yes … contains “her” yes contains “his” no yes contains “hers” how do we find the best tree? D R no yes D R

  8. Decision trees from Flach 2014

  9. <x, y> training data x 2 >15 x 2 ≤ 15 x 2 > 5 x 2 ≤ 5 x 1 > 10 x 1 ≤ 10

  10. Decision trees from Flach 2014

  11. Decision trees • Homogeneous(D): the elements in D are homogeneous enough that they can be labeled with a single label • Label(D): the single most appropriate label for all elements in D

  12. Decision trees Homogeneous Label All (or most) of the Classification elements in D share y the same label y The elements in D the average of Regression have low variance elements in D

  13. Decision trees from Flach 2014

  14. Entropy Measure of uncertainty in a probability distribution � P ( x ) log P ( x ) − x ∈ X • a great _______ • the oakland ______

  15. a great … the oakland … deal 12196 athletics 185 job 2164 raiders 185 idea 1333 museum 92 opportunity 855 hills 72 weekend 585 tribune 51 police 49 player 556 coliseum 41 extent 439 honor 282 pleasure 267 gift 256 humor 221 tool 184 athlete 173 Corpus of Contemporary American English disservice 108 …

  16. Entropy � P ( x ) log P ( x ) − x ∈ X • High entropy means the phenomenon is less predictable • Entropy of 0 means it is entirely predictable.

  17. Entropy 0.4 P(X=x) 6 0.2 6 log 1 1 � 6 = 2 . 58 − 0.0 1 1 2 3 4 5 6 A uniform distribution has maximum entropy 0.4 P(X=x) 5 0.2 − 0 . 4 log 0 . 4 − 0 . 12 log 0 . 12 = 2 . 36 � 1 0.0 1 2 3 4 5 6 This entropy is lower because it is more predictable 
 (if we always guess 2, we would be right 40% of the time)

  18. Conditional entropy • Measures your level of surprise about some phenomenon Y if you have information about another phenomenon X • Y = word, X = preceding bigram (“the oakland ___”) • Y = label (democrat, republican), X = feature (lives in Berkeley)

  19. Conditional entropy • Measures you level of surprise about some phenomenon Y if you have information about another phenomenon X H ( Y | X ) X = feature Y = label � P ( X = x ) H ( Y | X = x ) = value x H ( Y | X = x ) = − � p ( y | x ) log p ( y | x ) y ∈ Y

  20. Information gain • aka “Mutual Information”: the reduction in entropy in Y as a result of knowing information about X H ( Y ) − H ( Y | X ) H ( Y ) = − � p ( y ) log p ( y ) y ∈ Y H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y

  21. 1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y Which of these features gives you more information about y?

  22. 1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖

  23. x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y 3 P ( y = + | x = 0 ) = 3 + 0 = 1 3 P ( x = 0 ) = 3 + 3 = 0 . 5 0 P ( y = − | x = 0 ) = 3 + 0 = 0 3 0 P ( x = 1 ) = 3 + 3 = 0 . 5 P ( y = + | x = 1 ) = 3 + 0 = 0 3 P ( y = − | x = 1 ) = 3 + 0 = 1

  24. x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y − 3 6 ( 1 log 1 + 0 log 0 ) − 3 6 ( 0 log 0 + 1 log 1 ) = 0

  25. 1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖

  26. x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ 1 P ( y = + | x = 0 ) = 1 + 2 = 0 . 33 3 2 P ( x = 0 ) = 3 + 3 = 0 . 5 P ( y = − | x = 0 ) = 1 + 2 = 0 . 67 2 3 P ( y = + | x = 1 ) = 1 + 2 = 0 . 67 P ( x = 1 ) = 3 + 3 = 0 . 5 1 P ( y = − | x = 1 ) = 1 + 2 = 0 . 33

  27. x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y − 3 6 ( 0 . 33 log 0 . 33 + 0 . 67 log 0 . 67 ) − 3 6 ( 0 . 67 log 0 . 67 + 0 . 33 log 0 . 33 ) = 0 . 91

  28. Feature H(Y | X) In decision trees, the feature follow clinton 0.91 with the lowest conditional entropy/highest information follow trump 0.77 gain defines the “best split” “benghazi” 0.45 negative sentiment 0.33 + “benghazi” MI = IG = H ( Y ) − H ( Y | X ) “illegal immigrants” 0 “republican” in 0.31 profile for a given partition, H(Y) is the same for all features, so we can ignore it when deciding “democrat” in among them 0.67 profile self-reported 0.80 location = Berkeley

  29. Feature H(Y | X) follow clinton 0.91 How could we use this in other models (e.g., the perceptron)? follow trump 0.77 “benghazi” 0.45 negative sentiment 0.33 + “benghazi” “illegal immigrants” 0 “republican” in 0.31 profile “democrat” in 0.67 profile self-reported 0.80 location = Berkeley

  30. Decision trees BestSplit identifies the feature with the highest information gain and partitions the data according to values for that feature

  31. Gini impurity • Measure the “purity” of a partition (how diverse the labels are). If we were to pick an element in D and assign a label in proportion to the label distribution in D, how often would we make a mistake? Probability of selecting an item with label y at random p y ( 1 − p y ) � y ∈ Y The probability of randomly assigning it the wrong label

  32. Gini impurity p y ( 1 − p y ) � y ∈ Y x ∈ 𝒴 x ∈ 𝒴 0 1 0 1 x 1 x 2 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ G ( 0 ) = 0 . 33 × ( 1 − 0 . 33 ) + 0 . 67 × ( 1 − 0 . 67 ) = 0 . 44 G ( 0 ) = 1 × ( 1 − 1 ) + 0 × ( 1 − 0 ) = 0 G ( 1 ) = 0 . 67 × ( 1 − 0 . 67 ) + 0 . 33 × ( 1 − 0 . 33 ) = 0 . 44 G ( 0 ) = 0 × ( 1 − 0 ) + 1 × ( 1 − 1 ) = 0 3 3 3 3 G ( x 2 ) = ( 3 + 3 ) 0 . 44 + ( 3 + 3 ) 0 . 44 = 0 . 44 G ( x 1 ) = ( 3 + 3 ) 0 + ( 3 + 3 ) 0 = 0

  33. Classification A mapping h from input data x (drawn from instance space 𝓨 ) to a label (or labels) y from some enumerable output space 𝒵 𝓨 = set of all skyscrapers 𝒵 = {art deco, neo-gothic, modern} x = the empire state building y = art deco

  34. lives in Berkeley Feature Value no yes follow clinton 0 follows contains Trump “email” follow trump 0 yes no yes no follows “benghazi” 0 D R Trump negative sentiment no yes 0 profile + “benghazi” contains “Republican” D R “illegal immigrants” 0 no yes “republican” in D R 0 profile “democrat” in 0 profile self-reported The tree that we’ve learned is the mapping ĥ (x) 1 location = Berkeley

  35. lives in Berkeley Feature Value no yes follow clinton 0 follows contains Trump “email” follow trump 0 yes no yes no follows “benghazi” 0 D R Trump negative sentiment no yes 0 profile + “benghazi” contains “Republican” D R “illegal immigrants” 0 no yes “republican” in D R 0 profile “democrat” in 0 profile How is this different from the perceptron? self-reported 1 location = Berkeley

  36. Regression A mapping from input data x (drawn from instance space 𝓨 ) to a point y in ℝ ( ℝ = the set of real numbers) x = the empire state building y = 17444.5625”

Recommend


More recommend