Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

  Deconstructing Data Science David Bamman, UC Berkeley   Info 290   Lecture 7: Decision trees & random forests Feb 10, 2016

Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random forests Logistic regression Networks Support vector machines Topic models Survival models K-means clustering Neural networks Hierarchical clustering Perceptron

Decision trees Random forests

20 questions

Feature Value lives in Berkeley no yes follow clinton 0 follows contains follow trump 0 Trump “email” yes no yes “benghazi” 0 no follows D R negative sentiment Trump 0 + “benghazi” no yes profile “illegal immigrants” 0 contains “Republican” D R no yes “republican” in 0 profile D R “democrat” in 0 profile self-reported 1 location = Berkeley

lives in contains Berkeley “the” no yes no yes follows contains contains contains Trump “email” “a” “he” yes yes no yes no yes no no follows contains D R D R Trump “she” no yes no yes profile contains contains “they” “Republican” D R D R no yes no yes D R D R how do we find the best tree?

contains “the” no yes contains contains contains “a” “he” “the” yes no yes no yes no contains D R “she” contains contains contains “they” “a” “he” … … yes no yes … no contains contains “an” D R no yes “she” contains contains no yes ”are” “our” contains yes no yes “they” no contains D R “him” D R no yes no yes contains “them” D R D R no yes … contains “her” yes contains “his” no yes contains “hers” how do we find the best tree? D R no yes D R

Decision trees from Flach 2014

<x, y> training data x 2 >15 x 2 ≤ 15 x 2 > 5 x 2 ≤ 5 x 1 > 10 x 1 ≤ 10

Decision trees • Homogeneous(D): the elements in D are homogeneous enough that they can be labeled with a single label • Label(D): the single most appropriate label for all elements in D

Decision trees Homogeneous Label All (or most) of the Classification elements in D share y the same label y The elements in D the average of Regression have low variance elements in D

Entropy Measure of uncertainty in a probability distribution � P ( x ) log P ( x ) − x ∈ X • a great _______ • the oakland ______

a great … the oakland … deal 12196 athletics 185 job 2164 raiders 185 idea 1333 museum 92 opportunity 855 hills 72 weekend 585 tribune 51 police 49 player 556 coliseum 41 extent 439 honor 282 pleasure 267 gift 256 humor 221 tool 184 athlete 173 Corpus of Contemporary American English disservice 108 …

Entropy � P ( x ) log P ( x ) − x ∈ X • High entropy means the phenomenon is less predictable • Entropy of 0 means it is entirely predictable.

Entropy 0.4 P(X=x) 6 0.2 6 log 1 1 � 6 = 2 . 58 − 0.0 1 1 2 3 4 5 6 A uniform distribution has maximum entropy 0.4 P(X=x) 5 0.2 − 0 . 4 log 0 . 4 − 0 . 12 log 0 . 12 = 2 . 36 � 1 0.0 1 2 3 4 5 6 This entropy is lower because it is more predictable   (if we always guess 2, we would be right 40% of the time)

Conditional entropy • Measures your level of surprise about some phenomenon Y if you have information about another phenomenon X • Y = word, X = preceding bigram (“the oakland ___”) • Y = label (democrat, republican), X = feature (lives in Berkeley)

Information gain • aka “Mutual Information”: the reduction in entropy in Y as a result of knowing information about X H ( Y ) − H ( Y | X ) H ( Y ) = − � p ( y ) log p ( y ) y ∈ Y H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y

1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y Which of these features gives you more information about y?

1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖

x ∈ 𝒴 0 1 x 1 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y − 3 6 ( 1 log 1 + 0 log 0 ) − 3 6 ( 0 log 0 + 1 log 1 ) = 0

1 2 3 4 5 6 x 1 0 1 1 0 0 1 x 2 0 0 0 1 1 1 ⊕ ⊖ ⊖ ⊕ ⊕ ⊖ y x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖

x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ 1 P ( y = + | x = 0 ) = 1 + 2 = 0 . 33 3 2 P ( x = 0 ) = 3 + 3 = 0 . 5 P ( y = − | x = 0 ) = 1 + 2 = 0 . 67 2 3 P ( y = + | x = 1 ) = 1 + 2 = 0 . 67 P ( x = 1 ) = 3 + 3 = 0 . 5 1 P ( y = − | x = 1 ) = 1 + 2 = 0 . 33

x ∈ 𝒴 0 1 x 2 y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ H ( Y | X ) = − � p ( x ) � p ( y | x ) log p ( y | x ) x ∈ X y ∈ Y − 3 6 ( 0 . 33 log 0 . 33 + 0 . 67 log 0 . 67 ) − 3 6 ( 0 . 67 log 0 . 67 + 0 . 33 log 0 . 33 ) = 0 . 91

Feature H(Y | X) In decision trees, the feature follow clinton 0.91 with the lowest conditional entropy/highest information follow trump 0.77 gain defines the “best split” “benghazi” 0.45 negative sentiment 0.33 + “benghazi” MI = IG = H ( Y ) − H ( Y | X ) “illegal immigrants” 0 “republican” in 0.31 profile for a given partition, H(Y) is the same for all features, so we can ignore it when deciding “democrat” in among them 0.67 profile self-reported 0.80 location = Berkeley

Feature H(Y | X) follow clinton 0.91 How could we use this in other models (e.g., the perceptron)? follow trump 0.77 “benghazi” 0.45 negative sentiment 0.33 + “benghazi” “illegal immigrants” 0 “republican” in 0.31 profile “democrat” in 0.67 profile self-reported 0.80 location = Berkeley

Decision trees BestSplit identifies the feature with the highest information gain and partitions the data according to values for that feature

Gini impurity • Measure the “purity” of a partition (how diverse the labels are). If we were to pick an element in D and assign a label in proportion to the label distribution in D, how often would we make a mistake? Probability of selecting an item with label y at random p y ( 1 − p y ) � y ∈ Y The probability of randomly assigning it the wrong label

Gini impurity p y ( 1 − p y ) � y ∈ Y x ∈ 𝒴 x ∈ 𝒴 0 1 0 1 x 1 x 2 y ∈ 𝒵 3 ⊕ 0 ⊖ 0 ⊕ 3 ⊖ y ∈ 𝒵 1 ⊕ 2 ⊖ 2 ⊕ 1 ⊖ G ( 0 ) = 0 . 33 × ( 1 − 0 . 33 ) + 0 . 67 × ( 1 − 0 . 67 ) = 0 . 44 G ( 0 ) = 1 × ( 1 − 1 ) + 0 × ( 1 − 0 ) = 0 G ( 1 ) = 0 . 67 × ( 1 − 0 . 67 ) + 0 . 33 × ( 1 − 0 . 33 ) = 0 . 44 G ( 0 ) = 0 × ( 1 − 0 ) + 1 × ( 1 − 1 ) = 0 3 3 3 3 G ( x 2 ) = ( 3 + 3 ) 0 . 44 + ( 3 + 3 ) 0 . 44 = 0 . 44 G ( x 1 ) = ( 3 + 3 ) 0 + ( 3 + 3 ) 0 = 0

Classification A mapping h from input data x (drawn from instance space 𝓨 ) to a label (or labels) y from some enumerable output space 𝒵 𝓨 = set of all skyscrapers 𝒵 = {art deco, neo-gothic, modern} x = the empire state building y = art deco

lives in Berkeley Feature Value no yes follow clinton 0 follows contains Trump “email” follow trump 0 yes no yes no follows “benghazi” 0 D R Trump negative sentiment no yes 0 profile + “benghazi” contains “Republican” D R “illegal immigrants” 0 no yes “republican” in D R 0 profile “democrat” in 0 profile self-reported The tree that we’ve learned is the mapping ĥ (x) 1 location = Berkeley

lives in Berkeley Feature Value no yes follow clinton 0 follows contains Trump “email” follow trump 0 yes no yes no follows “benghazi” 0 D R Trump negative sentiment no yes 0 profile + “benghazi” contains “Republican” D R “illegal immigrants” 0 no yes “republican” in D R 0 profile “democrat” in 0 profile How is this different from the perceptron? self-reported 1 location = Berkeley

Regression A mapping from input data x (drawn from instance space 𝓨 ) to a point y in ℝ ( ℝ = the set of real numbers) x = the empire state building y = 17444.5625”

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Decision trees & random forests Feb 10, 2016 Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models

GALATIANS Galatians in Five Weeks 1. Paul 2. This Age and the Age to Come (New Creation)

What are the different abodes of the soul? There are three separate and distinct domains or

Testament in the Gospel of John JESUS The Prologue of John John 1:1-18 FINAL STOP The

The human heart is an idol factory that takes good things like a successful career, love,

P ART I: R ESEARCH F OUNDATIONS Tom Jacobs Morgan Grove Dexter Locke Director, Environmental

Discriminating Nitrogen Status Parameters of Maize Cultivars with High-throughput Phenotyping

On the Resolution Necessary to Capture Dynamics of Unsteady Detonation Christopher M. Romick,

By Andrew D. Birrell and Bruce Jay Nelson Presented By: Abdussalam Alawini Reviewed By Prof.

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Decision trees & random forests Feb 10, 2016 Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice &amp; Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models

GALATIANS Galatians in Five Weeks 1. Paul 2. This Age and the Age to Come (New Creation)

What are the different abodes of the soul? There are three separate and distinct domains or

Testament in the Gospel of John JESUS The Prologue of John John 1:1-18 FINAL STOP The

The human heart is an idol factory that takes good things like a successful career, love,

P ART I: R ESEARCH F OUNDATIONS Tom Jacobs Morgan Grove Dexter Locke Director, Environmental

Discriminating Nitrogen Status Parameters of Maize Cultivars with High-throughput Phenotyping

On the Resolution Necessary to Capture Dynamics of Unsteady Detonation Christopher M. Romick,

By Andrew D. Birrell and Bruce Jay Nelson Presented By: Abdussalam Alawini Reviewed By Prof.

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal