deconstructing data science
play

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic regression Feb 22, 2016 Generative vs. Discriminative models Generative models specify a joint distribution over the labels and the data. With


  1. 
 Deconstructing Data Science David Bamman, UC Berkeley 
 Info 290 
 Lecture 9: Logistic regression Feb 22, 2016

  2. Generative vs. Discriminative models • Generative models specify a joint distribution over the labels and the data. With this you could generate new data P ( x , y ) = P ( y ) P ( x | y ) • Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes P ( y | x )

  3. Generating 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most P ( x | y = Hamlet ) 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most P ( x | y = Romeo and Juliet )

  4. Generative models • With generative models (e.g., Naive Bayes), we ultimately also care about P(y | x), but we get there by modeling more. prior likelihood posterior P ( Y = y ) P ( x | Y = y ) P ( Y = y | x ) = y ∈ Y P ( Y = y ) P ( x | Y = y ) � • Discriminative models focus on modeling P(y | x) — and only P(y | x) — directly.

  5. Remember F � x i β i = x 1 β 1 + x 2 β 2 + . . . + x F β F i = 1 F � x i = x i × x 2 × . . . × x F i = 1 exp( x ) = e x ≈ 2 . 7 x exp( x + y ) = exp( x ) exp( y ) log( x ) = y → e y = x log( xy ) = log( x ) + log( y ) 5

  6. Classification A mapping h from input data x (drawn from instance space 𝓨 ) to a label (or labels) y from some enumerable output space 𝒵 𝓨 = set of all skyscrapers 𝒵 = {art deco, neo-gothic, modern} x = the empire state building y = art deco

  7. x = feature vector β = coefficients Feature Value Feature β follow clinton 0 follow clinton -3.1 follow trump 0 follow trump 6.8 “benghazi” 0 “benghazi” 1.4 negative sentiment + negative sentiment + 0 3.2 “benghazi” “benghazi” “illegal immigrants” 0 “illegal immigrants” 8.7 “republican” in profile 0 “republican” in profile 7.9 “democrat” in profile 0 “democrat” in profile -3.0 self-reported location self-reported location 1 -1.7 = Berkeley = Berkeley 7

  8. Logistic regression �� F � i = 1 x i β i exp P ( y | x , β ) = �� F � 1 + exp i = 1 x i β i Y = { 0 , 1 } output space

  9. follows follows benghazi trump clinton β 0.7 1.2 -1.1 follows follows exp(a)/ a= ∑ x i β i benghazi exp(a) trump clinton 1+exp(a) x 1 1 1 0 1.9 6.69 87.0% x 2 0 0 1 -1.1 0.33 25.0% x 3 1 0 1 -0.4 0.67 40.1% 9

  10. β = coefficients Feature β follow clinton -3.1 follow trump 6.8 How do we get good values for β ? “benghazi” 1.4 negative sentiment + 3.2 “benghazi” “illegal immigrants” 8.7 “republican” in profile 7.9 “democrat” in profile -3.0 self-reported location -1.7 = Berkeley 10

  11. Likelihood Remember the likelihood of data is its probability under some parameter values In maximum likelihood estimation, we pick the values of the parameters under which the data is most likely. 11

  12. Likelihood fair 0.5 0.4 0.3 0.2 =.17 x .17 x .17 
 P( | ) 2 6 6 0.1 = 0.004913 0.0 1 2 3 4 5 6 not fair 0.5 0.4 = .1 x .5 x .5 
 P( | ) 0.3 2 6 6 = 0.025 0.2 0.1 0.0 1 2 3 4 5 6

  13. Conditional likelihood N For all training data, we want � P ( y i | x i , β ) probability of the true label y for each data point x to high i This principle gives us a way to pick the values of the parameters β that maximize the probability of the training data <x, y> 13

  14. The value of β that maximizes likelihood also maximizes the log likelihood N N � P ( y i | x i , β ) = arg max � P ( y i | x i , β ) arg max log β β i = 1 i = 1 The log likelihood is an easier form to work with: N N � P ( y i | x i , β ) = � log P ( y i | x i , β ) log i = 1 i = 1 14

  15. • We want to find the value of β that leads to the highest value of the log likelihood: N � ( β ) = � log P ( y i | x i , β ) i = 1 • Solution: derivatives! 15

  16. x + α (-2x) 0 [ α = 0.1] x .1(-2x) -25 8.00 1.60 6.40 1.28 5.12 1.02 -x^2 4.10 0.82 -50 3.28 0.66 2.62 0.52 2.10 0.42 -75 1.68 0.34 1.34 0.27 1.07 0.21 0.86 0.17 -100 0.69 0.14 -10 -5 0 5 10 x d We can get to maximum value of this dx − x 2 = − 2 x function by following the gradient 16

  17. We want to find the values of β that make the value of this function the greatest � log P ( 1 | x , β ) + � log P ( 0 | x , β ) < x , y =+ 1 > < x , y = 0 > � � ( β ) = � ( y − ˆ p ( x )) x i � β i < x , y > 17

  18. Gradient descent If y is 1 and p(x) = 0, then this still pushes the weights a lot If y is 1 and p(x) = 0.99, then this still pushes the weights just a little bit 18

  19. Stochastic g.d. • Batch gradient descent reasons over every training data point for each update of β . This can be slow to converge. • Stochastic gradient descent updates β after each data point. 19

  20. Perceptron 20

  21. 21

  22. Stochastic g.d. β i + α ( y − ˆ p ( x )) x i p is between Logistic regression 0 and 1 stochastic update Perceptron β i + α ( y − ˆ y ) x i ŷ is exactly 
 0 or 1 stochastic update The perceptron is an approximation to logistic regression

  23. Practicalities • When calculating the P(y | x) or in calculating the gradient, you don’t need to loop through all features — only those with nonzero values • (Which makes sparse, binary values useful) �� F � i = 1 x i β i exp � � ( β ) = � ( y − ˆ p ( x )) x i P ( y | x , β ) = � β i �� F � 1 + exp i = 1 x i β i < x , y >

  24. � � ( β ) = � ( y − ˆ p ( x )) x i � β i < x , y > If a feature x i only shows up with one class (e.g., democrats), what are the possible values of its corresponding β i ? � � � ( β ) = � ( 1 − 0 ) 1 � ( β ) = � ( 1 − 0 . 9999999 ) 1 � β i � β i < x , y > < x , y > always positive

  25. β = coefficients Feature β Many features that show up follow clinton -3.1 rarely may likely only appear (by follow trump + follow 7299302 chance) with one label NFL + follow bieber “benghazi” 1.4 More generally, may appear so few times that the noise of negative sentiment + 3.2 “benghazi” randomness dominates “illegal immigrants” 8.7 “republican” in profile 7.9 “democrat” in profile -3.0 self-reported location -1.7 = Berkeley 25

  26. Feature selection • We could threshold features by minimum count but that also throws away information • We can take a probabilistic approach and encode a prior belief that all β should be 0 unless we have strong evidence otherwise 26

  27. L2 regularization N F � � β 2 � ( β ) = log P ( y i | x i , β ) η j − i = 1 j = 1 � �� � � �� � we want this to be high but we want this to be small • We can do this by changing the function we’re trying to optimize by adding a penalty for having values of β that are high • This is equivalent to saying that each β element is drawn from aNormal distribution centered on 0. • η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 27

  28. no L2 some L2 high L2 regularization regularization regularization 33.83 Won Bin 2.17 Eddie Murphy 0.41 Family Film 29.91 Alexander Beyer 1.98 Tom Cruise 0.41 Thriller 24.78 Bloopers 1.70 Tyler Perry 0.36 Fantasy 23.01 Daniel Brühl 1.70 Michael Douglas 0.32 Action 22.11 Ha Jeong-woo 1.66 Robert Redford 0.25 Buddy film 20.49 Supernatural 1.66 Julia Roberts 0.24 Adventure 18.91 Kristine DeBell 1.64 Dance 0.20 Comp Animation 18.61 Eddie Murphy 1.63 Schwarzenegger 0.19 Animation 18.33 Cher 1.63 Lee Tergesen 0.18 Science Fiction 18.18 Michael Douglas 1.62 Cher 0.18 Bruce Willis 28

  29. μ σ 2 β ∼ Norm ( μ , σ 2 ) β �� F � � � i = 1 x i β i exp y ∼ Ber x y � � �� F � 1 + exp i = 1 x i β i 29

  30. L1 regularization N F � � � ( β ) = log P ( y i | x i , β ) η | β j | − i = 1 j = 1 � �� � � �� � we want this to be high but we want this to be small • L1 regularization encourages coefficients to be exactly 0. • η again controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 30

  31. What do the coefficients mean? exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β ) = 1 + exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β )( 1 + exp ( x 0 β 0 + x 1 β 1 )) = exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β ) + P ( y | x , β ) exp ( x 0 β 0 + x 1 β 1 ) = exp ( x 0 β 0 + x 1 β 1 )

  32. P ( y | x , β ) + P ( y | x , β ) exp ( x 0 β 0 + x 1 β 1 ) = exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β ) = exp ( x 0 β 0 + x 1 β 1 ) − P ( y | x , β ) exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β ) = exp ( x 0 β 0 + x 1 β 1 )( 1 − P ( y | x , β )) P ( y | x , β ) 1 − P ( y | x , β ) = exp ( x 0 β 0 + x 1 β 1 ) This is the odds of y occurring

  33. Odds • Ratio of an event occurring to its not taking place P ( x ) 1 − P ( x ) 0 . 75 0 . 25 = 3 Green Bay Packers 1 = 3 : 1 vs. SF 49ers probability of odds for GB GB winning winning

Recommend


More recommend