coms 4721 machine learning for data science lecture 9 2
play

COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University L OGISTIC REGRESSION B INARY CLASSIFICATION Linear classifiers Given:


  1. COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. L OGISTIC REGRESSION

  3. B INARY CLASSIFICATION Linear classifiers Given: Data ( x 1 , y 1 ) , . . . , ( x n , y n ) , where x i ∈ R d and y i ∈ {− 1 , + 1 } A linear classifier takes a vector w ∈ R d and scalar w 0 ∈ R and predicts y i = f ( x i ; w , w 0 ) = sign ( x T i w + w 0 ) . We discussed two methods last time: ◮ Least squares: Sensitive to outliers ◮ Perceptron: Convergence issues, assumes linear separability Can we combine the separating hyperplane idea with probability to fix this?

  4. B AYES LINEAR CLASSIFICATION Linear discriminant analysis We saw an example of a linear classification rule using a Bayes classifier. For the model y ∼ Bern ( π ) and x | y ∼ N ( µ y , Σ) , declare y = 1 given x if ln p ( x | y = 1 ) p ( y = 1 ) p ( x | y = 0 ) p ( y = 0 ) > 0 . In this case, the log odds is equal to ln p ( x | y = 1 ) p ( y = 1 ) ln π 1 − 1 2 ( µ 1 + µ 0 ) T Σ − 1 ( µ 1 − µ 0 ) = p ( x | y = 0 ) p ( y = 0 ) π 0 � �� � a constant w 0 + x T Σ − 1 ( µ 1 − µ 0 ) � �� � a vector w

  5. L OG ODDS AND B AYES CLASSIFICATION Original formulation Recall that originally we wanted to declare y = 1 given x if ln p ( y = 1 | x ) p ( y = 0 | x ) > 0 We didn’t have a way to define p ( y | x ) , so we used Bayes rule: ◮ Use p ( y | x ) = p ( x | y ) p ( y ) and let the p ( x ) cancel each other in the fraction p ( x ) ◮ Define p ( y ) to be a Bernoulli distribution (coin flip distribution) ◮ Define p ( x | y ) however we want (e.g., a single Gaussian) Now, we want to directly define p ( y | x ) . We’ll use the log odds to do this.

  6. L OG ODDS AND B AYES CLASSIFICATION Log odds and hyperplanes x 2 x H Classifying x based on the log odds L = ln p ( y = + 1 | x ) p ( y = − 1 | x ) , w x 1 we notice that 1. L ≫ 0 : more confident y = + 1, 2. L ≪ 0 : more confident y = − 1, − w 0 / � w � 2 3. L = 0 : can go either way The linear function x T w + w 0 captures these three objectives: � � � x T w w 0 ◮ The distance of x to a hyperplane H defined by ( w , w 0 ) is � w � 2 + � . � w � 2 ◮ The sign of the function captures which side x is on. ◮ As x moves away/towards H , we become more/less confident.

  7. L OG ODDS AND HYPERPLANES Logistic link function We can directly plug in the hyperplane representation for the log odds: ln p ( y = + 1 | x ) p ( y = − 1 | x ) = x T w + w 0 Question : What is different from the previous Bayes classifier? Answer : There was a formula for calculating w and w 0 based on the prior model and data x . Now, we put no restrictions on these values. Setting p ( y = − 1 | x ) = 1 − p ( y = + 1 | x ) , solve for p ( y = + 1 | x ) to find exp { x T w + w 0 } 1 + exp { x T w + w 0 } = σ ( x T w + w 0 ) . p ( y = + 1 | x ) = ◮ This is called the sigmoid function . ◮ We have chosen x T w + w 0 as the link function for the log odds.

  8. L OGISTIC SIGMOID FUNCTION 1 0.5 0 −5 0 5 ◮ Red line: Sigmoid function σ ( x T w + w 0 ) , which maps x to p ( y = + 1 | x ) . ◮ The function σ ( · ) captures our desire to be more confident as we move away from the separating hyperplane, defined by the x -axis. ◮ (Blue dashed line: Not discussed.)

  9. L OGISTIC REGRESSION � w 0 � 1 � � As with regression, absorb the offset: w ← and x ← . w x Definition Let ( x 1 , y 1 ) , . . . , ( x n , y n ) be a set of binary labeled data with y ∈ {− 1 , + 1 } . Logistic regression models each y i as independently generated, with e x T i w P ( y i = + 1 | x i , w ) = σ ( x T i w ) , σ ( x i ; w ) = i w . 1 + e x T Discriminative vs Generative classifiers ◮ This is a discriminative classifier because x is not directly modeled. ◮ Bayes classifiers are known as generative because x is modeled. Discriminative: p ( y | x ) Generative: p ( x | y ) p ( y ) .

  10. L OGISTIC REGRESSION LIKELIHOOD Data likelihood Define σ i ( w ) = σ ( x T i w ) . The joint likelihood of y 1 , . . . , y n is n � p ( y 1 , . . . , y n | x 1 , . . . , x n , w ) = p ( y i | x i , w ) i = 1 n � σ i ( w ) 1 ( y i =+ 1 ) ( 1 − σ i ( w )) 1 ( y i = − 1 ) = i = 1 ◮ Notice that each x i modifies the probability of a ‘ + 1’ for its respective y i . ◮ Predicting new data is the same: ◮ If x T w > 0, then σ ( x T w ) > 1 / 2 and predict y = + 1, and vice versa. ◮ We now get a confidence in our prediction via the probability σ ( x T w ) .

  11. L OGISTIC REGRESSION AND MAXIMUM LIKELIHOOD More notation changes Use the following fact to condense the notation: e y i x T e x T e x T i w i w i w � � 1 ( y i =+ 1 ) � � 1 ( y i = − 1 ) = 1 − 1 + e y i x T 1 + e x T 1 + e x T i w i w i w � �� � � �� � � �� � σ i ( y i · w ) σ i ( w ) 1 − σ i ( w ) therefore, the data likelihood can be written compactly as n � p ( y 1 , . . . , y n | x 1 , . . . , x n , w ) = σ i ( y i · w ) i = 1 We want to maximize this over w .

  12. L OGISTIC REGRESSION AND MAXIMUM LIKELIHOOD Maximum likelihood The maximum likelihood solution for w can be written n � = ln σ i ( y i · w ) w ML arg max w i = 1 = L arg max w As with the Perceptron, we can’t directly set ∇ w L = 0, and so we need an iterative algorithm. Since we want to maximize L , at step t we can update n � w ( t + 1 ) = w ( t ) + η ∇ w L , ∇ w L = ( 1 − σ i ( y i · w )) y i x i . i = 1 We will see that this results in an algorithm similar to the Perceptron.

  13. L OGISTIC REGRESSION ALGORITHM ( STEEPEST ASCENT ) Input : Training data ( x 1 , y i ) , . . . , ( x n , y n ) and step size η > 0 1. Set w ( 1 ) = � 0 2. For iteration t = 1 , 2 , . . . do n • Update w ( t + 1 ) = w ( t ) + η � � � 1 − σ i ( y i · w ( t ) ) y i x i i = 1 Perceptron : Search for misclassified ( x i , y i ) , update w ( t + 1 ) = w ( t ) + η y i x i . Logistic regression : Something similar except we sum over all data. ◮ Recall that σ i ( y i · w ) picks out the probability model gives to the observed y i . ◮ Therefore 1 − σ i ( y i · w ) is the probability the model picks the wrong value. ◮ Perceptron is “all-or-nothing.” Either it’s correctly or incorrectly classified. ◮ Logistic regression has a probabilistic “fudge-factor.”

  14. B AYESIAN LOGISTIC REGRESSION Problem : If a hyperplane can separate all training data, then � w ML � 2 → ∞ . This drives σ i ( y i · w ) → 1 for each ( x i , y i ) . Even for nearly separable data it might get a few very wrong in order to be more confident about the rest. This is a case of “over-fitting.” 4 A solution : Regularize w with λ w T w : 2 � n 0 i = 1 ln σ i ( y i · w ) − λ w T w w MAP = arg max w −2 We’ve seen how this corresponds to a −4 Gaussian prior distribution on w . −6 How about the posterior p ( w | x , y ) ? −8 −4 −2 0 2 4 6 8

  15. L APLACE APPROXIMATION

  16. B AYESIAN LOGISTIC REGRESSION Posterior calculation Define the prior distribution on w to be w ∼ N ( 0 , λ − 1 I ) . The posterior is p ( w ) � n i = 1 σ i ( y i · w ) p ( w | x , y ) = � p ( w ) � n i = 1 σ i ( y i · w ) dw This is not a “standard” distribution and we can’t calculate the denominator. Therefore we can’t actually say what p ( w | x , y ) is. Can we approximate p ( w | x , y ) ?

  17. L APLACE APPROXIMATION One strategy Pick a distribution to approximate p ( w | x , y ) . We will say p ( w | x , y ) ≈ Normal ( µ, Σ) . Now we need a method for setting µ and Σ . Laplace approximations Using a condensed notation, notice from Bayes rule that e ln p ( y , w | x ) p ( w | x , y ) = e ln p ( y , w | x ) dw . � We will approximate ln p ( y , w | x ) in the numerator and denominator.

  18. L APLACE APPROXIMATION Let’s define f ( w ) = ln p ( y , w | x ) . Taylor expansions We can approximate f ( w ) with a second order Taylor expansion . Recall that w ∈ R d + 1 . For any point z ∈ R d + 1 , f ( w ) ≈ f ( z ) + ( w − z ) T ∇ f ( z ) + 1 2 ( w − z ) T � � ∇ 2 f ( z ) ( w − z ) The notation ∇ f ( z ) is short for ∇ w f ( w ) | z , and similarly for the matrix of second derivatives. We just need to pick z . The Laplace approximation defines z = w MAP .

  19. L APLACE APPROXIMATION ( SOLVING ) Recall f ( w ) = ln p ( y , w | x ) and z = w MAP . From Bayes rule and the Laplace approximation we now have e f ( w ) p ( w | x , y ) = � e f ( w ) dw e f ( z )+( w − z ) T ∇ f ( z )+ 1 2 ( w − z ) T ( ∇ 2 f ( z ) ) ( w − z ) ≈ � e f ( z )+( w − z ) T ∇ f ( z )+ 1 2 ( w − z ) T ( ∇ 2 f ( z ))( w − z ) dw This can be simplified in two ways, 1. The term e f ( w MAP ) in the numerator and denominator can be viewed as a multiplicative constant since it doesn’t vary in w . They therefore cancel. 2. By definition of how we find w MAP , the vector ∇ w ln p ( y , w | x ) | w MAP = 0.

Recommend


More recommend