COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

L OGISTIC REGRESSION

B INARY CLASSIFICATION Linear classifiers Given: Data ( x 1 , y 1 ) , . . . , ( x n , y n ) , where x i ∈ R d and y i ∈ {− 1 , + 1 } A linear classifier takes a vector w ∈ R d and scalar w 0 ∈ R and predicts y i = f ( x i ; w , w 0 ) = sign ( x T i w + w 0 ) . We discussed two methods last time: ◮ Least squares: Sensitive to outliers ◮ Perceptron: Convergence issues, assumes linear separability Can we combine the separating hyperplane idea with probability to fix this?

B AYES LINEAR CLASSIFICATION Linear discriminant analysis We saw an example of a linear classification rule using a Bayes classifier. For the model y ∼ Bern ( π ) and x | y ∼ N ( µ y , Σ) , declare y = 1 given x if ln p ( x | y = 1 ) p ( y = 1 ) p ( x | y = 0 ) p ( y = 0 ) > 0 . In this case, the log odds is equal to ln p ( x | y = 1 ) p ( y = 1 ) ln π 1 − 1 2 ( µ 1 + µ 0 ) T Σ − 1 ( µ 1 − µ 0 ) = p ( x | y = 0 ) p ( y = 0 ) π 0 � �� a constant w 0 + x T Σ − 1 ( µ 1 − µ 0 ) � �� a vector w

L OG ODDS AND B AYES CLASSIFICATION Original formulation Recall that originally we wanted to declare y = 1 given x if ln p ( y = 1 | x ) p ( y = 0 | x ) > 0 We didn’t have a way to define p ( y | x ) , so we used Bayes rule: ◮ Use p ( y | x ) = p ( x | y ) p ( y ) and let the p ( x ) cancel each other in the fraction p ( x ) ◮ Define p ( y ) to be a Bernoulli distribution (coin flip distribution) ◮ Define p ( x | y ) however we want (e.g., a single Gaussian) Now, we want to directly define p ( y | x ) . We’ll use the log odds to do this.

L OG ODDS AND B AYES CLASSIFICATION Log odds and hyperplanes x 2 x H Classifying x based on the log odds L = ln p ( y = + 1 | x ) p ( y = − 1 | x ) , w x 1 we notice that 1. L ≫ 0 : more confident y = + 1, 2. L ≪ 0 : more confident y = − 1, − w 0 / � w � 2 3. L = 0 : can go either way The linear function x T w + w 0 captures these three objectives: � � � x T w w 0 ◮ The distance of x to a hyperplane H defined by ( w , w 0 ) is � w � 2 + � . � w � 2 ◮ The sign of the function captures which side x is on. ◮ As x moves away/towards H , we become more/less confident.

L OG ODDS AND HYPERPLANES Logistic link function We can directly plug in the hyperplane representation for the log odds: ln p ( y = + 1 | x ) p ( y = − 1 | x ) = x T w + w 0 Question : What is different from the previous Bayes classifier? Answer : There was a formula for calculating w and w 0 based on the prior model and data x . Now, we put no restrictions on these values. Setting p ( y = − 1 | x ) = 1 − p ( y = + 1 | x ) , solve for p ( y = + 1 | x ) to find exp { x T w + w 0 } 1 + exp { x T w + w 0 } = σ ( x T w + w 0 ) . p ( y = + 1 | x ) = ◮ This is called the sigmoid function . ◮ We have chosen x T w + w 0 as the link function for the log odds.

L OGISTIC SIGMOID FUNCTION 1 0.5 0 −5 0 5 ◮ Red line: Sigmoid function σ ( x T w + w 0 ) , which maps x to p ( y = + 1 | x ) . ◮ The function σ ( · ) captures our desire to be more confident as we move away from the separating hyperplane, defined by the x -axis. ◮ (Blue dashed line: Not discussed.)

L OGISTIC REGRESSION � w 0 � 1 � � As with regression, absorb the offset: w ← and x ← . w x Definition Let ( x 1 , y 1 ) , . . . , ( x n , y n ) be a set of binary labeled data with y ∈ {− 1 , + 1 } . Logistic regression models each y i as independently generated, with e x T i w P ( y i = + 1 | x i , w ) = σ ( x T i w ) , σ ( x i ; w ) = i w . 1 + e x T Discriminative vs Generative classifiers ◮ This is a discriminative classifier because x is not directly modeled. ◮ Bayes classifiers are known as generative because x is modeled. Discriminative: p ( y | x ) Generative: p ( x | y ) p ( y ) .

L OGISTIC REGRESSION LIKELIHOOD Data likelihood Define σ i ( w ) = σ ( x T i w ) . The joint likelihood of y 1 , . . . , y n is n � p ( y 1 , . . . , y n | x 1 , . . . , x n , w ) = p ( y i | x i , w ) i = 1 n � σ i ( w ) 1 ( y i =+ 1 ) ( 1 − σ i ( w )) 1 ( y i = − 1 ) = i = 1 ◮ Notice that each x i modifies the probability of a ‘ + 1’ for its respective y i . ◮ Predicting new data is the same: ◮ If x T w > 0, then σ ( x T w ) > 1 / 2 and predict y = + 1, and vice versa. ◮ We now get a confidence in our prediction via the probability σ ( x T w ) .

L OGISTIC REGRESSION AND MAXIMUM LIKELIHOOD More notation changes Use the following fact to condense the notation: e y i x T e x T e x T i w i w i w � � 1 ( y i =+ 1 ) � � 1 ( y i = − 1 ) = 1 − 1 + e y i x T 1 + e x T 1 + e x T i w i w i w � �� σ i ( y i · w ) σ i ( w ) 1 − σ i ( w ) therefore, the data likelihood can be written compactly as n � p ( y 1 , . . . , y n | x 1 , . . . , x n , w ) = σ i ( y i · w ) i = 1 We want to maximize this over w .

L OGISTIC REGRESSION AND MAXIMUM LIKELIHOOD Maximum likelihood The maximum likelihood solution for w can be written n � = ln σ i ( y i · w ) w ML arg max w i = 1 = L arg max w As with the Perceptron, we can’t directly set ∇ w L = 0, and so we need an iterative algorithm. Since we want to maximize L , at step t we can update n � w ( t + 1 ) = w ( t ) + η ∇ w L , ∇ w L = ( 1 − σ i ( y i · w )) y i x i . i = 1 We will see that this results in an algorithm similar to the Perceptron.

L OGISTIC REGRESSION ALGORITHM ( STEEPEST ASCENT ) Input : Training data ( x 1 , y i ) , . . . , ( x n , y n ) and step size η > 0 1. Set w ( 1 ) = � 0 2. For iteration t = 1 , 2 , . . . do n • Update w ( t + 1 ) = w ( t ) + η � � � 1 − σ i ( y i · w ( t ) ) y i x i i = 1 Perceptron : Search for misclassified ( x i , y i ) , update w ( t + 1 ) = w ( t ) + η y i x i . Logistic regression : Something similar except we sum over all data. ◮ Recall that σ i ( y i · w ) picks out the probability model gives to the observed y i . ◮ Therefore 1 − σ i ( y i · w ) is the probability the model picks the wrong value. ◮ Perceptron is “all-or-nothing.” Either it’s correctly or incorrectly classified. ◮ Logistic regression has a probabilistic “fudge-factor.”

B AYESIAN LOGISTIC REGRESSION Problem : If a hyperplane can separate all training data, then � w ML � 2 → ∞ . This drives σ i ( y i · w ) → 1 for each ( x i , y i ) . Even for nearly separable data it might get a few very wrong in order to be more confident about the rest. This is a case of “over-fitting.” 4 A solution : Regularize w with λ w T w : 2 � n 0 i = 1 ln σ i ( y i · w ) − λ w T w w MAP = arg max w −2 We’ve seen how this corresponds to a −4 Gaussian prior distribution on w . −6 How about the posterior p ( w | x , y ) ? −8 −4 −2 0 2 4 6 8

L APLACE APPROXIMATION

B AYESIAN LOGISTIC REGRESSION Posterior calculation Define the prior distribution on w to be w ∼ N ( 0 , λ − 1 I ) . The posterior is p ( w ) � n i = 1 σ i ( y i · w ) p ( w | x , y ) = � p ( w ) � n i = 1 σ i ( y i · w ) dw This is not a “standard” distribution and we can’t calculate the denominator. Therefore we can’t actually say what p ( w | x , y ) is. Can we approximate p ( w | x , y ) ?

L APLACE APPROXIMATION One strategy Pick a distribution to approximate p ( w | x , y ) . We will say p ( w | x , y ) ≈ Normal ( µ, Σ) . Now we need a method for setting µ and Σ . Laplace approximations Using a condensed notation, notice from Bayes rule that e ln p ( y , w | x ) p ( w | x , y ) = e ln p ( y , w | x ) dw . � We will approximate ln p ( y , w | x ) in the numerator and denominator.

L APLACE APPROXIMATION Let’s define f ( w ) = ln p ( y , w | x ) . Taylor expansions We can approximate f ( w ) with a second order Taylor expansion . Recall that w ∈ R d + 1 . For any point z ∈ R d + 1 , f ( w ) ≈ f ( z ) + ( w − z ) T ∇ f ( z ) + 1 2 ( w − z ) T � � ∇ 2 f ( z ) ( w − z ) The notation ∇ f ( z ) is short for ∇ w f ( w ) | z , and similarly for the matrix of second derivatives. We just need to pick z . The Laplace approximation defines z = w MAP .

L APLACE APPROXIMATION ( SOLVING ) Recall f ( w ) = ln p ( y , w | x ) and z = w MAP . From Bayes rule and the Laplace approximation we now have e f ( w ) p ( w | x , y ) = � e f ( w ) dw e f ( z )+( w − z ) T ∇ f ( z )+ 1 2 ( w − z ) T ( ∇ 2 f ( z ) ) ( w − z ) ≈ � e f ( z )+( w − z ) T ∇ f ( z )+ 1 2 ( w − z ) T ( ∇ 2 f ( z ))( w − z ) dw This can be simplified in two ways, 1. The term e f ( w MAP ) in the numerator and denominator can be viewed as a multiplicative constant since it doesn’t vary in w . They therefore cancel. 2. By definition of how we find w MAP , the vector ∇ w ln p ( y , w | x ) | w MAP = 0.

COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University L OGISTIC REGRESSION B INARY CLASSIFICATION Linear classifiers Given:

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

Overview of Bode Plots Transfer function review Piece-wise linear approximations

Linear and nonlinear methods for model reduction Diane Guignard Joint work : A. Bonito, R.

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Not Smooth High degree approximation Explicit y=f(x) Implicit f(x,y)=0 Parametric

Todays Agenda Upcoming Homework Section 2.8: Linear Approximations and Differentials

Generalized Correlation Analysis of Vectorial Boolean Functions Claude Carlet, Khoongming Khoo,

I. Floorplanning with Fixed Modules Fixed modules only, no rotation allowed m 1 (4,5), m 2

Models for individual demand Kathrin Gruber Assistant Professor of Econometrics Erasmus