COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University
R EGRESSION : P ROBLEM D EFINITION Data Measured pairs ( x , y ) , where x ∈ R d + 1 (input) and y ∈ R (output) Goal Find a function f : R d + 1 → R such that y ≈ f ( x ; w ) for the data pair ( x , y ) . f ( x ; w ) is the regression function and the vector w are its parameters. Definition of linear regression A regression method is called linear if the prediction f is a linear function of the unknown parameters w .
L EAST S QUARES ( CONTINUED )
L EAST SQUARES LINEAR REGRESSION Least squares solution Least squares finds the w that minimizes the sum of squared errors. The least squares objective in the most basic form where f ( x ; w ) = x T w is n � i w ) 2 = � y − Xw � 2 = ( y − Xw ) T ( y − Xw ) . ( y i − x T L = i = 1 We defined y = [ y 1 , . . . , y n ] T and X = [ x 1 , . . . , x n ] T . Taking the gradient with respect to w and setting to zero, we find that ∇ w L = 2 X T Xw − 2 X T y = 0 w LS = ( X T X ) − 1 X T y . ⇒ In other words, w LS is the vector that minimizes L .
P ROBABILISTIC VIEW ◮ Last class, we discussed the geometric interpretation of least squares. ◮ Least squares also has an insightful probabilistic interpretation that allows us to analyze its properties. ◮ That is, given that we pick this model as reasonable for our problem, we can ask: What kinds of assumptions are we making?
P ROBABILISTIC VIEW Recall: Gaussian density in n dimensions Assume a diagonal covariance matrix Σ = σ 2 I . The density is � � 1 − 1 p ( y | µ, σ 2 ) = 2 σ 2 ( y − µ ) T ( y − µ ) . 2 exp n ( 2 πσ 2 ) What if we restrict the mean to µ = Xw and find the maximum likelihood solution for w ?
P ROBABILISTIC VIEW Maximum likelihood for Gaussian linear regression Plug µ = Xw into the multivariate Gaussian distribution and solve for w using maximum likelihood. ln p ( y | µ = Xw , σ 2 ) = w ML arg max w − 1 2 σ 2 � y − Xw � 2 − n 2 ln ( 2 πσ 2 ) . = arg max w Least squares (LS) and maximum likelihood (ML) share the same solution: − 1 w � y − Xw � 2 2 σ 2 � y − Xw � 2 ⇔ LS: arg min ML: arg max w
P ROBABILISTIC VIEW ◮ Therefore, in a sense we are making an independent Gaussian noise assumption about the error, ǫ i = y i − x T i w . ◮ Other ways of saying this: iid 1) y i = x T ∼ N ( 0 , σ 2 ) , i w + ǫ i , ǫ i for i = 1 , . . . , n , ind ∼ N ( x T i w , σ 2 ) , for i = 1 , . . . , n , 2) y i 3) y ∼ N ( Xw , σ 2 I ) , as on the previous slides. ◮ Can we use this probabilistic line of analysis to better understand the maximum likelihood (i.e., least squares) solution?
P ROBABILISTIC VIEW Expected solution Given: The modeling assumption that y ∼ N ( Xw , σ 2 I ) . We can calculate the expectation of the ML solution under this distribution, � � � � � E [( X T X ) − 1 X T y ] ( X T X ) − 1 X T y E [ w ML ] = = p ( y | X , w ) dy ( X T X ) − 1 X T E [ y ] = ( X T X ) − 1 X T Xw = = w Therefore w ML is an unbiased estimate of w , i.e., E [ w ML ] = w .
R EVIEW : A N EQUALITY FROM PROBABILITY ◮ Even though the “expected” maximum likelihood solution is the correct one, should we actually expect to get something near it?
R EVIEW : A N EQUALITY FROM PROBABILITY ◮ Even though the “expected” maximum likelihood solution is the correct one, should we actually expect to get something near it? ◮ We should also look at the covariance. Recall that if y ∼ N ( µ, Σ) , then Var [ y ] = E [( y − E [ y ])( y − E [ y ]) T ] = Σ .
R EVIEW : A N EQUALITY FROM PROBABILITY ◮ Even though the “expected” maximum likelihood solution is the correct one, should we actually expect to get something near it? ◮ We should also look at the covariance. Recall that if y ∼ N ( µ, Σ) , then Var [ y ] = E [( y − E [ y ])( y − E [ y ]) T ] = Σ . ◮ Plugging in E [ y ] = µ , this is equivalently written as E [( y − µ )( y − µ ) T ] Var [ y ] = E [ yy T − y µ T − µ y T + µµ T ] = E [ yy T ] − µµ T = ◮ Immediately we also get E [ yy T ] = Σ + µµ T .
P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .
P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .
P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .
P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T ( σ 2 I + Xww T X T ) X ( X T X ) − 1 − ww T = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .
P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T ( σ 2 I + Xww T X T ) X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T σ 2 IX ( X T X ) − 1 + · · · = ( X T X ) − 1 X T Xww T X T X ( X T X ) − 1 − ww T 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .
P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T ( σ 2 I + Xww T X T ) X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T σ 2 IX ( X T X ) − 1 + · · · = ( X T X ) − 1 X T Xww T X T X ( X T X ) − 1 − ww T σ 2 ( X T X ) − 1 = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .
P ROBABILISTIC VIEW ◮ We’ve shown that, under the Gaussian assumption y ∼ N ( Xw , σ 2 I ) , Var [ w ML ] = σ 2 ( X T X ) − 1 . E [ w ML ] = w , ◮ When there are very large values in σ 2 ( X T X ) − 1 , the values of w ML are very sensitive to the measured data y (more analysis later). ◮ This is bad if we want to analyze and predict using w ML .
R IDGE R EGRESSION
R EGULARIZED LEAST SQUARES ◮ We saw how with least squares, the values in w ML may be huge. ◮ In general, when developing a model for data we often wish to constrain the model parameters in some way. ◮ There are many models of the form w � y − Xw � 2 + λ g ( w ) . w OPT = arg min ◮ The added terms are 1. λ > 0 : a regularization parameter, 2. g ( w ) > 0 : a penalty function that encourages desired properties about w .
R IDGE REGRESSION Ridge regression is one g ( w ) that addresses variance issues with w ML . It uses the squared penalty on the regression coefficient vector w , � y − Xw � 2 + λ � w � 2 = w RR arg min w The term g ( w ) = � w � 2 penalizes large values in w . However, there is a tradeoff between the first and second terms that is controlled by λ . ◮ Case λ → 0 : w RR → w LS ◮ Case λ → ∞ : w RR → � 0
R IDGE REGRESSION SOLUTION Objective: We can solve the ridge regression problem using exactly the same procedure as for least squares, � y − Xw � 2 + λ � w � 2 L = ( y − Xw ) T ( y − Xw ) + λ w T w . = Solution: First, take the gradient of L with respect to w and set to zero, ∇ w L = − 2 X T y + 2 X T Xw + 2 λ w = 0 Then, solve for w to find that w RR = ( λ I + X T X ) − 1 X T y .
R IDGE REGRESSION GEOMETRY w 2 There is a tradeoff between squared error and penalty on w . w LS We can write both in terms of λw T w level sets : Curves where function evaluation gives the same number. (w-w LS ) T (X T X)(w-w LS ) The sum of these gives a new set of levels with a unique minimum. w 1 You can check that we can write: x 1 � y − Xw � 2 + λ � w � 2 = ( w − w LS ) T ( X T X )( w − w LS )+ λ w T w +( const. w.r.t. w ) .
Recommend
More recommend