regression methods
play

Regression Methods 1. Linear Regression with only one parameter, - PowerPoint PPT Presentation

0. Regression Methods 1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3 2. Consider real-valued variables X and Y . The Y variable is


  1. 0. Regression Methods

  2. 1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3

  3. 2. Consider real-valued variables X and Y . The Y variable is generated, conditional on X , from the following process: ε ∼ N (0 , σ 2 ) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0 , and standard deviation σ . This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p ( Y | X, a ) ∼ N ( aX, σ 2 ) , so it can be written as � � 1 − 1 2 σ 2 ( Y − aX ) 2 √ p ( Y | X, a ) = 2 πσ exp

  4. 3. MLE estimation a. Assume we have a training dataset of n pairs ( X i , Y i ) for i = 1 , . . . , n , and σ is known. Which ones of the following equations correctly represent the maximum likelihood problem for estimating a ? Say yes or no to each one. More than one of them should have the answer yes . � � 1 − 1 � 2 σ 2 ( Y i − aX i ) 2 √ i. arg max a 2 πσ exp i � � 1 − 1 � 2 σ 2 ( Y i − aX i ) 2 √ arg max a 2 πσ exp ii. i � � − 1 � 2 σ 2 ( Y i − aX i ) 2 arg max a i exp iii. � � − 1 � 2 σ 2 ( Y i − aX i ) 2 iv. arg max a i exp � i ( Y i − aX i ) 2 arg max a v. � i ( Y i − aX i ) 2 argmin a vi.

  5. 4. Answer: def. L D ( a ) = p ( Y 1 , . . . , Y n | a ) = p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) n n � � 1 − 1 � � i.i.d. 2 σ 2 ( Y i − aX i ) 2 √ = p ( Y i | X i , a ) = 2 πσ exp i =1 i =1 Therefore n � � 1 − 1 � def. 2 σ 2 ( Y i − aX i ) 2 √ = arg max L D ( a ) = arg max exp ( ii. ) a MLE a a 2 πσ i =1 � � � � n n � � n 1 − 1 1 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 √ √ − = arg max exp = arg max 2 πσ ) n exp 2 πσ ( a a i =1 i =1 n � � − 1 � 2 σ 2 ( Y i − aX i ) 2 = arg max exp ( iv. ) a i =1 n � � n − 1 − 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 = arg max ln exp = arg max a a i =1 i =1 n n − 1 � � ( Y i − aX i ) 2 = arg min ( Y i − aX i ) 2 = arg max ( vi. ) 2 σ 2 a a i =1 i =1

  6. 5. b. Derive the maximum likelihood estimate of the parameter a in terms of the training example X i ’s and Y i ’s. We recommend you start with the simplest form of the problem you found above. Answer: � � n n n n � � � � ( Y i − aX i ) 2 = arg min a 2 X 2 Y 2 i − 2 a a MLE = arg min X i Y i + i a a i =1 i =1 i =1 i =1 −− 2 � n � n i =1 X i Y i i =1 X i Y i = = 2 � n � n i =1 X 2 i =1 X 2 i i

  7. 6. MAP estimation Let’s put a prior on a . Assume a ∼ N (0 , λ 2 ) , so � � 1 − 1 2 λ 2 a 2 p ( a | λ ) = √ 2 πλ exp The posterior probability of a is p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) p ( a | Y 1 , . . . , Y n , X 1 , . . . , X n , λ ) = � a ′ p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ′ ) p ( a ′ | λ ) da ′ We can ignore the denominator when doing MAP estimation. c. Assume σ = 1 , and a fixed prior parameter λ . Solve for the MAP estimate of a , [ln p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) + ln p ( a | λ )] argmax a Your solution should be in terms of X i ’s, Y i ’s, and λ .

  8. 7. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) · p ( a | λ ) � n �� � � � − a 2 1 − 1 1 � 2 σ 2 ( Y i − aX i ) 2 √ √ = 2 πσ exp · 2 πλ exp 2 λ 2 i =1 � n �� � � � − a 2 1 − 1 1 � σ =1 2( Y i − aX i ) 2 √ · √ = exp exp 2 λ 2 2 π 2 πλ i =1 Therefore the MAP optimization problem is � � n 1 2 π − 1 1 1 � ( Y i − aX i ) 2 + ln 2 λ 2 a 2 √ √ 2 πλ − arg max n ln 2 a i =1 � � n − 1 1 � ( Y i − aX i ) 2 − 2 λ 2 a 2 = arg max 2 a i =1 � n � n � � � � n n ( Y i − aX i ) 2 + a 2 i + 1 � � � � a 2 X 2 Y 2 − 2 a = arg min = arg min X i Y i + i λ 2 λ 2 a a i =1 i =1 i =1 i =1 � n i =1 X i Y i ⇒ a MAP = i + 1 � n i =1 X 2 λ 2

  9. 8. d. Under the following conditions, how do the prior and conditional likelihood curves change? Do a MLE and a MAP become closer together, or further apart? p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ )

  10. 9. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ same narrower decrease (fixed λ )

  11. 10. Linear Regression – the general case: • MLE and the Least Squares Loss function ◦ Nonlinear Regression • [ L 2 /Ridge] Regularization; MAP CMU, 2015 spring, Alex Smola, HW1, pr. 2

  12. 11. The objective of this problem is to gain knowledge on linear regression, Maximum Like- lihood Estimation (MLE), Maximum-a-Posteriori Estimation (MAP) and the variants of regression problems with introduction of regularization terms. Part I: Linear Regression — MLE and Least Squares Consider a linear model with some Gaussian noise: Y i = X i · w + b + ε i where ε i ∼ N (0 , σ 2 ) , i = 1 , . . . , n. (1) where Y i ∈ R is a scalar, X i ∈ R d is a d -dimensional vector, b ∈ R is a constant, w ∈ R d is d -dimensional weight on X i , and ε i is a i.i.d. Gaussian noise with variance σ 2 . Given the data X i , i = 1 , . . . , n , our goal is to estimate w and b which specify the model. We will show that solving the linear model ( 1 ) with the MLE method is the same as solving the following Least Squares problem : β ( Y − X ′ β ) ⊤ ( Y − X ′ β ) , arg min (2) i = (1 , X i ) ⊤ , X ′ = ( X ′ n ) ⊤ and β = ( b, w ) ⊤ . where Y = ( Y 1 , . . . , Y n ) ⊤ , X ′ 1 , . . . , X ′

  13. 12. a. From the model ( 1 ), derive the conditional distribution of Y i | X i , w, b . Remember that X i is a fixed data point. Answer: Note that Y i | X i ; w, b ∼ N ( X i · w + b, σ 2 ) , thus we can write the p.d.f. of Y i | X i , w, b in the following form: � ( y i − X i · w − b ) 2 � 1 f ( Y i = y i | X i ; w, b ) = √ 2 πσ exp . 2 σ 2

  14. 13. b. Assuming i.i.d. between each ε i , i = 1 , . . . , n , give an explicit expression for the log-likelihood, ℓ ( Y | β ) of the data. Note : The notation for Y and β was given at ( 2 ). Given that the ε i ’s are i.i.d., it follows that P ( Y | β ) = � i P ( Y i | w, b ) . Remark that we are just omitting X i for convenience, as the problem explicitly tells that X i are fixed points. Answer: Given y = ( y 1 , . . . , y n ) ⊤ , since Y i are independent as ε i ’s are i.i.d. and X i ’s are given, the likelihood of Y | β is as follows: n n � � − ( y i − X i · w − b ) 2 1 � � f ( Y = y | β ) f ( y i | w, b ) = √ = 2 πσ exp 2 σ 2 i =1 i =1 � n � � n � � i =1 ( y i − X i · w − b ) 2 1 √ = exp − . 2 σ 2 2 πσ Now, taking the ln , the log-likelihood of Y | β is as follows: n √ 1 � ( y i − X i · w − b ) 2 . ℓ ( Y = y | β ) = n ln( 2 πσ ) − (3) 2 σ 2 i =1

  15. 14. c. Now show that solving for β that maximizes the log-likelihood (i.e., MLE), is the same as solving the Least Square problem of ( 2 ). Answer: To maximize the log-likelihood ℓ ( Y = y | β ) , we want to focus on the second term since the first term of the log-likelihood ( 3 ) is a constant. In short, to maximize the second term of ( 3 ), we want to minimize � n i =1 ( y i − X i · w − b ) 2 . Writing it in the matrix-vector from, we get: n � ( y i − X i · w − b ) 2 = min β ( y − X ′ β ) ⊤ ( y − X ′ β ) , max ℓ ( Y = y | β ) = min β β i =1 i = (1 , X i ) ⊤ , X ′ = ( X ′ n ) ⊤ and β = ( b, w ) ⊤ . where again, Y = ( Y 1 , . . . , Y n ) ⊤ , X ′ 1 , . . . , X ′

  16. 15. d. Derive β that maximizes the log-likelihood. Hint : You may find useful the following formulas: a ∂ ∂ ∂ (6 c ) ∂ ( XY ) = X ∂Y ∂z + ∂X ∂X a ⊤ X = ∂X X ⊤ a = a ∂X X ⊤ AX = ( A + A ⊤ ) X (5 a ) (5 b ) ∂z Y ∂z Answer: Setting the objective function J ( β ) as J ( β ) = ( y − X ′ β ) ⊤ ( y − X ′ β ) , implies (see rules (6c), (5a) and (5b) in the document mentioned in the Hint ): ∇ β J ( β ) = 2 X ′⊤ ( X ′ β − y ) . The log-likelihood maximizer ˆ β can be found by solving the following optimality condi- tion: β ) = 0 ⇔ X ′⊤ ( X ′ ˆ β − y ) = 0 ⇔ X ′⊤ X ′ ˆ ∇ β J (ˆ β = X ′⊤ y ⇔ ˆ β = ( X ′⊤ X ′ ) − 1 X ′⊤ y. (4) Note that ( X ′⊤ X ′ ) − 1 is possible because it was assumed that X has full rank on the column space. a From Matrix Identities , Sam Roweis, 1999, http://www.cs.nyu.edu/ ∼ roweis/notes/matrixid.pdf .

Recommend


More recommend