regression methods
play

Regression Methods 1. Linear Regression and Logistic Regression: - PowerPoint PPT Presentation

0. Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common property CMU, 2004 fall, Andrew Moore, HW2, pr. 4 2. Linear Regression and Logistic Regression: Definitions Given an input vector X , linear


  1. 0. Regression Methods

  2. 1. Linear Regression and Logistic Regression: definitions, and a common property CMU, 2004 fall, Andrew Moore, HW2, pr. 4

  3. 2. Linear Regression and Logistic Regression: Definitions Given an input vector X , linear regression models a real-valued output Y as Y | X ∼ Normal ( µ ( X ) , σ 2 ) , where µ ( X ) = β ⊤ X = β 0 + β 1 X 1 + . . . + β p X p . Given an input vector X , logistic regression models a binary output Y by Y | X ∼ Bernoulli ( θ ( X )) , where the Bernoulli parameter is related to β ⊤ X by the logit transformation θ ( X ) def. 1 − θ ( X ) = β ⊤ X. logit ( θ ( X )) = log

  4. 3. a. For each of the two regression models defined above, write the log likelihood function and its gradient with respect to the parameter vector β = ( β 0 , β 1 , . . . , β p ) . Answer: For linear regression , we can write the log likelihood function as: � n �� − ( y i − µ ( x i ) 2 1 � � √ LL ( β ) = log 2 πσ exp 2 σ 2 i =1 n − ( y i − β ⊤ x i ) 2 � 1 � �� � √ = log 2 πσ exp 2 σ 2 i =1 n √ 1 � ( y i − β ⊤ x i ) 2 − n log( 2 πσ ) − = 2 σ 2 i =1 n √ 1 � ( y i − β ⊤ x i ) ⊤ ( y i − β ⊤ x i ) . = − n log( 2 πσ ) − 2 σ 2 i =1 Therefore, its gradient is: n � ( y i − β ⊤ x i ) x i ∇ β LL ( β ) = i =1

  5. 4. For logistic regression : θ ( X ) θ ( X ) 1 − θ ( X ) = β ⊤ X ⇔ e β ⊤ X = 1 − θ ( X ) ⇔ e β ⊤ X = θ ( X )(1 + e β ⊤ X ) log Therefore, e β ⊤ X 1 1 1 + e − β ⊤ X and 1 − θ ( X ) = θ ( X ) = 1 + e β ⊤ X = 1 + e β ⊤ X . Note that Y | X ∼ Bernoulli ( θ ( X )) means that P ( Y = 1 | X ) = θ ( X ) and P ( Y = 0 | X ) = 1 − θ ( X ) , which can be equivalently written as P ( Y = y | X ) = θ ( X ) y (1 − θ ( X )) 1 − y for all y ∈ { 0 , 1 } .

  6. 5. So, in this case the log likelihood function is: � n � � { θ ( x i ) y i (1 − θ ( x i )) 1 − y i } LL ( β ) = log i =1 n � { y i log θ ( x i ) + (1 − y i ) log(1 − θ ( x i )) } = i =1 n � { y i ( β ⊤ x i + log(1 − θ ( x i )) + (1 − y i ) log(1 − θ ( x i )) } = i =1 n { y i ( β ⊤ x i ) − log(1 + e β ⊤ x i ) } � = i =1 And therefore, n � � n e β ⊤ x i � � ∇ β LL ( β ) = y i x i − = ( y i − θ ( x i )) x i 1 + e β ⊤ x i x i i =1 i =1

  7. 6. Remark Actually, in the above solutions the full log likelihood function should look like the following first: n � log-likelihood = log p ( x i , y i ) i =1 n � = log ( p Y | X ( y i | x i ) p X ( x i )) i =1 �� n � n � �� � � p Y | X ( y i | x i ) · = log p X ( x i ) i =1 i =1 n n � � p Y | X ( y i | x i ) + log = log p X ( x i ) i =1 i =1 = LL + LL x Because LL x does not depend on the parameter β , when doing MLE we could just consider maximizing LL .

  8. 7. Show that for each of the two regression models above, at the MLE ˆ b. β has the following property: n n E [ Y | X = x i , β = ˆ � � y i x i = β ] x i . i =1 i =1 Answer: For linear regression : For logistic regression : n n n n � � (ˆ β ⊤ x i ) x i . ∇ β LL ( β ) = 0 ⇒ y i x i = � � ∇ β LL ( β ) = 0 ⇒ y i x i = θ ( x i ) x i . i =1 i =1 i =1 i =1 Since Y | X ∼ Normal ( µ ( X ) , σ 2 ) , Since Y | X ∼ Bernoulli ( θ ( X )) , E [ Y | X = x i , β = ˆ β ] = µ ( x i ) = ˆ β ⊤ x i . e ˆ β ⊤ x i E [ Y | X = x i , β = ˆ β ] = θ ( x i ) = β ⊤ x i . 1 + e ˆ i =1 E [ Y | X = x i , β = ˆ So � n i =1 y i x i = � n β ] x i . So � n i =1 y i x i = � n i =1 E [ Y | X = x i , β = ˆ β ] x i .

  9. 8. Linear Regression with only one parameter; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3

  10. 9. Consider real-valued variables X and Y . The Y variable is generated, conditional on X , from the following process: ε ∼ N (0 , σ 2 ) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0 , and standard deviation σ . This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p ( Y | X, a ) ∼ N ( aX, σ 2 ) , so it can be written as � � 1 − 1 2 σ 2 ( Y − aX ) 2 √ p ( Y | X, a ) = 2 πσ exp

  11. 10. MLE estimation a. Assume we have a training dataset of n pairs ( X i , Y i ) for i = 1 , . . . , n , and σ is known. Which ones of the following equations correctly represent the maximum likelihood problem for estimating a ? Say yes or no to each one. More than one of them should have the answer yes . 1 � − 1 � 2 σ 2 ( Y i − aX i ) 2 � √ arg max a 2 πσ exp i. i 1 � − 1 � � √ 2 σ 2 ( Y i − aX i ) 2 arg max a 2 πσ exp ii. i � � − 1 � 2 σ 2 ( Y i − aX i ) 2 arg max a i exp iii. � � − 1 2 σ 2 ( Y i − aX i ) 2 � iv. arg max a i exp i ( Y i − aX i ) 2 arg max a � v. � i ( Y i − aX i ) 2 argmin a vi.

  12. 11. Answer: def. L D ( a ) = p ( Y 1 , . . . , Y n | a ) = p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) n n � � 1 − 1 i.i.d. � � 2 σ 2 ( Y i − aX i ) 2 √ = p ( Y i | X i , a ) = 2 πσ exp i =1 i =1 Therefore n � � 1 − 1 def. � 2 σ 2 ( Y i − aX i ) 2 √ = arg max L D ( a ) = arg max exp ( ii. ) a MLE a a 2 πσ i =1 n � n � � n � 1 � − 1 � 1 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 √ √ − = arg max exp = arg max 2 πσ ) n exp 2 πσ ( a a i =1 i =1 n � − 1 � � 2 σ 2 ( Y i − aX i ) 2 = arg max exp ( iv. ) a i =1 n n � − 1 � − 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 = arg max ln exp = arg max a a i =1 i =1 n n − 1 ( Y i − aX i ) 2 = arg min � � ( Y i − aX i ) 2 = arg max ( vi. ) 2 σ 2 a a i =1 i =1

  13. 12. b. Derive the maximum likelihood estimate of the parameter a in terms of the training example X i ’s and Y i ’s. We recommend you start with the simplest form of the problem you found above. Answer: � � n n n n ( Y i − aX i ) 2 = arg min � � � � a 2 X 2 Y 2 i − 2 a a MLE = arg min X i Y i + i a a i =1 i =1 i =1 i =1 −− 2 � n � n i =1 X i Y i i =1 X i Y i = = 2 � n � n i =1 X 2 i =1 X 2 i i

  14. 13. MAP estimation Let’s put a prior on a . Assume a ∼ N (0 , λ 2 ) , so � � 1 − 1 2 λ 2 a 2 p ( a | λ ) = √ 2 πλ exp The posterior probability of a is p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) p ( a | Y 1 , . . . , Y n , X 1 , . . . , X n , λ ) = � a ′ p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ′ ) p ( a ′ | λ ) da ′ We can ignore the denominator when doing MAP estimation. c. Assume σ = 1 , and a fixed prior parameter λ . Solve for the MAP estimate of a , [ln p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) + ln p ( a | λ )] argmax a Your solution should be in terms of X i ’s, Y i ’s, and λ .

  15. 14. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) · p ( a | λ ) � n �� − a 2 � � � 1 − 1 1 � 2 σ 2 ( Y i − aX i ) 2 √ √ = 2 πσ exp · 2 πλ exp 2 λ 2 i =1 � n �� − a 2 1 � − 1 1 � � σ =1 � 2( Y i − aX i ) 2 √ · √ = exp exp 2 λ 2 2 π 2 πλ i =1 Therefore the MAP optimization problem is � n � 1 2 π − 1 1 1 ( Y i − aX i ) 2 + ln � 2 λ 2 a 2 √ √ 2 πλ − arg max n ln 2 a i =1 � � n − 1 1 ( Y i − aX i ) 2 − � 2 λ 2 a 2 = arg max 2 a i =1 � n � n � � � n n � ( Y i − aX i ) 2 + a 2 i + 1 � � � � a 2 X 2 Y 2 − 2 a = arg min = arg min X i Y i + i λ 2 λ 2 a a i =1 i =1 i =1 i =1 � n i =1 X i Y i ⇒ a MAP = i + 1 � n i =1 X 2 λ 2

  16. 15. d. Under the following conditions, how do the prior and conditional likelihood curves change? Do a MLE and a MAP become closer together, or further apart? p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ )

  17. 16. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ same narrower decrease (fixed λ )

  18. 17. Linear Regression in R 2 [ without “intercept” term ] with either Gaussian or Laplace noise CMU, 2009 fall, Carlos Guestrin, HW3, pr. 1.5.2 CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 2

Recommend


More recommend