today
play

Today Finish Linear Regression: Best linear function prediction of Y - PowerPoint PPT Presentation

Today Finish Linear Regression: Best linear function prediction of Y given X . MMSE: Best Function that predicts Y from S . Conditional Expectation. Applications to random processes. LLSE Theorem Consider two RVs X , Y with a given distribution


  1. Today Finish Linear Regression: Best linear function prediction of Y given X . MMSE: Best Function that predicts Y from S . Conditional Expectation. Applications to random processes.

  2. LLSE Theorem Consider two RVs X , Y with a given distribution Pr [ X = x , Y = y ] . Then, Y = E [ Y ]+ cov ( X , Y ) L [ Y | X ] = ˆ ( X − E [ X ]) . var ( X ) Proof 1: Y = ( Y − E [ Y ]) − cov ( X , Y ) Y − ˆ E [ Y − ˆ var [ X ] ( X − E [ X ]) . Y ] = 0 by linearity. Also, E [( Y − ˆ Y ) X ] = 0 , after a bit of algebra. (See next slide.) Combine brown inequalities: E [( Y − ˆ Y )( c + dX )] = 0 for any c , d . Since: ˆ Y = α + β X for some α , β , so ∃ c , d s.t. ˆ Y − a − bX = c + dX . Then, E [( Y − ˆ Y )( ˆ Y − a − bX )] = 0 , ∀ a , b . Now, E [( Y − a − bX ) 2 ] = E [( Y − ˆ Y + ˆ Y − a − bX ) 2 ] = E [( Y − ˆ Y ) 2 ]+ E [( ˆ Y − a − bX ) 2 ]+ 0 ≥ E [( Y − ˆ Y ) 2 ] . This shows that E [( Y − ˆ Y ) 2 ] ≤ E [( Y − a − bX ) 2 ] , for all ( a , b ) . Thus ˆ Y is the LLSE.

  3. A Bit of Algebra Y − ˆ Y = ( Y − E [ Y ]) − cov ( X , Y ) var [ X ] ( X − E [ X ]) . Hence, E [ Y − ˆ Y ] = 0. We want to show that E [( Y − ˆ Y ) X ] = 0. Note that E [( Y − ˆ Y ) X ] = E [( Y − ˆ Y )( X − E [ X ])] , because E [( Y − ˆ Y ) E [ X ]] = 0. Now, E [( Y − ˆ Y )( X − E [ X ])] = E [( Y − E [ Y ])( X − E [ X ])] − cov ( X , Y ) E [( X − E [ X ])( X − E [ X ])] var [ X ] = ( ∗ ) cov ( X , Y ) − cov ( X , Y ) var [ X ] = 0 . var [ X ] ( ∗ ) Recall that cov ( X , Y ) = E [( X − E [ X ])( Y − E [ Y ])] and var [ X ] = E [( X − E [ X ]) 2 ] .

  4. Estimation Error We saw that the LLSE of Y given X is Y = E [ Y ]+ cov ( X , Y ) L [ Y | X ] = ˆ ( X − E [ X ]) . var ( X ) How good is this estimator? Or what is the mean squared estimation error? We find E [ | Y − L [ Y | X ] | 2 ] = E [( Y − E [ Y ] − ( cov ( X , Y ) / var ( X ))( X − E [ X ])) 2 ] = E [( Y − E [ Y ]) 2 ] − 2 ( cov ( X , Y ) / var ( X )) E [( Y − E [ Y ])( X − E [ X ])] +( cov ( X , Y ) / var ( X )) 2 E [( X − E [ X ]) 2 ] = var ( Y ) − cov ( X , Y ) 2 . var ( X ) Without observations, the estimate is E [ Y ] . The error is var ( Y ) . Observing X reduces the error.

  5. Estimation Error: A Picture We saw that Y = E [ Y ]+ cov ( X , Y ) L [ Y | X ] = ˆ ( X − E [ X ]) var ( X ) and E [ | Y − L [ Y | X ] | 2 ] = var ( Y ) − cov ( X , Y ) 2 . var ( X ) Here is a picture when E [ X ] = 0 , E [ Y ] = 0: Dimensions correspond to sample points, uniform sample space. 1 Vector Y at dimension ω is Ω Y ( ω ) √

  6. Linear Regression Examples Example 1:

  7. Linear Regression Examples Example 2: We find: E [ X ] = 0 ; E [ Y ] = 0 ; E [ X 2 ] = 1 / 2 ; E [ XY ] = 1 / 2 ; var [ X ] = E [ X 2 ] − E [ X ] 2 = 1 / 2 ; cov ( X , Y ) = E [ XY ] − E [ X ] E [ Y ] = 1 / 2 ; Y = E [ Y ]+ cov ( X , Y ) LR: ˆ ( X − E [ X ]) = X . var [ X ]

  8. Linear Regression Examples Example 3: We find: E [ X ] = 0 ; E [ Y ] = 0 ; E [ X 2 ] = 1 / 2 ; E [ XY ] = − 1 / 2 ; var [ X ] = E [ X 2 ] − E [ X ] 2 = 1 / 2 ; cov ( X , Y ) = E [ XY ] − E [ X ] E [ Y ] = − 1 / 2 ; Y = E [ Y ]+ cov ( X , Y ) LR: ˆ ( X − E [ X ]) = − X . var [ X ]

  9. Linear Regression Examples Example 4: We find: E [ X ] = 3 ; E [ Y ] = 2 . 5 ; E [ X 2 ] = ( 3 / 15 )( 1 + 2 2 + 3 2 + 4 2 + 5 2 ) = 11 ; E [ XY ] = ( 1 / 15 )( 1 × 1 + 1 × 2 + ··· + 5 × 4 ) = 8 . 4 ; var [ X ] = 11 − 9 = 2 ; cov ( X , Y ) = 8 . 4 − 3 × 2 . 5 = 0 . 9 ; Y = 2 . 5 + 0 . 9 LR: ˆ 2 ( X − 3 ) = 1 . 15 + 0 . 45 X .

  10. LR: Another Figure Note that ◮ the LR line goes through ( E [ X ] , E [ Y ]) ◮ its slope is cov ( X , Y ) var ( X ) .

  11. Summary Linear Regression 1. Linear Regression: L [ Y | X ] = E [ Y ]+ cov ( X , Y ) var ( X ) ( X − E [ X ]) 2. Non-Bayesian: minimize ∑ n ( Y n − a − bX n ) 2 3. Bayesian: minimize E [( Y − a − bX ) 2 ]

  12. CS70: Noninear Regression. 1. Review: joint distribution, LLSE 2. Quadratic Regression 3. Definition of Conditional expectation 4. Properties of CE 5. Applications: Diluting, Mixing, Rumors 6. CE = MMSE

  13. Review Definitions Let X and Y be RVs on Ω . ◮ Joint Distribution: Pr [ X = x , Y = y ] ◮ Marginal Distribution: Pr [ X = x ] = ∑ y Pr [ X = x , Y = y ] ◮ Conditional Distribution: Pr [ Y = y | X = x ] = Pr [ X = x , Y = y ] Pr [ X = x ] ◮ LLSE: L [ Y | X ] = a + bX where a , b minimize E [( Y − a − bX ) 2 ] . We saw that L [ Y | X ] = E [ Y ]+ cov ( X , Y ) ( X − E [ X ]) . var [ X ] Recall the non-Bayesian and Bayesian viewpoints.

  14. Nonlinear Regression: Motivation There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk). Our goal: explore estimates ˆ Y = g ( X ) for nonlinear functions g ( · ) .

  15. Quadratic Regression Let X , Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q [ Y | X ] = a + bX + cX 2 where a , b , c are chosen to minimize E [( Y − a − bX − cX 2 ) 2 ] . Derivation: We set to zero the derivatives w.r.t. a , b , c . We get E [ Y − a − bX − cX 2 ] 0 = E [( Y − a − bX − cX 2 ) X ] 0 = E [( Y − a − bX − cX 2 ) X 2 ] 0 = We solve these three equations in the three unknowns ( a , b , c ) . Note: These equations imply that E [( Y − Q [ Y | X ]) h ( X )] = 0 for any h ( X ) = d + eX + fX 2 . That is, the estimation error is orthogonal to all the quadratic functions of X . Hence, Q [ Y | X ] is the projection of Y onto the space of quadratic functions of X .

  16. Conditional Expectation Definition Let X and Y be RVs on Ω . The conditional expectation of Y given X is defined as E [ Y | X ] = g ( X ) where g ( x ) := E [ Y | X = x ] := ∑ yPr [ Y = y | X = x ] . y Fact E [ Y | X = x ] = ∑ Y ( ω ) Pr [ ω | X = x ] . ω Proof: E [ Y | X = x ] = E [ Y | A ] with A = { ω : X ( ω ) = x } .

  17. Deja vu, all over again? Have we seen this before? Yes. Is anything new? Yes. The idea of defining g ( x ) = E [ Y | X = x ] and then E [ Y | X ] = g ( X ) . Big deal? Quite! Simple but most convenient. Recall that L [ Y | X ] = a + bX is a function of X . This is similar: E [ Y | X ] = g ( X ) for some function g ( · ) . In general, g ( X ) is not linear, i.e., not a + bX . It could be that g ( X ) = a + bX + cX 2 . Or that g ( X ) = 2sin ( 4 X )+ exp {− 3 X } . Or something else.

  18. Properties of CE E [ Y | X = x ] = ∑ yPr [ Y = y | X = x ] y Theorem (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; (e) E [ E [ Y | X ]] = E [ Y ] . Proof: (a),(b) Obvious (c) E [ Yh ( X ) | X = x ] = ∑ Y ( ω ) h ( X ( ω )) Pr [ ω | X = x ] ω = ∑ Y ( ω ) h ( x ) Pr [ ω | X = x ] = h ( x ) E [ Y | X = x ] ω

  19. Properties of CE E [ Y | X = x ] = ∑ yPr [ Y = y | X = x ] y Theorem (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; (e) E [ E [ Y | X ]] = E [ Y ] . Proof: (continued) (d) E [ h ( X ) E [ Y | X ]] = ∑ h ( x ) E [ Y | X = x ] Pr [ X = x ] x = ∑ h ( x ) ∑ yPr [ Y = y | X = x ] Pr [ X = x ] x y = ∑ h ( x ) ∑ yPr [ X = x , y = y ] x y = ∑ h ( x ) yPr [ X = x , y = y ] = E [ h ( X ) Y ] . x , y

  20. Properties of CE E [ Y | X = x ] = ∑ yPr [ Y = y | X = x ] y Theorem (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; (e) E [ E [ Y | X ]] = E [ Y ] . Proof: (continued) (e) Let h ( X ) = 1 in (d).

  21. Properties of CE Theorem (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; (e) E [ E [ Y | X ]] = E [ Y ] . Note that (d) says that E [( Y − E [ Y | X ]) h ( X )] = 0 . We say that the estimation error Y − E [ Y | X ] is orthogonal to every function h ( X ) of X . We call this the projection property. More about this later.

  22. Application: Calculating E [ Y | X ] Let X , Y , Z be i.i.d. with mean 0 and variance 1. We want to calculate E [ 2 + 5 X + 7 XY + 11 X 2 + 13 X 3 Z 2 | X ] . We find E [ 2 + 5 X + 7 XY + 11 X 2 + 13 X 3 Z 2 | X ] = 2 + 5 X + 7 XE [ Y | X ]+ 11 X 2 + 13 X 3 E [ Z 2 | X ] = 2 + 5 X + 7 XE [ Y ]+ 11 X 2 + 13 X 3 E [ Z 2 ] = 2 + 5 X + 11 X 2 + 13 X 3 ( var [ Z ]+ E [ Z ] 2 ) = 2 + 5 X + 11 X 2 + 13 X 3 .

  23. Application: Diluting Each step, pick ball from well-mixed urn. Replace with blue ball. Let X n be the number of red balls in the urn at step n . What is E [ X n ] ? Given X n = m , X n + 1 = m − 1 w.p. m / N (if you pick a red ball) and X n + 1 = m otherwise. Hence, E [ X n + 1 | X n = m ] = m − ( m / N ) = m ( N − 1 ) / N = X n ρ , with ρ := ( N − 1 ) / N . Consequently, E [ X n + 1 ] = E [ E [ X n + 1 | X n ]] = ρ E [ X n ] , n ≥ 1 . ⇒ E [ X n ] = ρ n − 1 E [ X 1 ] = N ( N − 1 ) n − 1 , n ≥ 1 . = N

  24. Diluting Here is a plot:

Recommend


More recommend