cs70 jean walrand lecture 32
play

CS70: Jean Walrand: Lecture 32. Chernoff, Jensen, Polling, - PowerPoint PPT Presentation

CS70: Jean Walrand: Lecture 32. Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression 1. About M3 2. Chernoff 3. Jensen 4. Polling 5. Confidence Intervals 6. Linear Regression About M3 Not easy. Definitely! Should I worry?


  1. CS70: Jean Walrand: Lecture 32. Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression 1. About M3 2. Chernoff 3. Jensen 4. Polling 5. Confidence Intervals 6. Linear Regression

  2. About M3 Not easy. Definitely! Should I worry? Why? Me worry? Probability takes a while to get used to. The math looks trivial, but what the ... are we computing? Be patient! Patience has its own rewards.

  3. Some Mysteries ◮ A probability space is a set Ω with probabilities assigned to elements ... ◮ It is uniform is all the elements have the same probability ... ◮ Let Ω = { 1 , 2 , 3 , 4 } be a uniform probability space ... ◮ Say what! Never heard of that before!!! ◮ A random variable is a function X : Ω → ℜ ... ◮ Define two random variables on the uniform probability space Ω = { 1 , 2 , 3 , 4 } so that ... ◮ Let me try ”If you first get an odd number, then X = 2, if you then get an even number, then Y = − 3 .... ◮ What happened? ◮ Gee!!, these are conceptual questions! Not like the homework!! Nor the M3 review!! ◮ Meaning, “to do the homework, we did not need to understand probability space nor random variables.” Really??

  4. Seriously Folks! You have time to get these ideas straight. If you knew it all already, you would not learn anything from this course. It is not that complicated. You will get to the bottom of this! A midterm de-briefing will take place next week. Time and place TBA on Piazza.

  5. Sample Question Question: On uniform probability space Ω := { 1 , 2 , 3 , 4 } , define the RVs X and Y such that E [ XY ] = E [ X ] E [ Y ] even though X and Y are not independent. Recall M3 review in lecture: E [ XY ] = E [ X ] E [ Y ] if X and Y are independent, not only if. We have to define X : Ω → ℜ and Y : Ω → ℜ so that .... Let us try: We see that XY = 0, so E [ XY ] = 0 = E [ X ] E [ Y ] . Also, X , Y not independent. Note that X = 0 and Y = ... does not work because then X and Y are independent.

  6. Herman Chernoff Herman Chernoff (born July 1, 1923, New York) is an American applied mathematician, statistician and physicist, formerly a professor at MIT and currently working at Harvard University.

  7. Chernoff Faces Figure : This example shows Chernoff faces for lawyers’ ratings of twelve judges Chernoff faces, invented by Herman Chernoff, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and orientation.

  8. Chernoff’s Inequality Chernoff’s inequality is due to Herman Rubin, continuing the tradition started with Markov’s inequality (that is due to Chebyshev). Theorem Chernoff’s Inequality E [ e θ X ] Pr [ X ≥ a ] ≤ min . e θ a θ > 0 Proof: We use Markov’s inequality with f ( x ) = e θ x with θ > 0. We find = E [ e θ X ] Pr [ X ≥ a ] ≤ E [ f ( X )] . e θ a f ( a ) Since the inequality holds for all θ > 0, this concludes the proof.

  9. Chernoff’s Inequality and B(n,p) Let X = B ( n , p ) . We want a bound on Pr [ X ≥ a ] . Since X = X 1 + ··· + X n with Pr [ X m = 1 ] = p = 1 − Pr [ X m = 0 ] , we have E [ e θ ( X 1 + ··· + X n ) ] = E [ e θ X 1 ×···× e θ X n ] E [ e θ X ] = � n � = [ pe θ +( 1 − p )] n . E [ e θ X 1 ] = Thus, Pr [ X ≥ a ] ≤ [ pe θ +( 1 − p )] n . e θ a We minimize the RHS over θ > 0 and we find (after some algebra ...) Pr [ X ≥ a ] ≤ Pr [ X = a ] Pr [ Y = a ] where Y = B ( n , a n ) .

  10. Chernoff’s Inequality and B(n,p) Here is a picture;

  11. Chernoff’s Inequality and P ( λ ) Let X = P ( λ ) . We want a bound on Pr [ X ≥ a ] . We have e θ n λ n ( λ e θ ) n n ! e − λ = ∑ = ∑ E [ e θ X ] e − λ n ! n ≥ 0 n ≥ 0 exp { λ e θ } exp {− λ } = exp { λ ( e θ − 1 ) } . = Thus, Pr [ X ≥ a ] ≤ E [ e θ X ] = exp { λ ( e θ − 1 ) − θ a } . e θ a We minimize over θ > 0 and we find (after some algebra) � a � λ e a − λ = Pr [ X = a ] Pr [ X ≥ a ] ≤ Pr [ Y = a ] where Y = P ( a ) . a

  12. Chernoff’s Inequality and P ( λ ) Here is a picture:

  13. Chernoff’s Inequality Chernoff’s inequality is typically used to estimate Pr [ X 1 + ··· + X n ≥ a ] n where X 1 ,..., X n are independent and have the same distribution and n ≫ 1 and a > E [ X 1 ] . We expect the average X 1 + ··· + X n to be close to the mean, so n that the desired probability is small. Chernoff’s inequality yields useful bounds. It works because E [ exp { θ ( X 1 + ··· + X n ) / n } ] = ( E [ exp { θ X 1 / n } ]) n , by independence. Thus, Chernoff’s bound is typically used for rare events. Herman Chernoff is now 92, a rare event.

  14. Jensen’s Inequality A function g ( x ) is convex if it is above all its tangents. Consider the tangent at the point ( E [ X ] , g ( E [ X ])) shown in the figure. We have g ( X ) ≥ g ( E [ X ])+ a ( X − E [ X ]) . Taking expectation, we conclude that g ( · ) convex ⇒ E [ g ( X )] ≥ g ( E [ X ]) .

  15. Jensen’s Inequality: Examples ◮ E [ | X | ] ≥ | E [ X ] | ◮ E [ X 4 ] ≥ E [ X ] 4 ◮ E [ e θ X ] ≥ e θ E [ X ] ◮ E [ ln ( X )] ≤ ln ( E [ X ]) ◮ E [ max { X 2 , 1 + X } ] ≥ max { E [ X ] 2 , 1 + E [ X ] }

  16. Polling: Problem Here is a central question about polling. Setup: Assume people vote democrat with probability p . We poll n people. Let A n be the fraction of those who vote democrat. Question: How large should n be so that Pr [ | A n − p | ≥ 0 . 1 ] ≤ 0 . 05 ?

  17. Polling: Analysis Recall the problem: Find n so that Pr [ | A n − p | ≥ 0 . 1 ] ≤ 0 . 05 ? Approach: Chebyshev! Recall Chebyshev’s inequality: Pr [ | X − E [ X ] | ≥ a ] ≤ var [ X ] . a 2 Here, X = A n = Y / n where Y is the number of people out of n who vote democrat. Thus, Y = B ( n , p ) . Hence, E [ Y ] = np and var [ Y ] = np ( 1 − p ) . Consequently, E [ X ] = p and var [ X ] = p ( 1 − p ) / n . This gives Pr [ | A n − p | ≥ 0 . 1 ] ≤ p ( 1 − p ) n ( 0 . 1 ) 2 = 100 p ( 1 − p ) . n However, we do not know p . What should we do? We know that p ( 1 − p ) ≤ 1 / 4. Hence, Pr [ | A n − p | ≥ 0 . 1 ] ≤ 25 n . Thus, if n = 500, we find Pr [ | A n − p | ≥ 0 . 1 ] ≤ 0 . 05 .

  18. Estimation Common problem: Estimate a mean value. Examples: height, weight, lifetime, arrival rate, job duration, .... Setup: X 1 , X 2 ,... are independent RVs with E [ X n ] = µ and var [ X n ] = σ 2 . We observe { X 1 ,..., X n } and want to estimate µ . Approach: Let A n = ( X 1 + ··· + X n ) / n be the average (sample mean). Then, E [ A n ] = µ and var [ A n ] = σ 2 n . Using Chebyshev: Pr [ | A n − µ | ≥ a ] ≤ σ 2 na 2 . Thus, Pr [ | A n − µ | ≥ a ] → 0 as n → ∞ . This is the WLLN, as we know.

  19. Confidence Interval How much confidence in our estimate? Recall the setup X m independent with mean µ and variance σ 2 . Chebyshev told us Pr [ | A n − µ | ≥ a ] ≤ σ 2 na 2 . √ This probability is at most δ if σ 2 ≤ na 2 δ , i.e., a ≥ σ / n δ . Thus σ Pr [ | A n − µ | ≥ √ ] ≤ δ . n δ Equivalently, σ σ √ √ Pr [ µ ∈ [ A n − , A n + ]] ≥ 1 − δ . n δ n δ We say that σ σ [ A n − √ , A n + √ ] is a ( 1 − δ ) − confidence interval for µ . n δ n δ

  20. Confidence Interval, continued We just found out that σ σ [ A n − √ , A n + √ ] is a ( 1 − δ ) − confidence interval for µ . n δ n δ For δ = 0 . 05, this shows that [ A n − 4 . 5 σ √ n , A n + 4 . 5 σ √ n ] is a 95 % − confidence interval for µ . A more refined analysis, using the Central Limit Theorem, allows to replace 4 . 5 by 2.

  21. CI with Unknown Variance If σ is not known, we replace it by the estimate s n : n n = 1 s 2 ( X m − A n ) 2 . ∑ n m = 1 Thus, we expect that, for n large enough (e.g., larger than 20), [ A n − 2 s n √ n , A n + 2 s n √ n ] is a 95 % − confidence interval for µ . Does this work well? The theory says we have to be careful. What happens is that the error in estimating σ may throw us off. What is known is that if the X m have a nice distribution (e.g., Gaussian), and if n is not too small (say ≥ 15), then this is fine.

  22. CI for Pr [ H ] CIs using the upper bound σ ≤ 0 . 5 or using the estimated s n .

  23. Linear Regression An example: Random experiment: select people at random and plot their (age, height). You get ( X n , Y n ) for n = 1 ,..., N where X n = age and Y n = height for person n . The linear regression is a guess a + bX n for Y n that is close to the true values, in some sense to be made precise.

  24. Linear Regression Another example:

  25. LR: Two Viewpoints Linear regression: a + bX n is a guess for Y n . There are two ways to look at the linear regression: Bayesian and non-Bayesian. ◮ Bayesian Viewpoint: ◮ We have a prior: Pr [ X = x , Y = y ] , x = ..., y = ... ; ◮ We choose ( a , b ) to minimize E [( Y − a − bX ) 2 ] . ◮ Non-Bayesian Viewpoint: ◮ We have no prior, but samples: { ( X n , Y n ) , n = 1 ,..., N } ; ◮ We choose ( a , b ) to minimize ∑ N n = 1 ( Y n − a − bX n ) 2 . ◮ We hope Y k ≈ a + bX k for future samples.

  26. Summary Chernoff, Jensen, Polling, Confidence Intervals, Linear Regression ◮ Chernoff: Pr [ X ≥ a ] ≤ min θ > 0 E [ e θ ( X − a ) ] . ◮ Jensen: E [ c ( X )] ≥ c ( E [ X ]) for c ( · ) convex. ◮ Polling: How many people to poll? ◮ Confidence Interval: Sample mean ± 2 σ / √ n . ◮ Linear Regression: Y ≈ a + bX . B or not B?

Recommend


More recommend