1
play

1 Rolling Dice Hyper for the Hypergeometric Roll two 6-sided dice - PDF document

Viva La Correlacin! Fun with Indicator Variables Say X and Y are arbitrary random variables Let I A and I B be indicators for events A and B Correlation of X and Y, denoted r (X, Y) : 1 if A occurs 1 if B occurs


  1. Viva La Correlación! Fun with Indicator Variables • Say X and Y are arbitrary random variables • Let I A and I B be indicators for events A and B    Correlation of X and Y, denoted r (X, Y) : 1 if A occurs 1 if B occurs     I A I B  0 otherwise  0 otherwise Cov ( X , Y ) r  ( , ) X Y  E[ I A ] = P(A), E[ I B ] = P(B), E[ I A I B ] = P(AB) Var(X)Var( Y) = E[ I A I B ] – E[ I A ] E[ I B ]  Note: -1  r (X, Y)  1  Cov( I A , I B ) = P(AB) – P(A)P(B)  Correlation measures linearity between X and Y = P(A | B)P(B) – P(A)P(B)  r (X, Y) = 1  Y = aX + b where a = s y / s x = P(B)[P(A | B) – P(A)]  r (X, Y) = -1  Y = aX + b where a = - s y / s x  Cov( I A , I B ) determined by P(A | B) – P(A)  r (X, Y) = 0  absence of linear relationship  P(A | B) > P(A)  r ( I A , I B ) > 0 o But, X and Y can still be related in some other way!  If r (X, Y) = 0, we say X and Y are “uncorrelated”  P(A | B) = P(A)  r ( I A , I B ) = 0 (and Cov( I A , I B ) = 0)  P(A | B) < P(A)  r ( I A , I B ) < 0 o Note: Independence implies uncorrelated, but not vice versa! Can’t Get Enough of that Multinomial Covariance and the Multinomial • Multinomial distribution • Computing Cov( X i , X j )  n independent trials of experiment performed  Indicator I i ( k ) = 1 if trial k has outcome i , 0 otherwise n n    Each trials results in one of m outcomes, with    m [ ( )] ( ) ( ) E I k p X I k X I k  p  1 i i i i j j respective probabilities: p 1 , p 2 , …, p m where   1 1 i k k i  1 n n   X i = number of trials with outcome i  Cov ( , ) Cov ( ( ), ( ))  X X I b I a i j i j   n   a 1 b 1       c c c ( , ,..., ) ...  When a  b , trial a and b independent: P X c X c X c   p 1 p 2 p m  Cov ( ( ), ( )) 0 I b I a 1 1 2 2 m m 1 2 m  c , c ,..., c  i j 1 2 m    When a = b : Cov ( I ( b ), I ( a )) E [ I ( a ) I ( a )] E [ I ( a )] E [ I ( a )]  E.g., Rolling 6-sided die multiple times and counting how i j i j i j many of each value {1, 2, 3, 4, 5, 6} we get   Since trial a cannot have outcome i and j : E [ I ( a ) I ( a )] 0 i j n n  Would expect that X i are negatively correlated      Cov ( X , X ) Cov ( I ( b ), I ( a )) ( E [ I ( a )] E [ I ( a )]) i j i j i j  Let’s see... when i  j , what is Cov( X i , X j )?    a b 1 a 1 n       X i and X j negatively correlated ( ) p p np p i j i j  1 a Multinomials All Around Conditional Expectation • Multinomial distributions: • X and Y are jointly discrete random variables  Count of strings hashed across buckets in hash table  Recall conditional PMF of X given Y = y:  Number of server requests across machines in cluster p ( x , y )     X , Y p ( x | y ) P ( X x | Y y )  Distribution of words/tokens in an email X | Y ( ) p y Y  Etc. • Define conditional expectation of X given Y = y: • When m (# outcomes) is large, p i is small        [ | ] ( | ) ( | ) E X Y y x P X x Y y x p x y  For equally likely outcomes: p i = 1/m X | Y x x n     Cov ( X , X ) np p • Analogously, jointly continuous random variables: i j i j 2 m   Large m  X i and X j very mildly negatively correlated ( , ) f x y     X , Y ( | ) [ | ] ( | ) f x y E X Y y x f x y dx | X | Y X Y  Poisson paradigm still applicable f ( y )   Y 1

  2. Rolling Dice Hyper for the Hypergeometric • Roll two 6-sided dice D 1 and D 2 • X and Y are independent random variables  X = value of D 1 + D 2 Y = value of D 2  X ~ Bin(n, p) Y ~ Bin(n, p)  What is E[X | X + Y = m], where m ≤ n?  What is E[X | Y = 6]?   Start by computing P(X = k | X + Y = m):     [ | 6 ] ( | 6 ) E X Y x P X x Y          ( , ) ( , ) ( ) ( ) x P X k X Y m P X k Y m k P X k P Y m k       P ( X k | X Y m )         P ( X Y m ) P ( X Y m ) P ( X Y m ) 1 57           ( 7 8 9 10 11 12 ) 9 . 5           6 6 n n n n           k ( 1 ) n k   m k ( 1 ) n ( m k )     p p p p           k m k k m k       2 n 2 n     m 2 n m   p ( 1 p )  Intuitively makes sense: 6 + E[value of D 1 ] = 6 + 3.5     m m  Hypergeometric: (X | X + Y = m) ~ HypG( m , 2n, n )  E[X | X + Y = m] = nm /2 n = m /2 # total “X” total successes trials trials Properties of Conditional Expectation Expectations of Conditional Expectations • X and Y are jointly distributed random variables • Define g (Y) = E[X | Y]    g (Y) is a random variable    E [ g ( X ) | Y y ] g ( x ) p ( x | y ) or g ( x ) f ( x | y ) dx X | Y X | Y x   For any Y = y, g (Y) = E[X | Y = y] - o This is just function of Y, since we sum over all values of X • Expectation of conditional sum:  What is E[E[X | Y]] = E[ g (Y)]? (Consider discrete case)   n n          [ [ | ]] [ | ] ( ) E  X | Y y  E [ X | Y y ] E E X Y E X Y y P Y y i i     1 1 y i i       [ ( | ) ] ( ) x P X x Y y P Y y y x          xP ( X x , Y y ) x P ( X x , Y y ) y x x y     ( ) [ ] xP X x E X (Same for continuous) x Analyzing Recursive Code Random Number of Random Variables int Recurse() • Say you have a web site: PimentoLoaf.com { int x = randomInt(1, 3); // Equally likely values  X = Number of people/day visit your site. X ~ N(50, 25)  Y i = Number of minutes spent by visitor i . Y i ~ Poi(8) if (x == 1) return 3; else if (x == 2) return (5 + Recurse());  X and all Y i are independent else return (7 + Recurse()); X }    Time spent by all visitors/day: . What is E[W]? W Y i  i 1 • Let Y = value returned by Recurse() . What is E[Y]?       X X            E [ W ] E  Y  E  E  Y | X   [ ] [ ] [ ] 50 8 E X E Y E X E Y          i i E [ Y ] E [ Y | X 1 ] P ( X 1 ) E [ Y | X 2 ] P ( X 2 ) E [ Y | X 3 ] P ( X 3 )       i i   i 1 i 1           X n n [ | 1 ] 3 [ | 2 ] 5 [ ] [ | 3 ] 7 [ ]    E Y X E Y X E Y E Y X E Y      | [ | ] [ ] [ ] E  Y X n  E Y X n E Y nE Y i i i i             E [ Y ] 3 ( 1 / 3 ) ( 5 E [ Y ])( 1 / 3 ) ( 7 E [ Y ])( 1 / 3 ) ( 1 / 3 )( 15 2 E [ Y ]) 1 1 1 i i i    X    E  Y | X  X E [ Y ] [ ] 15 E Y i i    i 1 2

  3. Conditional Variance Making Predictions • Recall definition: Var(X) = E[(X – E[X]) 2 ] • We observe random variable X  Define: Var(X | Y) = E[(X – E[X | Y]) 2 | Y]  Want to make prediction about Y • Derived: Var(X) = E[X 2 ] – (E[X]) 2  E.g., X = stock price at 9am, Y = stock price at 10am ˆ  Can derive: Var(X | Y) = E[X 2 | Y] – (E[X | Y]) 2 Y   Let g ( X ) be function we use to predict Y, i.e.: ( ) g X  Choose g ( X ) to minimize E[(Y – g ( X ) ) 2 ] • After a bit more math (in the book):  Best predictor: g ( X ) = E[Y | X]  Var(X) = E[Var(X | Y)] + Var(E[X | Y])  Intuitively: E[(Y – c) 2 ] is minimized when c = E[Y]  Intuitively, let Y = true temperature, X = thermostat value o Now, you observe X, and Y depends on X, then use c = E[Y | X]  Variance in thermostat readings depends on:  You just got your first baby steps into Machine Learning o Average variance in thermostat at different temperatures + o We’ll go into this more rigorously in a few weeks o Variance in average thermostat value at different temperatures Speaking of Babies... • Say my height is X inches ( x = 71)  My son: He does not look like:  Say, historically, sons grow to heights Y ~ N( X + 1, 4), where X is height of father o Y = (X + 1) + C where C ~ N(0, 4)  What should I predict for the eventual height of my son?  E[Y | X = 71] = E[X + 1 + C | X = 71] = E[72 + C] = E[72] + E[C] = 72 + 0 = 72 inches 3

Recommend


More recommend