Viva La Correlación! Fun with Indicator Variables • Say X and Y are arbitrary random variables • Let I A and I B be indicators for events A and B Correlation of X and Y, denoted r (X, Y) : 1 if A occurs 1 if B occurs I A I B 0 otherwise 0 otherwise Cov ( X , Y ) r ( , ) X Y E[ I A ] = P(A), E[ I B ] = P(B), E[ I A I B ] = P(AB) Var(X)Var( Y) = E[ I A I B ] – E[ I A ] E[ I B ] Note: -1 r (X, Y) 1 Cov( I A , I B ) = P(AB) – P(A)P(B) Correlation measures linearity between X and Y = P(A | B)P(B) – P(A)P(B) r (X, Y) = 1 Y = aX + b where a = s y / s x = P(B)[P(A | B) – P(A)] r (X, Y) = -1 Y = aX + b where a = - s y / s x Cov( I A , I B ) determined by P(A | B) – P(A) r (X, Y) = 0 absence of linear relationship P(A | B) > P(A) r ( I A , I B ) > 0 o But, X and Y can still be related in some other way! If r (X, Y) = 0, we say X and Y are “uncorrelated” P(A | B) = P(A) r ( I A , I B ) = 0 (and Cov( I A , I B ) = 0) P(A | B) < P(A) r ( I A , I B ) < 0 o Note: Independence implies uncorrelated, but not vice versa! Can’t Get Enough of that Multinomial Covariance and the Multinomial • Multinomial distribution • Computing Cov( X i , X j ) n independent trials of experiment performed Indicator I i ( k ) = 1 if trial k has outcome i , 0 otherwise n n Each trials results in one of m outcomes, with m [ ( )] ( ) ( ) E I k p X I k X I k p 1 i i i i j j respective probabilities: p 1 , p 2 , …, p m where 1 1 i k k i 1 n n X i = number of trials with outcome i Cov ( , ) Cov ( ( ), ( )) X X I b I a i j i j n a 1 b 1 c c c ( , ,..., ) ... When a b , trial a and b independent: P X c X c X c p 1 p 2 p m Cov ( ( ), ( )) 0 I b I a 1 1 2 2 m m 1 2 m c , c ,..., c i j 1 2 m When a = b : Cov ( I ( b ), I ( a )) E [ I ( a ) I ( a )] E [ I ( a )] E [ I ( a )] E.g., Rolling 6-sided die multiple times and counting how i j i j i j many of each value {1, 2, 3, 4, 5, 6} we get Since trial a cannot have outcome i and j : E [ I ( a ) I ( a )] 0 i j n n Would expect that X i are negatively correlated Cov ( X , X ) Cov ( I ( b ), I ( a )) ( E [ I ( a )] E [ I ( a )]) i j i j i j Let’s see... when i j , what is Cov( X i , X j )? a b 1 a 1 n X i and X j negatively correlated ( ) p p np p i j i j 1 a Multinomials All Around Conditional Expectation • Multinomial distributions: • X and Y are jointly discrete random variables Count of strings hashed across buckets in hash table Recall conditional PMF of X given Y = y: Number of server requests across machines in cluster p ( x , y ) X , Y p ( x | y ) P ( X x | Y y ) Distribution of words/tokens in an email X | Y ( ) p y Y Etc. • Define conditional expectation of X given Y = y: • When m (# outcomes) is large, p i is small [ | ] ( | ) ( | ) E X Y y x P X x Y y x p x y For equally likely outcomes: p i = 1/m X | Y x x n Cov ( X , X ) np p • Analogously, jointly continuous random variables: i j i j 2 m Large m X i and X j very mildly negatively correlated ( , ) f x y X , Y ( | ) [ | ] ( | ) f x y E X Y y x f x y dx | X | Y X Y Poisson paradigm still applicable f ( y ) Y 1
Rolling Dice Hyper for the Hypergeometric • Roll two 6-sided dice D 1 and D 2 • X and Y are independent random variables X = value of D 1 + D 2 Y = value of D 2 X ~ Bin(n, p) Y ~ Bin(n, p) What is E[X | X + Y = m], where m ≤ n? What is E[X | Y = 6]? Start by computing P(X = k | X + Y = m): [ | 6 ] ( | 6 ) E X Y x P X x Y ( , ) ( , ) ( ) ( ) x P X k X Y m P X k Y m k P X k P Y m k P ( X k | X Y m ) P ( X Y m ) P ( X Y m ) P ( X Y m ) 1 57 ( 7 8 9 10 11 12 ) 9 . 5 6 6 n n n n k ( 1 ) n k m k ( 1 ) n ( m k ) p p p p k m k k m k 2 n 2 n m 2 n m p ( 1 p ) Intuitively makes sense: 6 + E[value of D 1 ] = 6 + 3.5 m m Hypergeometric: (X | X + Y = m) ~ HypG( m , 2n, n ) E[X | X + Y = m] = nm /2 n = m /2 # total “X” total successes trials trials Properties of Conditional Expectation Expectations of Conditional Expectations • X and Y are jointly distributed random variables • Define g (Y) = E[X | Y] g (Y) is a random variable E [ g ( X ) | Y y ] g ( x ) p ( x | y ) or g ( x ) f ( x | y ) dx X | Y X | Y x For any Y = y, g (Y) = E[X | Y = y] - o This is just function of Y, since we sum over all values of X • Expectation of conditional sum: What is E[E[X | Y]] = E[ g (Y)]? (Consider discrete case) n n [ [ | ]] [ | ] ( ) E X | Y y E [ X | Y y ] E E X Y E X Y y P Y y i i 1 1 y i i [ ( | ) ] ( ) x P X x Y y P Y y y x xP ( X x , Y y ) x P ( X x , Y y ) y x x y ( ) [ ] xP X x E X (Same for continuous) x Analyzing Recursive Code Random Number of Random Variables int Recurse() • Say you have a web site: PimentoLoaf.com { int x = randomInt(1, 3); // Equally likely values X = Number of people/day visit your site. X ~ N(50, 25) Y i = Number of minutes spent by visitor i . Y i ~ Poi(8) if (x == 1) return 3; else if (x == 2) return (5 + Recurse()); X and all Y i are independent else return (7 + Recurse()); X } Time spent by all visitors/day: . What is E[W]? W Y i i 1 • Let Y = value returned by Recurse() . What is E[Y]? X X E [ W ] E Y E E Y | X [ ] [ ] [ ] 50 8 E X E Y E X E Y i i E [ Y ] E [ Y | X 1 ] P ( X 1 ) E [ Y | X 2 ] P ( X 2 ) E [ Y | X 3 ] P ( X 3 ) i i i 1 i 1 X n n [ | 1 ] 3 [ | 2 ] 5 [ ] [ | 3 ] 7 [ ] E Y X E Y X E Y E Y X E Y | [ | ] [ ] [ ] E Y X n E Y X n E Y nE Y i i i i E [ Y ] 3 ( 1 / 3 ) ( 5 E [ Y ])( 1 / 3 ) ( 7 E [ Y ])( 1 / 3 ) ( 1 / 3 )( 15 2 E [ Y ]) 1 1 1 i i i X E Y | X X E [ Y ] [ ] 15 E Y i i i 1 2
Conditional Variance Making Predictions • Recall definition: Var(X) = E[(X – E[X]) 2 ] • We observe random variable X Define: Var(X | Y) = E[(X – E[X | Y]) 2 | Y] Want to make prediction about Y • Derived: Var(X) = E[X 2 ] – (E[X]) 2 E.g., X = stock price at 9am, Y = stock price at 10am ˆ Can derive: Var(X | Y) = E[X 2 | Y] – (E[X | Y]) 2 Y Let g ( X ) be function we use to predict Y, i.e.: ( ) g X Choose g ( X ) to minimize E[(Y – g ( X ) ) 2 ] • After a bit more math (in the book): Best predictor: g ( X ) = E[Y | X] Var(X) = E[Var(X | Y)] + Var(E[X | Y]) Intuitively: E[(Y – c) 2 ] is minimized when c = E[Y] Intuitively, let Y = true temperature, X = thermostat value o Now, you observe X, and Y depends on X, then use c = E[Y | X] Variance in thermostat readings depends on: You just got your first baby steps into Machine Learning o Average variance in thermostat at different temperatures + o We’ll go into this more rigorously in a few weeks o Variance in average thermostat value at different temperatures Speaking of Babies... • Say my height is X inches ( x = 71) My son: He does not look like: Say, historically, sons grow to heights Y ~ N( X + 1, 4), where X is height of father o Y = (X + 1) + C where C ~ N(0, 4) What should I predict for the eventual height of my son? E[Y | X = 71] = E[X + 1 + C | X = 71] = E[72 + C] = E[72] + E[C] = 72 + 0 = 72 inches 3
Recommend
More recommend