scalable machine learning
play

Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon - PowerPoint PPT Presentation

Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon and Alex Smola CMU http://alex.smola.org/teaching/10-701x Estimating Probabilities Binomial Distribution Two outcomes (head, tail); (0,1) Data likelihood p ( X ;


  1. Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon and Alex Smola CMU http://alex.smola.org/teaching/10-701x

  2. Estimating Probabilities

  3. Binomial Distribution • Two outcomes (head, tail); (0,1) • Data likelihood p ( X ; π ) = π n 1 (1 − π ) n 0 • Maximum Likelihood Estimation • Constrained optimization problem π ∈ [0 , 1] • Incorporate constraint via e x θ p ( x ; θ ) = 1 + e θ • Taking derivatives yields n 1 n 1 θ = log ⇒ p ( x = 1) = ⇐ n 0 + n 1 n 0 + n 1

  4. ... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1

  5. ... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1 empirical probability of x=1

  6. Discrete Distribution • n outcomes (e.g. USA, Canada, India, UK, NZ) • Data likelihood Y π n i p ( X ; π ) = i i • Maximum Likelihood Estimation • Constrained optimization problem ... or ... exp θ x • Incorporate constraint via p ( x ; θ ) = P x 0 exp θ x 0 • Taking derivatives yields n i n i θ i = log ⇒ p ( x = i ) = ⇐ P P j n j j n j

  7. Tossing a Dice 12 24 60 120

  8. Tossing a Dice 12 24 60 120

  9. Key Questions • Do empirical averages converge? • Probabilities • Means / moments • Rate of convergence and limit distribution • Worst case guarantees • Using prior knowledge drug testing, semiconductor fabs computational advertising user interface design ...

  10. Tail Bounds Chebyshev Chernoff Hoeffding

  11. Expectations • Random variable x with probability measure • Expected value of f(x) Z E [ f ( x )] = f ( x ) dp ( x ) • Special case - discrete probability mass Z Pr { x = c } = E [ { x = c } ] = { x = c } dp ( x ) (same trick works for intervals) • Draw x i identically and independently from p • Empirical average n n E emp [ f ( x )] = 1 emp { x = c } = 1 X X f ( x i ) and Pr { x i = c } n n i =1 i =1

  12. Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 IS THE DICE TAINTED? • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

  13. Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 ad campaign working IS THE DICE TAINTED? new page layout better drug working • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

  14. Empirical average for a dice 6 5 4 3 2 1 10 1 10 2 10 3 how quickly does it converge?

  15. Law of Large Numbers • Random variables x i with mean µ = E [ x i ] n µ n := 1 • Empirical average X ˆ x i n i =1 • Weak Law of Large Numbers n →∞ Pr ( | ˆ lim µ n − µ | ≤ ✏ ) = 1 for any ✏ > 0 • Strong Law of Large Numbers ⇣ ⌘ Pr n →∞ ˆ lim µ n = µ = 1 this means convergence in probability

  16. Empirical average for a dice 6 5 4 3 2 1 5 sample traces 10 1 10 2 10 3 • Upper and lower bounds are p µ ± Var( x ) /n • This is an example of the central limit theorem

  17. Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable " n 2 " n # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 converges to a Normal Distribution N (0 , 1) • Special case - IID random variables & average √ n " # n 1 X → N (0 , 1) x i − µ n σ i =1 convergence ⇣ ⌘ n − 1 O 2

  18. Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable " n 2 " n # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 converges to a Normal Distribution N (0 , 1) • Special case - IID random variables & average √ n " # n 1 X → N (0 , 1) x i − µ n σ i =1 convergence ⇣ ⌘ n − 1 O 2

  19. Slutsky’s Theorem • Continuous mapping theorem • X i and Y i sequences of random variables • X i has as its limit the random variable X • Y i has as its limit the constant c • g(x,y) is continuous function for all g(x,c) • g(X i , Y i ) converges in distribution to g(X,c)

  20. Delta Method • Random variable X i convergent to b a − 2 n ( X n − b ) → N (0 , Σ ) with a 2 n → 0 for n → ∞ • g is a continuously differentiable function for b • Then g(X i ) inherits convergence properties a � 2 n ( g ( X n ) � g ( b )) ! N (0 , [ r x g ( b )] Σ [ r x g ( b )] > ) • Proof: use Taylor expansion for g(X n ) - g(b) a � 2 n [ g ( X n ) � g ( b )] = [ r x g ( ξ n )] > a � 2 n ( X n � b ) • g( ξ n ) is on line segment [X n , b] • By Slutsky’s theorem it converges to g(b) • Hence g(X i ) is asymptotically normal

  21. Tools for the proof

  22. Fourier Transform • Fourier transform relations Z F [ f ]( ω ) := (2 π ) − d R n f ( x ) exp( � i h ω , x i ) dx 2 Z F − 1 [ g ]( x ) := (2 π ) − d R n g ( ω ) exp( i h ω , x i ) d ω . 2 • Useful identities • Identity F − 1 � F = F � F − 1 = Id • Derivative F [ ∂ x f ] = − i ω F [ f ] • Convolution (also holds for inverse transform) d 2 F [ f ] · F [ g ] F [ f � g ] = (2 π )

  23. The Characteristic Function Method • Characteristic function Z φ X ( ω ) := F − 1 [ p ( x )] = exp( i h ω , x i ) dp ( x ) • For X and Y independent we have • Joint distribution is convolution Z p X + Y ( z ) = p X ( z � y ) p Y ( y ) dy = p X � p Y • Characteristic function is product φ X + Y ( ω ) = φ X ( ω ) · φ Y ( ω ) • Proof - plug in definition of Fourier transform • Characteristic function is unique

  24. Proof - Weak law of large numbers • Require that expectation exists • Taylor expansion of exponential exp( iwx ) = 1 + i h w, x i + o ( | w | ) and hence φ X ( ω ) = 1 + iw E X [ x ] + o ( | w | ) . (need to assume that we can bound the tail) • Average of random variables convolution ◆ m ✓ 1 + i mwµ + o ( m − 1 | w | ) µ m ( ω ) = φ ˆ vanishing higher • Limit is constant distribution order terms µ m ( ω ) → exp i ω µ = 1 + i ω µ + . . . φ ˆ mean

  25. Warning • Moments may not always exist • Cauchy distribution p ( x ) = 1 1 1 + x 2 π • For the mean to exist the following integral would have to converge Z ∞ Z ∞ | x | dp ( x ) ≥ 2 1 + x 2 dx ≥ 1 1 Z x xdx = ∞ π π 1 1

  26. Proof - Central limit theorem • Require that second order moment exists (we assume they’re all identical WLOG) • Characteristic function exp( iwx ) = 1 + iwx − 1 2 w 2 x 2 + o ( | w | 2 ) and hence φ X ( ω ) = 1 + iw E X [ x ] − 1 2 w 2 var X [ x ] + o ( | w | 2 ) " n 2 " n • Subtract out mean (centering) # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 ◆ m ✓ 1 − 1 ✓ − 1 ◆ 2 mw 2 + o ( m − 1 | w | 2 ) 2 w 2 φ Z m ( ω ) = → exp for m → ∞ This is the FT of a Normal Distribution

  27. Central Limit Theorem in Practice 1.0 1.0 1.0 1.0 1.0 unscaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5 1.5 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 1.0 scaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1

  28. Finite sample tail bounds

  29. Simple tail bounds • Gauss Markov inequality Random variable X with mean μ Pr( X ≥ ✏ ) ≤ µ/ ✏ Proof - decompose expectation Z ∞ Z ∞ Z ∞ x xdp ( x ) = µ ✏ dp ( x ) ≤ ✏ − 1 Pr( X ≥ ✏ ) = dp ( x ) ≤ ✏ . 0 ✏ ✏ • Chebyshev inequality Random variable X with mean μ and variance σ 2 p µ m � µ k > ✏ )  � 2 m − 1 ✏ − 2 or equivalently ✏  � / Pr( | ˆ m � Proof - applying Gauss-Markov to Y = (X - μ ) 2 with confidence ε 2 yields the result.

  30. Scaling behavior • Gauss-Markov ✏ ≤ µ Scales properly in μ but expensive in δ � • Chebyshev � ✏ ≤ √ Proper scaling in σ but still bad in δ m � Can we get logarithmic scaling in δ ?

  31. Chernoff bound • KL-divergence variant of Chernoff bound q + (1 − p ) log 1 − p K ( p, q ) = p log p 1 − q • n independent tosses from biased coin with p (X ) − 2 n ( p − q ) 2 � � Pr ≤ exp ( − nK ( q, p )) ≤ exp x i ≥ nq i Pinsker’s inequality Pinsker’s inequality • Proof w.l.o.g. q > p and set k ≥ qn i x i = k | p } = q k (1 − q ) n − k p k (1 − p ) n − k ≥ q qn (1 − q ) n − qn Pr { P i x i = k | q } p qn (1 − p ) n − qn = exp ( nK ( q, p )) Pr { P (X ) (X ) X X Pr x i = k | p Pr x i = k | q exp( − nK ( q, p )) ≤ exp( − nK ( q, p )) ≤ i i k ≥ nq k ≥ nq

Recommend


More recommend