Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon - PowerPoint PPT Presentation

Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon and Alex Smola CMU http://alex.smola.org/teaching/10-701x

Estimating Probabilities

Binomial Distribution • Two outcomes (head, tail); (0,1) • Data likelihood p ( X ; π ) = π n 1 (1 − π ) n 0 • Maximum Likelihood Estimation • Constrained optimization problem π ∈ [0 , 1] • Incorporate constraint via e x θ p ( x ; θ ) = 1 + e θ • Taking derivatives yields n 1 n 1 θ = log ⇒ p ( x = 1) = ⇐ n 0 + n 1 n 0 + n 1

... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1

... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1 empirical probability of x=1

Discrete Distribution • n outcomes (e.g. USA, Canada, India, UK, NZ) • Data likelihood Y π n i p ( X ; π ) = i i • Maximum Likelihood Estimation • Constrained optimization problem ... or ... exp θ x • Incorporate constraint via p ( x ; θ ) = P x 0 exp θ x 0 • Taking derivatives yields n i n i θ i = log ⇒ p ( x = i ) = ⇐ P P j n j j n j

Tossing a Dice 12 24 60 120

Key Questions • Do empirical averages converge? • Probabilities • Means / moments • Rate of convergence and limit distribution • Worst case guarantees • Using prior knowledge drug testing, semiconductor fabs computational advertising user interface design ...

Tail Bounds Chebyshev Chernoff Hoeffding

Expectations • Random variable x with probability measure • Expected value of f(x) Z E [ f ( x )] = f ( x ) dp ( x ) • Special case - discrete probability mass Z Pr { x = c } = E [ { x = c } ] = { x = c } dp ( x ) (same trick works for intervals) • Draw x i identically and independently from p • Empirical average n n E emp [ f ( x )] = 1 emp { x = c } = 1 X X f ( x i ) and Pr { x i = c } n n i =1 i =1

Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 IS THE DICE TAINTED? • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 ad campaign working IS THE DICE TAINTED? new page layout better drug working • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

Empirical average for a dice 6 5 4 3 2 1 10 1 10 2 10 3 how quickly does it converge?

Law of Large Numbers • Random variables x i with mean µ = E [ x i ] n µ n := 1 • Empirical average X ˆ x i n i =1 • Weak Law of Large Numbers n →∞ Pr ( | ˆ lim µ n − µ | ≤ ✏ ) = 1 for any ✏ > 0 • Strong Law of Large Numbers ⇣ ⌘ Pr n →∞ ˆ lim µ n = µ = 1 this means convergence in probability

Empirical average for a dice 6 5 4 3 2 1 5 sample traces 10 1 10 2 10 3 • Upper and lower bounds are p µ ± Var( x ) /n • This is an example of the central limit theorem

Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable " n 2 " n # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 converges to a Normal Distribution N (0 , 1) • Special case - IID random variables & average √ n " # n 1 X → N (0 , 1) x i − µ n σ i =1 convergence ⇣ ⌘ n − 1 O 2

Slutsky’s Theorem • Continuous mapping theorem • X i and Y i sequences of random variables • X i has as its limit the random variable X • Y i has as its limit the constant c • g(x,y) is continuous function for all g(x,c) • g(X i , Y i ) converges in distribution to g(X,c)

Delta Method • Random variable X i convergent to b a − 2 n ( X n − b ) → N (0 , Σ ) with a 2 n → 0 for n → ∞ • g is a continuously differentiable function for b • Then g(X i ) inherits convergence properties a � 2 n ( g ( X n ) � g ( b )) ! N (0 , [ r x g ( b )] Σ [ r x g ( b )] > ) • Proof: use Taylor expansion for g(X n ) - g(b) a � 2 n [ g ( X n ) � g ( b )] = [ r x g ( ξ n )] > a � 2 n ( X n � b ) • g( ξ n ) is on line segment [X n , b] • By Slutsky’s theorem it converges to g(b) • Hence g(X i ) is asymptotically normal

Tools for the proof

Fourier Transform • Fourier transform relations Z F [ f ]( ω ) := (2 π ) − d R n f ( x ) exp( � i h ω , x i ) dx 2 Z F − 1 [ g ]( x ) := (2 π ) − d R n g ( ω ) exp( i h ω , x i ) d ω . 2 • Useful identities • Identity F − 1 � F = F � F − 1 = Id • Derivative F [ ∂ x f ] = − i ω F [ f ] • Convolution (also holds for inverse transform) d 2 F [ f ] · F [ g ] F [ f � g ] = (2 π )

The Characteristic Function Method • Characteristic function Z φ X ( ω ) := F − 1 [ p ( x )] = exp( i h ω , x i ) dp ( x ) • For X and Y independent we have • Joint distribution is convolution Z p X + Y ( z ) = p X ( z � y ) p Y ( y ) dy = p X � p Y • Characteristic function is product φ X + Y ( ω ) = φ X ( ω ) · φ Y ( ω ) • Proof - plug in definition of Fourier transform • Characteristic function is unique

Proof - Weak law of large numbers • Require that expectation exists • Taylor expansion of exponential exp( iwx ) = 1 + i h w, x i + o ( | w | ) and hence φ X ( ω ) = 1 + iw E X [ x ] + o ( | w | ) . (need to assume that we can bound the tail) • Average of random variables convolution ◆ m ✓ 1 + i mwµ + o ( m − 1 | w | ) µ m ( ω ) = φ ˆ vanishing higher • Limit is constant distribution order terms µ m ( ω ) → exp i ω µ = 1 + i ω µ + . . . φ ˆ mean

Warning • Moments may not always exist • Cauchy distribution p ( x ) = 1 1 1 + x 2 π • For the mean to exist the following integral would have to converge Z ∞ Z ∞ | x | dp ( x ) ≥ 2 1 + x 2 dx ≥ 1 1 Z x xdx = ∞ π π 1 1

Proof - Central limit theorem • Require that second order moment exists (we assume they’re all identical WLOG) • Characteristic function exp( iwx ) = 1 + iwx − 1 2 w 2 x 2 + o ( | w | 2 ) and hence φ X ( ω ) = 1 + iw E X [ x ] − 1 2 w 2 var X [ x ] + o ( | w | 2 ) " n 2 " n • Subtract out mean (centering) # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 ◆ m ✓ 1 − 1 ✓ − 1 ◆ 2 mw 2 + o ( m − 1 | w | 2 ) 2 w 2 φ Z m ( ω ) = → exp for m → ∞ This is the FT of a Normal Distribution

Central Limit Theorem in Practice 1.0 1.0 1.0 1.0 1.0 unscaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5 1.5 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 1.0 scaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1

Finite sample tail bounds

Simple tail bounds • Gauss Markov inequality Random variable X with mean μ Pr( X ≥ ✏ ) ≤ µ/ ✏ Proof - decompose expectation Z ∞ Z ∞ Z ∞ x xdp ( x ) = µ ✏ dp ( x ) ≤ ✏ − 1 Pr( X ≥ ✏ ) = dp ( x ) ≤ ✏ . 0 ✏ ✏ • Chebyshev inequality Random variable X with mean μ and variance σ 2 p µ m � µ k > ✏ )  � 2 m − 1 ✏ − 2 or equivalently ✏  � / Pr( | ˆ m � Proof - applying Gauss-Markov to Y = (X - μ ) 2 with confidence ε 2 yields the result.

Scaling behavior • Gauss-Markov ✏ ≤ µ Scales properly in μ but expensive in δ � • Chebyshev � ✏ ≤ √ Proper scaling in σ but still bad in δ m � Can we get logarithmic scaling in δ ?

Chernoff bound • KL-divergence variant of Chernoff bound q + (1 − p ) log 1 − p K ( p, q ) = p log p 1 − q • n independent tosses from biased coin with p (X ) − 2 n ( p − q ) 2 � � Pr ≤ exp ( − nK ( q, p )) ≤ exp x i ≥ nq i Pinsker’s inequality Pinsker’s inequality • Proof w.l.o.g. q > p and set k ≥ qn i x i = k | p } = q k (1 − q ) n − k p k (1 − p ) n − k ≥ q qn (1 − q ) n − qn Pr { P i x i = k | q } p qn (1 − p ) n − qn = exp ( nK ( q, p )) Pr { P (X ) (X ) X X Pr x i = k | p Pr x i = k | q exp( − nK ( q, p )) ≤ exp( − nK ( q, p )) ≤ i i k ≥ nq k ≥ nq

Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon - PowerPoint PPT Presentation

Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon and Alex Smola CMU http://alex.smola.org/teaching/10-701x Estimating Probabilities Binomial Distribution Two outcomes (head, tail); (0,1) Data likelihood p ( X ;

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Patterns in Standard Young Tableaux Sara Billey University of Washington Slides:

Bootstrap method for misspecified stochastic differential equation models Yuma Uehara The

Three different approaches to frontier estimation St ephane Girard INRIA Grenoble Rh

From quantum Fisher information to local asymptotic normality M d lin Gu School of

CLT for Kostlan Shub Smale polynomial systems Joint work with D. Armentano; J-M Azas & J.

Deconvolution for an atomic distribution Shota Gugushvili Peter Spreij Bert van Es Universiteit

Sampling Distributions Recall the general mean-variance specification E( Y | x ) = f ( x , ) ,

Variance Parameters Recall the general mean-variance specification E( Y | x ) = f ( x , ) ,

Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon - PowerPoint PPT Presentation

Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon and Alex Smola CMU http://alex.smola.org/teaching/10-701x Estimating Probabilities Binomial Distribution Two outcomes (head, tail); (0,1) Data likelihood p ( X ;

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Patterns in Standard Young Tableaux Sara Billey University of Washington Slides:

Bootstrap method for misspecified stochastic differential equation models Yuma Uehara The

Three different approaches to frontier estimation St ephane Girard INRIA Grenoble Rh

From quantum Fisher information to local asymptotic normality M d lin Gu School of

CLT for Kostlan Shub Smale polynomial systems Joint work with D. Armentano; J-M Azas &amp; J.

Deconvolution for an atomic distribution Shota Gugushvili Peter Spreij Bert van Es Universiteit

Sampling Distributions Recall the general mean-variance specification E( Y | x ) = f ( x , ) ,

Variance Parameters Recall the general mean-variance specification E( Y | x ) = f ( x , ) ,

CLT for Kostlan Shub Smale polynomial systems Joint work with D. Armentano; J-M Azas & J.