10701 Machine Learning Recitation 7 - Tail bounds and Averages - PowerPoint PPT Presentation

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola

Why this stuff ? • Can machine learning work ?

Why this stuff ? • Can machine learning work ? • Yes, otherwise: • No Google • No spam-filters • No face detectors • No 701 midterm • I’d be living my life

Why this stuff ? • Will machine learning always work ?

Why this stuff ? • Will machine learning always work ? • No, Class 1 ? Class 2 ?

Why this stuff ? • We need some theory to analyze machine learning algorithms. • We will go through basic tools used to build theory. • How well can we estimate stuff from data ? • What is the convergence behavior of empirical averages ?

Outline • Estimation Example • Convergence of Averages • Law of Large Numbers • Central Limit Theorem • Inequalities and Tail Bounds • Markov Inequality • Chebychev’s Inequality • Heoffding’s and McDiarmid’s Inequalities • Proof Tools • Union Bound • Fourier Analysis • Characteristic Functions

Estimating Probabilities

Discrete Distribution • n outcomes (e.g. faces of a dice) • Data likelihood • Maximum Likelihood Estimation • Constrained optimization problem ... or ... • Incorporate constraint via • Taking derivatives yields

Tossing a Die 12 24 60 120

Empirical average for a die

Empirical average for a die is it guaranteed to converge ? how quickly does it converge?

Convergence of Empirical Averages

Expectations • Random variable x with probability measure • Expected value of f(x) • Special case - discrete probability mass (same trick works for intervals) • Draw x i identically and independently from p • Empirical average

Law of Large Numbers • Random variables x i with mean • Empirical average • Weak Law of Large Numbers convergence in probability • Strong Law of Large Numbers Almost sure convergence

Empirical average for a dice 5 sample traces

Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable converges to a Normal Distribution • Special case - IID random variables & average convergence

Central Limit Theorem in Practice unscaled scaled

Tail Bounds

Simple tail bounds • Gauss Markov inequality Non-negative Random variable X with mean μ • Proof - decompose expectation

Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result.

Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Correct ?

Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Approximately Correct ?

Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Probably Approximately Correct !

Scaling behavior • Gauss-Markov Scales properly in μ but expensive in δ • Chebyshev Proper scaling in σ but still bad in δ Can we get logarithmic scaling in δ?

Chernoff bound • For Bernoulli Random Variable with P(x=1)=p • Ex: n independent tosses from biased coin with p probability of getting head Pr 𝜈 𝑜 − 𝑞 ≥ 𝜗 ≤ exp(−2𝑜𝜗 2 )

Chernoff bound • Proof: We show that Pinsker’s inequality Pinsker’s inequality • Where

Heoffding’s Inequality • If X i have bounded range c

Heoffding’s Inequality • Scaling Behavior • This helps when we need to combine several tail bounds since we only pay logarithmically in terms of their combination.

McDiarmid Inequality • Generalization of Heoffding’s Inequality • Independent random variables X i • Function • Deviation from expected value • Here C is given by where • f is average and X i have bounded range c  Heoffding’s Inequality

More tail bounds • Higher order moments • Bernstein inequality (needs variance bound) 𝑜𝜗 2 /2 Pr 𝜈 𝑛 − 𝜈 ≥ 𝜗 ≤ 2 exp − 2 + 𝐷𝑜𝜗/3 𝐹 𝑌 𝑗 𝑗 • Absolute / relative error bounds • Bounds for (weakly) dependent random variables

Summary • Markov [X is non-negative] • Chebychev [Finite variance] • Heoffding [Bound on range] • Brenestein [Bounded on range + Bound on second moment] Tighter Bounds More Assumptions

Tools for the proof

Union Bound 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 In general 𝑄( 𝐵 𝑗 ) ≤ 𝑄(𝐵 𝑗 ) 𝑗 𝑗

Fourier Transform • Fourier transform relations • Useful identities • Identity • Derivative • Convolution (also holds for inverse transform)

The Characteristic Function Method • Characteristic function • For X and Y independent we have • Joint distribution is convolution • Characteristic function is product • Proof - plug in definition of Fourier transform • Characteristic function is unique

Proof - Weak law of large numbers • Require that expectation exists • Taylor expansion of exponential (need to assume that we can bound the tail) • Average of random variables convolution • Limit is constant distribution vanishing higher order terms mean

Warning • Moments may not always exist • Cauchy distribution • For the mean to exist the following integral would have to converge

Proof - Central limit theorem • Require that second order moment exists (we assume they’re all identical WLOG) • Characteristic function • Subtract out mean (centering) • This is the FT of a Normal Distribution

Conclusion & what’s next ? We looked at basic building blocks of learning theory - Convergence of empirical averages - Tail bounds - Union bound

Conclusion & what’s next ? Evaluate classifier C on N data points and estimate accuracy. Can we upper-bound estimation error ?

Conclusion & what’s next ? Evaluate classifier C on N data points and estimate accuracy. Can we upper-bound estimation error ? Yes, Chernoff bound / Heoffding’s inequality

Conclusion & what’s next ? Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ?

Conclusion & what’s next ? Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ? Yes, Chernoff bound / Heoffding’s inequality + union bound

Conclusion & what’s next ? What if the set of classifiers is infinite ??

10701 Machine Learning Recitation 7 - Tail bounds and Averages - PowerPoint PPT Presentation

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola Why this stuff ? Can machine learning work ? Why this stuff ? Can machine learning work ? Yes, otherwise: No Google No

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Reinforcement Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Big Picture Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University March 2

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning 10701 Independent Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Proofs about functions Function consuming A is related to proof about A Q: How to prove two

Data Structures: Queues & ADT CS 1112 Mona Diab

Tail call elimination Tail calls and their elimination Michel Schinz Loops in functional

Heavy Tails: Performance Models and Scheduling Disciplines Part II Workload Asymptotics for

Branching random walk with stretched exponential tails Piotr Dyszewski (TUM & UWr) June 5,

and Uniform Lifetime for Virtualized SSDs Jian Huang Anirudh Badam Laura Caulfield Bikash

ProtoDUNE TPC data: TPC coherent noise ProtoDUNE data David Adams BNL July 24, 2019 Updated

How the Timed Automaton Lost its Tail (and Clocks) Oded Maler Joint work with Jean-Francois

10701 Machine Learning Recitation 7 - Tail bounds and Averages - PowerPoint PPT Presentation

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola Why this stuff ? Can machine learning work ? Why this stuff ? Can machine learning work ? Yes, otherwise: No Google No

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Reinforcement Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Big Picture Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University March 2

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning 10701 Independent Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Proofs about functions Function consuming A is related to proof about A Q: How to prove two

Data Structures: Queues &amp; ADT CS 1112 Mona Diab

Tail call elimination Tail calls and their elimination Michel Schinz Loops in functional

Heavy Tails: Performance Models and Scheduling Disciplines Part II Workload Asymptotics for

Branching random walk with stretched exponential tails Piotr Dyszewski (TUM &amp; UWr) June 5,

and Uniform Lifetime for Virtualized SSDs Jian Huang Anirudh Badam Laura Caulfield Bikash

ProtoDUNE TPC data: TPC coherent noise ProtoDUNE data David Adams BNL July 24, 2019 Updated

How the Timed Automaton Lost its Tail (and Clocks) Oded Maler Joint work with Jean-Francois

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning 10701 Independent Component Analysis Barnabs Pczos &

Data Structures: Queues & ADT CS 1112 Mona Diab

Branching random walk with stretched exponential tails Piotr Dyszewski (TUM & UWr) June 5,