10701 machine learning
play

10701 Machine Learning Recitation 7 - Tail bounds and Averages - PowerPoint PPT Presentation

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola Why this stuff ? Can machine learning work ? Why this stuff ? Can machine learning work ? Yes, otherwise: No Google No


  1. 10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola

  2. Why this stuff ? • Can machine learning work ?

  3. Why this stuff ? • Can machine learning work ? • Yes, otherwise: • No Google • No spam-filters • No face detectors • No 701 midterm • I’d be living my life

  4. Why this stuff ? • Will machine learning always work ?

  5. Why this stuff ? • Will machine learning always work ? • No, Class 1 ? Class 2 ?

  6. Why this stuff ? • We need some theory to analyze machine learning algorithms. • We will go through basic tools used to build theory. • How well can we estimate stuff from data ? • What is the convergence behavior of empirical averages ?

  7. Outline • Estimation Example • Convergence of Averages • Law of Large Numbers • Central Limit Theorem • Inequalities and Tail Bounds • Markov Inequality • Chebychev’s Inequality • Heoffding’s and McDiarmid’s Inequalities • Proof Tools • Union Bound • Fourier Analysis • Characteristic Functions

  8. Estimating Probabilities

  9. Discrete Distribution • n outcomes (e.g. faces of a dice) • Data likelihood • Maximum Likelihood Estimation • Constrained optimization problem ... or ... • Incorporate constraint via • Taking derivatives yields

  10. Tossing a Die 12 24 60 120

  11. Tossing a Die 12 24 60 120

  12. Tossing a Die 12 24 60 120

  13. Tossing a Die 12 24 60 120

  14. Empirical average for a die

  15. Empirical average for a die is it guaranteed to converge ? how quickly does it converge?

  16. Convergence of Empirical Averages

  17. Expectations • Random variable x with probability measure • Expected value of f(x) • Special case - discrete probability mass (same trick works for intervals) • Draw x i identically and independently from p • Empirical average

  18. Law of Large Numbers • Random variables x i with mean • Empirical average • Weak Law of Large Numbers convergence in probability • Strong Law of Large Numbers Almost sure convergence

  19. Empirical average for a dice 5 sample traces

  20. Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable converges to a Normal Distribution • Special case - IID random variables & average convergence

  21. Central Limit Theorem in Practice unscaled scaled

  22. Tail Bounds

  23. Simple tail bounds • Gauss Markov inequality Non-negative Random variable X with mean μ • Proof - decompose expectation

  24. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result.

  25. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Correct ?

  26. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Approximately Correct ?

  27. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Probably Approximately Correct !

  28. Scaling behavior • Gauss-Markov Scales properly in μ but expensive in δ • Chebyshev Proper scaling in σ but still bad in δ Can we get logarithmic scaling in δ?

  29. Chernoff bound • For Bernoulli Random Variable with P(x=1)=p • Ex: n independent tosses from biased coin with p probability of getting head Pr 𝜈 𝑜 − 𝑞 ≥ 𝜗 ≤ exp(−2𝑜𝜗 2 )

  30. Chernoff bound • Proof: We show that Pinsker’s inequality Pinsker’s inequality • Where

  31. Heoffding’s Inequality • If X i have bounded range c

  32. Heoffding’s Inequality • Scaling Behavior • This helps when we need to combine several tail bounds since we only pay logarithmically in terms of their combination.

  33. McDiarmid Inequality • Generalization of Heoffding’s Inequality • Independent random variables X i • Function • Deviation from expected value • Here C is given by where • f is average and X i have bounded range c  Heoffding’s Inequality

  34. More tail bounds • Higher order moments • Bernstein inequality (needs variance bound) 𝑜𝜗 2 /2 Pr 𝜈 𝑛 − 𝜈 ≥ 𝜗 ≤ 2 exp − 2 + 𝐷𝑜𝜗/3 𝐹 𝑌 𝑗 𝑗 • Absolute / relative error bounds • Bounds for (weakly) dependent random variables

  35. Summary • Markov [X is non-negative] • Chebychev [Finite variance] • Heoffding [Bound on range] • Brenestein [Bounded on range + Bound on second moment] Tighter Bounds More Assumptions

  36. Tools for the proof

  37. Union Bound 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 In general 𝑄( 𝐵 𝑗 ) ≤ 𝑄(𝐵 𝑗 ) 𝑗 𝑗

  38. Fourier Transform • Fourier transform relations • Useful identities • Identity • Derivative • Convolution (also holds for inverse transform)

  39. The Characteristic Function Method • Characteristic function • For X and Y independent we have • Joint distribution is convolution • Characteristic function is product • Proof - plug in definition of Fourier transform • Characteristic function is unique

  40. Proof - Weak law of large numbers • Require that expectation exists • Taylor expansion of exponential (need to assume that we can bound the tail) • Average of random variables convolution • Limit is constant distribution vanishing higher order terms mean

  41. Warning • Moments may not always exist • Cauchy distribution • For the mean to exist the following integral would have to converge

  42. Proof - Central limit theorem • Require that second order moment exists (we assume they’re all identical WLOG) • Characteristic function • Subtract out mean (centering) • This is the FT of a Normal Distribution

  43. Conclusion & what’s next ? We looked at basic building blocks of learning theory - Convergence of empirical averages - Tail bounds - Union bound

  44. Conclusion & what’s next ? Evaluate classifier C on N data points and estimate accuracy. Can we upper-bound estimation error ?

  45. Conclusion & what’s next ? Evaluate classifier C on N data points and estimate accuracy. Can we upper-bound estimation error ? Yes, Chernoff bound / Heoffding’s inequality

  46. Conclusion & what’s next ? Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ?

  47. Conclusion & what’s next ? Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ? Yes, Chernoff bound / Heoffding’s inequality + union bound

  48. Conclusion & what’s next ? What if the set of classifiers is infinite ??

Recommend


More recommend