this lecture
play

This Lecture Basic de fi nitions and concepts. Introduction to the - PowerPoint PPT Presentation

This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 16 De fi nitions Spaces: input space , output space . Y X Loss function:


  1. This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 16

  2. De fi nitions Spaces: input space , output space . Y X Loss function: . L : Y × Y → R • : cost of predicting instead of . L ( b y, y ) b y y • binary classi fi cation: 0-1 loss, . L ( y, y 0 )=1 y 6 = y 0 • regression: , . l ( y, y 0 )=( y 0 − y ) 2 Y ⊆ R Hypothesis set: , subset of functions out of which H ⊆ Y X the learner selects his hypothesis. • depends on features. • represents prior knowledge about task. Foundations of Machine Learning page 17

  3. Supervised Learning Set-Up Training data: sample of size drawn i.i.d. from S X × Y m according to distribution : D S = (( x 1 , y 1 ) , . . . , ( x m , y m )) . Problem: fi nd hypothesis with small generalization h ∈ H error. • deterministic case: output label deterministic function of input, . y = f ( x ) • stochastic case: output probabilistic function of input. Foundations of Machine Learning page 18

  4. Errors Generalization error: for , it is de fi ned by h ∈ H R ( h ) = ( x,y ) ∼ D [ L ( h ( x ) , y )] . E Empirical error: for and sample , it is S h ∈ H m X R ( h ) = 1 b L ( h ( x i ) , y i ) . m i =1 Bayes error: R ? = inf R ( h ) . h h measurable • R ? =0 . in deterministic case, Foundations of Machine Learning page 19

  5. Noise Noise: • in binary classi fi cation, for any , x ∈ X noise ( x ) = min { Pr[1 | x ] , Pr[0 | x ] } . • observe that E[ noise ( x )] = R ∗ . Foundations of Machine Learning page 20

  6. Learning ≠ Fitting Notion of simplicity/complexity. How do we de fi ne complexity? Foundations of Machine Learning page 21

  7. Generalization Observations: • the best hypothesis on the sample may not be the best overall. • generalization is not memorization. • complex rules (very complex separation surfaces) can be poor predictors. • trade-o ff : complexity of hypothesis set vs sample size (under fi tting/over fi tting). Foundations of Machine Learning page 22

  8. Model Selection best in class General equality: for any , h ∈ H R ( h ) − R ∗ = [ R ( h ) − R ( h ∗ )] + [ R ( h ∗ ) − R ∗ ] . | {z } | {z } estimation approximation Approximation: not a random variable, only depends on . H Estimation: only term we can hope to bound. Foundations of Machine Learning page 23

  9. Empirical Risk Minimization Select hypothesis set . H Find hypothesis minimizing empirical error: h ∈ H b h = argmin R ( h ) . h ∈ H • but may be too complex. H • the sample size may not be large enough. Foundations of Machine Learning page 24

  10. Generalization Bounds  � De fi nition: upper bound on | R ( h ) − b Pr sup R ( h ) | > ✏ . h ∈ H Bound on estimation error for hypothesis given by ERM: h 0 R ( h 0 ) − R ( h ∗ ) = R ( h 0 ) − b R ( h 0 ) + b R ( h 0 ) − R ( h ∗ ) ≤ R ( h 0 ) − b R ( h 0 ) + b R ( h ∗ ) − R ( h ∗ ) | R ( h ) − b ≤ 2 sup R ( h ) | . h ∈ H How should we choose ? (model selection problem) H Foundations of Machine Learning page 25

  11. Model Selection error estimation approximation upper bound γ ∗ γ [ H = H γ . γ ∈ Γ Foundations of Machine Learning page 26

  12. Structural Risk Minimization (Vapnik, 1995) Principle: consider an in fi nite sequence of hypothesis sets ordered for inclusion, H 1 ⊂ H 2 ⊂ · · · ⊂ H n ⊂ · · · b h = argmin R ( h ) + penalty ( H n , m ) . h ∈ H n ,n ∈ N • strong theoretical guarantees. • typically computationally hard. Foundations of Machine Learning page 27

  13. General Algorithm Families Empirical risk minimization (ERM): b h = argmin R ( h ) . h ∈ H Structural risk minimization (SRM): , H n ⊆ H n +1 b h = argmin R ( h ) + penalty ( H n , m ) . h ∈ H n ,n ∈ N Regularization-based algorithms: , λ ≥ 0 b R ( h ) + λ k h k 2 . h = argmin h ∈ H Foundations of Machine Learning page 28

  14. This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 29

  15. Basic Properties Union bound: Pr[ A ∨ B ] ≤ Pr[ A ] + Pr[ B ] . Inversion: if , then, for any , with Pr[ X ≥ ✏ ] ≤ f ( ✏ ) δ > 0 probability at least , . X ≤ f − 1 ( δ ) 1 − δ Jensen’s inequality: if is convex, . f f (E[ X ]) ≤ E[ f ( X )] Z + ∞ Expectation: if , . X ≥ 0 E[ X ]= Pr[ X > t ] dt 0 Foundations of Machine Learning page 30

  16. Basic Inequalities Markov’s inequality: if and , then ✏ > 0 X ≥ 0 Pr[ X ≥ ✏ ] ≤ E[ X ] . ✏ Chebyshev’s inequality: for any , ✏ > 0 Pr[ | X − E[ X ] | ≥ ✏ ] ≤ � 2 ✏ 2 . X Foundations of Machine Learning page 31

  17. Hoe ff ding’s Inequality Theorem: Let be indep. rand. variables with the X 1 , . . . , X m same expectation and , ( ). Then, for any , X i ∈ [ a, b ] a<b ✏ > 0 µ the following inequalities hold: m − 2 m ✏ 2  µ − 1 � ✓ ◆ X Pr ≤ exp X i > ✏ ( b − a ) 2 m i =1  1 m − 2 m ✏ 2 � ✓ ◆ X Pr ≤ exp X i − µ > ✏ . ( b − a ) 2 m i =1 Foundations of Machine Learning page 32

  18. McDiarmid’s Inequality (McDiarmid, 1989) Theorem: let be independent random variables X 1 , . . . , X m taking values in and a function verifying for f : U m → R U all , i ∈ [1 , m ] | f ( x 1 , . . . , x i , . . . , x m ) − f ( x 1 , . . . , x 0 sup i , . . . , x m ) | ≤ c i . x 1 ,...,x m ,x 0 i Then, for all , ✏ > 0 2 ✏ 2 ✓ ◆ h� i � Pr � f ( X 1 , . . . , X m ) − E[ f ( X 1 , . . . , X m )] ≤ 2 exp � > ✏ . − P m i =1 c 2 i Foundations of Machine Learning page 33

  19. Appendix Foundations of Machine Learning page 34

  20. Markov’s Inequality Theorem: let be a non-negative random variable X with , then, for all , E[ X ] < ∞ t> 0 Pr[ X ≥ t E[ X ]] ≤ 1 t . Proof: X Pr[ X ≥ t E[ X ]] = Pr[ X = x ] x ≥ t E[ X ] x X Pr[ X = x ] ≤ t E[ X ] x ≥ t E[ X ] x X Pr[ X = x ] ≤ t E[ X ] x  � = 1 X = E t . t E[ X ] Foundations of Machine Learning page 35

  21. Chebyshev’s Inequality Theorem: let be a random variable with , then, Var[ X ] < ∞ X for all , t> 0 Pr[ | X − E[ X ] | ≥ t σ X ] ≤ 1 t 2 . Proof: Observe that Pr[ | X − E[ X ] | ≥ t σ X ] = Pr[( X − E[ X ]) 2 ≥ t 2 σ 2 X ] . The result follows Markov’s inequality. Foundations of Machine Learning page 36

  22. Weak Law of Large Numbers Theorem: let be a sequence of independent ( X n ) n ∈ N random variables with the same mean and variance σ 2 < ∞ µ and let , then, for any , P n X n = 1 i =1 X i ✏ > 0 n n →∞ Pr[ | X n − µ | ≥ ✏ ] = 0 . lim Proof: Since the variables are independent, n = n σ 2 n 2 = σ 2  X i � X Var[ X n ] = Var n . n i =1 Thus, by Chebyshev’s inequality, Pr[ | X n − µ | ≥ ✏ ] ≤ � 2 n ✏ 2 . Foundations of Machine Learning page 37

  23. Concentration Inequalities Some general tools for error analysis and bounds: • Hoe ff ding’s inequality (additive). • Cherno ff bounds (multiplicative). • McDiarmid’s inequality (more general). Foundations of Machine Learning page 38

  24. Hoe ff ding’s Lemma Lemma: Let be a random variable with X ∈ [ a, b ] E[ X ]=0 and . Then for any , b 6 = a t> 0 t 2( b − a )2 E [ e tX ] ≤ e 8 . Proof: by convexity of , for all , x 7! e tx a ≤ x ≤ b e tx ≤ b − x b − ae ta + x − a b − a e tb . Thus, b − a e ta + X − a b − a e ta + − a b − a e tb = e φ ( t ) , E[ e tX ] ≤ E [ b − X b b − a e tb ] = with, b − a e ta + − a b b − a + − a b b − a e t ( b − a ) ) . b − a e tb ) = ta + log( φ ( t ) = log( Foundations of Machine Learning page 39

  25. Taking the derivative gives: ae t ( b − a ) φ 0 ( t ) = a − b − a e t ( b − a ) = a − a b − a . b a b a b − a e − t ( b − a ) � b − a � Note that: Furthermore, φ (0) = 0 and φ 0 (0) = 0 . − abe � t ( b � a ) Φ 00 ( t ) = b � a e � t ( b � a ) − b a b � a ] 2 [ = α (1 − α ) e � t ( b � a ) ( b − a ) 2 [(1 − α ) e � t ( b � a ) + α ] 2 (1 − α ) e � t ( b � a ) α [(1 − α ) e � t ( b � a ) + α ]( b − a ) 2 = [(1 − α ) e � t ( b � a ) + α ] = u (1 − u )( b − a ) 2 ≤ ( b − a ) 2 , 4 − a with There exists such that: b − a. 0 ≤ θ ≤ t α = φ ( t ) = φ (0) + t φ 0 (0) + t 2 2 φ 00 ( θ ) ≤ t 2 ( b − a ) 2 . 8 Foundations of Machine Learning page 40

Recommend


More recommend