This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 16
De fi nitions Spaces: input space , output space . Y X Loss function: . L : Y × Y → R • : cost of predicting instead of . L ( b y, y ) b y y • binary classi fi cation: 0-1 loss, . L ( y, y 0 )=1 y 6 = y 0 • regression: , . l ( y, y 0 )=( y 0 − y ) 2 Y ⊆ R Hypothesis set: , subset of functions out of which H ⊆ Y X the learner selects his hypothesis. • depends on features. • represents prior knowledge about task. Foundations of Machine Learning page 17
Supervised Learning Set-Up Training data: sample of size drawn i.i.d. from S X × Y m according to distribution : D S = (( x 1 , y 1 ) , . . . , ( x m , y m )) . Problem: fi nd hypothesis with small generalization h ∈ H error. • deterministic case: output label deterministic function of input, . y = f ( x ) • stochastic case: output probabilistic function of input. Foundations of Machine Learning page 18
Errors Generalization error: for , it is de fi ned by h ∈ H R ( h ) = ( x,y ) ∼ D [ L ( h ( x ) , y )] . E Empirical error: for and sample , it is S h ∈ H m X R ( h ) = 1 b L ( h ( x i ) , y i ) . m i =1 Bayes error: R ? = inf R ( h ) . h h measurable • R ? =0 . in deterministic case, Foundations of Machine Learning page 19
Noise Noise: • in binary classi fi cation, for any , x ∈ X noise ( x ) = min { Pr[1 | x ] , Pr[0 | x ] } . • observe that E[ noise ( x )] = R ∗ . Foundations of Machine Learning page 20
Learning ≠ Fitting Notion of simplicity/complexity. How do we de fi ne complexity? Foundations of Machine Learning page 21
Generalization Observations: • the best hypothesis on the sample may not be the best overall. • generalization is not memorization. • complex rules (very complex separation surfaces) can be poor predictors. • trade-o ff : complexity of hypothesis set vs sample size (under fi tting/over fi tting). Foundations of Machine Learning page 22
Model Selection best in class General equality: for any , h ∈ H R ( h ) − R ∗ = [ R ( h ) − R ( h ∗ )] + [ R ( h ∗ ) − R ∗ ] . | {z } | {z } estimation approximation Approximation: not a random variable, only depends on . H Estimation: only term we can hope to bound. Foundations of Machine Learning page 23
Empirical Risk Minimization Select hypothesis set . H Find hypothesis minimizing empirical error: h ∈ H b h = argmin R ( h ) . h ∈ H • but may be too complex. H • the sample size may not be large enough. Foundations of Machine Learning page 24
Generalization Bounds � De fi nition: upper bound on | R ( h ) − b Pr sup R ( h ) | > ✏ . h ∈ H Bound on estimation error for hypothesis given by ERM: h 0 R ( h 0 ) − R ( h ∗ ) = R ( h 0 ) − b R ( h 0 ) + b R ( h 0 ) − R ( h ∗ ) ≤ R ( h 0 ) − b R ( h 0 ) + b R ( h ∗ ) − R ( h ∗ ) | R ( h ) − b ≤ 2 sup R ( h ) | . h ∈ H How should we choose ? (model selection problem) H Foundations of Machine Learning page 25
Model Selection error estimation approximation upper bound γ ∗ γ [ H = H γ . γ ∈ Γ Foundations of Machine Learning page 26
Structural Risk Minimization (Vapnik, 1995) Principle: consider an in fi nite sequence of hypothesis sets ordered for inclusion, H 1 ⊂ H 2 ⊂ · · · ⊂ H n ⊂ · · · b h = argmin R ( h ) + penalty ( H n , m ) . h ∈ H n ,n ∈ N • strong theoretical guarantees. • typically computationally hard. Foundations of Machine Learning page 27
General Algorithm Families Empirical risk minimization (ERM): b h = argmin R ( h ) . h ∈ H Structural risk minimization (SRM): , H n ⊆ H n +1 b h = argmin R ( h ) + penalty ( H n , m ) . h ∈ H n ,n ∈ N Regularization-based algorithms: , λ ≥ 0 b R ( h ) + λ k h k 2 . h = argmin h ∈ H Foundations of Machine Learning page 28
This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 29
Basic Properties Union bound: Pr[ A ∨ B ] ≤ Pr[ A ] + Pr[ B ] . Inversion: if , then, for any , with Pr[ X ≥ ✏ ] ≤ f ( ✏ ) δ > 0 probability at least , . X ≤ f − 1 ( δ ) 1 − δ Jensen’s inequality: if is convex, . f f (E[ X ]) ≤ E[ f ( X )] Z + ∞ Expectation: if , . X ≥ 0 E[ X ]= Pr[ X > t ] dt 0 Foundations of Machine Learning page 30
Basic Inequalities Markov’s inequality: if and , then ✏ > 0 X ≥ 0 Pr[ X ≥ ✏ ] ≤ E[ X ] . ✏ Chebyshev’s inequality: for any , ✏ > 0 Pr[ | X − E[ X ] | ≥ ✏ ] ≤ � 2 ✏ 2 . X Foundations of Machine Learning page 31
Hoe ff ding’s Inequality Theorem: Let be indep. rand. variables with the X 1 , . . . , X m same expectation and , ( ). Then, for any , X i ∈ [ a, b ] a<b ✏ > 0 µ the following inequalities hold: m − 2 m ✏ 2 µ − 1 � ✓ ◆ X Pr ≤ exp X i > ✏ ( b − a ) 2 m i =1 1 m − 2 m ✏ 2 � ✓ ◆ X Pr ≤ exp X i − µ > ✏ . ( b − a ) 2 m i =1 Foundations of Machine Learning page 32
McDiarmid’s Inequality (McDiarmid, 1989) Theorem: let be independent random variables X 1 , . . . , X m taking values in and a function verifying for f : U m → R U all , i ∈ [1 , m ] | f ( x 1 , . . . , x i , . . . , x m ) − f ( x 1 , . . . , x 0 sup i , . . . , x m ) | ≤ c i . x 1 ,...,x m ,x 0 i Then, for all , ✏ > 0 2 ✏ 2 ✓ ◆ h� i � Pr � f ( X 1 , . . . , X m ) − E[ f ( X 1 , . . . , X m )] ≤ 2 exp � > ✏ . − P m i =1 c 2 i Foundations of Machine Learning page 33
Appendix Foundations of Machine Learning page 34
Markov’s Inequality Theorem: let be a non-negative random variable X with , then, for all , E[ X ] < ∞ t> 0 Pr[ X ≥ t E[ X ]] ≤ 1 t . Proof: X Pr[ X ≥ t E[ X ]] = Pr[ X = x ] x ≥ t E[ X ] x X Pr[ X = x ] ≤ t E[ X ] x ≥ t E[ X ] x X Pr[ X = x ] ≤ t E[ X ] x � = 1 X = E t . t E[ X ] Foundations of Machine Learning page 35
Chebyshev’s Inequality Theorem: let be a random variable with , then, Var[ X ] < ∞ X for all , t> 0 Pr[ | X − E[ X ] | ≥ t σ X ] ≤ 1 t 2 . Proof: Observe that Pr[ | X − E[ X ] | ≥ t σ X ] = Pr[( X − E[ X ]) 2 ≥ t 2 σ 2 X ] . The result follows Markov’s inequality. Foundations of Machine Learning page 36
Weak Law of Large Numbers Theorem: let be a sequence of independent ( X n ) n ∈ N random variables with the same mean and variance σ 2 < ∞ µ and let , then, for any , P n X n = 1 i =1 X i ✏ > 0 n n →∞ Pr[ | X n − µ | ≥ ✏ ] = 0 . lim Proof: Since the variables are independent, n = n σ 2 n 2 = σ 2 X i � X Var[ X n ] = Var n . n i =1 Thus, by Chebyshev’s inequality, Pr[ | X n − µ | ≥ ✏ ] ≤ � 2 n ✏ 2 . Foundations of Machine Learning page 37
Concentration Inequalities Some general tools for error analysis and bounds: • Hoe ff ding’s inequality (additive). • Cherno ff bounds (multiplicative). • McDiarmid’s inequality (more general). Foundations of Machine Learning page 38
Hoe ff ding’s Lemma Lemma: Let be a random variable with X ∈ [ a, b ] E[ X ]=0 and . Then for any , b 6 = a t> 0 t 2( b − a )2 E [ e tX ] ≤ e 8 . Proof: by convexity of , for all , x 7! e tx a ≤ x ≤ b e tx ≤ b − x b − ae ta + x − a b − a e tb . Thus, b − a e ta + X − a b − a e ta + − a b − a e tb = e φ ( t ) , E[ e tX ] ≤ E [ b − X b b − a e tb ] = with, b − a e ta + − a b b − a + − a b b − a e t ( b − a ) ) . b − a e tb ) = ta + log( φ ( t ) = log( Foundations of Machine Learning page 39
Taking the derivative gives: ae t ( b − a ) φ 0 ( t ) = a − b − a e t ( b − a ) = a − a b − a . b a b a b − a e − t ( b − a ) � b − a � Note that: Furthermore, φ (0) = 0 and φ 0 (0) = 0 . − abe � t ( b � a ) Φ 00 ( t ) = b � a e � t ( b � a ) − b a b � a ] 2 [ = α (1 − α ) e � t ( b � a ) ( b − a ) 2 [(1 − α ) e � t ( b � a ) + α ] 2 (1 − α ) e � t ( b � a ) α [(1 − α ) e � t ( b � a ) + α ]( b − a ) 2 = [(1 − α ) e � t ( b � a ) + α ] = u (1 − u )( b − a ) 2 ≤ ( b − a ) 2 , 4 − a with There exists such that: b − a. 0 ≤ θ ≤ t α = φ ( t ) = φ (0) + t φ 0 (0) + t 2 2 φ 00 ( θ ) ≤ t 2 ( b − a ) 2 . 8 Foundations of Machine Learning page 40
Recommend
More recommend