Introduction Goals Overfitting Examples Key issues Conclusion Classification rule/algorithm Classification rule � ( X × { 0 , 1 } ) n → S � f : n ≥ 1 Input: a data set D n (of any size n ≥ 1) Output: a classifier � f ( D n ): X → { 0 , 1 } Example: k -nearest neighbours ( k -NN): x ∈ X → majority vote among the Y i such that X i is one of the k nearest neighbours of x in X 1 , . . . , X n 14/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Example: 3-nearest neighbours 1 0 0 10 15/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → 16/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → a . s . strong consistency: R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → 16/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → a . s . strong consistency: R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → universal (weak) consistency: for all P , � � R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → E a . s . universal strong consistency: for all P , R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → 16/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → a . s . strong consistency: R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → universal (weak) consistency: for all P , � � R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → E a . s . universal strong consistency: for all P , R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → Stone’s theorem [Stone, 1977]: If X = R d with the Euclidean distance, k n -NN is (weakly) universally consistent if k n → + ∞ and k n / n → 0 as n → + ∞ . 16/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Uniform universal consistency? universal weak consistency: � � R ( � − R ( f ⋆ ) = 0 sup n → + ∞ E lim f ( D n )) P ∈M 1 ( X×{ 0 , 1 } ) uniform universal weak consistency: � � � � R ( � − R ( f ⋆ ) lim sup f ( D n )) = 0 E n → + ∞ P ∈M 1 ( X×{ 0 , 1 } ) that is, a common learning rate for all P ? 17/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Uniform universal consistency? universal weak consistency: � � R ( � − R ( f ⋆ ) = 0 sup n → + ∞ E lim f ( D n )) P ∈M 1 ( X×{ 0 , 1 } ) uniform universal weak consistency: � � � � R ( � − R ( f ⋆ ) lim sup f ( D n )) = 0 E n → + ∞ P ∈M 1 ( X×{ 0 , 1 } ) that is, a common learning rate for all P ? Yes if X is finite. No otherwise (see Chapter 7 of [Devroye et al., 1996]). 17/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Classification on X finite Theorem f maj is the majority vote rule ( for each x ∈ X , If X is finite and � majority vote among { Y i / X i = x } ) , � � � � � Card( X ) log(2) R ( � − R ( f ⋆ ) f maj ( D n )) sup ≤ . E 2 n P 18/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Classification on X finite Theorem f maj is the majority vote rule ( for each x ∈ X , If X is finite and � majority vote among { Y i / X i = x } ) , � � � � � Card( X ) log(2) R ( � − R ( f ⋆ ) f maj ( D n )) sup ≤ . E 2 n P Proof: standard risk bounds (see next section) + maximal inequality � � �� � n � log(Card( T )) E sup ξ i , t ≤ 2 n t ∈ T i =1 if for all t , ( ξ i , t ) i are independent, centered and in [0 , 1]. 18/53 See e.g. http://www.di.ens.fr/~arlot/2013orsay.htm Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Classification on X finite Theorem f maj is the majority vote rule ( for each x ∈ X , If X is finite and � majority vote among { Y i / X i = x } ) , � � � � � Card( X ) log(2) R ( � f maj ( D n )) − R ( f ⋆ ) sup ≤ . E 2 n P Constants matter: Card( X ) can be larger than n ⇒ beware of asymptotic results and O ( · ) that can hide such constants in first or second order terms. 18/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup f ( D n )) 2 . E P ∈M 1 ( X×{ 0 , 1 } ) 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points 0 X 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points estimator 0 X 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points estimator unobserved points 0 X 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points estimator unobserved points 0 X 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points estimator unobserved points 0 X 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points estimator unobserved points 0 X 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) Remark: for any ( a n ) decreasing to zero and any � f , some P exists � � R ( � − R ( f ⋆ ) ≥ a n . See Chapter 7 of such that E f ( D n )) [Devroye et al., 1996]. C ( P ) ⇒ impossible to have log log n as a universal risk bound! 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem: proof Assume N ⊂ X and let K ≥ 1. For any r ∈ { 0 , 1 } K , define P r by X uniform on { 1 , . . . , K } and P ( Y = r i | X = i ) = 1 for all i = 1 , . . . , K . Under P r , f ⋆ ( x ) = r x and R ( f ⋆ ) = 0. So, � � � � � � �� R P ( � � − R P ( f ⋆ ) sup E P f ( D n )) ≥ sup P P r f ( X ; D n ) � = r X P P r � � �� � ≥ E r ∼ R f ( X ; D n ) � = r X P P r � � � �� � ≥ E ∈{ X 1 ,..., X n } E � X , ( X i , r X i ) i =1 ... n 1 X / 1 � f ( X ;( X i , r Xi ) i =1 ... n ) � = r X � � n = 1 ∈ { X 1 , . . . , X n } ) = 1 1 − 1 2 P ( X / 2 K 19/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) 20/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) No Free Lunch Theorems ⇒ must make assumptions on P 20/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) No Free Lunch Theorems ⇒ must make assumptions on P Minimax rate: given a set P ⊂ M 1 ( X × { 0 , 1 } ), � � � � �� − R ( f ⋆ ) � inf sup E R f ( D n ) � f P ∈P 20/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) No Free Lunch Theorems ⇒ must make assumptions on P Minimax rate: given a set P ⊂ M 1 ( X × { 0 , 1 } ), � � � � �� − R ( f ⋆ ) � inf sup E R f ( D n ) � f P ∈P Examples: � V / n when f ⋆ ∈ S known and dim VC ( S ) = V [Devroye et al., 1996] V / ( nh ) when in addition P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 (margin assumption) [Massart and N´ ed´ elec, 2006] 20/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Outline Introduction 1 Goals 2 Overfitting 3 Examples 4 Key issues 5 21/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Overfitting with k -nearest-neighbours: k = 1 1 0 0 10 21/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Choosing k ∈ { 1 , 3 , 20 , 200 } for k -NN ( n = 200) 1 1 0 0 0 10 0 10 1 1 0 0 22/53 0 10 0 10 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization Empirical risk n � R n ( f ) := 1 � ℓ ( f ( X i ) , Y i ) n i =1 Empirical risk minimizer over a model S ⊂ S : � � � � f S ∈ argmin f ∈ S R n ( f ) 23/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization Empirical risk n � R n ( f ) := 1 � ℓ ( f ( X i ) , Y i ) n i =1 Empirical risk minimizer over a model S ⊂ S : � � � � f S ∈ argmin f ∈ S R n ( f ) Examples: �� � partitioning rule: S = k ≥ 1 α k 1 A k / α k ∈ { 0 , 1 } for some partition ( A k ) k ≥ 1 of X linear discrimination ( X = R d ): � � x �→ 1 β ⊤ x + β 0 ≥ 0 / β ∈ R d , β 0 ∈ R S = ... 23/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Example: linear discrimination Fig. 4.3 of [Devroye et al., 1996] 24/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Bias-variance trade-off � � � � − R ( f ⋆ ) � R f S = Bias + Variance E Bias or Approximation error S ) − R ( f ⋆ ) = inf f ∈ S R ( f ) − R ( f ⋆ ) R ( f ⋆ Variance or Estimation error σ 2 dim( S ) OLS in regression: n σ 2 k-NN in regression: k 25/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Bias-variance trade-off � � � � − R ( f ⋆ ) � R f S = Bias + Variance E Bias or Approximation error S ) − R ( f ⋆ ) = inf f ∈ S R ( f ) − R ( f ⋆ ) R ( f ⋆ Variance or Estimation error σ 2 dim( S ) OLS in regression: n σ 2 k-NN in regression: k Bias-variance trade-off ⇔ avoid overfitting and underfitting 25/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Outline Introduction 1 Goals 2 Overfitting 3 Examples 4 Plug in rules Empirical risk minimization and model selection Convexification and support vector machines Decision trees and forests Key issues 5 26/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Plug in classifiers Idea: f ⋆ ( x ) = 1 η ( x ) ≥ 1 2 ⇒ if � η ( D n ) estimates η (regression problem), � f ( x ; D n ) = 1 � η ( x ; D n ) ≥ 1 2 Examples: partitioning, k -NN, local average classifiers [Devroye et al., 1996], [Audibert and Tsybakov, 2007]... 26/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Risk bound for plug in Proposition (Theorem 2.2 in [Devroye et al., 1996]) For a plug in classifier � f , � � − R ( f ⋆ ) ≤ 2 E [ | η ( X ) − � � R f ( D n ) η ( X ; D n ) | | D n ] � η ( X ; D n )) 2 � � � � ≤ 2 E ( η ( X ) − � � D n (First step for proving Stone’s theorem [Stone, 1977]) Proof: � � � � � � − R ( f ⋆ ) = E � R f ( D n ) | 2 η ( X ) − 1 | 1 � � D n f ( X ; D n ) � = f ⋆ ( X ) � f ( X ; D n ) � = f ⋆ ( X ) implies | 2 η ( X ) − 1 | ≤ 2 | η ( X ) − � and η ( X ; D n ) | . 27/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization (ERM) � � ERM over S : � � f S ∈ argmin f ∈ S R n ( f ) � � � � − R ( f ⋆ ) � R f S = Approximation error + Estimation error E 28/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization (ERM) � � ERM over S : � � f S ∈ argmin f ∈ S R n ( f ) � � � � − R ( f ⋆ ) � R f S = Approximation error + Estimation error E S ) − R ( f ⋆ ): bounded thanks to Approximation error R ( f ⋆ approximation theory, or assumed equal to zero 28/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization (ERM) � � ERM over S : � � f S ∈ argmin f ∈ S R n ( f ) � � � � − R ( f ⋆ ) � R f S = Approximation error + Estimation error E S ) − R ( f ⋆ ): bounded thanks to Approximation error R ( f ⋆ approximation theory, or assumed equal to zero Estimation error � �� � � � � � � R ( f ) − � − R ( f ⋆ R f S S ) ≤ E sup R n ( f ) E f ∈ S � � � − R ( f ⋆ Proof: R f S S ) � � � � � � � − � � − R ( f ⋆ S ) + � R n ( f ⋆ S ) + � � − � R n ( f ⋆ = R f S R n f S R n f S S ) � � R ( f ) − � + � R n ( f ⋆ S ) − R ( f ⋆ ≤ sup R n ( f ) S ) 28/53 f ∈ S Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (1): global approach � � � � � − R ( f ⋆ E R f S S ) � �� � R ( f ) − � ≤ E sup R n ( f ) (global complexity of S ) f ∈ S � � �� � n 1 ≤ 2 E sup ε i ℓ ( f ( X i ) , Y i ) (symmetrization) n f ∈ S i =1 √ �� � ≤ 2 2 √ n E H ( S ; X 1 , . . . , X n ) (combinatorial entropy) � � � � � en � 2 V ( S ) log V ( S ) ≤ 2 (VC dimension) n References: Section 3 of [Boucheron et al., 2005], Chapters 12–13 of [Devroye et al., 1996] 29/53 See also lectures 1–2 of http://www.di.ens.fr/~arlot/2013orsay.htm Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � 30/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � Margin condition: P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that � f S is not anywhere in S 30/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � Margin condition: P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that � f S is not anywhere in S f S ∈ { f ∈ S / R ( f ) − R ( f ⋆ ) ≤ ε } � ⊂ { f ∈ S / var ( ℓ ( f ( X ) , Y ) − ℓ ( f ⋆ ( X ) , Y )) ≤ ε/ h } by the margin condition. 30/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � Margin condition: P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that � f S is not anywhere in S f S ∈ { f ∈ S / R ( f ) − R ( f ⋆ ) ≤ ε } � ⊂ { f ∈ S / var ( ℓ ( f ( X ) , Y ) − ℓ ( f ⋆ ( X ) , Y )) ≤ ε/ h } by the margin condition. + Talagrand concentration inequality [Talagrand, 1996, Bousquet, 2002] + . . . ⇒ fast rates (depending on the assumptions), e.g., � � �� κ V ( S ) nh 2 1 + log nh V ( S ) [Boucheron et al., 2005, Sec. 5], [Massart and N´ ed´ elec, 2006]. 30/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Model selection family of models ( S m ) m ∈M ⇒ family of classifiers ( � f m ( D n )) m ∈M n � � � ⇒ choose � m = � m ( D n ) such that R f � m ( D n ) is minimal? 31/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Model selection family of models ( S m ) m ∈M ⇒ family of classifiers ( � f m ( D n )) m ∈M n � � � ⇒ choose � m = � m ( D n ) such that R f � m ( D n ) is minimal? Goal: minimize the risk, i.e., Oracle inequality (in expectation or with a large probability): � � � � � � − R ( f ⋆ ) ≤ C inf − R ( f ⋆ ) � � R f � R f m + R n m m ∈M Interpretation of � m : the best model can be wrong / the true model can be worse than smaller ones. 31/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Penalization: � � � � � � m ∈ argmin m ∈M � R n f m + pen( m ) Ideal penalty: � � � � � � �� � − � � � pen id ( m ) = R f m R n f m ⇔ m ∈ argmin m ∈M � R f m 32/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Penalization: � � � � � � m ∈ argmin m ∈M � R n f m + pen( m ) Ideal penalty: � � � � � � �� � − � � � pen id ( m ) = R f m R n f m ⇔ m ∈ argmin m ∈M � R f m General idea: choose pen such that pen( m ) ≈ pen id ( m ) or at least pen( m ) ≥ pen id ( m ) for all m ∈ M . Lemma (see next slide): if pen( m ) ≥ pen id ( m ) for all m ∈ M , � � � � � � −R ( f ⋆ ) ≤ inf − R ( f ⋆ ) + pen( m ) − pen id ( m ) � � R f � R f m . m m ∈M 32/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection: lemma Lemma If ∀ m ∈ M , − B ( m ) ≤ pen( m ) − pen id ( m ) ≤ A ( m ) , then, � � � � � � − R ( f ⋆ ) − B ( � − R ( f ⋆ ) + A ( m ) � � R f � m ) ≤ inf R f m . m m ∈M Proof: For all m ∈ M , by definition of � m , � � � � � � m ) ≤ � � R n f � + pen( � R n f m + pen( m ) . m � � � � � � � + pen( � − pen id ( � m ) + pen( � So, R n f � m ) = R f � m ) m m � � � ≥ R f � − B ( � m ) m � � � � � � � and R n f m + pen( m ) = R f m − pen id ( m ) + pen( m ) � � � ≤ R f m + A ( m ) . 32/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Structural risk minimization (Vapnik): � � R ( f ) − � pen id ( m ) ≤ sup R n ( f ) f ∈ S m ⇒ can use previous bounds [Koltchinskii, 2001, Bartlett et al., 2002, Fromont, 2007] but remainder terms ≥ Cn − 1 / 2 ⇒ no fast rates. 33/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Structural risk minimization (Vapnik): � � R ( f ) − � pen id ( m ) ≤ sup R n ( f ) f ∈ S m ⇒ can use previous bounds [Koltchinskii, 2001, Bartlett et al., 2002, Fromont, 2007] but remainder terms ≥ Cn − 1 / 2 ⇒ no fast rates. Tighter estimates of pen id ( m ) for fast rates: localization [Koltchinskii, 2006], resampling [Arlot, 2009]. See also Section 8 of [Boucheron et al., 2005]. 33/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Convexification of the classification problem Y i ∈ {− 1 , 1 } so that 1 y � = y ′ = 1 yy ′ < 0 = Φ 0 − 1 ( yy ′ ) Convention: � n 1 min Φ 0 − 1 ( Y i f ( X i )) computationally heavy in general. n f i =1 34/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Convexification of the classification problem Y i ∈ {− 1 , 1 } so that 1 y � = y ′ = 1 yy ′ < 0 = Φ 0 − 1 ( yy ′ ) Convention: � n 1 min Φ 0 − 1 ( Y i f ( X i )) computationally heavy in general. n f i =1 Classifier f : X → {− 1 , 1 } ⇒ prediction function f : X → R such that sign( f ( x )) will be used to classify x Risk R 0 − 1 ( f ) = E [Φ 0 − 1 ( Yf ( X ))] ⇒ Φ-risk R Φ ( f ) = E [Φ ( Yf ( X ))] for some Φ : R → R + 34/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Convexification of the classification problem Y i ∈ {− 1 , 1 } so that 1 y � = y ′ = 1 yy ′ < 0 = Φ 0 − 1 ( yy ′ ) Convention: � n 1 min Φ 0 − 1 ( Y i f ( X i )) computationally heavy in general. n f i =1 Classifier f : X → {− 1 , 1 } ⇒ prediction function f : X → R such that sign( f ( x )) will be used to classify x Risk R 0 − 1 ( f ) = E [Φ 0 − 1 ( Yf ( X ))] ⇒ Φ-risk R Φ ( f ) = E [Φ ( Yf ( X ))] for some Φ : R → R + � n 1 ⇒ min Φ( Y i f ( X i )) with S and Φ convex. n f ∈ S i =1 34/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Examples of functions Φ exponential: Φ( u ) = e − u ⇒ AdaBoost hinge: Φ( u ) = max { 1 − u , 0 } ⇒ support vector machines logistic/logit: Φ( u ) = log(1 + exp( − u )) ⇒ logistic regression truncated quadratic: Φ( u ) = (max { 1 − u , 0 } ) 2 Figure from [Bartlett et al., 2006]. References: [Bartlett et al., 2006] and Section 4 of [Boucheron et al., 2005]. 35/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Links between 0–1 and convex risks Definition Φ is classification-calibrated if for any x with η ( x ) � = 1 / 2, sign( f ⋆ Φ ( x )) = f ⋆ ( x ) f ⋆ for any Φ ∈ argmin f R Φ ( f ) 36/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Links between 0–1 and convex risks Definition Φ is classification-calibrated if for any x with η ( x ) � = 1 / 2, sign( f ⋆ Φ ( x )) = f ⋆ ( x ) f ⋆ for any Φ ∈ argmin f R Φ ( f ) Theorem ([Bartlett et al., 2006]) Φ convex is classification-calibrated ⇔ Φ differentiable at 0 and Φ ′ (0) < 0 . Then, a function ψ exists such that � � �� f ⋆ ≤ R Φ ( f ) − R Φ ( f ⋆ ψ R 0 − 1 ( f ) − R 0 − 1 Φ ) . 0 − 1 Examples: √ 1 − θ 2 exponential loss: ψ ( θ ) = 1 − hinge loss: ψ ( θ ) = | θ | truncated quadratic: ψ ( θ ) = θ 2 36/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: linear classifier X = R d , linear classifier: sign( β ⊤ x + β 0 ) with β ∈ R d , β 0 ∈ R 37/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: linear classifier X = R d , linear classifier: sign( β ⊤ x + β 0 ) with β ∈ R d , β 0 ∈ R � ��� n � � � 1 β ⊤ X i + β 0 argmin β,β 0 / � β �≤ R Φ hinge Y i n i =1 � � � � �� n � 1 + λ � β � 2 β ⊤ X i + β 0 ⇔ argmin β,β 0 Φ hinge Y i n i =1 up to some (random) reparametrization ( λ = λ ( R ; D n )). ⇒ quadratic program with 2 n linear constraints. 37/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: linear classifier 38/53 Figure from http://cbio.ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: kernel trick Positive definite kernel k : X × X → R s.t. ( k ( X i , X j )) i , j symmetric positive definite Reproducing Kernel Hilbert Space (RKHS) F : space of functions X → R spanned by the Φ( x ) = k ( x , · ), x ∈ X . Figure from http://cbio.ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf 39/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: kernel trick Positive definite kernel k : X × X → R s.t. ( k ( X i , X j )) i , j symmetric positive definite Reproducing Kernel Hilbert Space (RKHS) F : space of functions X → R spanned by the Φ( x ) = k ( x , · ), x ∈ X . Theorem (Representer theorem) For any cost function ℓ , � � n � 1 ℓ ( Y i , f ( X i )) + λ � f � 2 min F n f ∈F i =1 n � is attained at some f of the form α i k ( X i , · ) i =1 ⇒ any algorithm for X = R d relying only on the dot products ( � X i , X j � ) i , j can be kernelized. 39/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Kernel examples linear kernel: X = R d , k ( x , y ) = � x , y � ⇒ F = R d Euclidean polynomial kernel: X = R d , k ( x , y ) = ( � x , y � + 1) r ⇒ F = R r [ X 1 , . . . , X d ] 40/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Kernel examples linear kernel: X = R d , k ( x , y ) = � x , y � ⇒ F = R d Euclidean polynomial kernel: X = R d , k ( x , y ) = ( � x , y � + 1) r ⇒ F = R r [ X 1 , . . . , X d ] Gaussian kernel: X = R d , k ( x , y ) = e −� x − y � 2 / (2 σ 2 ) Laplace kernel: X = R , k ( x , y ) = e −| x − y | / 2 ⇒ F = H 1 (Sobolev space), � f � 2 F = � f � 2 L 2 + � f ′ � 2 L 2 . min kernel: X = [0 , 1], k ( x , y ) = min { x , y } ⇒ F = { f ∈ C 0 ([0 , 1]), f ′ ∈ L 2 , f (0) = 0 } , � f � F = � f ′ � L 2 . 40/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Kernel examples Gaussian kernel: X = R d , k ( x , y ) = e −� x − y � 2 / (2 σ 2 ) Laplace kernel: X = R , k ( x , y ) = e −| x − y | / 2 ⇒ F = H 1 (Sobolev space), � f � 2 L 2 + � f ′ � 2 F = � f � 2 L 2 . min kernel: X = [0 , 1], k ( x , y ) = min { x , y } ⇒ F = { f ∈ C 0 ([0 , 1]), f ′ ∈ L 2 , f (0) = 0 } , � f � F = � f ′ � L 2 . � � p ∈ [0 , 1] d / p 1 + · · · + p d = 1 ⇒ intersection kernel: X = , k ( p , q ) = � d i =1 min( p i , q i ), useful in computer vision [Hein and Bousquet, 2004, Maji et al., 2008]. other kernels on non-vectorial data (graphs, words / DNA sequences, ...): see for instance [Sch¨ olkopf et al., 2004, Mah´ e et al., 2005, Shervashidze et al., 2011] and http://cbio. ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf 40/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: results / references Main mathematical tools for SVM analysis: probability in Hilbert spaces (RKHS), functional analysis. Some references: Risk bounds: e.g., [Blanchard et al., 2008] (SVM as a penalization procedure for selecting among balls). see also [Boucheron et al., 2005, Section 4] Tutorials and lecture notes: [Burges, 1998], http://cbio.ensmp.fr/~jvert/svn/kernelcourse/ slides/master/master.pdf Books: e.g., [Steinwart and Christmann, 2008, Hastie et al., 2009, Scholkopf and Smola, 2001] 41/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Decision / classification tree piecewise constant predictor partition obtained by recursive splitting of X ⊂ R p , orthogonally to one axis ( X j < t vs. X j ≥ t ) empirical risk minimization 42/53 Figures from [Hastie et al., 2009] Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion CART (Classification And Regression Trees) CART [Breiman et al., 1984]: 1 generate one large tree by splitting recursively the data (minimization of some impurity measure), ⇒ over-adapted to data 43/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion CART (Classification And Regression Trees) CART [Breiman et al., 1984]: 1 generate one large tree by splitting recursively the data (minimization of some impurity measure), ⇒ over-adapted to data 2 pruning ( ⇔ model selection) Model selection results: e.g., [Gey and N´ ed´ elec, 2005, Sauv´ e and Tuleau-Malot, 2011, Gey and Mary-Huard, 2011]. 43/53 Classification and statistical machine learning Sylvain Arlot
� � � � � � Introduction Goals Overfitting Examples Key issues Conclusion Random forests [Breiman, 2001] D n � � ������������������� � � � ���������� � Bootstrap � � � � � � � � � � � � � � . . . . . . D ⋆ 1 D ⋆ 2 D ⋆ K n n n . . . . . . Tree building . . . . . . T 1 T 2 T K � � � � ���������������� � � � � � � � � � � � � � � � � � � Voting � � � � � � Classifier Various ways to build individual trees (subset of variables...) Purely random forests: partitions independent from training data. 44/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Results on random forests (classification and regression) Most theoretical results on purely random forests (partitions independent from training data: by data splitting or with simpler models) Consistency result in classification [Biau et al., 2008] Convergence rate and some combination with variable selection [Biau, 2012] From a single tree to a large forest: estimation error reduction (at least a constant factor) [Genuer, 2012] approximation error reduction (A. & Genuer, work in progress) ⇒ sometimes improvement in the learning rate See also [Breiman, 2004, Genuer et al., 2008, Genuer et al., 2010]. 45/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Kinect: depth features ⇒ body part Depth image ⇒ depth comparison features at each pixel ⇒ body part at each pixel ⇒ body part positions ⇒ · · · Figure from [Shotton et al., 2011] 46/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Outline Introduction 1 Goals 2 Overfitting 3 Examples 4 Key issues 5 47/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Hyperparameter choice Always one or several parameters to choose: k for k -NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point 47/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Hyperparameter choice Always one or several parameters to choose: k for k -NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point Most general ideas: data splitting (cross-validation) [Arlot and Celisse, 2010] Sometimes specific approaches (penalization...): more efficient (for risk and computational cost) but also dependent on stronger assumptions 47/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Hyperparameter choice Always one or several parameters to choose: k for k -NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point Most general ideas: data splitting (cross-validation) [Arlot and Celisse, 2010] Sometimes specific approaches (penalization...): more efficient (for risk and computational cost) but also dependent on stronger assumptions Important to choose a good parametrization (e.g., for cross-validation, the optimal parameter should not vary too much from a sample to another) 47/53 Classification and statistical machine learning Sylvain Arlot
Introduction Goals Overfitting Examples Key issues Conclusion Computational complexity � Most classifiers are defined as f ∈ argmin f ∈ S C ( f ) Optimization algorithms: usually faster (polynomial) when C and S convex. Often NP hard with 0–1 loss. Counterexample: interval classification [Kearns et al., 1997]. 48/53 Classification and statistical machine learning Sylvain Arlot
Recommend
More recommend