Suboptimality of Penalized Empirical Risk Minimization in Classification. Guillaume Lecu´ e Universit´ e Paris 6 COLT 2007, June 13
General Framework. Aggregations Procedures. Optimality in classification. Motivation. M prior estimators (’weak’ estimators) : f 1 , . . . , f M n observations : D n Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Motivation. M prior estimators (’weak’ estimators) : f 1 , . . . , f M n observations : D n Aim Construction of a new estimator which is approximatively as good as the best ’weak’ estimator : Aggregation method or Aggregate Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Examples. Adaptation : Observations : D m + n Estimation : D m → non-adaptive estimators f 1 , . . . , f M . learning : D ( n ) → aggregate ˜ f n (adaptive). Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Examples. Adaptation : Observations : D m + n Estimation : D m → non-adaptive estimators f 1 , . . . , f M . learning : D ( n ) → aggregate ˜ f n (adaptive). Estimation : ǫ − net : f 1 , . . . , f M (functions) learning : D n → aggregate ˜ f n . Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , D n = (( X 1 , Y 1 ) , . . . , ( X n , Y n )) : n i.i.d. observations. Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , D n = (( X 1 , Y 1 ) , . . . , ( X n , Y n )) : n i.i.d. observations. Problem of prediction : x ∈ X → label y ∈ {− 1 , 1 } ? Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , D n = (( X 1 , Y 1 ) , . . . , ( X n , Y n )) : n i.i.d. observations. Problem of prediction : x ∈ X → label y ∈ {− 1 , 1 } ? f : X �− → {− 1 , 1 } : prediction rule. Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , D n = (( X 1 , Y 1 ) , . . . , ( X n , Y n )) : n i.i.d. observations. Problem of prediction : x ∈ X → label y ∈ {− 1 , 1 } ? f : X �− → {− 1 , 1 } : prediction rule. Bayes risk : A 0 ( f ) = P [ f ( X ) � = Y ] Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , D n = (( X 1 , Y 1 ) , . . . , ( X n , Y n )) : n i.i.d. observations. Problem of prediction : x ∈ X → label y ∈ {− 1 , 1 } ? f : X �− → {− 1 , 1 } : prediction rule. Bayes risk : A 0 ( f ) = P [ f ( X ) � = Y ] Bayes rule : f ∗ ( x ) = Sign (2 η ( x ) − 1) where η ( x ) = P [ Y = 1 | X = x ] . def A ∗ = min f A 0 ( f ) = A 0 ( f ∗ ) 0 Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , D n = (( X 1 , Y 1 ) , . . . , ( X n , Y n )) : n i.i.d. observations. Problem of prediction : x ∈ X → label y ∈ {− 1 , 1 } ? f : X �− → {− 1 , 1 } : prediction rule. Bayes risk : A 0 ( f ) = P [ f ( X ) � = Y ] Bayes rule : f ∗ ( x ) = Sign (2 η ( x ) − 1) where η ( x ) = P [ Y = 1 | X = x ] . def A ∗ = min f A 0 ( f ) = A 0 ( f ∗ ) 0 Prediction → estimation : estimation of f ∗ . Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( X , A ) a measurable space, ( X , Y ) ∼ π valued in X × {− 1 , 1 } , D n = (( X 1 , Y 1 ) , . . . , ( X n , Y n )) : n i.i.d. observations. Problem of prediction : x ∈ X → label y ∈ {− 1 , 1 } ? f : X �− → {− 1 , 1 } : prediction rule. Bayes risk : A 0 ( f ) = P [ f ( X ) � = Y ] Bayes rule : f ∗ ( x ) = Sign (2 η ( x ) − 1) where η ( x ) = P [ Y = 1 | X = x ] . def A ∗ = min f A 0 ( f ) = A 0 ( f ∗ ) 0 Prediction → estimation : estimation of f ∗ . excess risk : A 0 ( f ) − A ∗ 0 Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( f : X �− → R ) → risk A 0 ( f ) = E [ φ 0 ( Yf ( X ))] where φ 0 ( x ) = 1 I ( x ≤ 0) classical loss or 0 − 1 loss Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( f : X �− → R ) → risk A 0 ( f ) = E [ φ 0 ( Yf ( X ))] where φ 0 ( x ) = 1 I ( x ≤ 0) classical loss or 0 − 1 loss φ 1 ( x ) = max(0 , 1 − x ) hinge loss or (SVM loss) x �− → log 2 (1 + exp( − x )) ’Logit-Boosting’ loss x �− → exp( − x ) exponential Boosting loss → (1 − x ) 2 x �− quadratic loss → max(0 , 1 − x ) 2 x �− 2-norm soft margin loss Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Model of classification ( f : X �− → R ) → risk A 0 ( f ) = E [ φ 0 ( Yf ( X ))] where φ 0 ( x ) = 1 I ( x ≤ 0) classical loss or 0 − 1 loss φ 1 ( x ) = max(0 , 1 − x ) hinge loss or (SVM loss) x �− → log 2 (1 + exp( − x )) ’Logit-Boosting’ loss x �− → exp( − x ) exponential Boosting loss → (1 − x ) 2 x �− quadratic loss → max(0 , 1 − x ) 2 x �− 2-norm soft margin loss A φ ∗ def φ − risk : A φ ( f ) = E [ φ ( Yf ( X ))] , = inf f A ( f ) = A ( f φ ∗ ) , excess φ − risk : A φ ( f ) − A φ ∗ . n n ( f ) = 1 � empirical φ − risk : A φ φ ( Y i f ( X i )) . n i =1 Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Selectors. φ : R �− → R a loss, F 0 = { f 1 , . . . , f M } ⊂ F a dictionary. Empirical Risk Minimization (ERM) :(Vapnik, Chervonenkis...) ˜ f ERM f ∈F 0 A φ ∈ Arg min n ( f ) . n Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Selectors. φ : R �− → R a loss, F 0 = { f 1 , . . . , f M } ⊂ F a dictionary. Empirical Risk Minimization (ERM) :(Vapnik, Chervonenkis...) ˜ f ERM f ∈F 0 A φ ∈ Arg min n ( f ) . n penalized Empirical Risk Minimization (pERM) : ˜ f ∈F 0 [ A φ f ERM ∈ Arg min n ( f ) + pen ( f )] , n where pen is a penalty function. (Barron, Bartlett, Birg´ e, Boucheron, Koltchinski, Lugosi, Massart,...) Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Aggregation methods with exponential weights. φ : R �− → R a loss, F 0 = { f 1 , . . . , f M } ⊂ F a dictionary. Aggregate with Exponential weights (AEW) : � − nTA φ � exp n ( f ) ˜ � w ( n ) T ( f ) f , where w ( n ) f AEW = T ( f ) = � , n , T � − nTA φ � g ∈F 0 exp n ( g ) f ∈F 0 T − 1 : temperature parameter. Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
General Framework. Aggregations Procedures. Optimality in classification. Aggregation methods with exponential weights. φ : R �− → R a loss, F 0 = { f 1 , . . . , f M } ⊂ F a dictionary. Aggregate with Exponential weights (AEW) : � − nTA φ � exp n ( f ) ˜ � w ( n ) T ( f ) f , where w ( n ) f AEW = T ( f ) = � , n , T � − nTA φ � g ∈F 0 exp n ( g ) f ∈F 0 T − 1 : temperature parameter. Cumulative Aggregate with Exponential Weights (CAEW) :(Catoni, Yang,...) n = 1 ˜ � ˜ f CAEW f AEW . n , T k , T n k =1 Suboptimality of Penalized Empirical Risk Minimization in Classification. Universit´ e Paris 6
Recommend
More recommend