an improper estimator with optimal excess risk in
play

An improper estimator with optimal excess risk in misspecified - PowerPoint PPT Presentation

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression Jaouad Mourtada , Stphane Gaffas StatMathAppli 2019, Frjus CMAP, cole polytechnique, LPSM, Universit Paris-Diderot


  1. An improper estimator with optimal excess risk in misspecified density estimation and logistic regression Jaouad Mourtada ∗ , Stéphane Gaïffas † StatMathAppli 2019, Fréjus ∗ CMAP, École polytechnique, † LPSM, Université Paris-Diderot On arXiv soon. 1

  2. Predictive density estimation

  3. Predictive density estimation: setting • Space Z ; i.i.d. sample Z n 1 = ( Z 1 , . . . , Z n ) ∼ P n , with P unknown distribution on Z . • Given Z n 1 , predict new sample Z ∼ P (probabilistic prediction) • f density on Z (wrt base measure µ ), z ∈ Z , log-loss ℓ ( f , z ) = − log f ( z ) . Risk R ( f ) = E [ ℓ ( f , Z )] where Z ∼ P . • Family F of densities on Z = statistical model ; g n ( Z n • Goal : find density � g n = � 1 ) with small excess risk E [ R ( � g n )] − inf f ∈F R ( f ) . 2

  4. On the logarithmic loss: ℓ ( f , z ) = − log f ( z ) • Standard loss function, connected to lossless compression; • Minimizing risk amounts to maximizing joint probability attributed to large test sample ( Z ′ 1 , . . . , Z ′ m ) ∼ P m : � � m m � � f ( Z ′ ℓ ( f , Z ′ − = exp [ − m ( R ( f ) + o ( 1 ))] j ) = exp j ) j = 1 j = 1 • Letting p = dP / d µ be the true density, � � p ( Z ) �� R ( f ) − R ( p ) = E Z ∼ P log =: KL ( p , f ) � 0 . f ( Z ) Risk minimized by true density : f ∗ = p ; excess risk given by the Kullback-Leibler divergence (relative entropy). 3

  5. Well-specified case: asymptotic optimality of the MLE Here, assume that p ∈ F ( well-specified model), with F a regular parametric family/model of dimension d . The Maximum Likelihood Estimator (MLE) � f n , defined by n n � � � f n := argmin ℓ ( f , Z i ) = argmax f ( Z i ) f ∈F f ∈F i = 1 i = 1 satisfies, as n → ∞ , � 1 � f n ) = d R ( � f ∈F R ( f ) = KL ( p , � f n ) − inf 2 n + o . n The d / ( 2 n ) rate is asymptotically optimal (locally asymptotically minimax – Hájek, Le Cam): MLE is efficient . 4

  6. Misspecified case (statistical learning viewpoint) Assumption p ∈ F is restrictive and generally not satisfied: model chosen by the statistician, simplification of the truth. General misspecified case where p �∈ F : model F is false but useful. Excess risk is a relevant objective. MLE � f n can degrade under model misspecification: � 1 � f ∈F R ( f ) = d eff R ( � f n ) − inf 2 n + o n where d eff = Tr[ H − 1 G ] , G = E [ ∇ ℓ ( f ∗ , Z ) ∇ ℓ ( f ∗ , Z ) ⊤ ] , H = ∇ 2 R ( f ∗ ) . Misspecified case: d eff depends on P , and we may have d eff ≫ d . 5

  7. Cumulative risk/regret and online-to-batch conversion Well-established theory (Merhav 1998, Cesa-Bianchi & Lugosi 2006) for controlling cumulative excess risk n n � � Regret n = ℓ ( � g t − 1 , Z t ) − inf ℓ ( f , Z t ) ; f ∈F t = 1 t = 1 F bounded family: minimax regret of ( d log n ) / 2 + O ( 1 ) . Implies excess risk of ( d log n ) / ( 2 n ) + O ( 1 / n ) for averaged predictor: n � 1 g n = ¯ g t . � n + 1 t = 0 ⊕ Valid under model misspecification (distribution-free); ⊖ Suboptimal rate for individual risk, inefficient predictor. Infinite for unbounded families (eg Gaussian), computational complexity. 6

  8. The Sample Minimax Predictor

  9. The Sample Minimax Predictor (SMP) We introduce the Sample Minimax Predictor , given by: � f z n ( z ) � [ ℓ ( g , z ) − ℓ ( � f z f n = argmin sup n , z )] = � Z � f z ′ n ( z ′ ) µ ( dz ′ ) g z ∈Z where � n � � � f z n = argmin ℓ ( f , Z i ) + ℓ ( f , z ) . f ∈F i = 1 • In general, � f n �∈ F : improper predictor . • Conditional variant � f n ( y | x ) for conditional density estimation. • Regularized variant. 7

  10. Excess risk bound for the SMP � f z n ( z ) � f n ( z ) = (1) � Z � f z ′ n ( z ′ ) µ ( dz ′ ) Theorem (M., Gaïffas, Scornet, 2019) The SMP � f n (1) satisfies: � � � �� � � R ( � f ( z ) � f n ) − inf f ∈F R ( f ) � E Z n log ( z ) µ ( d z ) . (2) E n 1 Y • Analogous excess risk bound in the conditional case. • Typically simple d / n + o ( n − 1 ) bound for standard models (Gaussian, multinomial), even in misspecified case . 8

  11. Application: Gaussian linear model

  12. Gaussian linear model • Conditional density estimation problem. • Probabilistic prediction of response Y ∈ R given covariates X ∈ R d . Risk of conditional density f ( y | x ) is R ( f ) = E [ ℓ ( f ( X ) , Y )] = E [ − log f ( Y | X )] . • F = { f β : β ∈ R d } with f β ( ·| x ) = N ( � β, x � , 1 ) , so that ℓ ( f β , ( x , y )) = 1 2 ( y − � β, x � ) 2 • MLE is � f n ( ·| x ) = N ( � � β n , x � , 1 ) , with � β n ordinary least squares: � n � − 1 n n � � � ( Y i − � β, X i � ) 2 = � X i X ⊤ β n = argmin Y i X i i β ∈ R d i = 1 i = 1 i = 1 9

  13. SMP for the Gaussian linear model Σ n = n − 1 � n Σ = E [ XX ⊤ ] , � i = 1 X i X ⊤ true/sample covariance matrix i Theorem (Distribution-free excess risk for SMP) � � 2 ) . If The SMP is � f n ( ·| x ) = N ( � � 1 + � ( n � Σ n ) − 1 x , x � β n , x � , E [ Y 2 ] < + ∞ , then � � �� � � � � R ( � ( n � Σ n + XX ⊤ ) − 1 X , X f n ) − inf β ∈ R d R ( β ) � E − log 1 − E � �� � "leverage score" which is twice the minimax risk in the well-specified case. • Smaller than E [Tr(Σ 1 / 2 � Σ − 1 n Σ 1 / 2 )] / n ∼ d / n under regularity assumption on P X ( Σ − 1 / 2 X not too close to any hyperplane) • By contrast, for MLE: E [ R ( � f n )] − R ( β ∗ ) ∼ E [( Y − � β ∗ , X � ) 2 � Σ − 1 / 2 X � 2 ] / ( 2 n ) . 10

  14. Application to logistic regression

  15. Logistic regression: setting • Binary label Y ∈ {− 1 , 1 } , covariates X ∈ R d . Risk of conditional density f ( ± 1 | x ) R ( f ) = E [ − log f ( Y | X )] . • F = { f β : β ∈ R d } family of conditional densities of Y | X : f β ( y | x ) = P β ( Y = y | X = x ) = σ ( y � β, x � ) , y ∈ {− 1 , 1 } with σ ( u ) = e u / ( 1 + e u ) sigmoid function. For β, x ∈ R d , y ∈ {± 1 } ℓ ( β, ( x , y )) = log( 1 + e − y � β, x � ) 11

  16. Limitations of MLE and proper (plug-in) predictors β n ( y | x ) = σ ( y � � • MLE f � β n , x � ) not fully satisfying for prediction: – Ill-defined when sets { X i : Y i = 1 } and { X i : Y i = − 1 } are linearly separated, yields 0 or 1 probabilities ( ⇒ infinite risk). – Risk d eff / ( 2 n ) ; if � X � � R , d eff may be as large as 1 de � β ∗ � R . • Lower bound (Hazan et al., 2014) for any proper (within class) predictor of min( BR / √ n , de BR / n ) . • Better O ( d · log( BRn ) / n ) through online-to-batch conversion, with improper predictor (Foster et al., 2018). But computationally expensive (posterior sampling). 1 Bach & Moulines (2013); see also Ostrovskii & Bach (2018). 12

  17. Sample Minimax Predictor for logistic regression The SMP writes: f ( x , y ) � ( y | x ) � n f n ( y | x ) = f ( x , − 1 ) f ( x , 1 ) � ( − 1 | x ) + � ( 1 | x ) n n f ( x , y ) where � is the MLE obtained when adding ( x , y ) to the sample. n • Well-defined, even in the separated case; invariant by linear transformation of X (“prior-free”). Never outputs 0 probability. • Computationally reasonable: prediction obtained by solving two logistic regressions (replaces sampling by optimization). • NB: still more expensive than simple logistic regression (need to update solution of logistic regression for each test input x ). 13

  18. Excess risk bound for the penalized SMP Theorem (M., Gaïffas, Scornet 2019) Assume that � X � � R a.s. and let λ = 2 R 2 / ( n + 1 ) . Then, logistic SMP with penalty λ � β � 2 / 2 satisfies: for every β ∈ R d , n + � β � 2 R 2 � � − R ( β ) � 3 d R ( � E f λ, n ) (3) n Remark. Fast rate under no assumption on L ( Y | X ) . √ d ) and � β ∗ � = O ( 1 ) , then optimal O ( d / n ) excess risk. If R = O ( � √ Recall min( BR / √ n , de BR / n ) = min( d / n ) lower bound d / n , de for proper predictors (incl. Ridge logistic regression). Also better than O ( d log n / n ) from OTB, but worse dependence on � β ∗ � . 14

  19. Conclusion

  20. Conclusion Sample Minimax Predictor = procedure for predictive density estimation. General excess risk bound, typically does not degrade under model misspecification. Gaussian linear model: tight bound, within a factor of 2 of minimax. For logistic regression: simple predictor, bypasses lower bounds for proper (plug-in) predictors (removes exponential factor for worst-case distributions). Next directions: • Other GLMs? • Online logistic regression (individual sequences)? • Application to statistical learning with other loss functions? 15

  21. Thank you! 16

Recommend


More recommend