overparametrization and the bias variance dilemma
play

Overparametrization and the bias-variance dilemma Johannes - PowerPoint PPT Presentation

Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13 double descent and implicit regularization overparametrization generalizes well implicit


  1. Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13

  2. double descent and implicit regularization overparametrization generalizes well � implicit regularization 2 / 13

  3. can we defy the bias-variance trade-off? Geman et al. ’92: ”the fundamental limitations resulting from the bias-variance dilemma apply to all nonparametric inference methods, including neural networks” Because of the double descent phenomenon, there is some doubt whether this statement is true. Recent work includes 3 / 13

  4. lower bounds on the bias-variance trade-off Similar to minimax lower bounds we want to establish a general mathematical framework to derive lower bounds on the bias-variance trade-off that hold for all estimators. given such bounds we can answer many interesting questions • are there methods (e.g. deep learning) that can defy the bias-variance trade-off? • lower bounds for the U -shaped curve of the classical bias-variance trade-off 4 / 13

  5. related literature • Low ’95 provides complete characterization of bias-variance trade-off for functionals in the Gaussian white noise model • Pfanzagl ’99 shows that estimators of functionals satisfying an asymptotic unbiasedness property must have unbounded variance No general treatment of lower bounds for the bias-variance trade-off yet. 5 / 13

  6. Cram´ er-Rao inequality for parametric problems: V ( θ ) ≥ (1 + B ′ ( θ )) 2 F ( θ ) • V ( θ ) the variance • B ′ ( θ ) the derivative of the bias • F ( θ ) the Fisher information 6 / 13

  7. change of expectation inequalities • probability measures P 0 , . . . , P M • χ 2 ( P 0 , . . . , P M ) the matrix with entries � dP j χ 2 ( P 0 , . . . , P M ) j , k = dP k − 1 dP 0 • any random variable X • ∆ := ( E P 1 [ X ] − E P 0 [ X ] , . . . , E P M [ X ] − E P 0 [ X ]) ⊤ then, ∆ ⊤ χ 2 ( P 0 , . . . , P M ) − 1 ∆ ≤ Var P 0 ( X ) 7 / 13

  8. pointwise estimation Gaussian white noise model: We observe ( Y x ) x with dY x = f ( x ) dx + n − 1 / 2 dW x • estimate f ( x 0 ) for a fixed x 0 • C β ( R ) denotes ball of H¨ older β -smooth functions • for any estimator � f ( x 0 ) , we obtain the bias-variance lower bound � �� �� �� � � 1 � 1 /β � Bias f inf sup f ( x 0 ) sup Var f f ( x 0 ) n � f f ∈ C β ( R ) f ∈ C β ( R ) • bound is attained by most estimators • generates U -shaped curve 8 / 13

  9. high-dimensional models Gaussian sequence model: • observe independent X i ∼ N ( θ i , 1) , i = 1 , . . . , n • Θ( s ) the space of s -sparse vectors (here: s ≤ √ n / 2) • bias-variance decomposition + � n E θ [ � � θ − θ � 2 ] = � E θ [ � i =1 Var θ ( � θ ] − θ � 2 θ i ) � �� � B 2 ( θ ) • bias-variance lower bound: if B 2 ( θ ) ≤ γ s log( n / s 2 ) , then, � s 2 � 4 γ � n �� � Var 0 θ i � n n i =1 • bound is matched (up to a factor in the exponent) by soft thresholding • bias-variance trade-off more extreme than U -shape • results also extend to high-dimensional linear regression 9 / 13

  10. L 2 -loss Gaussian white noise model: We observe ( Y x ) x with dY x = f ( x ) dx + n − 1 / 2 dW x • bias-variance decomposition �� � �� � � � 2 �� f − f MISE f f := E f L 2 [0 , 1] � 1 � 1 �� � �� � Bias 2 = f ( x ) dx + Var f f ( x ) dx f 0 0 �� � =: IBias 2 f ( � f ) + IVar f f . • is there a bias-variance trade-off between IBias 2 f ( � f ) and �� � IVar f f ? • turns out to be a very hard problem 10 / 13

  11. L 2 -loss (ctd.) • we propose a two-fold reduction scheme • reduction to a simpler model • reduction to a smaller class of estimators • S β ( R ) Sobolev space of β -smooth functions Bias-variance lower bound: For any estimator � f , � � �� � ≥ 1 � 1 /β � IBias f ( � inf sup f ) sup IVar f f 8 n , � f f ∈ S β ( R ) f ∈ S β ( R ) • many estimators � f can be found with upper bound � 1 / n 11 / 13

  12. mean absolute deviation • several extensions of the bias-variance trade-off have been proposed in the literature, e.g. for classification • the mean absolute deviation (MAD) of an estimator � θ is E θ [ | � θ − m | ] with m either the mean or the median of � θ can the general framework be extended to lower bounds on the trade-off between bias and MAD? • derived change of expectation inequality • this can be used to obtain a partial answer for pointwise estimation in the Gaussian white noise model 12 / 13

  13. Summary • general framework to derive bias-variance lower bounds • leads to matching bias-variance lower bounds for standard models in nonparametric and high-dimensional statistics • different types of the bias-variance trade-off occur • can machine learning methods defy the bias-variance trade-off? No, there are universal lower bounds that no method can avoid for details and more results consult the preprint https://arxiv.org/abs/2006.00278.pdf 13 / 13

Recommend


More recommend