Overparametrization and the bias-variance dilemma Johannes - PowerPoint PPT Presentation

Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13

double descent and implicit regularization overparametrization generalizes well � implicit regularization 2 / 13

can we defy the bias-variance trade-off? Geman et al. ’92: ”the fundamental limitations resulting from the bias-variance dilemma apply to all nonparametric inference methods, including neural networks” Because of the double descent phenomenon, there is some doubt whether this statement is true. Recent work includes 3 / 13

lower bounds on the bias-variance trade-off Similar to minimax lower bounds we want to establish a general mathematical framework to derive lower bounds on the bias-variance trade-off that hold for all estimators. given such bounds we can answer many interesting questions • are there methods (e.g. deep learning) that can defy the bias-variance trade-off? • lower bounds for the U -shaped curve of the classical bias-variance trade-off 4 / 13

related literature • Low ’95 provides complete characterization of bias-variance trade-off for functionals in the Gaussian white noise model • Pfanzagl ’99 shows that estimators of functionals satisfying an asymptotic unbiasedness property must have unbounded variance No general treatment of lower bounds for the bias-variance trade-off yet. 5 / 13

Cram´ er-Rao inequality for parametric problems: V ( θ ) ≥ (1 + B ′ ( θ )) 2 F ( θ ) • V ( θ ) the variance • B ′ ( θ ) the derivative of the bias • F ( θ ) the Fisher information 6 / 13

change of expectation inequalities • probability measures P 0 , . . . , P M • χ 2 ( P 0 , . . . , P M ) the matrix with entries � dP j χ 2 ( P 0 , . . . , P M ) j , k = dP k − 1 dP 0 • any random variable X • ∆ := ( E P 1 [ X ] − E P 0 [ X ] , . . . , E P M [ X ] − E P 0 [ X ]) ⊤ then, ∆ ⊤ χ 2 ( P 0 , . . . , P M ) − 1 ∆ ≤ Var P 0 ( X ) 7 / 13

pointwise estimation Gaussian white noise model: We observe ( Y x ) x with dY x = f ( x ) dx + n − 1 / 2 dW x • estimate f ( x 0 ) for a fixed x 0 • C β ( R ) denotes ball of H¨ older β -smooth functions • for any estimator � f ( x 0 ) , we obtain the bias-variance lower bound � �� 1 � 1 /β � Bias f inf sup f ( x 0 ) sup Var f f ( x 0 ) n � f f ∈ C β ( R ) f ∈ C β ( R ) • bound is attained by most estimators • generates U -shaped curve 8 / 13

high-dimensional models Gaussian sequence model: • observe independent X i ∼ N ( θ i , 1) , i = 1 , . . . , n • Θ( s ) the space of s -sparse vectors (here: s ≤ √ n / 2) • bias-variance decomposition + � n E θ [ � � θ − θ � 2 ] = � E θ [ � i =1 Var θ ( � θ ] − θ � 2 θ i ) � �� B 2 ( θ ) • bias-variance lower bound: if B 2 ( θ ) ≤ γ s log( n / s 2 ) , then, � s 2 � 4 γ � n �� Var 0 θ i � n n i =1 • bound is matched (up to a factor in the exponent) by soft thresholding • bias-variance trade-off more extreme than U -shape • results also extend to high-dimensional linear regression 9 / 13

L 2 -loss Gaussian white noise model: We observe ( Y x ) x with dY x = f ( x ) dx + n − 1 / 2 dW x • bias-variance decomposition �� 2 �� f − f MISE f f := E f L 2 [0 , 1] � 1 � 1 �� Bias 2 = f ( x ) dx + Var f f ( x ) dx f 0 0 �� =: IBias 2 f ( � f ) + IVar f f . • is there a bias-variance trade-off between IBias 2 f ( � f ) and �� IVar f f ? • turns out to be a very hard problem 10 / 13

L 2 -loss (ctd.) • we propose a two-fold reduction scheme • reduction to a simpler model • reduction to a smaller class of estimators • S β ( R ) Sobolev space of β -smooth functions Bias-variance lower bound: For any estimator � f , � � �� ≥ 1 � 1 /β � IBias f ( � inf sup f ) sup IVar f f 8 n , � f f ∈ S β ( R ) f ∈ S β ( R ) • many estimators � f can be found with upper bound � 1 / n 11 / 13

mean absolute deviation • several extensions of the bias-variance trade-off have been proposed in the literature, e.g. for classification • the mean absolute deviation (MAD) of an estimator � θ is E θ [ | � θ − m | ] with m either the mean or the median of � θ can the general framework be extended to lower bounds on the trade-off between bias and MAD? • derived change of expectation inequality • this can be used to obtain a partial answer for pointwise estimation in the Gaussian white noise model 12 / 13

Summary • general framework to derive bias-variance lower bounds • leads to matching bias-variance lower bounds for standard models in nonparametric and high-dimensional statistics • different types of the bias-variance trade-off occur • can machine learning methods defy the bias-variance trade-off? No, there are universal lower bounds that no method can avoid for details and more results consult the preprint https://arxiv.org/abs/2006.00278.pdf 13 / 13

Overparametrization and the bias-variance dilemma Johannes - PowerPoint PPT Presentation

Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13 double descent and implicit regularization overparametrization generalizes well implicit

The role of over-parametrisation in NNs The role of over-parametrisation in NNs Levent Sagun,

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Survey of Overparametrization and Optimization Jason D. Lee University of Southern California

Bias-Variance Tradeoff Matthieu R. Bloch h in a given set H that minimizes the true risk R ( h ) .

Lecture 20: AdaBoost Aykut Erdem December 2017 Hacettepe University Last time Bias/Variance

Bias-variance trade-off. Crossvalidation. Regularization. Petr Po s k P. Po s k

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

10701 Machine Learning Boosting Fighting the bias-variance tradeoff Simple (a.k.a. weak)

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

A Unified Bias-Variance Decomposition and its Applications Pedro Domingos

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

GMM & EM Last time summary Normalization Bias-Variance trade-off Overfitting and

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Bias Variance Trade-off Intuition: If the model is too simple, the solution is biased and

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

recap: Approximation Versus Generalization VC Analysis Bias-Variance Analysis E out E in +

Machine Learning Lecture 05: The Bias-Variance decomposition Nevin L. Zhang lzhang@cse.ust.hk

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences

Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy Kareem Amin, Alex

Bias, Variance and Parsimony in Regression Analysis ECS 256 Winter 2014 Christopher Patton,

Overparametrization and the bias-variance dilemma Johannes - PowerPoint PPT Presentation

Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13 double descent and implicit regularization overparametrization generalizes well implicit

The role of over-parametrisation in NNs The role of over-parametrisation in NNs Levent Sagun,

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Survey of Overparametrization and Optimization Jason D. Lee University of Southern California

Bias-Variance Tradeoff Matthieu R. Bloch h in a given set H that minimizes the true risk R ( h ) .

Lecture 20: AdaBoost Aykut Erdem December 2017 Hacettepe University Last time Bias/Variance

Bias-variance trade-off. Crossvalidation. Regularization. Petr Po s k P. Po s k

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

10701 Machine Learning Boosting Fighting the bias-variance tradeoff Simple (a.k.a. weak)

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

A Unified Bias-Variance Decomposition and its Applications Pedro Domingos

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees &amp; Bias-Variance

GMM &amp; EM Last time summary Normalization Bias-Variance trade-off Overfitting and

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Bias Variance Trade-off Intuition: If the model is too simple, the solution is biased and

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

recap: Approximation Versus Generalization VC Analysis Bias-Variance Analysis E out E in +

Machine Learning Lecture 05: The Bias-Variance decomposition Nevin L. Zhang lzhang@cse.ust.hk

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences

Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy Kareem Amin, Alex

Bias, Variance and Parsimony in Regression Analysis ECS 256 Winter 2014 Christopher Patton,

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

GMM & EM Last time summary Normalization Bias-Variance trade-off Overfitting and