Generalization of Deep Learning 1 Yuan YAO HKUST Some Theories - PowerPoint PPT Presentation

Generalization of Deep Learning 1 Yuan YAO HKUST

Some Theories are limited but help: ´ Approximation Theory and Harmonic Analysis : What functions are represented well by deep neural networks, without suffering the curse of dimensionality and better than shallow networks? ´ Sparse (local), hierarchical (multiscale), compositional functions avoid the curse dimensionality ´ Group (translation, rotational, scaling, deformation) invariances achieved as depth grows ´ Generalization: How can deep learning generalize well without overfitting the noise? ´ Double descent curve with overparametrized models ´ Implicit regularization of SGD: Max-Margin classifier ´ “Benign overfitting”? ´ Optimization: What is the landscape of the empirical risk and how to optimize it efficiently? ´ Wide networks may have simple landscape for GD/SGD algorithms …

Empirical Risk vs. Population Risk ´ Consider the empirical risk minimization under i.i.d. (independent and identically distributed) samples n E n ` ( y, f ( x ; ✓ )) := 1 R n ( ✓ ) = ˆ ˆ X ` ( y i , f ( x i ; ✓ )) + R n ( ✓ ) n i =1 ´ The population risk with respect to unknown distribution R ( ✓ ) = E ( x,y ) ∼ P ` ( y, f ( x ; ✓ ))

Optimization vs. Generalization ´ Fundamental Theorem of Machine Learning (for 0-1 misclassification loss, called ’errors’ below) ˆ + R ( θ ) − ˆ R ( θ ) = R n ( θ ) R n ( θ ) |{z} | {z } | {z } training loss generalization gap test/validation/generalization loss | R ( θ ) − ˆ sup R n ( θ ) | ≤ Complexity ( Θ ) θ ∈ Θ e.g. Rademacher complexity ´ How to make training loss/error small? – Optimization issue ´ How to make generalization gap small? – Model Complexity issue

Uniform Convergence: Another View I For θ ∗ ∈ arg min θ ∈ Θ R ( θ ) and b θ n ∈ arg min θ ∈ Θ ˆ R n ( θ ) , R ( b R ( b R n ( b θ n ) − ˆ θ n ) − R ( θ ∗ ) = θ n ) + . . . | {z } | {z } excess risk A + ( ˆ R n ( b θ n ) − ˆ R n ( θ ∗ )) + . . . | {z } ≤ 0 + ( ˆ R n ( θ ∗ ) − R ( θ ∗ )) | {z } B I To make both A and B small, | R ( θ ) − ˆ sup R n ( θ ) | ≤ Complexity ( Θ ) θ ∈ Θ e.g. Rademacher complexity

Example: regression and square loss I Given an estimate ˆ f and a set of predictors X , we can predict Y using Y = ˆ ˆ f ( X ) , I Assume for a moment that both ˆ f and X are fixed. In regression setting, Y ) 2 = E [ f ( X ) + ✏ − ˆ E ( Y − ˆ f ( X )] 2 = [ f ( X ) − ˆ f ( X )] 2 (2) + Var ( ✏ ) , | {z } | {z } Reducible Irreducible Y ) 2 represents the expected squared error between the where E ( Y − ˆ predicted and actual value of Y , and Var ( ✏ ) represents the variance associated with the error term ✏ . An optimal estimate is to minimize the reducible error.

Bias-Variance Decomposition I Let f ( X ) be the true function which we aim at estimating from a training data set D . I Let ˆ h i h i f ( X ; D ) be the estimated function from the training data set D . I Take the expectation with respect to D , h i 2 f ( X ) − ˆ f ( X ; D ) E D h i 2 � h i 2 f ( X ) − E D (ˆ E D (ˆ f ( X ; D )) − ˆ = f ( X ; D )) + E D f ( X ; D ) | {z } | {z } Bias 2 Variance

Bias-Variance Tradeoff

Why big models in NN generalize well? n=50,000 CIFAR10 d=3,072 k=10 What happens when I turn off the regularizers? Train Test Model parameters p/n loss error CudaConvNet 145,578 2.9 0 23% CudaConvNet 145,578 2.9 0.34 18% (with regularization) MicroInception 1,649,402 33 0 14% ResNet 2,401,440 48 0 13% Chiyuan Zhang et al. 2016

The Bias-Variance Tradeoff? Deep models Models where p>20n are common

Increasing # parameters 0.6 Test Train 0.4 0.2 0 0.08 0.25 1 2.5 7.5 20 # parameters / # samples # parameters / # samples N/n Figure: Experiments on MNIST. Left: [Belkin, Hsu, Ma, Mandal, 2018]. Right: [Spigler, Geiger, Ascoli, Sagun, Biroli, Wyart, 2018]. Similar phenomenon appeared in the literature [LeCun, Kanter, and Solla, 1991], [Krogh and Hertz, 1992], [Opper and Kinzel, 1995], [Neyshabur, Tomioka, Srebro, 2014], [Advani and Saxe, 2017].

“Double Descent” Figure: A cartoon by [Belkin, Hsu, Ma, Mandal, 2018]. X Peak at the interpolation threshold. X Monotone decreasing in the overparameterized regime. X Global minimum when the number of parameters is infinity.

Complementary rather than Contradiction U-shaped curve Test error vs model complexity that tightly controls generalization. Examples: norm in linear model, “ ” in nearest-neighbors. Double-descent Test error vs number of parameters. Examples: parameters in NN. In NN, parameters model complexity that tightly controls generalization. [Bartlett, 1997], [Bartlett and Mendelson, 2002]

Let’s go to two talks ´ Prof. Misha Belkin (OSU/UCSD) ´ From Classical Statistics to Modern Machine Learning at Simons Institute at Berkeley ´ How interpolation models do not overfit… ´ Prof. Song Mei (UC Berkeley) ´ Generalization of linearized neural networks: staircase decay and double descent, at HKUST ´ How simple linearized single-hidden-layer models help understand…

Thank you!

Generalization of Deep Learning 1 Yuan YAO HKUST Some Theories - PowerPoint PPT Presentation

Generalization of Deep Learning 1 Yuan YAO HKUST Some Theories are limited but help: Approximation Theory and Harmonic Analysis : What functions are represented well by deep neural networks, without suffering the curse of dimensionality and

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang Background Before (ex:

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Towards a Foundation of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee

Deep Model Generalization for Medical Image Computing at Scale DOU Qi Department of Computer

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Incentives in Crowdsourcing: A Game-theoretic Approach ARPITA GHOSH Cornell University NIPS

Game Theoretic Security Framework for Quantum Key Distribution Walter O. Krawec Fei Miao

Using Game Theory to analyze Risk to Privacy Lisa Rajbhandari Einar A. Snekkenes Agenda

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich

The Effect of Network Width on Stochastic Gradient Descent and Generalization Daniel S. Park

Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11,

Lecture 4.5: Generalized Fourier series Matthew Macauley Department of Mathematical Sciences