The generalization error of random features model: Precise - PowerPoint PPT Presentation

The generalization error of random features model: Precise asymptotics and double descent curve Song Mei Stanford University September 8, 2019 Joint work with Andrea Montanari Song Mei (Stanford University) Random feature regression September 8, 2019 1 / 22

Surprises of generalization behavior of neural networks Figure: Experiments on MNIST by [Neyshabur, Tomioka, Srebro, 2014a] Surprise: why does’t higher model complexity ... ... induce larger generalization error? Song Mei (Stanford University) Random feature regression September 8, 2019 2 / 22

Partial explanations: The intrinsic model complexity is not the number of parameters, but “some norm” of the weights. This intrinsic model complexity is implicitly controlled by SGD. [Neyshabur, Tomioka, Srebro, 2014b], [Gunasekar, Woodworth, Bhojanapalli, Neyshabur, Srebro, 2017], .... Song Mei (Stanford University) Random feature regression September 8, 2019 3 / 22

Train more carefully to better interpolates the data Figure: Experiments on MNIST. Left: [Spigler, Geiger, Ascoli, Sagun, Biroli, Wyart, 2018]. Right: [Belkin, Hsu, Ma, Mandal, 2018]. Song Mei (Stanford University) Random feature regression September 8, 2019 4 / 22

Double descent Figure: A cartoon by [Belkin, Hsu, Ma, Mandal, 2018]. � Peak at the interpolation thresholds. � Monotonic decreasing in the overparameterized regime. � Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 5 / 22

The misspecified linear model Figure: By [Hastie, Montanari, Rosset, Tibshirani, 2019]. See also [Belkin, Hsu, Xu, 2019]. Model: ② ❂ ❤ x ❙ ❀ β ❙ ✐ ✰ ✧ for ❥ ❙ ❥ ❂ ❦ . Fitting: ▲ ✭ β ✮ ❂ ❫ E ❬✭ ② � ❤ x ❀ β ✐ ✮ ✷ ❪ Song Mei (Stanford University) Random feature regression September 8, 2019 6 / 22

The misspecified linear model � Peak at the interpolation thresholds. ❄ Monotonic decreasing in the overparameterized regime. ❄ Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 7 / 22

Goal: find a tractable model that exhibits all the features of the double descent curve. Figure: By [Belkin, Hsu, Ma, Mandal, 2018]. Song Mei (Stanford University) Random feature regression September 8, 2019 8 / 22

The neural tangent model ◮ Let ❢ ✭ x ❀ θ ✮ be a multi-layers neural network ❢ ✭ x ❀ θ ✮ ❂ ✛ ✭ W ✶ ✛ ✭ W ✷ ✁ ✁ ✁ ✛ ✭ W ▲ x ✮✮✮ ◮ NT model: linearization of ❢ ✭ x ❀ θ ✮ around initialization θ ✵ , ❢ NT ✭ x ❀ θ ✮ ❂ ❤ θ ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✿ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b]. ◮ Under some conditions of initialization and learning rate, the trajactory of neural tangent model and neural network is uniformly close. Song Mei (Stanford University) Random feature regression September 8, 2019 9 / 22

Two-layers neural tangent model The two-layers neural tangent model ◆ ◆ ❳ ❳ ❤ t ❥ ❀ x ✐ ✛ ✵ ✭ ❤ w ❥ ❀ x ✐ ✮ ❢ NT ✭ x ❀ ❢ ❛ ❥ ❣ ❀ ❢ t ❥ ❣ ✮ ❂ ❛ ❥ ✛ ✭ ❤ w ❥ ❀ x ✐ ✮ ✰ ✿ ❥ ❂✶ ❥ ❂✶ ⑤ ④③ ⑥ ⑤ ④③ ⑥ Second layer linearization First layer linearization Random weights w ❥ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ✿ Song Mei (Stanford University) Random feature regression September 8, 2019 10 / 22

An even simpler model The random features model ◆ ❳ ❛ ❥ ✛ ✭ ❤ w ❥ ❀ x ✐ ✮ ✿ ❢ RF ✭ x ❀ a ✮ ❂ ❥ ❂✶ Random weights w ❥ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ✿ Song Mei (Stanford University) Random feature regression September 8, 2019 11 / 22

Setting ◮ ♥ data points, ◆ features, in dimension ❞ , proportional as ❞ ✦ ✶ . ♣ ◮ Data ✭ x ✐ ✮ ✐ ✷ ♥ ✘ ❯♥✐❢✭ S ❞ � ✶ ✭ ❞ ✮✮ , ② ✐ ❂ ❢ ❄ ✭ x ✐ ✮ ✰ ✧ ✐ , E ❬ ✧ ✷ ✐ ❪ ❂ ✜ ✷ . ◮ Features ✭ w ❥ ✮ ❥ ✷ ❬ ◆ ❪ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ . ◮ Random feature regression: ❫ a ✕ ❂ ❛r❣ ♠✐♥ a ▲ ✕ ✭ a ✮ , ♥ ◆ ▲ ✕ ✭ a ✮ ❂ ✶ ② ✐ � ✶ ✑ ✷ ✐ ✰ ✕◆ ❤✏ ❳ ❳ ❞ ❦ a ❦ ✷ ❛ ❥ ✛ ✭ ❤ x ✐ ❀ w ❥ ✐ ✮ ✷ ❀ ♥ ◆ ✐ ❂✶ ❥ ❂✶ ◆ ❢ ❄ ✭ x ✮ � ✶ ✑ ✷ ✐ ❤✏ ❳ ❛ ❥ ✛ ✭ ❤ x ❀ w ❥ ✐ ✮ ❘ ✭ a ✮ ❂ E x ❀② ✿ ◆ ❥ ❂✶ Song Mei (Stanford University) Random feature regression September 8, 2019 12 / 22

Assumption ◮ Proportional regime: ◆❂❞ ✦ ✥ ✶ , ♥❂❞ ✦ ✥ ✷ , as ❞ ✦ ✶ . ◮ Activation: ✛ sub exponential growth, including ReLU, t❛♥❤ , etc. ◮ Truth function: ❢ ❄ ✭ x ✮ ❂ ❤ β ✶ ❀ x ✐ . Song Mei (Stanford University) Random feature regression September 8, 2019 13 / 22

Precise asymptotics Theorem (M. and Montanari, 2019) Assume ❢ ❄ ✭ x ✮ ❂ ❤ β ✶ ❀ x ✐ and define (for ● ✘ ◆ ✭✵ ❀ ✶✮ ) ❄ ❂ E ❬ ✛ ✭ ● ✮ ✷ ❪ � E ❬ ✛ ✭ ● ✮❪ ✷ � E ❬ ✛ ✭ ● ✮ ● ❪ ✷ ❀ ✖ ✷ ✖ ✶ ❂ E ❬ ✛ ✭ ● ✮ ● ❪ ❀ ✏ ❂ ✖ ✶ ❂✖ ❄ ✿ Let ◆❂❞ ✦ ✥ ✶ , ♥❂❞ ✦ ✥ ✷ , as ❞ ✦ ✶ . Then for any ✕ ❃ ✵ , we have a ✕ ❀ ❢ ❄ ✮ ❂ ❦ β ✶ ❦ ✷ ✷ ✁ B ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕❂✖ ✷ ❄ ✮✰ ✜ ✷ ✁ V ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕❂✖ ✷ ❘ RF ✭❫ ❄ ✮✰ ♦ ❞❀ P ✭✶✮ ❀ where functions B and V are given explicitly below. Song Mei (Stanford University) Random feature regression September 8, 2019 14 / 22

Explicit formulae Let the functions ✗ ✶ ❀ ✗ ✷ ✿ C ✰ ✦ C ✰ be the unique solution of ✏ ✷ ✗ ✷ � ✁ � ✶ ✗ ✶ ❂ ✥ ✶ � ✘ � ✗ ✷ � ❀ ✶ � ✏ ✷ ✗ ✶ ✗ ✷ ✏ ✷ ✗ ✶ � ✁ � ✶ ✗ ✷ ❂ ✥ ✷ � ✘ � ✗ ✶ � ❀ ✶ � ✏ ✷ ✗ ✶ ✗ ✷ Let ✤ ✑ ✗ ✶ ✭ i ✭ ✥ ✶ ✥ ✷ ✕ ✮ ✶ ❂ ✷ ✮ ✁ ✗ ✷ ✭ i ✭ ✥ ✶ ✥ ✷ ✕ ✮ ✶ ❂ ✷ ✮ ❀ and E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ � ✤ ✺ ✏ ✻ ✰ ✸ ✤ ✹ ✏ ✹ ✰ ✭ ✥ ✶ ✥ ✷ � ✥ ✷ � ✥ ✶ ✰ ✶✮ ✤ ✸ ✏ ✻ � ✷ ✤ ✸ ✏ ✹ � ✸ ✤ ✸ ✏ ✷ ✰ ✭ ✥ ✶ ✰ ✥ ✷ � ✸ ✥ ✶ ✥ ✷ ✰ ✶✮ ✤ ✷ ✏ ✹ ✰ ✷ ✤ ✷ ✏ ✷ ✰ ✤ ✷ ✰ ✸ ✥ ✶ ✥ ✷ ✤✏ ✷ � ✥ ✶ ✥ ✷ ❀ E ✶ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✥ ✷ ✤ ✸ ✏ ✹ � ✥ ✷ ✤ ✷ ✏ ✷ ✰ ✥ ✶ ✥ ✷ ✤✏ ✷ � ✥ ✶ ✥ ✷ ❀ E ✷ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✤ ✺ ✏ ✻ � ✸ ✤ ✹ ✏ ✹ ✰ ✭ ✥ ✶ � ✶✮ ✤ ✸ ✏ ✻ ✰ ✷ ✤ ✸ ✏ ✹ ✰ ✸ ✤ ✸ ✏ ✷ ✰ ✭ � ✥ ✶ � ✶✮ ✤ ✷ ✏ ✹ � ✷ ✤ ✷ ✏ ✷ � ✤ ✷ ✿ We then have E ✶ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ E ✷ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ B ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ❀ V ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✿ E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ Song Mei (Stanford University) Random feature regression September 8, 2019 15 / 22

4 1.5 3.5 3 1 2.5 2 1.5 0.5 1 0.5 0 0 0 1 2 3 4 5 0 1 2 3 4 5 ✕ ❂ ✸ ✂ ✶✵ � ✹ ✕ ❂ ✵✰ � Peak at the interpolation thresholds. � Monotonic decreasing in the overparameterized regime. � Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 16 / 22

Further insights Song Mei (Stanford University) Random feature regression September 8, 2019 17 / 22

1 3 0.9 2.5 0.8 0.7 2 0.6 Test error Test error 0.5 1.5 0.4 1 0.3 0.2 0.5 0.1 0 0 10 -2 10 0 10 2 10 -2 10 0 10 2 SNR ❂ ✺ SNR ❂ ✶ ❂ ✺ For any ✕ , the minimum generalization error is achieved at ◆❂♥ ✦ ✶ . Song Mei (Stanford University) Random feature regression September 8, 2019 18 / 22

1.5 1 0.5 0 10 -2 10 -1 10 0 10 1 10 2 For optimal ✕ , the generalization error is monotonically decreasing in ◆❂♥ . Song Mei (Stanford University) Random feature regression September 8, 2019 19 / 22

1 3 0.9 2.5 0.8 0.7 2 0.6 Test error Test error 0.5 1.5 0.4 1 0.3 0.2 0.5 0.1 0 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 -3 10 -2 10 -1 10 0 10 1 10 2 SNR ❂ ✺ SNR ❂ ✶ ❂ ✶✵ ◮ High SNR: minimum at ✕ ❂ ✵✰ ; ◮ Low SNR: minimum at ✕ ❃ ✵ . Song Mei (Stanford University) Random feature regression September 8, 2019 20 / 22

Proof strategy Random matrix theory for the random kernel inner product matrices Song Mei (Stanford University) Random feature regression September 8, 2019 21 / 22

Conclusion ◮ Number of parameters is not the right model complexity to control the generalization error (we already know this). ◮ The double descent phenomenon also appears in linearized neural networks. ◮ When SNR is high, without regularization could be better than with regularization. Song Mei (Stanford University) Random feature regression September 8, 2019 22 / 22

The generalization error of random features model: Precise - PowerPoint PPT Presentation

The generalization error of random features model: Precise asymptotics and double descent curve Song Mei Stanford University September 8, 2019 Joint work with Andrea Montanari Song Mei (Stanford University) Random feature regression

Precise Performance LTD Jake Yarranton jake@precise-performance.co.uk 07468 465754 Precise

MQTT Protocol for Real Time GNSS Data and Correction Distribution Precise Positioning Precise

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Uncertainty in Eddy Sources of Random Error Random Errors: . . . Covariance Measurements:

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

On the Influence of Input Noise On the Influence of Input Noise on a Generalization Error

Outline IAML: Overfitting and Capacity Control Generalization error Estimating

Precise Garbage Collection in C PANKHURI February 16, 2011 Agenda Problem Statement. Precise /

Optimal Prices in the Towards a Precise . . . Towards a Precise . . . Presence of Discounts:

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Event-Driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines Emre Neftci

Random Set Solutions to Stochastic Wave Equations Michael Oberguggenberger Lukas Wurzer ISIPTA

Simulating Space Use of Animals from RSF and SSF Johannes Signer ( signer_j) Wildlife

Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer:

Random growth models and planted problems Graduating bits - ITCS 2016 Laura Florescu NYU

Salvaging Weak Security Bounds for Blockcipher-based Constructions Thomas Shrimpton (University

Forward-looking statements Except for the historical information contained herein, the matters

A New Solution To The Random Assignment Problem By Anna Bogomolnaia, Herve Moulin Presented By

The generalization error of random features model: Precise - PowerPoint PPT Presentation

The generalization error of random features model: Precise asymptotics and double descent curve Song Mei Stanford University September 8, 2019 Joint work with Andrea Montanari Song Mei (Stanford University) Random feature regression

Precise Performance LTD Jake Yarranton jake@precise-performance.co.uk 07468 465754 Precise

MQTT Protocol for Real Time GNSS Data and Correction Distribution Precise Positioning Precise

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Uncertainty in Eddy Sources of Random Error Random Errors: . . . Covariance Measurements:

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

On the Influence of Input Noise On the Influence of Input Noise on a Generalization Error

Outline IAML: Overfitting and Capacity Control Generalization error Estimating

Precise Garbage Collection in C PANKHURI February 16, 2011 Agenda Problem Statement. Precise /

Optimal Prices in the Towards a Precise . . . Towards a Precise . . . Presence of Discounts:

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Event-Driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines Emre Neftci

Random Set Solutions to Stochastic Wave Equations Michael Oberguggenberger Lukas Wurzer ISIPTA

Simulating Space Use of Animals from RSF and SSF Johannes Signer ( signer_j) Wildlife

Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer:

Random growth models and planted problems Graduating bits - ITCS 2016 Laura Florescu NYU

Salvaging Weak Security Bounds for Blockcipher-based Constructions Thomas Shrimpton (University

Forward-looking statements Except for the historical information contained herein, the matters

A New Solution To The Random Assignment Problem By Anna Bogomolnaia, Herve Moulin Presented By

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits