The generalization error of random features model: Precise asymptotics and double descent curve Song Mei Stanford University September 8, 2019 Joint work with Andrea Montanari Song Mei (Stanford University) Random feature regression September 8, 2019 1 / 22
Surprises of generalization behavior of neural networks Figure: Experiments on MNIST by [Neyshabur, Tomioka, Srebro, 2014a] Surprise: why does’t higher model complexity ... ... induce larger generalization error? Song Mei (Stanford University) Random feature regression September 8, 2019 2 / 22
Partial explanations: The intrinsic model complexity is not the number of parameters, but “some norm” of the weights. This intrinsic model complexity is implicitly controlled by SGD. [Neyshabur, Tomioka, Srebro, 2014b], [Gunasekar, Woodworth, Bhojanapalli, Neyshabur, Srebro, 2017], .... Song Mei (Stanford University) Random feature regression September 8, 2019 3 / 22
Train more carefully to better interpolates the data Figure: Experiments on MNIST. Left: [Spigler, Geiger, Ascoli, Sagun, Biroli, Wyart, 2018]. Right: [Belkin, Hsu, Ma, Mandal, 2018]. Song Mei (Stanford University) Random feature regression September 8, 2019 4 / 22
Double descent Figure: A cartoon by [Belkin, Hsu, Ma, Mandal, 2018]. � Peak at the interpolation thresholds. � Monotonic decreasing in the overparameterized regime. � Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 5 / 22
The misspecified linear model Figure: By [Hastie, Montanari, Rosset, Tibshirani, 2019]. See also [Belkin, Hsu, Xu, 2019]. Model: ② ❂ ❤ x ❙ ❀ β ❙ ✐ ✰ ✧ for ❥ ❙ ❥ ❂ ❦ . Fitting: ▲ ✭ β ✮ ❂ ❫ E ❬✭ ② � ❤ x ❀ β ✐ ✮ ✷ ❪ Song Mei (Stanford University) Random feature regression September 8, 2019 6 / 22
The misspecified linear model � Peak at the interpolation thresholds. ❄ Monotonic decreasing in the overparameterized regime. ❄ Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 7 / 22
Goal: find a tractable model that exhibits all the features of the double descent curve. Figure: By [Belkin, Hsu, Ma, Mandal, 2018]. Song Mei (Stanford University) Random feature regression September 8, 2019 8 / 22
The neural tangent model ◮ Let ❢ ✭ x ❀ θ ✮ be a multi-layers neural network ❢ ✭ x ❀ θ ✮ ❂ ✛ ✭ W ✶ ✛ ✭ W ✷ ✁ ✁ ✁ ✛ ✭ W ▲ x ✮✮✮ ◮ NT model: linearization of ❢ ✭ x ❀ θ ✮ around initialization θ ✵ , ❢ NT ✭ x ❀ θ ✮ ❂ ❤ θ ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✿ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b]. ◮ Under some conditions of initialization and learning rate, the trajactory of neural tangent model and neural network is uniformly close. Song Mei (Stanford University) Random feature regression September 8, 2019 9 / 22
Two-layers neural tangent model The two-layers neural tangent model ◆ ◆ ❳ ❳ ❤ t ❥ ❀ x ✐ ✛ ✵ ✭ ❤ w ❥ ❀ x ✐ ✮ ❢ NT ✭ x ❀ ❢ ❛ ❥ ❣ ❀ ❢ t ❥ ❣ ✮ ❂ ❛ ❥ ✛ ✭ ❤ w ❥ ❀ x ✐ ✮ ✰ ✿ ❥ ❂✶ ❥ ❂✶ ⑤ ④③ ⑥ ⑤ ④③ ⑥ Second layer linearization First layer linearization Random weights w ❥ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ✿ Song Mei (Stanford University) Random feature regression September 8, 2019 10 / 22
An even simpler model The random features model ◆ ❳ ❛ ❥ ✛ ✭ ❤ w ❥ ❀ x ✐ ✮ ✿ ❢ RF ✭ x ❀ a ✮ ❂ ❥ ❂✶ Random weights w ❥ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ✿ Song Mei (Stanford University) Random feature regression September 8, 2019 11 / 22
Setting ◮ ♥ data points, ◆ features, in dimension ❞ , proportional as ❞ ✦ ✶ . ♣ ◮ Data ✭ x ✐ ✮ ✐ ✷ ♥ ✘ ❯♥✐❢✭ S ❞ � ✶ ✭ ❞ ✮✮ , ② ✐ ❂ ❢ ❄ ✭ x ✐ ✮ ✰ ✧ ✐ , E ❬ ✧ ✷ ✐ ❪ ❂ ✜ ✷ . ◮ Features ✭ w ❥ ✮ ❥ ✷ ❬ ◆ ❪ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ . ◮ Random feature regression: ❫ a ✕ ❂ ❛r❣ ♠✐♥ a ▲ ✕ ✭ a ✮ , ♥ ◆ ▲ ✕ ✭ a ✮ ❂ ✶ ② ✐ � ✶ ✑ ✷ ✐ ✰ ✕◆ ❤✏ ❳ ❳ ❞ ❦ a ❦ ✷ ❛ ❥ ✛ ✭ ❤ x ✐ ❀ w ❥ ✐ ✮ ✷ ❀ ♥ ◆ ✐ ❂✶ ❥ ❂✶ ◆ ❢ ❄ ✭ x ✮ � ✶ ✑ ✷ ✐ ❤✏ ❳ ❛ ❥ ✛ ✭ ❤ x ❀ w ❥ ✐ ✮ ❘ ✭ a ✮ ❂ E x ❀② ✿ ◆ ❥ ❂✶ Song Mei (Stanford University) Random feature regression September 8, 2019 12 / 22
Assumption ◮ Proportional regime: ◆❂❞ ✦ ✥ ✶ , ♥❂❞ ✦ ✥ ✷ , as ❞ ✦ ✶ . ◮ Activation: ✛ sub exponential growth, including ReLU, t❛♥❤ , etc. ◮ Truth function: ❢ ❄ ✭ x ✮ ❂ ❤ β ✶ ❀ x ✐ . Song Mei (Stanford University) Random feature regression September 8, 2019 13 / 22
Precise asymptotics Theorem (M. and Montanari, 2019) Assume ❢ ❄ ✭ x ✮ ❂ ❤ β ✶ ❀ x ✐ and define (for ● ✘ ◆ ✭✵ ❀ ✶✮ ) ❄ ❂ E ❬ ✛ ✭ ● ✮ ✷ ❪ � E ❬ ✛ ✭ ● ✮❪ ✷ � E ❬ ✛ ✭ ● ✮ ● ❪ ✷ ❀ ✖ ✷ ✖ ✶ ❂ E ❬ ✛ ✭ ● ✮ ● ❪ ❀ ✏ ❂ ✖ ✶ ❂✖ ❄ ✿ Let ◆❂❞ ✦ ✥ ✶ , ♥❂❞ ✦ ✥ ✷ , as ❞ ✦ ✶ . Then for any ✕ ❃ ✵ , we have a ✕ ❀ ❢ ❄ ✮ ❂ ❦ β ✶ ❦ ✷ ✷ ✁ B ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕❂✖ ✷ ❄ ✮✰ ✜ ✷ ✁ V ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕❂✖ ✷ ❘ RF ✭❫ ❄ ✮✰ ♦ ❞❀ P ✭✶✮ ❀ where functions B and V are given explicitly below. Song Mei (Stanford University) Random feature regression September 8, 2019 14 / 22
Explicit formulae Let the functions ✗ ✶ ❀ ✗ ✷ ✿ C ✰ ✦ C ✰ be the unique solution of ✏ ✷ ✗ ✷ � ✁ � ✶ ✗ ✶ ❂ ✥ ✶ � ✘ � ✗ ✷ � ❀ ✶ � ✏ ✷ ✗ ✶ ✗ ✷ ✏ ✷ ✗ ✶ � ✁ � ✶ ✗ ✷ ❂ ✥ ✷ � ✘ � ✗ ✶ � ❀ ✶ � ✏ ✷ ✗ ✶ ✗ ✷ Let ✤ ✑ ✗ ✶ ✭ i ✭ ✥ ✶ ✥ ✷ ✕ ✮ ✶ ❂ ✷ ✮ ✁ ✗ ✷ ✭ i ✭ ✥ ✶ ✥ ✷ ✕ ✮ ✶ ❂ ✷ ✮ ❀ and E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ � ✤ ✺ ✏ ✻ ✰ ✸ ✤ ✹ ✏ ✹ ✰ ✭ ✥ ✶ ✥ ✷ � ✥ ✷ � ✥ ✶ ✰ ✶✮ ✤ ✸ ✏ ✻ � ✷ ✤ ✸ ✏ ✹ � ✸ ✤ ✸ ✏ ✷ ✰ ✭ ✥ ✶ ✰ ✥ ✷ � ✸ ✥ ✶ ✥ ✷ ✰ ✶✮ ✤ ✷ ✏ ✹ ✰ ✷ ✤ ✷ ✏ ✷ ✰ ✤ ✷ ✰ ✸ ✥ ✶ ✥ ✷ ✤✏ ✷ � ✥ ✶ ✥ ✷ ❀ E ✶ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✥ ✷ ✤ ✸ ✏ ✹ � ✥ ✷ ✤ ✷ ✏ ✷ ✰ ✥ ✶ ✥ ✷ ✤✏ ✷ � ✥ ✶ ✥ ✷ ❀ E ✷ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✤ ✺ ✏ ✻ � ✸ ✤ ✹ ✏ ✹ ✰ ✭ ✥ ✶ � ✶✮ ✤ ✸ ✏ ✻ ✰ ✷ ✤ ✸ ✏ ✹ ✰ ✸ ✤ ✸ ✏ ✷ ✰ ✭ � ✥ ✶ � ✶✮ ✤ ✷ ✏ ✹ � ✷ ✤ ✷ ✏ ✷ � ✤ ✷ ✿ We then have E ✶ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ E ✷ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ B ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ❀ V ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✿ E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ Song Mei (Stanford University) Random feature regression September 8, 2019 15 / 22
4 1.5 3.5 3 1 2.5 2 1.5 0.5 1 0.5 0 0 0 1 2 3 4 5 0 1 2 3 4 5 ✕ ❂ ✸ ✂ ✶✵ � ✹ ✕ ❂ ✵✰ � Peak at the interpolation thresholds. � Monotonic decreasing in the overparameterized regime. � Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 16 / 22
Further insights Song Mei (Stanford University) Random feature regression September 8, 2019 17 / 22
1 3 0.9 2.5 0.8 0.7 2 0.6 Test error Test error 0.5 1.5 0.4 1 0.3 0.2 0.5 0.1 0 0 10 -2 10 0 10 2 10 -2 10 0 10 2 SNR ❂ ✺ SNR ❂ ✶ ❂ ✺ For any ✕ , the minimum generalization error is achieved at ◆❂♥ ✦ ✶ . Song Mei (Stanford University) Random feature regression September 8, 2019 18 / 22
1.5 1 0.5 0 10 -2 10 -1 10 0 10 1 10 2 For optimal ✕ , the generalization error is monotonically decreasing in ◆❂♥ . Song Mei (Stanford University) Random feature regression September 8, 2019 19 / 22
1 3 0.9 2.5 0.8 0.7 2 0.6 Test error Test error 0.5 1.5 0.4 1 0.3 0.2 0.5 0.1 0 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 -3 10 -2 10 -1 10 0 10 1 10 2 SNR ❂ ✺ SNR ❂ ✶ ❂ ✶✵ ◮ High SNR: minimum at ✕ ❂ ✵✰ ; ◮ Low SNR: minimum at ✕ ❃ ✵ . Song Mei (Stanford University) Random feature regression September 8, 2019 20 / 22
Proof strategy Random matrix theory for the random kernel inner product matrices Song Mei (Stanford University) Random feature regression September 8, 2019 21 / 22
Conclusion ◮ Number of parameters is not the right model complexity to control the generalization error (we already know this). ◮ The double descent phenomenon also appears in linearized neural networks. ◮ When SNR is high, without regularization could be better than with regularization. Song Mei (Stanford University) Random feature regression September 8, 2019 22 / 22
Recommend
More recommend