the generalization error of random features model precise
play

The generalization error of random features model: Precise - PowerPoint PPT Presentation

The generalization error of random features model: Precise asymptotics and double descent curve Song Mei Stanford University September 8, 2019 Joint work with Andrea Montanari Song Mei (Stanford University) Random feature regression


  1. The generalization error of random features model: Precise asymptotics and double descent curve Song Mei Stanford University September 8, 2019 Joint work with Andrea Montanari Song Mei (Stanford University) Random feature regression September 8, 2019 1 / 22

  2. Surprises of generalization behavior of neural networks Figure: Experiments on MNIST by [Neyshabur, Tomioka, Srebro, 2014a] Surprise: why does’t higher model complexity ... ... induce larger generalization error? Song Mei (Stanford University) Random feature regression September 8, 2019 2 / 22

  3. Partial explanations: The intrinsic model complexity is not the number of parameters, but “some norm” of the weights. This intrinsic model complexity is implicitly controlled by SGD. [Neyshabur, Tomioka, Srebro, 2014b], [Gunasekar, Woodworth, Bhojanapalli, Neyshabur, Srebro, 2017], .... Song Mei (Stanford University) Random feature regression September 8, 2019 3 / 22

  4. Train more carefully to better interpolates the data Figure: Experiments on MNIST. Left: [Spigler, Geiger, Ascoli, Sagun, Biroli, Wyart, 2018]. Right: [Belkin, Hsu, Ma, Mandal, 2018]. Song Mei (Stanford University) Random feature regression September 8, 2019 4 / 22

  5. Double descent Figure: A cartoon by [Belkin, Hsu, Ma, Mandal, 2018]. � Peak at the interpolation thresholds. � Monotonic decreasing in the overparameterized regime. � Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 5 / 22

  6. The misspecified linear model Figure: By [Hastie, Montanari, Rosset, Tibshirani, 2019]. See also [Belkin, Hsu, Xu, 2019]. Model: ② ❂ ❤ x ❙ ❀ β ❙ ✐ ✰ ✧ for ❥ ❙ ❥ ❂ ❦ . Fitting: ▲ ✭ β ✮ ❂ ❫ E ❬✭ ② � ❤ x ❀ β ✐ ✮ ✷ ❪ Song Mei (Stanford University) Random feature regression September 8, 2019 6 / 22

  7. The misspecified linear model � Peak at the interpolation thresholds. ❄ Monotonic decreasing in the overparameterized regime. ❄ Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 7 / 22

  8. Goal: find a tractable model that exhibits all the features of the double descent curve. Figure: By [Belkin, Hsu, Ma, Mandal, 2018]. Song Mei (Stanford University) Random feature regression September 8, 2019 8 / 22

  9. The neural tangent model ◮ Let ❢ ✭ x ❀ θ ✮ be a multi-layers neural network ❢ ✭ x ❀ θ ✮ ❂ ✛ ✭ W ✶ ✛ ✭ W ✷ ✁ ✁ ✁ ✛ ✭ W ▲ x ✮✮✮ ◮ NT model: linearization of ❢ ✭ x ❀ θ ✮ around initialization θ ✵ , ❢ NT ✭ x ❀ θ ✮ ❂ ❤ θ ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✿ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b]. ◮ Under some conditions of initialization and learning rate, the trajactory of neural tangent model and neural network is uniformly close. Song Mei (Stanford University) Random feature regression September 8, 2019 9 / 22

  10. Two-layers neural tangent model The two-layers neural tangent model ◆ ◆ ❳ ❳ ❤ t ❥ ❀ x ✐ ✛ ✵ ✭ ❤ w ❥ ❀ x ✐ ✮ ❢ NT ✭ x ❀ ❢ ❛ ❥ ❣ ❀ ❢ t ❥ ❣ ✮ ❂ ❛ ❥ ✛ ✭ ❤ w ❥ ❀ x ✐ ✮ ✰ ✿ ❥ ❂✶ ❥ ❂✶ ⑤ ④③ ⑥ ⑤ ④③ ⑥ Second layer linearization First layer linearization Random weights w ❥ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ✿ Song Mei (Stanford University) Random feature regression September 8, 2019 10 / 22

  11. An even simpler model The random features model ◆ ❳ ❛ ❥ ✛ ✭ ❤ w ❥ ❀ x ✐ ✮ ✿ ❢ RF ✭ x ❀ a ✮ ❂ ❥ ❂✶ Random weights w ❥ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ✿ Song Mei (Stanford University) Random feature regression September 8, 2019 11 / 22

  12. Setting ◮ ♥ data points, ◆ features, in dimension ❞ , proportional as ❞ ✦ ✶ . ♣ ◮ Data ✭ x ✐ ✮ ✐ ✷ ♥ ✘ ❯♥✐❢✭ S ❞ � ✶ ✭ ❞ ✮✮ , ② ✐ ❂ ❢ ❄ ✭ x ✐ ✮ ✰ ✧ ✐ , E ❬ ✧ ✷ ✐ ❪ ❂ ✜ ✷ . ◮ Features ✭ w ❥ ✮ ❥ ✷ ❬ ◆ ❪ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ . ◮ Random feature regression: ❫ a ✕ ❂ ❛r❣ ♠✐♥ a ▲ ✕ ✭ a ✮ , ♥ ◆ ▲ ✕ ✭ a ✮ ❂ ✶ ② ✐ � ✶ ✑ ✷ ✐ ✰ ✕◆ ❤✏ ❳ ❳ ❞ ❦ a ❦ ✷ ❛ ❥ ✛ ✭ ❤ x ✐ ❀ w ❥ ✐ ✮ ✷ ❀ ♥ ◆ ✐ ❂✶ ❥ ❂✶ ◆ ❢ ❄ ✭ x ✮ � ✶ ✑ ✷ ✐ ❤✏ ❳ ❛ ❥ ✛ ✭ ❤ x ❀ w ❥ ✐ ✮ ❘ ✭ a ✮ ❂ E x ❀② ✿ ◆ ❥ ❂✶ Song Mei (Stanford University) Random feature regression September 8, 2019 12 / 22

  13. Assumption ◮ Proportional regime: ◆❂❞ ✦ ✥ ✶ , ♥❂❞ ✦ ✥ ✷ , as ❞ ✦ ✶ . ◮ Activation: ✛ sub exponential growth, including ReLU, t❛♥❤ , etc. ◮ Truth function: ❢ ❄ ✭ x ✮ ❂ ❤ β ✶ ❀ x ✐ . Song Mei (Stanford University) Random feature regression September 8, 2019 13 / 22

  14. Precise asymptotics Theorem (M. and Montanari, 2019) Assume ❢ ❄ ✭ x ✮ ❂ ❤ β ✶ ❀ x ✐ and define (for ● ✘ ◆ ✭✵ ❀ ✶✮ ) ❄ ❂ E ❬ ✛ ✭ ● ✮ ✷ ❪ � E ❬ ✛ ✭ ● ✮❪ ✷ � E ❬ ✛ ✭ ● ✮ ● ❪ ✷ ❀ ✖ ✷ ✖ ✶ ❂ E ❬ ✛ ✭ ● ✮ ● ❪ ❀ ✏ ❂ ✖ ✶ ❂✖ ❄ ✿ Let ◆❂❞ ✦ ✥ ✶ , ♥❂❞ ✦ ✥ ✷ , as ❞ ✦ ✶ . Then for any ✕ ❃ ✵ , we have a ✕ ❀ ❢ ❄ ✮ ❂ ❦ β ✶ ❦ ✷ ✷ ✁ B ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕❂✖ ✷ ❄ ✮✰ ✜ ✷ ✁ V ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕❂✖ ✷ ❘ RF ✭❫ ❄ ✮✰ ♦ ❞❀ P ✭✶✮ ❀ where functions B and V are given explicitly below. Song Mei (Stanford University) Random feature regression September 8, 2019 14 / 22

  15. Explicit formulae Let the functions ✗ ✶ ❀ ✗ ✷ ✿ C ✰ ✦ C ✰ be the unique solution of ✏ ✷ ✗ ✷ � ✁ � ✶ ✗ ✶ ❂ ✥ ✶ � ✘ � ✗ ✷ � ❀ ✶ � ✏ ✷ ✗ ✶ ✗ ✷ ✏ ✷ ✗ ✶ � ✁ � ✶ ✗ ✷ ❂ ✥ ✷ � ✘ � ✗ ✶ � ❀ ✶ � ✏ ✷ ✗ ✶ ✗ ✷ Let ✤ ✑ ✗ ✶ ✭ i ✭ ✥ ✶ ✥ ✷ ✕ ✮ ✶ ❂ ✷ ✮ ✁ ✗ ✷ ✭ i ✭ ✥ ✶ ✥ ✷ ✕ ✮ ✶ ❂ ✷ ✮ ❀ and E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ � ✤ ✺ ✏ ✻ ✰ ✸ ✤ ✹ ✏ ✹ ✰ ✭ ✥ ✶ ✥ ✷ � ✥ ✷ � ✥ ✶ ✰ ✶✮ ✤ ✸ ✏ ✻ � ✷ ✤ ✸ ✏ ✹ � ✸ ✤ ✸ ✏ ✷ ✰ ✭ ✥ ✶ ✰ ✥ ✷ � ✸ ✥ ✶ ✥ ✷ ✰ ✶✮ ✤ ✷ ✏ ✹ ✰ ✷ ✤ ✷ ✏ ✷ ✰ ✤ ✷ ✰ ✸ ✥ ✶ ✥ ✷ ✤✏ ✷ � ✥ ✶ ✥ ✷ ❀ E ✶ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✥ ✷ ✤ ✸ ✏ ✹ � ✥ ✷ ✤ ✷ ✏ ✷ ✰ ✥ ✶ ✥ ✷ ✤✏ ✷ � ✥ ✶ ✥ ✷ ❀ E ✷ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✤ ✺ ✏ ✻ � ✸ ✤ ✹ ✏ ✹ ✰ ✭ ✥ ✶ � ✶✮ ✤ ✸ ✏ ✻ ✰ ✷ ✤ ✸ ✏ ✹ ✰ ✸ ✤ ✸ ✏ ✷ ✰ ✭ � ✥ ✶ � ✶✮ ✤ ✷ ✏ ✹ � ✷ ✤ ✷ ✏ ✷ � ✤ ✷ ✿ We then have E ✶ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ E ✷ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ B ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ❀ V ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ ✑ ✿ E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ E ✵ ✭ ✏❀ ✥ ✶ ❀ ✥ ✷ ❀ ✕ ✮ Song Mei (Stanford University) Random feature regression September 8, 2019 15 / 22

  16. 4 1.5 3.5 3 1 2.5 2 1.5 0.5 1 0.5 0 0 0 1 2 3 4 5 0 1 2 3 4 5 ✕ ❂ ✸ ✂ ✶✵ � ✹ ✕ ❂ ✵✰ � Peak at the interpolation thresholds. � Monotonic decreasing in the overparameterized regime. � Global minimum when the number of parameters is infinity. Song Mei (Stanford University) Random feature regression September 8, 2019 16 / 22

  17. Further insights Song Mei (Stanford University) Random feature regression September 8, 2019 17 / 22

  18. 1 3 0.9 2.5 0.8 0.7 2 0.6 Test error Test error 0.5 1.5 0.4 1 0.3 0.2 0.5 0.1 0 0 10 -2 10 0 10 2 10 -2 10 0 10 2 SNR ❂ ✺ SNR ❂ ✶ ❂ ✺ For any ✕ , the minimum generalization error is achieved at ◆❂♥ ✦ ✶ . Song Mei (Stanford University) Random feature regression September 8, 2019 18 / 22

  19. 1.5 1 0.5 0 10 -2 10 -1 10 0 10 1 10 2 For optimal ✕ , the generalization error is monotonically decreasing in ◆❂♥ . Song Mei (Stanford University) Random feature regression September 8, 2019 19 / 22

  20. 1 3 0.9 2.5 0.8 0.7 2 0.6 Test error Test error 0.5 1.5 0.4 1 0.3 0.2 0.5 0.1 0 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 -3 10 -2 10 -1 10 0 10 1 10 2 SNR ❂ ✺ SNR ❂ ✶ ❂ ✶✵ ◮ High SNR: minimum at ✕ ❂ ✵✰ ; ◮ Low SNR: minimum at ✕ ❃ ✵ . Song Mei (Stanford University) Random feature regression September 8, 2019 20 / 22

  21. Proof strategy Random matrix theory for the random kernel inner product matrices Song Mei (Stanford University) Random feature regression September 8, 2019 21 / 22

  22. Conclusion ◮ Number of parameters is not the right model complexity to control the generalization error (we already know this). ◮ The double descent phenomenon also appears in linearized neural networks. ◮ When SNR is high, without regularization could be better than with regularization. Song Mei (Stanford University) Random feature regression September 8, 2019 22 / 22

Recommend


More recommend