The generalization error of random features model: Precise - - PowerPoint PPT Presentation

the generalization error of random features model precise
SMART_READER_LITE
LIVE PREVIEW

The generalization error of random features model: Precise - - PowerPoint PPT Presentation

The generalization error of random features model: Precise asymptotics and double descent curve Song Mei Stanford University September 8, 2019 Joint work with Andrea Montanari Song Mei (Stanford University) Random feature regression


slide-1
SLIDE 1

The generalization error of random features model: Precise asymptotics and double descent curve

Song Mei

Stanford University

September 8, 2019

Joint work with Andrea Montanari

Song Mei (Stanford University) Random feature regression September 8, 2019 1 / 22

slide-2
SLIDE 2

Surprises of generalization behavior of neural networks

Figure: Experiments on MNIST by [Neyshabur, Tomioka, Srebro, 2014a]

Surprise: why does’t higher model complexity ... ... induce larger generalization error?

Song Mei (Stanford University) Random feature regression September 8, 2019 2 / 22

slide-3
SLIDE 3

Partial explanations: The intrinsic model complexity is not the number of parameters, but “some norm” of the weights. This intrinsic model complexity is implicitly controlled by SGD. [Neyshabur, Tomioka, Srebro, 2014b], [Gunasekar, Woodworth, Bhojanapalli, Neyshabur, Srebro, 2017], ....

Song Mei (Stanford University) Random feature regression September 8, 2019 3 / 22

slide-4
SLIDE 4

Train more carefully to better interpolates the data

Figure: Experiments on MNIST. Left: [Spigler, Geiger, Ascoli, Sagun, Biroli, Wyart, 2018]. Right: [Belkin, Hsu, Ma, Mandal, 2018].

Song Mei (Stanford University) Random feature regression September 8, 2019 4 / 22

slide-5
SLIDE 5

Double descent

Figure: A cartoon by [Belkin, Hsu, Ma, Mandal, 2018].

Peak at the interpolation thresholds. Monotonic decreasing in the overparameterized regime. Global minimum when the number of parameters is infinity.

Song Mei (Stanford University) Random feature regression September 8, 2019 5 / 22

slide-6
SLIDE 6

The misspecified linear model

Figure: By [Hastie, Montanari, Rosset, Tibshirani, 2019]. See also [Belkin, Hsu, Xu, 2019].

Model: ② ❂ ❤x❙❀ β❙✐ ✰ ✧ for ❥❙❥ ❂ ❦. Fitting: ▲✭β✮ ❂ ❫ E❬✭② ❤x❀ β✐✮✷❪

Song Mei (Stanford University) Random feature regression September 8, 2019 6 / 22

slide-7
SLIDE 7

The misspecified linear model

Peak at the interpolation thresholds. ❄ Monotonic decreasing in the overparameterized regime. ❄ Global minimum when the number of parameters is infinity.

Song Mei (Stanford University) Random feature regression September 8, 2019 7 / 22

slide-8
SLIDE 8

Goal: find a tractable model that exhibits all the features

  • f the double descent curve.

Figure: By [Belkin, Hsu, Ma, Mandal, 2018].

Song Mei (Stanford University) Random feature regression September 8, 2019 8 / 22

slide-9
SLIDE 9

The neural tangent model

◮ Let ❢✭x❀ θ✮ be a multi-layers neural network ❢✭x❀ θ✮ ❂ ✛✭W ✶✛✭W ✷ ✁ ✁ ✁ ✛✭W ▲x✮✮✮ ◮ NT model: linearization of ❢✭x❀ θ✮ around initialization θ✵, ❢NT✭x❀ θ✮ ❂ ❤θ❀ rθ❢✭x❀ θ✵✮✐✿ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b]. ◮ Under some conditions of initialization and learning rate, the trajactory of neural tangent model and neural network is uniformly close.

Song Mei (Stanford University) Random feature regression September 8, 2019 9 / 22

slide-10
SLIDE 10

Two-layers neural tangent model

The two-layers neural tangent model ❢NT✭x❀ ❢❛❥❣❀ ❢t❥❣✮ ❂

❥❂✶

❛❥✛✭❤w❥❀ x✐✮

⑤ ④③ ⑥

Second layer linearization

❥❂✶

❤t❥❀ x✐✛✵✭❤w❥❀ x✐✮

⑤ ④③ ⑥

First layer linearization

✿ Random weights w❥ ✘✐✐❞ ❯♥✐❢✭S❞✶✮✿

Song Mei (Stanford University) Random feature regression September 8, 2019 10 / 22

slide-11
SLIDE 11

An even simpler model

The random features model ❢RF✭x❀ a✮ ❂

❥❂✶

❛❥✛✭❤w❥❀ x✐✮✿ Random weights w❥ ✘✐✐❞ ❯♥✐❢✭S❞✶✮✿

Song Mei (Stanford University) Random feature regression September 8, 2019 11 / 22

slide-12
SLIDE 12

Setting

◮ ♥ data points, ◆ features, in dimension ❞, proportional as ❞ ✦ ✶. ◮ Data ✭x✐✮✐✷♥ ✘ ❯♥✐❢✭S❞✶✭ ♣ ❞✮✮, ②✐ ❂ ❢❄✭x✐✮ ✰ ✧✐, E❬✧✷

✐ ❪ ❂ ✜ ✷.

◮ Features ✭w❥✮❥✷❬◆❪ ✘✐✐❞ ❯♥✐❢✭S❞✶✮. ◮ Random feature regression: ❫ a✕ ❂ ❛r❣ ♠✐♥a ▲✕✭a✮, ▲✕✭a✮ ❂ ✶ ♥

✐❂✶

❤✏

②✐ ✶ ◆

❥❂✶

❛❥✛✭❤x✐❀ w❥✐✮

✑✷✐

✰ ✕◆ ❞ ❦a❦✷

✷❀

❘✭a✮ ❂ Ex❀②

❤✏

❢❄✭x✮ ✶ ◆

❥❂✶

❛❥✛✭❤x❀ w❥✐✮

✑✷✐

Song Mei (Stanford University) Random feature regression September 8, 2019 12 / 22

slide-13
SLIDE 13

Assumption

◮ Proportional regime: ◆❂❞ ✦ ✥✶, ♥❂❞ ✦ ✥✷, as ❞ ✦ ✶. ◮ Activation: ✛ sub exponential growth, including ReLU, t❛♥❤, etc. ◮ Truth function: ❢❄✭x✮ ❂ ❤β✶❀ x✐.

Song Mei (Stanford University) Random feature regression September 8, 2019 13 / 22

slide-14
SLIDE 14

Precise asymptotics

Theorem (M. and Montanari, 2019)

Assume ❢❄✭x✮ ❂ ❤β✶❀ x✐ and define (for ● ✘ ◆✭✵❀ ✶✮) ✖✶ ❂ E❬✛✭●✮●❪❀ ✖✷

❄ ❂ E❬✛✭●✮✷❪ E❬✛✭●✮❪✷ E❬✛✭●✮●❪✷❀

✏ ❂ ✖✶❂✖❄✿ Let ◆❂❞ ✦ ✥✶, ♥❂❞ ✦ ✥✷, as ❞ ✦ ✶. Then for any ✕ ❃ ✵, we have ❘RF✭❫ a✕❀ ❢❄✮ ❂ ❦β✶❦✷

✷✁B✭✏❀ ✥✶❀ ✥✷❀ ✕❂✖✷ ❄✮✰✜ ✷✁V ✭✏❀ ✥✶❀ ✥✷❀ ✕❂✖✷ ❄✮✰♦❞❀P✭✶✮❀

where functions B and V are given explicitly below.

Song Mei (Stanford University) Random feature regression September 8, 2019 14 / 22

slide-15
SLIDE 15

Explicit formulae

Let the functions ✗✶❀ ✗✷ ✿ C✰ ✦ C✰ be the unique solution of ✗✶ ❂ ✥✶

  • ✘ ✗✷

✏✷✗✷ ✶ ✏✷✗✶✗✷

✁✶

❀ ✗✷ ❂ ✥✷

  • ✘ ✗✶

✏✷✗✶ ✶ ✏✷✗✶✗✷

✁✶

❀ Let ✤ ✑ ✗✶✭i✭✥✶✥✷✕✮✶❂✷✮ ✁ ✗✷✭i✭✥✶✥✷✕✮✶❂✷✮❀ and E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✤✺✏✻ ✰ ✸✤✹✏✹ ✰ ✭✥✶✥✷ ✥✷ ✥✶ ✰ ✶✮✤✸✏✻ ✷✤✸✏✹ ✸✤✸✏✷ ✰ ✭✥✶ ✰ ✥✷ ✸✥✶✥✷ ✰ ✶✮✤✷✏✹ ✰ ✷✤✷✏✷ ✰ ✤✷ ✰ ✸✥✶✥✷✤✏✷ ✥✶✥✷ ❀ E✶✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✥✷✤✸✏✹ ✥✷✤✷✏✷ ✰ ✥✶✥✷✤✏✷ ✥✶✥✷ ❀ E✷✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✤✺✏✻ ✸✤✹✏✹ ✰ ✭✥✶ ✶✮✤✸✏✻ ✰ ✷✤✸✏✹ ✰ ✸✤✸✏✷ ✰ ✭✥✶ ✶✮✤✷✏✹ ✷✤✷✏✷ ✤✷ ✿ We then have B✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ E✶✭✏❀ ✥✶❀ ✥✷❀ ✕✮ E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ❀ V ✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ E✷✭✏❀ ✥✶❀ ✥✷❀ ✕✮ E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✿ Song Mei (Stanford University) Random feature regression September 8, 2019 15 / 22

slide-16
SLIDE 16

1 2 3 4 5 0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 0.5 1 1.5

✕ ❂ ✵✰ ✕ ❂ ✸ ✂ ✶✵✹ Peak at the interpolation thresholds. Monotonic decreasing in the overparameterized regime. Global minimum when the number of parameters is infinity.

Song Mei (Stanford University) Random feature regression September 8, 2019 16 / 22

slide-17
SLIDE 17

Further insights

Song Mei (Stanford University) Random feature regression September 8, 2019 17 / 22

slide-18
SLIDE 18

10-2 100 102 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error 10-2 100 102 0.5 1 1.5 2 2.5 3 Test error

SNR ❂ ✺ SNR ❂ ✶❂✺ For any ✕, the minimum generalization error is achieved at ◆❂♥ ✦ ✶.

Song Mei (Stanford University) Random feature regression September 8, 2019 18 / 22

slide-19
SLIDE 19

10-2 10-1 100 101 102 0.5 1 1.5

For optimal ✕, the generalization error is monotonically decreasing in ◆❂♥.

Song Mei (Stanford University) Random feature regression September 8, 2019 19 / 22

slide-20
SLIDE 20

10-3 10-2 10-1 100 101 102 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error 10-3 10-2 10-1 100 101 102 0.5 1 1.5 2 2.5 3 Test error

SNR ❂ ✺ SNR ❂ ✶❂✶✵ ◮ High SNR: minimum at ✕ ❂ ✵✰; ◮ Low SNR: minimum at ✕ ❃ ✵.

Song Mei (Stanford University) Random feature regression September 8, 2019 20 / 22

slide-21
SLIDE 21

Proof strategy

Random matrix theory for the random kernel inner product matrices

Song Mei (Stanford University) Random feature regression September 8, 2019 21 / 22

slide-22
SLIDE 22

Conclusion

◮ Number of parameters is not the right model complexity to control the generalization error (we already know this). ◮ The double descent phenomenon also appears in linearized neural networks. ◮ When SNR is high, without regularization could be better than with regularization.

Song Mei (Stanford University) Random feature regression September 8, 2019 22 / 22