Main Result samples 1 , , ReLU Main Theorem 1 - - PowerPoint PPT Presentation

β–Ά
main result
SMART_READER_LITE
LIVE PREVIEW

Main Result samples 1 , , ReLU Main Theorem 1 - - PowerPoint PPT Presentation

A Convergence Theory for Deep Learning via Over-Parameterization Zeyuan Allen-Zhu Yuanzhi Li Zhao Song MSR AI Stanford UT Austin Princeton U of Washington Harvard Main Result samples 1 , , ReLU


slide-1
SLIDE 1

A Convergence Theory for Deep Learning

via Over-Parameterization

Zeyuan Allen-Zhu

MSR AI

Yuanzhi Li

Stanford

Zhao Song

UT Austin U of Washington Harvard Princeton

slide-2
SLIDE 2

Main Result

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

𝑀 hidden layers 𝑋

β„“ ∈ ℝ𝑛×𝑛

samples 𝑦1, … , π‘¦π‘œ ∈ ℝ𝑒

The main result is the following. Consider training 𝑀 hidden layers of a deep neural network, given π‘œ training data points that are non-degenerate, meaning their pairwise relative distance is at least πœ€. Suppose the network is

  • verparameterized, meaning the number of neurons is polynomial in π‘œ, 𝑀 and πœ€βˆ’1.

If data non-degenerate (e.g. norm 1 and 𝑦𝑗 βˆ’ π‘¦π‘˜

2 β‰₯ πœ€)

If overparameterized 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀, πœ€βˆ’1 Main Theorem

slide-3
SLIDE 3

Main Result

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

𝑀 hidden layers 𝑋

β„“ ∈ ℝ𝑛×𝑛

samples 𝑦1, … , π‘¦π‘œ ∈ ℝ𝑒 If data non-degenerate (e.g. norm 1 and 𝑦𝑗 βˆ’ π‘¦π‘˜

2 β‰₯ πœ€)

If overparameterized 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀, πœ€βˆ’1 Then, SGD finds training global minima in π‘ˆ = π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 πœ€2 β‹… log 1 𝜁 iterations for β„“2-regression

Then, we proved stochastic gradient descent can find global minima in polynomial time by training only hidden layers.

Main Theorem

slide-4
SLIDE 4

Main Result

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

𝑀 hidden layers 𝑋

β„“ ∈ ℝ𝑛×𝑛

samples 𝑦1, … , π‘¦π‘œ ∈ ℝ𝑒 If data non-degenerate (e.g. norm 1 and 𝑦𝑗 βˆ’ π‘¦π‘˜

2 β‰₯ πœ€)

If overparameterized 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀, πœ€βˆ’1 Then, SGD finds training global minima in π‘ˆ = π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 πœ€2 β‹… log 1 𝜁 iterations for β„“2-regression

Similar results also hold for other losses and other network architectures such as ResNet and CNN. These can be found in the paper.

Main Theorem

In paper:

  • also for other smooth losses (cross-entropy, etc)
  • also for other architectures (ResNet, CNN, etc)
slide-5
SLIDE 5

Key Message 1

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

If data non-degenerate (e.g. norm 1 and 𝑦𝑗 βˆ’ π‘¦π‘˜

2 β‰₯ πœ€)

If overparameterized 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀, πœ€βˆ’1 Then, SGD finds training global minima in π‘ˆ = π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 πœ€2 β‹… log 1 𝜁 iterations for β„“2-regression Main Theorem Our first key message is the following. Our theorem is obtained by training with respect to hidden layers, where prior work [Daniely, NeurIPS 2017] studies training essentially only the last layer, which is an easy convex problem. samples 𝑦1, … , π‘¦π‘œ ∈ ℝ𝑒

slide-6
SLIDE 6

Key Message 2: poly(L)

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

If data non-degenerate (e.g. norm 1 and 𝑦𝑗 βˆ’ π‘¦π‘˜

2 β‰₯ πœ€)

If overparameterized 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀, πœ€βˆ’1 Then, SGD finds training global minima in π‘ˆ = π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 πœ€2 β‹… log 1 𝜁 iterations for β„“2-regression Main Theorem Our second key message is the following. We prove polynomial dependence on the depth 𝑀. In contrast,

  • The independent work [Du et al. ICML 19] needs exponential time in 𝑀
  • Prior work [Daniely, NeurIPS 17] for training last layer also needs 𝑓𝑃(𝑀)
slide-7
SLIDE 7

Key Message 2: poly(L)

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

If data non-degenerate (e.g. norm 1 and 𝑦𝑗 βˆ’ π‘¦π‘˜

2 β‰₯ πœ€)

If overparameterized 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀, πœ€βˆ’1 Then, SGD finds training global minima in π‘ˆ = π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 πœ€2 β‹… log 1 𝜁 iterations for β„“2-regression Main Theorem Intrinsically, our polynomial bound is possible because ReLU prevents exponential gradient explosion/vanishing, in a provable sense!

(for a sufficiently large region near random initialization)

slide-8
SLIDE 8

Key Message 2: poly(L)

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

If data non-degenerate (e.g. norm 1 and 𝑦𝑗 βˆ’ π‘¦π‘˜

2 β‰₯ πœ€)

If overparameterized 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀, πœ€βˆ’1 Then, SGD finds training global minima in π‘ˆ = π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 πœ€2 β‹… log 1 𝜁 iterations for β„“2-regression Main Theorem Intrinsically, our polynomial bound is possible because ReLU prevents exponential gradient explosion/vanishing, in a provable sense!

(for a sufficiently large region near random initialization)

In contrast, getting 𝑓𝑃 𝑀 is almost trivial: each hidden weight matrix 𝑋

β„“

has spectral norm 2, so overall 2𝑀. The hard part is proving π‘žπ‘π‘šπ‘§(𝑀).

slide-9
SLIDE 9

Key Message 3: almost-convex geometry

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU The third key message is the following. We prove in the paper, for a sufficiently large neighborhood of the random initialization, the training

  • bjective is almost convex.
slide-10
SLIDE 10

Key Message 3: almost-convex geometry

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU This means, if the objective is large, then gradient is large.

If loss is large, then gradient is large: βˆ‡πΊ 𝑋

𝐺 2

β‰₯ 𝐺 𝑋 β‹… πœ€/π‘œ2 Main Lemma

(after appropriate normalization)

slide-11
SLIDE 11

Key Message 3: almost-convex geometry

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

If loss is large, then gradient is large: βˆ‡πΊ 𝑋

𝐺 2

β‰₯ 𝐺 𝑋 β‹… πœ€/π‘œ2 Main Lemma Objective is semi-smooth: 𝐺 𝑋 + 𝑋′ = 𝐺 𝑋 + βˆ‡πΊ 𝑋 , 𝑋′ Β± π‘žπ‘π‘šπ‘§(π‘œ, 𝑀) β‹… 𝑋′

𝐺

Main Lemma

Also, the objective is sufficiently smooth, meaning that if you move in the negative gradient direction, the objective value can be sufficiently decreased.

slide-12
SLIDE 12

Key Message 3: almost-convex geometry

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU We verified this is true also on real data. Goodfellow et al. [ICLR 2015] also observed this phenomenon but a proof was not known.

CIFAR10/100 VGG19/ResNet32/ResNet110

If loss is large, then gradient is large: βˆ‡πΊ 𝑋

𝐺 2

β‰₯ 𝐺 𝑋 β‹… πœ€/π‘œ2 Main Lemma Objective is semi-smooth: 𝐺 𝑋 + 𝑋′ = 𝐺 𝑋 + βˆ‡πΊ 𝑋 , 𝑋′ Β± π‘žπ‘π‘šπ‘§(π‘œ, 𝑀) β‹… 𝑋′

𝐺

Main Lemma

slide-13
SLIDE 13

Key Message 3: almost-convex geometry

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU These two main lemmas together imply our main theorem.

CIFAR10/100 VGG19/ResNet32/ResNet110

SGD finds global minima in polynomial time Main Theorem If loss is large, then gradient is large: βˆ‡πΊ 𝑋

𝐺 2

β‰₯ 𝐺 𝑋 β‹… πœ€/π‘œ2 Main Lemma Objective is semi-smooth: 𝐺 𝑋 + 𝑋′ = 𝐺 𝑋 + βˆ‡πΊ 𝑋 , 𝑋′ Β± π‘žπ‘π‘šπ‘§(π‘œ, 𝑀) β‹… 𝑋′

𝐺

Main Lemma

slide-14
SLIDE 14

Equivalent View: neural tangent kernel

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

If 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 , for a sufficiently large neighborhood of the random initialization, neural networks behave like Neural Tangent Kernel (NTK). In fact… we proved

Finally, let us take an alternative view. If one goes into the paper, we proved the following. If 𝑛, the number of neurons, is polynomially large, then for a sufficiently large neighborhood of the random initialization, neural networks behave nearly identical to the so-called neural tangent kernels, or NTK.

slide-15
SLIDE 15

Equivalent View: neural tangent kernel

𝐡

𝑦 ReLU

𝑋

1

ReLU

𝐢 𝑋

𝑀

…

ReLU

If 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 , for a sufficiently large neighborhood of the random initialization, neural networks behave like Neural Tangent Kernel (NTK).

  • βˆ‡πΊ 𝑋

= 1 Β±

1 𝑛 β‹… feature space of NTK

  • 𝐺 π‘‹βˆ—

= πΊπ‘‚π‘ˆπΏ π‘‹βˆ— Β±

1 𝑛1/6

In fact… we proved

Specifically, this means two things. The gradient behaves like NTK, and the objective behaves like NTK.

slide-16
SLIDE 16

Conclusion

If 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). ⟹ networks essentially convex and smooth ⟹ training is EASY

We proved

In other words, we proved that within certain parameter regime, over-parameterized deep neural networks behave nearly the same as NTK. Therefore, the training task is essentially convex, so training is easy.

slide-17
SLIDE 17

Conclusion

If 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). ⟹ networks essentially convex and smooth ⟹ training is EASY

We proved

Note this is not true for other learning rate regimes, and neural networks can be provably more powerful than NTK, see our follow-up work. Author Note: for other regimes, neural networks provably more powerful than NTK See [A-L, 1905.10337], "What Can ResNet Learn Efficiently, Going Beyond Kernels?β€œ

slide-18
SLIDE 18

Conclusion

If 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). ⟹ networks essentially convex and smooth ⟹ training is EASY

We proved We emphasize again that, prior work studying the relationship to NTK either requires 𝑛 = ∞ or 𝑛 β‰₯ 𝑓Ω 𝑀 . Our result is polynomial in the depth 𝑴.

slide-19
SLIDE 19

Conclusion

If 𝑛 β‰₯ π‘žπ‘π‘šπ‘§ π‘œ, 𝑀 , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). ⟹ networks essentially convex and smooth ⟹ training is EASY

We proved We emphasize again that, prior work studying the relationship to NTK either requires 𝑛 = ∞ or 𝑛 β‰₯ 𝑓Ω 𝑀 . Our result is polynomial in the depth 𝑴.