Main Result samples 1 , , ReLU Main Theorem 1 - PowerPoint PPT Presentation

A Convergence Theory for Deep Learning via Over-Parameterization Zeyuan Allen-Zhu Yuanzhi Li Zhao Song MSR AI Stanford UT Austin Princeton U of Washington Harvard

Main Result samples 𝑦 1 , … , 𝑦 𝑜 ∈ ℝ 𝑒 𝑦 𝐵 ReLU Main Theorem 𝑋 1 If data non-degenerate (e.g. norm 1 and 𝑦 𝑗 − 𝑦 𝑘 2 ≥ 𝜀 ) If overparameterized 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀, 𝜀 −1 𝑀 hidden layers … ℓ ∈ ℝ 𝑛×𝑛 𝑋 ReLU 𝑋 𝑀 ReLU 𝐶 The main result is the following. Consider training 𝑀 hidden layers of a deep neural network, given 𝑜 training data points that are non-degenerate, meaning their pairwise relative distance is at least 𝜀 . Suppose the network is overparameterized, meaning the number of neurons is polynomial in 𝑜, 𝑀 and 𝜀 −1 .

Main Result samples 𝑦 1 , … , 𝑦 𝑜 ∈ ℝ 𝑒 𝑦 𝐵 ReLU Main Theorem 𝑋 1 If data non-degenerate (e.g. norm 1 and 𝑦 𝑗 − 𝑦 𝑘 2 ≥ 𝜀 ) If overparameterized 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀, 𝜀 −1 𝑀 hidden layers … ℓ ∈ ℝ 𝑛×𝑛 𝑋 Then, SGD finds training global minima in ReLU 𝑈 = 𝑞𝑝𝑚𝑧 𝑜, 𝑀 ⋅ log 1 𝑋 𝑀 𝜀 2 𝜁 iterations for ℓ 2 -regression ReLU 𝐶 Then, we proved stochastic gradient descent can find global minima in polynomial time by training only hidden layers.

Main Result samples 𝑦 1 , … , 𝑦 𝑜 ∈ ℝ 𝑒 𝑦 𝐵 ReLU Main Theorem 𝑋 1 If data non-degenerate (e.g. norm 1 and 𝑦 𝑗 − 𝑦 𝑘 2 ≥ 𝜀 ) If overparameterized 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀, 𝜀 −1 𝑀 hidden layers … ℓ ∈ ℝ 𝑛×𝑛 𝑋 Then, SGD finds training global minima in ReLU 𝑈 = 𝑞𝑝𝑚𝑧 𝑜, 𝑀 ⋅ log 1 𝑋 𝑀 𝜀 2 𝜁 iterations for ℓ 2 -regression ReLU In paper: 𝐶 • also for other smooth losses (cross-entropy, etc) • also for other architectures (ResNet, CNN, etc) Similar results also hold for other losses and other network architectures such as ResNet and CNN. These can be found in the paper.

Key Message 1 samples 𝑦 1 , … , 𝑦 𝑜 ∈ ℝ 𝑒 𝑦 𝐵 ReLU Main Theorem 𝑋 1 If data non-degenerate (e.g. norm 1 and 𝑦 𝑗 − 𝑦 𝑘 2 ≥ 𝜀 ) If overparameterized 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀, 𝜀 −1 … Then, SGD finds training global minima in ReLU 𝑈 = 𝑞𝑝𝑚𝑧 𝑜, 𝑀 ⋅ log 1 𝑋 𝑀 𝜀 2 𝜁 iterations for ℓ 2 -regression ReLU 𝐶 Our first key message is the following. Our theorem is obtained by training with respect to hidden layers, where prior work [Daniely, NeurIPS 2017] studies training essentially only the last layer, which is an easy convex problem.

Key Message 2: poly(L) 𝑦 𝐵 ReLU Main Theorem 𝑋 1 If data non-degenerate (e.g. norm 1 and 𝑦 𝑗 − 𝑦 𝑘 2 ≥ 𝜀 ) If overparameterized 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀, 𝜀 −1 … Then, SGD finds training global minima in ReLU 𝑈 = 𝑞𝑝𝑚𝑧 𝑜, 𝑀 ⋅ log 1 𝑋 𝑀 𝜀 2 𝜁 iterations for ℓ 2 -regression ReLU 𝐶 Our second key message is the following. We prove polynomial dependence on the depth 𝑀 . In contrast, • The independent work [Du et al. ICML 19] needs exponential time in 𝑀 Prior work [Daniely, NeurIPS 17] for training last layer also needs 𝑓 𝑃(𝑀) •

Key Message 2: poly(L) 𝑦 𝐵 ReLU Main Theorem 𝑋 1 If data non-degenerate (e.g. norm 1 and 𝑦 𝑗 − 𝑦 𝑘 2 ≥ 𝜀 ) If overparameterized 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀, 𝜀 −1 … Then, SGD finds training global minima in ReLU 𝑈 = 𝑞𝑝𝑚𝑧 𝑜, 𝑀 ⋅ log 1 𝑋 𝑀 𝜀 2 𝜁 iterations for ℓ 2 -regression ReLU 𝐶 Intrinsically, our polynomial bound is possible because ReLU prevents exponential gradient explosion/vanishing, in a provable sense! (for a sufficiently large region near random initialization)

Key Message 2: poly(L) 𝑦 𝐵 ReLU Main Theorem 𝑋 1 If data non-degenerate (e.g. norm 1 and 𝑦 𝑗 − 𝑦 𝑘 2 ≥ 𝜀 ) If overparameterized 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀, 𝜀 −1 … Then, SGD finds training global minima in ReLU 𝑈 = 𝑞𝑝𝑚𝑧 𝑜, 𝑀 ⋅ log 1 𝑋 𝑀 𝜀 2 𝜁 iterations for ℓ 2 -regression ReLU 𝐶 Intrinsically, our polynomial bound is possible because ReLU prevents exponential gradient explosion/vanishing, in a provable sense! (for a sufficiently large region near random initialization) In contrast, getting 𝑓 𝑃 𝑀 is almost trivial: each hidden weight matrix 𝑋 ℓ has spectral norm 2, so overall 2 𝑀 . The hard part is proving 𝑞𝑝𝑚𝑧(𝑀) .

Key Message 3: almost-convex geometry 𝑦 𝐵 ReLU 𝑋 1 … ReLU 𝑋 𝑀 ReLU 𝐶 The third key message is the following. We prove in the paper, for a sufficiently large neighborhood of the random initialization, the training objective is almost convex .

Key Message 3: almost-convex geometry 𝑦 𝐵 Main Lemma ReLU If loss is large, then gradient is large: 𝑋 2 1 ≥ 𝐺 𝑋 ⋅ 𝜀/𝑜 2 ∇𝐺 𝑋 𝐺 … (after appropriate normalization) ReLU 𝑋 𝑀 ReLU 𝐶 This means, if the objective is large, then gradient is large.

Key Message 3: almost-convex geometry 𝑦 𝐵 Main Lemma ReLU If loss is large, then gradient is large: 𝑋 2 1 ≥ 𝐺 𝑋 ⋅ 𝜀/𝑜 2 ∇𝐺 𝑋 𝐺 … Main Lemma ReLU Objective is semi-smooth: 𝑋 = 𝐺 𝑋 + ∇𝐺 𝑋 , 𝑋 ′ ± 𝑞𝑝𝑚𝑧(𝑜, 𝑀) ⋅ 𝐺 𝑋 + 𝑋 ′ 𝑋 ′ 𝑀 𝐺 ReLU 𝐶 Also, the objective is sufficiently smooth, meaning that if you move in the negative gradient direction, the objective value can be sufficiently decreased.

Key Message 3: almost-convex geometry 𝑦 𝐵 Main Lemma ReLU If loss is large, then gradient is large: 𝑋 2 1 ≥ 𝐺 𝑋 ⋅ 𝜀/𝑜 2 ∇𝐺 𝑋 𝐺 … Main Lemma ReLU Objective is semi-smooth: 𝑋 = 𝐺 𝑋 + ∇𝐺 𝑋 , 𝑋 ′ ± 𝑞𝑝𝑚𝑧(𝑜, 𝑀) ⋅ 𝐺 𝑋 + 𝑋 ′ 𝑋 ′ 𝑀 𝐺 ReLU 𝐶 CIFAR10/100 VGG19/ResNet32/ResNet110 We verified this is true also on real data. Goodfellow et al. [ICLR 2015] also observed this phenomenon but a proof was not known.

Key Message 3: almost-convex geometry 𝑦 𝐵 Main Lemma ReLU If loss is large, then gradient is large: 𝑋 2 1 ≥ 𝐺 𝑋 ⋅ 𝜀/𝑜 2 ∇𝐺 𝑋 𝐺 … Main Lemma ReLU Objective is semi-smooth: 𝑋 = 𝐺 𝑋 + ∇𝐺 𝑋 , 𝑋 ′ ± 𝑞𝑝𝑚𝑧(𝑜, 𝑀) ⋅ 𝐺 𝑋 + 𝑋 ′ 𝑋 ′ 𝑀 𝐺 ReLU Main Theorem 𝐶 SGD finds global minima in polynomial time CIFAR10/100 VGG19/ResNet32/ResNet110 These two main lemmas together imply our main theorem.

Equivalent View: neural tangent kernel 𝑦 𝐵 ReLU 𝑋 1 … In fact… we proved If 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀 , for a sufficiently large neighborhood of the random ReLU initialization, neural networks behave like Neural Tangent Kernel (NTK). 𝑋 𝑀 ReLU 𝐶 Finally, let us take an alternative view. If one goes into the paper, we proved the following. If 𝑛 , the number of neurons, is polynomially large, then for a sufficiently large neighborhood of the random initialization, neural networks behave nearly identical to the so-called neural tangent kernels, or NTK.

Equivalent View: neural tangent kernel 𝑦 𝐵 ReLU 𝑋 1 … In fact… we proved If 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀 , for a sufficiently large neighborhood of the random ReLU initialization, neural networks behave like Neural Tangent Kernel (NTK). 𝑋 𝑀 1 • ∇𝐺 𝑋 = 1 ± 𝑛 ⋅ feature space of NTK ReLU = 𝐺 𝑂𝑈𝐿 𝑋 ∗ ± 1 𝐺 𝑋 ∗ • 𝑛 1/6 𝐶 Specifically, this means two things. The gradient behaves like NTK, and the objective behaves like NTK.

Conclusion We proved If 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀 , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). ⟹ networks essentially convex and smooth ⟹ training is EASY In other words, we proved that within certain parameter regime, over-parameterized deep neural networks behave nearly the same as NTK. Therefore, the training task is essentially convex, so training is easy.

Conclusion We proved If 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀 , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). ⟹ networks essentially convex and smooth ⟹ training is EASY Author Note: for other regimes, neural networks provably more powerful than NTK See [A-L, 1905.10337], "What Can ResNet Learn Efficiently, Going Beyond Kernels?“ Note this is not true for other learning rate regimes, and neural networks can be provably more powerful than NTK, see our follow-up work.

Conclusion We emphasize again that, prior work studying the relationship to NTK either requires 𝑛 = ∞ or 𝑛 ≥ 𝑓 Ω 𝑀 . Our result is polynomial in the depth 𝑴 . We proved If 𝑛 ≥ 𝑞𝑝𝑚𝑧 𝑜, 𝑀 , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). ⟹ networks essentially convex and smooth ⟹ training is EASY

Main Result samples 1 , , ReLU Main Theorem 1 - PowerPoint PPT Presentation

A Convergence Theory for Deep Learning via Over-Parameterization Zeyuan Allen-Zhu Yuanzhi Li Zhao Song MSR AI Stanford UT Austin Princeton U of Washington Harvard Main Result samples 1 , , ReLU

2010 Full Year Result 2010 Full Year Result 23 February 2011 2010 Full Year Result 2010 Full

MAINFREIGHT LIMITED FULL YEAR RESULT TO MARCH 2014 Result Summary Result Summary Net surplus

Allan Gibbard - Manipulation of voting The result for game forms schemes: a general result (1973)

MAINFREIGHT LIMITED FULL YEAR RESULT TO MARCH 2017 Result Summary Result Summary Net surplus

Analysts and Investors Day June 2017 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Second Quarter 2015 Main Street Capital Corporation NYSE: MAIN

Investor Presentation First Quarter 2017 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Third Quarter 2016 Main Street Capital Corporation NYSE: MAIN

June 2011 Main One Corporate Structure Main One Cable Company Portugal 100% MOCCM Main One

Investor Presentation Second Quarter 2020 Main Street Capital Corporation NYSE: MAIN

Investor Presentation First Quarter 2019 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Third Quarter 2019 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Fourth Quarter 2019 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Fourth Quarter 2018 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Fourth Quarter 2016 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Third Quarter 2018 Main Street Capital Corporation NYSE: MAIN

Heavy vy Flavor in Small Systems Alexandre Lebedev (Iowa State University) for the PHENIX

Cryptography Cryptographic Hash Functions Uwe Egly Knowledge-Based Systems Group Institute of

Starting Strong for Community Health! Webinar: The ACA & Application for Benefits Eligibility

Heavy mesons in the quark model 15 th International Workshop on Mesons Physics Meson 2018 D.R.

Dening neural networks with Keras IN TRODUCTION TO TEN S ORF LOW IN P YTH ON Isaiah Hull

t r

Analyzing Backprop 3-4-16 Reading Quiz Q1: If a neural network has 3 layers with 10 input, 6

CSSE463: Image Recognition Day 18 Upcoming schedule: Lightning talks shortly Midterm

Main Result samples 1 , , ReLU Main Theorem 1 - PowerPoint PPT Presentation

A Convergence Theory for Deep Learning via Over-Parameterization Zeyuan Allen-Zhu Yuanzhi Li Zhao Song MSR AI Stanford UT Austin Princeton U of Washington Harvard Main Result samples 1 , , ReLU

2010 Full Year Result 2010 Full Year Result 23 February 2011 2010 Full Year Result 2010 Full

MAINFREIGHT LIMITED FULL YEAR RESULT TO MARCH 2014 Result Summary Result Summary Net surplus

Allan Gibbard - Manipulation of voting The result for game forms schemes: a general result (1973)

MAINFREIGHT LIMITED FULL YEAR RESULT TO MARCH 2017 Result Summary Result Summary Net surplus

Analysts and Investors Day June 2017 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Second Quarter 2015 Main Street Capital Corporation NYSE: MAIN

Investor Presentation First Quarter 2017 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Third Quarter 2016 Main Street Capital Corporation NYSE: MAIN

June 2011 Main One Corporate Structure Main One Cable Company Portugal 100% MOCCM Main One

Investor Presentation Second Quarter 2020 Main Street Capital Corporation NYSE: MAIN

Investor Presentation First Quarter 2019 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Third Quarter 2019 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Fourth Quarter 2019 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Fourth Quarter 2018 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Fourth Quarter 2016 Main Street Capital Corporation NYSE: MAIN

Investor Presentation Third Quarter 2018 Main Street Capital Corporation NYSE: MAIN

Heavy vy Flavor in Small Systems Alexandre Lebedev (Iowa State University) for the PHENIX

Cryptography Cryptographic Hash Functions Uwe Egly Knowledge-Based Systems Group Institute of

Starting Strong for Community Health! Webinar: The ACA &amp; Application for Benefits Eligibility

Heavy mesons in the quark model 15 th International Workshop on Mesons Physics Meson 2018 D.R.

Dening neural networks with Keras IN TRODUCTION TO TEN S ORF LOW IN P YTH ON Isaiah Hull

t r

Analyzing Backprop 3-4-16 Reading Quiz Q1: If a neural network has 3 layers with 10 input, 6

CSSE463: Image Recognition Day 18 Upcoming schedule: Lightning talks shortly Midterm

Starting Strong for Community Health! Webinar: The ACA & Application for Benefits Eligibility