A Convergence Theory for Deep Learning
via Over-Parameterization
Zeyuan Allen-Zhu
MSR AI
Yuanzhi Li
Stanford
Zhao Song
UT Austin U of Washington Harvard Princeton
Main Result samples 1 , , ReLU Main Theorem 1 - - PowerPoint PPT Presentation
A Convergence Theory for Deep Learning via Over-Parameterization Zeyuan Allen-Zhu Yuanzhi Li Zhao Song MSR AI Stanford UT Austin Princeton U of Washington Harvard Main Result samples 1 , , ReLU
MSR AI
Stanford
UT Austin U of Washington Harvard Princeton
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
π hidden layers π
β β βπΓπ
samples π¦1, β¦ , π¦π β βπ
The main result is the following. Consider training π hidden layers of a deep neural network, given π training data points that are non-degenerate, meaning their pairwise relative distance is at least π. Suppose the network is
If data non-degenerate (e.g. norm 1 and π¦π β π¦π
2 β₯ π)
If overparameterized π β₯ ππππ§ π, π, πβ1 Main Theorem
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
π hidden layers π
β β βπΓπ
samples π¦1, β¦ , π¦π β βπ If data non-degenerate (e.g. norm 1 and π¦π β π¦π
2 β₯ π)
If overparameterized π β₯ ππππ§ π, π, πβ1 Then, SGD finds training global minima in π = ππππ§ π, π π2 β log 1 π iterations for β2-regression
Then, we proved stochastic gradient descent can find global minima in polynomial time by training only hidden layers.
Main Theorem
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
π hidden layers π
β β βπΓπ
samples π¦1, β¦ , π¦π β βπ If data non-degenerate (e.g. norm 1 and π¦π β π¦π
2 β₯ π)
If overparameterized π β₯ ππππ§ π, π, πβ1 Then, SGD finds training global minima in π = ππππ§ π, π π2 β log 1 π iterations for β2-regression
Similar results also hold for other losses and other network architectures such as ResNet and CNN. These can be found in the paper.
Main Theorem
In paper:
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
If data non-degenerate (e.g. norm 1 and π¦π β π¦π
2 β₯ π)
If overparameterized π β₯ ππππ§ π, π, πβ1 Then, SGD finds training global minima in π = ππππ§ π, π π2 β log 1 π iterations for β2-regression Main Theorem Our first key message is the following. Our theorem is obtained by training with respect to hidden layers, where prior work [Daniely, NeurIPS 2017] studies training essentially only the last layer, which is an easy convex problem. samples π¦1, β¦ , π¦π β βπ
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
If data non-degenerate (e.g. norm 1 and π¦π β π¦π
2 β₯ π)
If overparameterized π β₯ ππππ§ π, π, πβ1 Then, SGD finds training global minima in π = ππππ§ π, π π2 β log 1 π iterations for β2-regression Main Theorem Our second key message is the following. We prove polynomial dependence on the depth π. In contrast,
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
If data non-degenerate (e.g. norm 1 and π¦π β π¦π
2 β₯ π)
If overparameterized π β₯ ππππ§ π, π, πβ1 Then, SGD finds training global minima in π = ππππ§ π, π π2 β log 1 π iterations for β2-regression Main Theorem Intrinsically, our polynomial bound is possible because ReLU prevents exponential gradient explosion/vanishing, in a provable sense!
(for a sufficiently large region near random initialization)
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
If data non-degenerate (e.g. norm 1 and π¦π β π¦π
2 β₯ π)
If overparameterized π β₯ ππππ§ π, π, πβ1 Then, SGD finds training global minima in π = ππππ§ π, π π2 β log 1 π iterations for β2-regression Main Theorem Intrinsically, our polynomial bound is possible because ReLU prevents exponential gradient explosion/vanishing, in a provable sense!
(for a sufficiently large region near random initialization)
In contrast, getting ππ π is almost trivial: each hidden weight matrix π
β
has spectral norm 2, so overall 2π. The hard part is proving ππππ§(π).
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU The third key message is the following. We prove in the paper, for a sufficiently large neighborhood of the random initialization, the training
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU This means, if the objective is large, then gradient is large.
If loss is large, then gradient is large: βπΊ π
πΊ 2
β₯ πΊ π β π/π2 Main Lemma
(after appropriate normalization)
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
If loss is large, then gradient is large: βπΊ π
πΊ 2
β₯ πΊ π β π/π2 Main Lemma Objective is semi-smooth: πΊ π + πβ² = πΊ π + βπΊ π , πβ² Β± ππππ§(π, π) β πβ²
πΊ
Main Lemma
Also, the objective is sufficiently smooth, meaning that if you move in the negative gradient direction, the objective value can be sufficiently decreased.
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU We verified this is true also on real data. Goodfellow et al. [ICLR 2015] also observed this phenomenon but a proof was not known.
CIFAR10/100 VGG19/ResNet32/ResNet110
If loss is large, then gradient is large: βπΊ π
πΊ 2
β₯ πΊ π β π/π2 Main Lemma Objective is semi-smooth: πΊ π + πβ² = πΊ π + βπΊ π , πβ² Β± ππππ§(π, π) β πβ²
πΊ
Main Lemma
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU These two main lemmas together imply our main theorem.
CIFAR10/100 VGG19/ResNet32/ResNet110
SGD finds global minima in polynomial time Main Theorem If loss is large, then gradient is large: βπΊ π
πΊ 2
β₯ πΊ π β π/π2 Main Lemma Objective is semi-smooth: πΊ π + πβ² = πΊ π + βπΊ π , πβ² Β± ππππ§(π, π) β πβ²
πΊ
Main Lemma
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
If π β₯ ππππ§ π, π , for a sufficiently large neighborhood of the random initialization, neural networks behave like Neural Tangent Kernel (NTK). In factβ¦ we proved
Finally, let us take an alternative view. If one goes into the paper, we proved the following. If π, the number of neurons, is polynomially large, then for a sufficiently large neighborhood of the random initialization, neural networks behave nearly identical to the so-called neural tangent kernels, or NTK.
π΅
π¦ ReLU
π
1
ReLU
πΆ π
π
β¦
ReLU
If π β₯ ππππ§ π, π , for a sufficiently large neighborhood of the random initialization, neural networks behave like Neural Tangent Kernel (NTK).
= 1 Β±
1 π β feature space of NTK
= πΊπππΏ πβ Β±
1 π1/6
In fact⦠we proved
Specifically, this means two things. The gradient behaves like NTK, and the objective behaves like NTK.
If π β₯ ππππ§ π, π , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). βΉ networks essentially convex and smooth βΉ training is EASY
We proved
In other words, we proved that within certain parameter regime, over-parameterized deep neural networks behave nearly the same as NTK. Therefore, the training task is essentially convex, so training is easy.
If π β₯ ππππ§ π, π , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). βΉ networks essentially convex and smooth βΉ training is EASY
We proved
Note this is not true for other learning rate regimes, and neural networks can be provably more powerful than NTK, see our follow-up work. Author Note: for other regimes, neural networks provably more powerful than NTK See [A-L, 1905.10337], "What Can ResNet Learn Efficiently, Going Beyond Kernels?β
If π β₯ ππππ§ π, π , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). βΉ networks essentially convex and smooth βΉ training is EASY
We proved We emphasize again that, prior work studying the relationship to NTK either requires π = β or π β₯ πΞ© π . Our result is polynomial in the depth π΄.
If π β₯ ππππ§ π, π , within certain initialization and learning rate regime, Over-parameterized deep networks = Neural Tangent Kernel (NTK). βΉ networks essentially convex and smooth βΉ training is EASY
We proved We emphasize again that, prior work studying the relationship to NTK either requires π = β or π β₯ πΞ© π . Our result is polynomial in the depth π΄.