Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks Alexander Shevchenko Marco Mondelli
Neural Network Training From theoretical perspective training of neural networks is di ffi cult (NP-hardness, local/disconnected minima …), but in practice works remarkably well! Two key ingredients of success: Over-parameterization (Stochastic) gradient descent
Training Landscape is indeed NICE • SGD minima connected via piecewise linear path with constant loss [Garipov et al., 2018; Draxler et al., 2018] • Mode connectivity proved assuming properties of well-trained networks (dropout/noise stability) [Kuditipudi et al., 2019]
What do we show? Theorem. (Informal) As neural network grows wider the solutions obtained via SGD become increasingly more dropout stable and barriers between local optima disappear. Mean-field view: Two layers [Mei et al., 2019] Multiple layers [Araujo et al., 2019] Quantitative bounds: • independent of input dimension for two-layer networks, scale linearly for multiple layers 1 • change in loss scales with network width as width • number of training samples is just required to scale faster than the log(width)
Related Work • Local minima are globally optimal for deep linear networks and networks with more neurons than training samples • Connected landscape if the number of neurons grows large (two-layer networks, energy gap exponential in input dimension) Strong assumptions on the model and poor scaling of parameters ︎ 😟
̂ Warm-up: Two Layer Networks Data: { ( x 1 , y 1 ) , …, ( x n , y n ) } ∼ i.i.d. ℙ ( ℝ d × ℝ ) N y N ( x , θ ) = 1 a i σ ( x ; w i ) Model: ∑ N i =1 Goal: Minimize loss L N ( θ ) = 𝔽 { ( y − 1 } , θ = ( w , a ) 2 i =1 a i σ ( x ; w i ) ) N ∑ N Online SGD: θ k +1 = θ k + α N ∇ θ k ( ( y k − 1 ) 2 i ) ) i σ ( x k ; w k N ∑ N i =1 a k • bounded, sub-gaussian ∇ w σ ( x , w ) y • bounded and di ff erentiable, bounded and Lipschitz ∇ σ σ • initialization of with bounded support a i
Recap: Dropout Stability 2 ( y − 1 a i σ ( x ; w i ) ) M ∑ L M ( θ ) = 𝔽 M i =1 is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D
Recap: Dropout Stability and Connectivity 2 ( y − 1 a i σ ( x ; w i ) ) M ∑ L M ( θ ) = 𝔽 M i =1 is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D and are - connected if there exists a continuous path connecting them θ ′ θ ε C where the loss does not increase more than ε C
Main Results: Dropout Stability
Main Results: Dropout Stability log M Change in loss scales as α ( D + log N ) + M
Main Results: Dropout Stability α ≪ ( − 1 D + log N ) • Loss change vanishes as and M ≫ 1 • does not need to scale with or M N D
Main Results: Connectivity
Main Results: Connectivity log N • Change in loss scales as + α ( D + log N ) N
Main Results: Connectivity log N • Change in loss scales as + α ( D + log N ) N • Can connect SGD solutions obtained from di ff erent training data (but same data distribution) and di ff erent initialization
Proof Idea Continuous dynamics of Discrete dynamics of SGD gradient flow • θ k close to i.i.d. particles that evolve with gradient flow N • and concentrate to the same limit L N ( θ ) L M ( θ ) • Dropout stability with connectivity M = N /2 ⇒
Multilayer Case: Setup Data: { ( x 1 , y 1 ) , …, ( x n , y n ) } ∼ i.i.d. ℙ ( ℝ d x × ℝ d y ) N W L +1 σ L ( ⋯ ( N W 2 σ 1 ( W 1 x ) ) ⋯ ) y N ( x , θ ) = 1 1 Model: ̂ Goal: Minimize loss L N ( θ ) = 𝔽 { } 2 y − ̂ y N ( x , θ ) 2 Online SGD: θ k +1 = θ k + α N 2 ∇ θ k y N ( x k , θ k ) y k − ̂ • bounded y • bounded and di ff erentiable, bounded and Lipschitz ∇ σ ℓ σ ℓ • initialization with bounded support • and stay fixed (random features) W 1 W L +1
Multilayer Case: Dropout Stability Dropout stability: loss does not change much if we remove part of neurons from each layer (and suitably rescale remaining neurons).
Multilayer Case: Dropout Stability loss when we keep at most neurons per layer L M ( θ ) := M is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D
Multilayer Case: Dropout Stability and Connectivity loss when we keep at most neurons per layer L M ( θ ) := M is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D and are - connected if there exists a continuous path connecting them θ ′ θ ε C where the loss does not increase more than ε C
Multilayer Case: Results
Multilayer Case: Results
Proof Challenges Continuous dynamics of Discrete dynamics of SGD gradient flow • Ideal particles are no longer independent (weights in di ff erent layers are correlated) • Bound on norm of weights during the training • Bound maximum distance between SGD and ideal particles ([Araujo et al., 2019] bounds the average distance) •
Numerical Results • CIFAR-10 dataset • Pretrained VGG-16 features • # layers = 3 • Keep half of neurons
Numerical Results • CIFAR-10 dataset • Pretrained VGG-16 features • # layers = 3 • Keep half of neurons
Numerical Results • CIFAR-10 dataset • Pretrained VGG-16 features • # layers = 3 • Keep half of neurons
Numerical Results
Conclusion
Thank You for Your Attention
Recommend
More recommend