landscape connectivity and dropout stability of sgd
play

Landscape Connectivity and Dropout Stability of SGD Solutions for - PowerPoint PPT Presentation

Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks Alexander Shevchenko Marco Mondelli Neural Network Training From theoretical perspective training of neural networks is di ffi cult


  1. Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks Alexander Shevchenko Marco Mondelli

  2. Neural Network Training From theoretical perspective training of neural networks is di ffi cult (NP-hardness, local/disconnected minima …), but in practice works remarkably well! Two key ingredients of success: Over-parameterization (Stochastic) gradient descent

  3. Training Landscape is indeed NICE • SGD minima connected via piecewise linear path with constant loss [Garipov et al., 2018; Draxler et al., 2018] • Mode connectivity proved assuming properties of well-trained networks (dropout/noise stability) [Kuditipudi et al., 2019]

  4. What do we show? Theorem. (Informal) As neural network grows wider the solutions obtained via SGD become increasingly more dropout stable and barriers between local optima disappear. Mean-field view: Two layers [Mei et al., 2019] Multiple layers [Araujo et al., 2019] Quantitative bounds: • independent of input dimension for two-layer networks, scale linearly for multiple layers 1 • change in loss scales with network width as width • number of training samples is just required to scale faster than the log(width)

  5. Related Work • Local minima are globally optimal for deep linear networks and networks with more neurons than training samples • Connected landscape if the number of neurons grows large (two-layer networks, energy gap exponential in input dimension) Strong assumptions on the model and poor scaling of parameters ︎ 😟

  6. ̂ Warm-up: Two Layer Networks Data: { ( x 1 , y 1 ) , …, ( x n , y n ) } ∼ i.i.d. ℙ ( ℝ d × ℝ ) N y N ( x , θ ) = 1 a i σ ( x ; w i ) Model: ∑ N i =1 Goal: Minimize loss L N ( θ ) = 𝔽 { ( y − 1 } , θ = ( w , a ) 2 i =1 a i σ ( x ; w i ) ) N ∑ N Online SGD: θ k +1 = θ k + α N ∇ θ k ( ( y k − 1 ) 2 i ) ) i σ ( x k ; w k N ∑ N i =1 a k • bounded, sub-gaussian ∇ w σ ( x , w ) y • bounded and di ff erentiable, bounded and Lipschitz ∇ σ σ • initialization of with bounded support a i

  7. Recap: Dropout Stability 2 ( y − 1 a i σ ( x ; w i ) ) M ∑ L M ( θ ) = 𝔽 M i =1 is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D

  8. Recap: Dropout Stability and Connectivity 2 ( y − 1 a i σ ( x ; w i ) ) M ∑ L M ( θ ) = 𝔽 M i =1 is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D and are - connected if there exists a continuous path connecting them θ ′ θ ε C where the loss does not increase more than ε C

  9. Main Results: Dropout Stability

  10. Main Results: Dropout Stability log M Change in loss scales as α ( D + log N ) + M

  11. Main Results: Dropout Stability α ≪ ( − 1 D + log N ) • Loss change vanishes as and M ≫ 1 • does not need to scale with or M N D

  12. Main Results: Connectivity

  13. Main Results: Connectivity log N • Change in loss scales as + α ( D + log N ) N

  14. Main Results: Connectivity log N • Change in loss scales as + α ( D + log N ) N • Can connect SGD solutions obtained from di ff erent training data (but same data distribution) and di ff erent initialization

  15. Proof Idea Continuous dynamics of Discrete dynamics of SGD gradient flow • θ k close to i.i.d. particles that evolve with gradient flow N • and concentrate to the same limit L N ( θ ) L M ( θ ) • Dropout stability with connectivity M = N /2 ⇒

  16. Multilayer Case: Setup Data: { ( x 1 , y 1 ) , …, ( x n , y n ) } ∼ i.i.d. ℙ ( ℝ d x × ℝ d y ) N W L +1 σ L ( ⋯ ( N W 2 σ 1 ( W 1 x ) ) ⋯ ) y N ( x , θ ) = 1 1 Model: ̂ Goal: Minimize loss L N ( θ ) = 𝔽 { } 2 y − ̂ y N ( x , θ ) 2 Online SGD: θ k +1 = θ k + α N 2 ∇ θ k y N ( x k , θ k ) y k − ̂ • bounded y • bounded and di ff erentiable, bounded and Lipschitz ∇ σ ℓ σ ℓ • initialization with bounded support • and stay fixed (random features) W 1 W L +1

  17. Multilayer Case: Dropout Stability Dropout stability: loss does not change much if we remove part of neurons from each layer (and suitably rescale remaining neurons).

  18. Multilayer Case: Dropout Stability loss when we keep at most neurons per layer L M ( θ ) := M is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D

  19. Multilayer Case: Dropout Stability and Connectivity loss when we keep at most neurons per layer L M ( θ ) := M is - dropout stable if | L N ( θ ) − L M ( θ ) | ≤ ε D θ ε D and are - connected if there exists a continuous path connecting them θ ′ θ ε C where the loss does not increase more than ε C

  20. Multilayer Case: Results

  21. Multilayer Case: Results

  22. Proof Challenges Continuous dynamics of Discrete dynamics of SGD gradient flow • Ideal particles are no longer independent (weights in di ff erent layers are correlated) • Bound on norm of weights during the training • Bound maximum distance between SGD and ideal particles ([Araujo et al., 2019] bounds the average distance) •

  23. Numerical Results • CIFAR-10 dataset • Pretrained VGG-16 features • # layers = 3 • Keep half of neurons

  24. Numerical Results • CIFAR-10 dataset • Pretrained VGG-16 features • # layers = 3 • Keep half of neurons

  25. Numerical Results • CIFAR-10 dataset • Pretrained VGG-16 features • # layers = 3 • Keep half of neurons

  26. Numerical Results

  27. Conclusion

  28. Thank You for Your Attention

Recommend


More recommend