Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018 Nar & Sastry Step Size Matters 1
Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 Nar & Sastry Step Size Matters 2
Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 From random initialization • converges to x ∗ 1 only if δ ≤ 0 . 5 • converges to x ∗ 2 only if δ ≤ 0 . 2 Nar & Sastry Step Size Matters 2
Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 From random initialization • converges to x ∗ 1 only if δ ≤ 0 . 5 • converges to x ∗ 2 only if δ ≤ 0 . 2 If the algorithm converges with δ = 0 . 3, the solution is x ∗ 1 . Nar & Sastry Step Size Matters 2
Deep Linear Networks x �→ W L W L − 1 · · · W 2 W 1 x Nar & Sastry Step Size Matters 3
Deep Linear Networks x �→ W L W L − 1 · · · W 2 W 1 x • Cost function has infinitely many local minimum • Different dynamic characteristics at different optima Nar & Sastry Step Size Matters 3
Lyapunov Stability of Gradient Descent Deep Linear Networks Proposition • λ ∈ R and λ � = 0 • λ is estimated as multiplication of scalar parameters { w i } 1 2 ( w L . . . w 2 w 1 − λ ) 2 . min { w i } Nar & Sastry Step Size Matters 4
Lyapunov Stability of Gradient Descent Deep Linear Networks Proposition • λ ∈ R and λ � = 0 • λ is estimated as multiplication of scalar parameters { w i } 1 2 ( w L . . . w 2 w 1 − λ ) 2 . min { w i } For convergence to { w ∗ i } with w ∗ L . . . w ∗ 2 w ∗ 1 = λ , step size must satisfy 2 δ ≤ � 2 . � λ � L i =1 w ∗ i Nar & Sastry Step Size Matters 4
Lyapunov Stability of Gradient Descent Deep Linear Networks • δ needs to be very small for equilibria with disproportionate { w ∗ i } • For each δ , the algorithm can converge only to a subset of optima Nar & Sastry Step Size Matters 5
Lyapunov Stability of Gradient Descent Deep Linear Networks • δ needs to be very small for equilibria with disproportionate { w ∗ i } • For each δ , the algorithm can converge only to a subset of optima • No finite Lipschitz constant for the gradient on the whole parameter space Nar & Sastry Step Size Matters 5
Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Nar & Sastry Step Size Matters 6
Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ Nar & Sastry Step Size Matters 6
Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ • Step size bounds the Lipschitz constant of the estimated function Nar & Sastry Step Size Matters 6
Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ • Step size bounds the Lipschitz constant of the estimated function • Contrary to ordinary-least-squares Nar & Sastry Step Size Matters 6
Deep Linear Networks Symmetric PSD matrices: • The bound is tight with identity initialization • Identity initialization allows convergence with the largest step size Nar & Sastry Step Size Matters 7
Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Nar & Sastry Step Size Matters 8
Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Theorem Let f : R n → R m be estimated by 1 � N i =1 � W ( V x i − b ) + − f ( x i ) � 2 min 2 . 2 W,V Nar & Sastry Step Size Matters 8
Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Theorem Let f : R n → R m be estimated by 1 � N i =1 � W ( V x i − b ) + − f ( x i ) � 2 min 2 . 2 W,V If the algorithm converges, then the estimate ˆ f ( x i ) satisfies f ( x i ) � ≤ 1 i ∈ [ N ] � x i �� ˆ max δ almost surely. Nar & Sastry Step Size Matters 8
Recommend
More recommend