DL Theory: Expressiveness, Optimization & Generalization Deep Learning (all functions) * f (hypotheses space) * h Approximation Error * h S ( Expressiveness ) h Estimation Error Training Error ( Generalization ) ( Optimization ) Optimization Empirical loss minimization is a non-convex program: h ∗ S is not unique — many hypotheses have low training err Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35
DL Theory: Expressiveness, Optimization & Generalization Deep Learning (all functions) * f (hypotheses space) * h Approximation Error * h S ( Expressiveness ) h Estimation Error Training Error ( Generalization ) ( Optimization ) Optimization Empirical loss minimization is a non-convex program: h ∗ S is not unique — many hypotheses have low training err Gradient descent (GD) somehow reaches one of these Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35
DL Theory: Expressiveness, Optimization & Generalization Deep Learning (all functions) * f (hypotheses space) * h Approximation Error * h S ( Expressiveness ) h Estimation Error Training Error ( Generalization ) ( Optimization ) Optimization Empirical loss minimization is a non-convex program: h ∗ S is not unique — many hypotheses have low training err Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35
DL Theory: Expressiveness, Optimization & Generalization Deep Learning (all functions) * f (hypotheses space) * h Approximation Error * h S ( Expressiveness ) h Estimation Error Training Error ( Generalization ) ( Optimization ) Optimization Empirical loss minimization is a non-convex program: h ∗ S is not unique — many hypotheses have low training err Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35
DL Theory: Expressiveness, Optimization & Generalization Deep Learning (all functions) * f (hypotheses space) * h Approximation Error * h S ( Expressiveness ) h Estimation Error Training Error ( Generalization ) ( Optimization ) Optimization Empirical loss minimization is a non-convex program: h ∗ S is not unique — many hypotheses have low training err Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t W/typical data, solution returned by GD often generalizes well Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35
DL Theory: Expressiveness, Optimization & Generalization Deep Learning (all functions) * f (hypotheses space) * h Approximation Error * h S ( Expressiveness ) h Estimation Error Training Error ( Generalization ) ( Optimization ) Optimization Empirical loss minimization is a non-convex program: h ∗ S is not unique — many hypotheses have low training err Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t W/typical data, solution returned by GD often generalizes well Expanding H reduces approximation err, but also estimation err! Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35
DL Theory: Expressiveness, Optimization & Generalization Deep Learning (all functions) * Not well understood f (hypotheses space) * h Approximation Error * h S ( Expressiveness ) h Estimation Error Training Error ( Generalization ) ( Optimization ) Optimization Empirical loss minimization is a non-convex program: h ∗ S is not unique — many hypotheses have low training err Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t W/typical data, solution returned by GD often generalizes well Expanding H reduces approximation err, but also estimation err! Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35
Analyzing Optimization via Trajectories Outline Deep Learning Theory: Expressiveness, Optimization and Generalization 1 Analyzing Optimization via Trajectories 2 Trajectories of Gradient Descent for Deep Linear Neural Networks 3 Convergence to Global Optimum Acceleration by Depth Conclusion 4 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 9 / 35
Analyzing Optimization via Trajectories Optimization (all functions) * f (hypotheses space) * h * h S h Training Error ( Optimization ) f ∗ D — ground truth h ∗ D — optimal hypothesis h ∗ S — empirically optimal hypothesis ¯ h — returned hypothesis Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 10 / 35
Analyzing Optimization via Trajectories Approach: Convergence via Critical Points Prominent approach for analyzing optimization in DL is via critical points ( ∇ = 0) in loss landscape Good local minimum Poor local minimum Strict saddle Non-strict saddle ( ≈ global minimum) Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35
Analyzing Optimization via Trajectories Approach: Convergence via Critical Points Prominent approach for analyzing optimization in DL is via critical points ( ∇ = 0) in loss landscape (1) (2) Good local minimum Poor local minimum Strict saddle Non-strict saddle ( ≈ global minimum) Result (cf. Ge et al. 2015; Lee et al. 2016) If: (1) there are no poor local minima; and (2) all saddle points are strict, then gradient descent (GD) converges to global minimum Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35
Analyzing Optimization via Trajectories Approach: Convergence via Critical Points Prominent approach for analyzing optimization in DL is via critical points ( ∇ = 0) in loss landscape (1) (2) Good local minimum Poor local minimum Strict saddle Non-strict saddle ( ≈ global minimum) Result (cf. Ge et al. 2015; Lee et al. 2016) If: (1) there are no poor local minima; and (2) all saddle points are strict, then gradient descent (GD) converges to global minimum Motivated by this, many 1 studied the validity of (1) and/or (2) 1 e.g. Haeffele & Vidal 2015; Kawaguchi 2016; Soudry & Carmon 2016; Safran & Shamir 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35
Analyzing Optimization via Trajectories Limitations Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35
Analyzing Optimization via Trajectories Limitations Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models Approach is insufficient when treating deep ( ≥ 3 layer) models: Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35
Analyzing Optimization via Trajectories Limitations Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models Approach is insufficient when treating deep ( ≥ 3 layer) models: (2) is violated — ∃ non-strict saddles, e.g. when all weights = 0 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35
Analyzing Optimization via Trajectories Limitations Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models Approach is insufficient when treating deep ( ≥ 3 layer) models: (2) is violated — ∃ non-strict saddles, e.g. when all weights = 0 Algorithmic aspects essential for convergence w/deep models, e.g. proper initialization, are ignored Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35
Analyzing Optimization via Trajectories Optimizer Trajectories Matter Different optimization trajectories may lead to qualitatively different results Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 13 / 35
Analyzing Optimization via Trajectories Optimizer Trajectories Matter Different optimization trajectories may lead to qualitatively different results = ⇒ details of algorithm and init should be taken into account! Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 13 / 35
Analyzing Optimization via Trajectories Existing Trajectory Analyses Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35
Analyzing Optimization via Trajectories Existing Trajectory Analyses Trajectory approach led to successful analyses of shallow models: Brutzkus & Globerson 2017 Li & Yuan 2017 Zhong et al. 2017 Tian 2017 Brutzkus et al. 2018 Li et al. 2018 Du et al. 2018 Oymak & Soltanolkotabi 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35
Analyzing Optimization via Trajectories Existing Trajectory Analyses Trajectory approach led to successful analyses of shallow models: Brutzkus & Globerson 2017 Li & Yuan 2017 Zhong et al. 2017 Tian 2017 Brutzkus et al. 2018 Li et al. 2018 Du et al. 2018 Oymak & Soltanolkotabi 2018 It also allowed treating prohibitively large deep models: Du et al. 2018 Allen-Zhu et al. 2018 Zou et al. 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35
Analyzing Optimization via Trajectories Existing Trajectory Analyses Trajectory approach led to successful analyses of shallow models: Brutzkus & Globerson 2017 Li & Yuan 2017 Zhong et al. 2017 Tian 2017 Brutzkus et al. 2018 Li et al. 2018 Du et al. 2018 Oymak & Soltanolkotabi 2018 It also allowed treating prohibitively large deep models: Du et al. 2018 Allen-Zhu et al. 2018 Zou et al. 2018 For deep linear residual networks, trajectories were used to show efficient convergence of GD to global min (Bartlett et al. 2018) Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35
Trajectories of GD for Deep LNNs Outline Deep Learning Theory: Expressiveness, Optimization and Generalization 1 Analyzing Optimization via Trajectories 2 Trajectories of Gradient Descent for Deep Linear Neural Networks 3 Convergence to Global Optimum Acceleration by Depth Conclusion 4 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 15 / 35
Trajectories of GD for Deep LNNs Sources On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization Arora + C + Hazan (alphabetical order) International Conference on Machine Learning (ICML) 2018 A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks Arora + C + Golowich + Hu (alphabetical order) To appear: International Conference on Learning Representations (ICLR) 2019 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 16 / 35
Trajectories of GD for Deep LNNs Collaborators Sanjeev Arora Elad Hazan Noah Golowich Wei Hu Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 17 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Outline Deep Learning Theory: Expressiveness, Optimization and Generalization 1 Analyzing Optimization via Trajectories 2 Trajectories of Gradient Descent for Deep Linear Neural Networks 3 Convergence to Global Optimum Acceleration by Depth Conclusion 4 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 18 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Linear Neural Networks Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation x W 1 W 2 W N y = W N • • • W 2 W 1 x Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Linear Neural Networks Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation x W 1 W 2 W N y = W N • • • W 2 W 1 x As surrogate for optimization in DL, GD over LNN (highly non-convex problem) is studied extensively 1 1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Linear Neural Networks Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation x W 1 W 2 W N y = W N • • • W 2 W 1 x As surrogate for optimization in DL, GD over LNN (highly non-convex problem) is studied extensively 1 Existing Result (Bartlett et al. 2018) W/linear residual networks (a special case: W j are square and init to I d ), for ℓ 2 loss on certain data, GD efficiently converges to global min 1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Linear Neural Networks Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation x W 1 W 2 W N y = W N • • • W 2 W 1 x As surrogate for optimization in DL, GD over LNN (highly non-convex problem) is studied extensively 1 Existing Result (Bartlett et al. 2018) W/linear residual networks (a special case: W j are square and init to I d ), for ℓ 2 loss on certain data, GD efficiently converges to global min ↑ Only existing proof of efficient convergence to global min for GD training deep model 1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Gradient Flow Gradient flow (GF) is a continuous version of GD (learning rate → 0): d dt α ( t ) = −∇ f ( α ( t )) , t ∈ R > 0 Gradient descent Gradient flow Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 20 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Gradient Flow Gradient flow (GF) is a continuous version of GD (learning rate → 0): d dt α ( t ) = −∇ f ( α ( t )) , t ∈ R > 0 Gradient descent Gradient flow Admits use of theoretical tools from differential geometry/equations Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 20 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Trajectories of Gradient Flow x W 1 W 2 W N y = W N • • • W 2 W 1 x Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Trajectories of Gradient Flow x W 1 W 2 W N y = W N • • • W 2 W 1 x Loss ℓ ( · ) for linear model induces overparameterized objective for LNN: φ ( W 1 , . . . , W N ) := ℓ ( W N · · · W 2 W 1 ) Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Trajectories of Gradient Flow x W 1 W 2 W N y = W N • • • W 2 W 1 x Loss ℓ ( · ) for linear model induces overparameterized objective for LNN: φ ( W 1 , . . . , W N ) := ℓ ( W N · · · W 2 W 1 ) Definition Weights W 1 . . . W N are balanced if W ⊤ j +1 W j +1 = W j W ⊤ , ∀ j . j Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Trajectories of Gradient Flow x W 1 W 2 W N y = W N • • • W 2 W 1 x Loss ℓ ( · ) for linear model induces overparameterized objective for LNN: φ ( W 1 , . . . , W N ) := ℓ ( W N · · · W 2 W 1 ) Definition Weights W 1 . . . W N are balanced if W ⊤ j +1 W j +1 = W j W ⊤ , ∀ j . j ↑ Holds approximately under ≈ 0 init, exactly under residual ( I d ) init Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Trajectories of Gradient Flow x W 1 W 2 W N y = W N • • • W 2 W 1 x Loss ℓ ( · ) for linear model induces overparameterized objective for LNN: φ ( W 1 , . . . , W N ) := ℓ ( W N · · · W 2 W 1 ) Definition Weights W 1 . . . W N are balanced if W ⊤ j +1 W j +1 = W j W ⊤ , ∀ j . j ↑ Holds approximately under ≈ 0 init, exactly under residual ( I d ) init Claim Trajectories of GF over LNN preserve balancedness: if W 1 . . . W N are balanced at init, they remain that way throughout GF optimization Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Implicit Preconditioning Question How does end-to-end matrix W 1: N := W N · · · W 1 move on GF trajectories? Linear Neural Network Equivalent Linear Model W 1 W 2 W N W 1:N Gradient flow over ( W , … , W ) ? 1 N Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Implicit Preconditioning Question How does end-to-end matrix W 1: N := W N · · · W 1 move on GF trajectories? Linear Neural Network Equivalent Linear Model W 1 W 2 W N W 1:N Preconditioned Gradient flow over (W 1 , … ,W N ) gradient flow over (W 1:N ) Theorem If W 1 . . . W N are balanced at init, W 1: N follows end-to-end dynamics : d dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec � ∇ ℓ � W 1: N ( t ) �� where P W 1: N ( t ) is a preconditioner (PSD matrix) that “reinforces” W 1: N ( t ) Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Implicit Preconditioning Question How does end-to-end matrix W 1: N := W N · · · W 1 move on GF trajectories? Linear Neural Network Equivalent Linear Model W 1 W 2 W N W 1:N Preconditioned Gradient flow over (W 1 , … ,W N ) gradient flow over (W 1:N ) Theorem If W 1 . . . W N are balanced at init, W 1: N follows end-to-end dynamics : d dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec � ∇ ℓ � W 1: N ( t ) �� where P W 1: N ( t ) is a preconditioner (PSD matrix) that “reinforces” W 1: N ( t ) Adding (redundant) linear layers to classic linear model induces preconditioner promoting movement in directions already taken! Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec P W 1: N ( t ) ≻ 0 when W 1: N ( t ) has full rank Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec P W 1: N ( t ) ≻ 0 when W 1: N ( t ) has full rank = ⇒ loss decreases until: � = 0 � W 1: N ( t ) (1) ∇ ℓ or (2) W 1: N ( t ) is singular Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec P W 1: N ( t ) ≻ 0 when W 1: N ( t ) has full rank = ⇒ loss decreases until: � = 0 � W 1: N ( t ) (1) ∇ ℓ or (2) W 1: N ( t ) is singular ℓ ( · ) is typically convex = ⇒ (1) means global min was reached Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec P W 1: N ( t ) ≻ 0 when W 1: N ( t ) has full rank = ⇒ loss decreases until: � = 0 � W 1: N ( t ) (1) ∇ ℓ or (2) W 1: N ( t ) is singular ℓ ( · ) is typically convex = ⇒ (1) means global min was reached Corollary Assume ℓ ( · ) is convex and LNN is init such that: W 1 . . . W N are balanced ℓ ( W 1: N ) < ℓ ( W ) for any singular W Then, GF converges to global min Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Our convergence result for GF made two assumptions on init: Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Our convergence result for GF made two assumptions on init: 1 Weights are balanced: W ⊤ j +1 W j +1 = W j W ⊤ , ∀ j j Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Our convergence result for GF made two assumptions on init: 1 Weights are balanced: � W ⊤ j +1 W j +1 − W j W ⊤ j � F = 0 , ∀ j Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Our convergence result for GF made two assumptions on init: 1 Weights are balanced: � W ⊤ j +1 W j +1 − W j W ⊤ j � F = 0 , ∀ j 2 Loss is smaller than that of any singular solution: ℓ ( W 1: N ) < ℓ ( W ) , ∀ W s . t . σ min ( W ) = 0 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Our convergence result for GF made two assumptions on init: 1 Weights are balanced: � W ⊤ j +1 W j +1 − W j W ⊤ j � F = 0 , ∀ j 2 Loss is smaller than that of any singular solution: ℓ ( W 1: N ) < ℓ ( W ) , ∀ W s . t . σ min ( W ) = 0 For translating to GD, we define discrete forms of these conditions: Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Our convergence result for GF made two assumptions on init: 1 Weights are balanced: � W ⊤ j +1 W j +1 − W j W ⊤ j � F = 0 , ∀ j 2 Loss is smaller than that of any singular solution: ℓ ( W 1: N ) < ℓ ( W ) , ∀ W s . t . σ min ( W ) = 0 For translating to GD, we define discrete forms of these conditions: Definition For δ ≥ 0, weights W 1 . . . W N are δ -balanced if: � W ⊤ j +1 W j +1 − W j W ⊤ j � F ≤ δ , ∀ j Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum From Gradient Flow to Gradient Descent Our convergence result for GF made two assumptions on init: 1 Weights are balanced: � W ⊤ j +1 W j +1 − W j W ⊤ j � F = 0 , ∀ j 2 Loss is smaller than that of any singular solution: ℓ ( W 1: N ) < ℓ ( W ) , ∀ W s . t . σ min ( W ) = 0 For translating to GD, we define discrete forms of these conditions: Definition For δ ≥ 0, weights W 1 . . . W N are δ -balanced if: � W ⊤ j +1 W j +1 − W j W ⊤ j � F ≤ δ , ∀ j Definition For c > 0, weights W 1 . . . W N have deficiency margin c if: ℓ ( W 1: N ) ≤ ℓ ( W ) , ∀ W s . t . σ min ( W ) ≤ c Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum for Gradient Descent � � i =1 � W x i − y i � 2 i.e. ℓ ( W ) = 1 � m Suppose ℓ ( · ) = ℓ 2 loss 2 m Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum for Gradient Descent � � i =1 � W x i − y i � 2 i.e. ℓ ( W ) = 1 � m Suppose ℓ ( · ) = ℓ 2 loss 2 m Theorem Assume GD over LNN is init s.t. W 1 . . . W N have deficiency margin c > 0 and are δ -balanced w/ δ ≤ O ( c 2 ) . Then, for any learning rate η ≤ O ( c 4 ) : loss(iteration t) ≤ e − Ω( c 2 η t ) Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum for Gradient Descent � � i =1 � W x i − y i � 2 i.e. ℓ ( W ) = 1 � m Suppose ℓ ( · ) = ℓ 2 loss 2 m Theorem Assume GD over LNN is init s.t. W 1 . . . W N have deficiency margin c > 0 and are δ -balanced w/ δ ≤ O ( c 2 ) . Then, for any learning rate η ≤ O ( c 4 ) : loss(iteration t) ≤ e − Ω( c 2 η t ) Claim Assumptions on init — deficiency margin and δ -balancedness: Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum for Gradient Descent � � i =1 � W x i − y i � 2 i.e. ℓ ( W ) = 1 � m Suppose ℓ ( · ) = ℓ 2 loss 2 m Theorem Assume GD over LNN is init s.t. W 1 . . . W N have deficiency margin c > 0 and are δ -balanced w/ δ ≤ O ( c 2 ) . Then, for any learning rate η ≤ O ( c 4 ) : loss(iteration t) ≤ e − Ω( c 2 η t ) Claim Assumptions on init — deficiency margin and δ -balancedness: Are necessary (violating any of them can lead to divergence) Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum for Gradient Descent � � i =1 � W x i − y i � 2 i.e. ℓ ( W ) = 1 � m Suppose ℓ ( · ) = ℓ 2 loss 2 m Theorem Assume GD over LNN is init s.t. W 1 . . . W N have deficiency margin c > 0 and are δ -balanced w/ δ ≤ O ( c 2 ) . Then, for any learning rate η ≤ O ( c 4 ) : loss(iteration t) ≤ e − Ω( c 2 η t ) Claim Assumptions on init — deficiency margin and δ -balancedness: Are necessary (violating any of them can lead to divergence) For output dim 1, hold w/const prob under random “balanced” init Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35
Trajectories of GD for Deep LNNs Convergence to Global Optimum Convergence to Global Optimum for Gradient Descent � � i =1 � W x i − y i � 2 i.e. ℓ ( W ) = 1 � m Suppose ℓ ( · ) = ℓ 2 loss 2 m Theorem Assume GD over LNN is init s.t. W 1 . . . W N have deficiency margin c > 0 and are δ -balanced w/ δ ≤ O ( c 2 ) . Then, for any learning rate η ≤ O ( c 4 ) : loss(iteration t) ≤ e − Ω( c 2 η t ) Claim Assumptions on init — deficiency margin and δ -balancedness: Are necessary (violating any of them can lead to divergence) For output dim 1, hold w/const prob under random “balanced” init Guarantee of efficient (linear rate) convergence to global min! Most general guarantee to date for GD efficiently training deep net. Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Outline Deep Learning Theory: Expressiveness, Optimization and Generalization 1 Analyzing Optimization via Trajectories 2 Trajectories of Gradient Descent for Deep Linear Neural Networks 3 Convergence to Global Optimum Acceleration by Depth Conclusion 4 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 26 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth The Effect of Depth Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth The Effect of Depth Conventional wisdom : Depth boosts expressiveness input early layers intermediate layers deep layers Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth The Effect of Depth Conventional wisdom : Depth boosts expressiveness input early layers intermediate layers deep layers But complicates optimization Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth The Effect of Depth Conventional wisdom : Depth boosts expressiveness input early layers intermediate layers deep layers But complicates optimization We will see : not always true... Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Effect of Depth for Linear Neural Networks Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Effect of Depth for Linear Neural Networks For LNN, we derived end-to-end dynamics: d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Effect of Depth for Linear Neural Networks For LNN, we derived end-to-end dynamics: d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec Consider a discrete version: � � � � � � vec W 1: N ( t + 1 ) ← � vec W 1: N ( t ) − η · P W 1: N ( t ) · vec ∇ ℓ ( W 1: N ( t )) Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Effect of Depth for Linear Neural Networks For LNN, we derived end-to-end dynamics: d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec Consider a discrete version: � � � � � � vec W 1: N ( t + 1 ) ← � vec W 1: N ( t ) − η · P W 1: N ( t ) · vec ∇ ℓ ( W 1: N ( t )) Claim For any p > 2 , there exist settings where ℓ ( · ) = ℓ p loss: ℓ ( W ) = 1 � m i =1 � W x i − y i � p p m Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Effect of Depth for Linear Neural Networks For LNN, we derived end-to-end dynamics: d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec Consider a discrete version: � � � � � � vec W 1: N ( t + 1 ) ← � vec W 1: N ( t ) − η · P W 1: N ( t ) · vec ∇ ℓ ( W 1: N ( t )) Claim For any p > 2 , there exist settings where ℓ ( · ) = ℓ p loss: ℓ ( W ) = 1 � m i =1 � W x i − y i � p ← convex p m Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Effect of Depth for Linear Neural Networks For LNN, we derived end-to-end dynamics: d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec Consider a discrete version: � � � � � � vec W 1: N ( t + 1 ) ← � vec W 1: N ( t ) − η · P W 1: N ( t ) · vec ∇ ℓ ( W 1: N ( t )) Claim For any p > 2 , there exist settings where ℓ ( · ) = ℓ p loss: ℓ ( W ) = 1 � m i =1 � W x i − y i � p ← convex p m and disc end-to-end dynamics reach global min arbitrarily faster than GD Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Effect of Depth for Linear Neural Networks For LNN, we derived end-to-end dynamics: d � ∇ ℓ � W 1: N ( t ) �� dt vec [ W 1: N ( t )] = − P W 1: N ( t ) · vec Consider a discrete version: � � � � � � vec W 1: N ( t + 1 ) ← � vec W 1: N ( t ) − η · P W 1: N ( t ) · vec ∇ ℓ ( W 1: N ( t )) Claim For any p > 2 , there exist settings where ℓ ( · ) = ℓ p loss: ℓ ( W ) = 1 � m i =1 � W x i − y i � p ← convex p m and disc end-to-end dynamics reach global min arbitrarily faster than GD Gradient descent End-to-end dymaics w 2 w 2 w 1 w 1 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments Linear neural networks : Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments Linear neural networks : Regression problem from UCI ML Repository ; ℓ 4 loss Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments Linear neural networks : Regression problem from UCI ML Repository ; ℓ 4 loss Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments Linear neural networks : Regression problem from UCI ML Repository ; ℓ 4 loss Depth can speed-up GD, even w/o any gain in expressiveness, and despite introducing non-convexity! Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments Linear neural networks : Regression problem from UCI ML Repository ; ℓ 4 loss Depth can speed-up GD, even w/o any gain in expressiveness, and despite introducing non-convexity! Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments Linear neural networks : Regression problem from UCI ML Repository ; ℓ 4 loss Depth can speed-up GD, even w/o This speed-up can outperform any gain in expressiveness, and popular acceleration methods designed for convex problems! despite introducing non-convexity! Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35
Trajectories of GD for Deep LNNs Acceleration by Depth Experiments (cont’) Non-linear neural networks : Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35
Recommend
More recommend