No Spurious Local Minima in Training Deep Quadratic Networks Abbas Kazemipour Conference on Mathematical Theory of Deep Neural Networks October 31, 2019 New York City, NY
Need for New Optimization Theory The mystery of deep neural networks and gradient descent § Good solutions despite highly nonlinear and nonconvex landscapes › Roles of overparameterization, regularization, normalization and side § information … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Shallow Quadratic Networks Quadratic NNs: A Sweet Spot Between Theory and Practice § › higher order polynomials, analytical and continuous activation functions , . = 202 9 : ; # 9 " # ) # = 4 6 ' 7 * 7 ' ( . . 78( * ( 9 Quadratc features " # x # " # ) # . . . . . . . . . ) # . ℒ 0, 2 = 4 ) # − 6 # * + . . ' , Minimum ! needed if " # ∈ ℝ & ? § Quadratic Linear Layer Activations
Simple vs. Complex Cells Primary visual cortex: sensitive vs insensitive to contrast § [Rust et. al, 2005]
/ ( ∈ ℝ 4×4 - ( ∈ ℝ Low-Rank Matrix Recovery % ∈ ℝ 4×6 ? ? + ? ? ? ? − ? ? − ? ? ? + ? ? ? ? = − ? ? ? ? ? ? ? ? ? ? ? ? ? % 0 - % # low-rank ( = - . / ( = %#% 0 . / ( + ) random measurements , ( − + ℒ #, % = ' ) ) ( (
5 # ∈ ℝ I×I K LM ∈ ℝ Low-Rank Matrix Recovery D ∈ ℝ I×N ? ? + ? ? ? ? − ? ? − ? ? ? + ? ? ? ? = − ? ? ? ? ? ? ? ? ? ? ? ? ? D E / D F low-rank 6 minimize ℒ / = 1 2 # − / 4 5 # nonconvex # subject to rank / ≤ B Convexification via nuclear norm minimization (e.g. under RIP of " # ) § SDP: Computationally challenging § Can we solve for (Λ, ') instead ? (Burer-Moteniro 02, 05) §
Global Optimality Conditions Least Squares + SVD - , − 3 0 1 , 2 minimize ℒ 3 = + Convex 4 ≥ 6 neurons sufficient 3 , No local minima reparameterization Computationally efficient - , − ( & ( / 0 1 , 2 minimize ℒ &, ( = + Nonconvex (local search methods e.g. SGD) &, ( , Possible local minima ≡ (- , − ( : ( / 0 1 , ) 2 0 minimize ℒ :, ( = + :, ( ? , & =&= > : A solution is globally optimal iff , : = - , − (&( / 0 1 , 7 + 7 , 1 , = 0 ,
Properties of Stationary Points (. , − ! ' ! 0 1 2 , ) 4 minimize ℒ ', ! = + + 5 , 2 , = 0 ', ! , , ! 0 + First order optimality § 5 , 2 , ! = 6 , 1 If ! full-rank then ∑ , 5 , 2 , = 0 2 , 1 !'9 : 4 − + , 2 , 1 9'9 0 ≥ 6 Second order optimality § + 5 , , ≽ 2 If ! low-rank then ∑ , 5 , 2 , 0 ≼ Can we force ! to be full-rank or use semidefiniteness? §
Escaping spurious critical points nonconvex penalty (4 3 − . , . 6 7 8 3 ) 9 + ;||.. 6 − =|| 9 9 1 minimize ℒ 0 ,, . = 2 ,, . 3 full rank and orthonormal large enough Theorem 1: Global minimum is achieved § › Solution is an eigenvalue decomposition ⇒ complexity "($ % ) (4 3 −(. , . 6 + ?=) 7 8 3 ) 9 2 minimize ℒ ?, ,, . = 2 >, ,, . 3 = 7 8 3 = ||B 3 || C Adding norm of input as regressor (side information) Theorem 2: All stationary points are global minima with probability 1 § Advantage of data normalization ›
Deep Quadratic Networks: Induction Overparameterization: how big should the hidden layer be? § 2 3 = 3 . 9 . 9 6 3 > 3 ' " ∈ ℝ 7 ! " ∈ ℝ / " ∈ ℝ < . . . . . . . . . . . . . . . 6 7 > 7< . 9 . 9 2 7 = 7< $ % = :;: ( - * = 1? * 1 ( $ 1 = vec(- 3 ), ⋯ , JKL(- 7 ) Quadratic for ℎ ≥ B 9 ⊗0 = $ ( = ∑ *+ $ 1 ( & / " ⊗0 ! " = $ 1$ %$ % & ' " ' " , *+ - * ⊗ - + & / "
Deep Quadratic Networks: Induction Overparameterization: how big should the hidden layer be? § . " # $ ∈ ℝ ' ( . " * $ ∈ ℝ ) $ ∈ ℝ ' . . . . . . . . . . . . . . . . " . " , 7 = vec(4 ? ), ⋯ , CDE(4 F ) Quadratic for ℎ ≥ : " ⊗6 = , / = ∑ 12 , 7 / . ) $ ⊗6 * $ = , 7, -, - . # $ # $ 3 12 4 1 ⊗ 4 2 . ) $
Deep Quadratic Networks . # (') (#) . # $ % $ % (*) ) % $ % . # . # . # … . . . . . . . . . . . . . . . . . . . . . . . . . . . . # . # . # = . # > (*) > (*A') = . # = > (#) Number of neurons superexponential in depth = > (') > (;)? − @|| # * ) # + : 6 || = > (;) = # minimize ℒ 3, 4 = 6 (7 % −) % (0, 2) % ; Theorem 3: All stationary points of ℒ are global minima § › Can form a similar objective by adding norms
How well does gradient descent perform 3 , 1 0 = ±1 w. p. ½ Experiment setup: ! " ∈ ℝ %& ∼ ( ), + , , - = ∑ 0 1 0 2 -0 § ℒ E, F ℒ D E, F ℒ G, E, F 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.6 2 0.6 1 0.6 0 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 5 10 15 20 20 0 5 10 15 20 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 Regular Norm Average Normalized Error Average Normalized Error 0.8 Orthogonal Most bad critical points are close to a global solution! y 0.6 0.4 rank @ A - B - = 1 0.2 - 0 0 5 10 15 20 Number of Hidden Units
Power of Gradient Descent How well does gradient descent work in practice? § 1 2 3 Data Block 1 1 1 Regular Regular Regular Norm Norm Norm Average Normalized Error Average Normalized Error Average Normalized Error Average Normalized Error 0.8 0.8 0.8 Orthogonal Orthogonal Orthogonal Planted Identity 0.6 0.6 0.6 Random Signs Input Distribution: Gaussian 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 1 1 Regular Regular Regular Norm Norm Norm Average Normalized Error Average Normalized Error Average Normalized Error Average Normalized Error 0.8 0.8 0.8 Orthogonal Orthogonal Orthogonal 0.6 0.6 0.6 Planted Gaussian 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units
Power of Gradient Descent How well does gradient descent work in practice? § Regular Quadratic Added Norm Orthogonality Penalty Least Squares Network Setup 1 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.8 Planted Identity 0.6 0.6 0.6 0.6 Random Signs 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units Number of Hidden Units Input Distribution: Gaussian 1 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.8 Planted 0.6 0.6 0.6 0.6 Guassian 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 Non-planted (Random) 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Number of Hidden Units Number of Hidden Units Number of Hidden Units Number of Hidden Units
Power of Gradient Descent How well does gradient descent work in practice? § 8 9 10 Input Dimension 1 1 1 0.8 0.8 0.8 Average Normalized Error Average Normalized Error Average Normalized Error 0.6 0.6 0.6 Average Normalized Error 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Number of Hidden Units Number of Hidden Units Number of Hidden Units 1 1 1 Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer Fraction Achieving Global Minimzer 0.8 0.8 0.8 Fraction achieving 0.6 0.6 0.6 Global Minimizer 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Number of Hidden Units Number of Hidden Units Number of Hidden Units
Summary § Quadratic neural networks are a sweet spot between theory and practice › Local minima can be easily escaped via • Overparameterization • Normalization • Regularization › Next steps: higher order polynomials, analytical and continuous activation functions Shaul Druckmann Brett Larsen
Recommend
More recommend