on the optimization landscape of neural networks
play

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS - PowerPoint PPT Presentation

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU) MOTIVATION We consider the standard Empirical Risk Minimization setup: `


  1. ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU)

  2. MOTIVATION ➤ We consider the standard Empirical Risk Minimization setup: ` ( z ) convex ˆ E ( Θ ) = E ( X,Y ) ∼ ˆ P ` ( Φ ( X ; Θ ) , Y ) + R ( Θ ) R ( Θ ): regularization E ( Θ ) = E ( X,Y ) ∼ P ` ( Φ ( X ; Θ ) , Y ) . P = 1 X ˆ δ ( xi,y i ) . n i ≤

  3. MOTIVATION ➤ We consider the standard Empirical Risk Minimization setup: ` ( z ) convex ˆ E ( Θ ) = E ( X,Y ) ∼ ˆ P ` ( Φ ( X ; Θ ) , Y ) + R ( Θ ) R ( Θ ): regularization E ( Θ ) = E ( X,Y ) ∼ P ` ( Φ ( X ; Θ ) , Y ) . P = 1 X ˆ δ ( x l ,y l ) L l ≤ L ➤ Population loss decomposition ( aka “fundamental theorem of ML ”): ˆ + E ( Θ ∗ ) − ˆ E ( Θ ∗ ) = E ( Θ ∗ ) E ( Θ ∗ ) . | {z } | {z } training error generalization gap ➤ Long history of techniques to provably control generalization error via appropriate regularization. ➤ Generalization error and optimization are entangled [Bottou & Bousquet]

  4. MOTIVATION ➤ However, when is a large, deep network, current best Φ ( X ; Θ ) mechanism to control generalization gap has two key ingredients: ➤ Stochastic Optimization ➤ “ During training, it adds the sampling noise that corresponds to empirical- population mismatch ” [Léon Bottou]. ➤ Make the model convolutional and very large . ➤ see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang et al , ICLR’17].

  5. MOTIVATION ➤ However, when is a large, deep network, current best Φ ( X ; Θ ) mechanism to control generalization gap has two key ingredients: ➤ Stochastic Optimization ➤ Make the model convolutional and as large as possible . ➤ We first address how overparametrization a ff ects the energy landscapes. ➤ Goal 1 : Study simple topological properties of these landscapes . for half-rectified neural networks. E ( Θ ) , ˆ E ( Θ ) ➤ Goal 2 : Estimate simple geometric properties with e ffi cient, scalable algorithms. Diagnostic tool.

  6. OUTLINE ➤ Topology of Neural Network Energy Landscapes ➤ Geometry of Neural Network Energy Landscapes (a) without skip connections (b) with skip connections [Li et al.’17]

  7. PRIOR RELATED WORK ➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15] ➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Hae ff ele et al.’15]

  8. PRIOR RELATED WORK ➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15] ➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Hae ff ele et al.’15] ➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in the overparametrized regime. ➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer ReLU networks, also in the over-parametrized regime.

  9. PRIOR RELATED WORK ➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15] ➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Hae ff ele et al.’15] ➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in the overparametrized regime. ➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer ReLU networks, also in the over-parametrized regime. ➤ [Tian’17] studies learning dynamics in a gaussian generative setting. ➤ [Chaudhari et al’17]: Studies local smoothing of energy landscape using the local entropy method from statistical physics. ➤ [Pennington & Bahri’17]: Hessian Analysis using Random Matrix Th. ➤ [Soltanolkotabi, Javanmard & Lee’17]: layer-wise quadratic NNs.

  10. NON-CONVEXITY ≠ NOT OPTIMIZABLE ➤ We can perturb any convex function in such a way it is no longer convex, but such that gradient descent still converges. ➤ E.g. quasi-convex functions.

  11. NON-CONVEXITY ≠ NOT OPTIMIZABLE ➤ We can perturb any convex function in such a way it is no longer convex, but such that gradient descent still converges. ➤ E.g. quasi-convex functions. ➤ In particular, deep models have internal symmetries. F ( θ ) = F ( g. θ ) , g ∈ G compact .

  12. ANALYSIS OF NON-CONVEX LOSS SURFACES ➤ Given loss we consider its representation in E ( θ ) , θ ∈ R d , terms of level sets: Z ∞ 1 ( θ ∈ Ω u ) du , Ω u = { y ∈ R d ; E ( y ) ≤ u } . E ( θ ) = 0

  13. ANALYSIS OF NON-CONVEX LOSS SURFACES ➤ Given loss we consider its representation in E ( θ ) , θ ∈ R d , terms of level sets: Z ∞ 1 ( θ ∈ Ω u ) du , Ω u = { y ∈ R d ; E ( y ) ≤ u } . E ( θ ) = 0 ➤ A first notion we address is about the topology of the level Ω u sets . Ω u ➤ In particular, we ask how connected they are, i.e. how many connected components at each energy level ? N u u

  14. ANALYSIS OF NON-CONVEX LOSS SURFACES ➤ Given loss we consider its representation in E ( θ ) , θ ∈ R d , terms of level sets: Z ∞ 1 ( θ ∈ Ω u ) du , Ω u = { y ∈ R d ; E ( y ) ≤ u } . E ( θ ) = 0 ➤ A first notion we address is about the topology of the level Ω u sets . Ω u ➤ In particular, we ask how connected they are, i.e. how many connected components at each energy level ? N u u ➤ Related to presence of poor local minima: Proposition: If N u = 1 for all u then E has no poor local minima. (i.e. no local minima y ∗ s.t. E ( y ∗ ) > min y E ( y ))

  15. LINEAR VS NON-LINEAR DEEP MODELS ➤ Some authors have considered linear “deep” models as a first step towards understanding nonlinear deep models: E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . X ∈ R n , Y ∈ R m , W k ∈ R n k × n k − 1 .

  16. LINEAR VS NON-LINEAR DEEP MODELS ➤ Some authors have considered linear “deep” models as a first step towards understanding nonlinear deep models: E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . X ∈ R n , Y ∈ R m , W k ∈ R n k × n k − 1 . Theorem: [Kawaguchi’16] If Σ = E ( XX T ) and E ( XY T ) are full-rank and Σ has distinct eigenvalues, then E ( Θ ) has no poor local minima. • studying critical points. • later generalized in [Hardt & Ma’16, Lu & Kawaguchi’17]

  17. LINEAR VS NON-LINEAR DEEP MODELS E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . Proposition: [BF’16] 1. If n k > min( n, m ) , 0 < k < K , then N u = 1 for all u . 2. (2-layer case, ridge regression) E ( W 1 , W 2 ) = E ( X,Y ) ∼ P k W 2 W 1 X � Y k 2 + λ ( k W 1 k 2 + k W 2 k 2 ) satisfies N u = 1 8 u if n 1 > min( n, m ). ➤ We pay extra redundancy price to get simple topology.

  18. LINEAR VS NON-LINEAR DEEP MODELS E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . Proposition: [BF’16] 1. If n k > min( n, m ) , 0 < k < K , then N u = 1 for all u . 2. (2-layer case, ridge regression) E ( W 1 , W 2 ) = E ( X,Y ) ∼ P k W 2 W 1 X � Y k 2 + λ ( k W 1 k 2 + k W 2 k 2 ) satisfies N u = 1 8 u if n 1 > min( n, m ). ➤ We pay extra redundancy price to get simple topology. ➤ This simple topology is an “artifact” of the linearity of the network: Proposition: [BF’16] For any architecture (choice of internal dimensions), there exists a distribution P ( X,Y ) such that N u > 1 in the ReLU ρ ( z ) = max(0 , z ) case.

  19. PROOF SKETCH ➤ Goal: Given Θ A = ( W A K ) and Θ B = ( W B 1 , . . . , W A 1 , . . . , W B K ), we construct a path γ ( t ) that connects Θ A with Θ B st E ( γ ( t )) ≤ max( E ( Θ A ) , E ( Θ B )).

  20. PROOF SKETCH ➤ Goal: Given Θ A = ( W A K ) and Θ B = ( W B 1 , . . . , W A 1 , . . . , W B K ), we construct a path γ ( t ) that connects Θ A with Θ B st E ( γ ( t )) ≤ max( E ( Θ A ) , E ( Θ B )). ➤ Main idea: 1. Induction on K . 2. Lift the parameter space to f W = W 1 W 2 : the problem is convex ⇒ there γ ( t ) that connects Θ A and Θ B . exists a (linear) path e 3. Write the path in terms of original coordinates by factorizing e γ ( t ). ➤ Simple fact: If M 0 , M 1 ∈ R n ⇥ n 0 with n 0 > n , then there exists a path t : [0 , 1] → γ ( t ) with γ (0) = M 0 , γ (1) = M 1 and M 0 , M 1 ∈ span( γ ( t )) for all t ∈ (0 , 1).

  21. MODEL SYMMETRIES [with L. Venturi, A. Bandeira, ’17] ➤ How much extra redundancy are we paying to achieve N u = 1 instead of simply no poor-local minima?

  22. MODEL SYMMETRIES [with L. Venturi, A. Bandeira, ’17] ➤ How much extra redundancy are we paying to achieve N u = 1 instead of simply no poor-local minima? ➤ In the multilinear case, we don’t need . n k > min( n, m ) ( W 1 , W 2 , . . . W K ) ∼ ( f W 1 , . . . , f W K ) ⇔ f W k = U k W k U − 1 k − 1 , U k ∈ GL ( R n k × n k ) .

Recommend


More recommend