ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS - PowerPoint PPT Presentation

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU)

MOTIVATION ➤ We consider the standard Empirical Risk Minimization setup: ` ( z ) convex ˆ E ( Θ ) = E ( X,Y ) ∼ ˆ P ` ( Φ ( X ; Θ ) , Y ) + R ( Θ ) R ( Θ ): regularization E ( Θ ) = E ( X,Y ) ∼ P ` ( Φ ( X ; Θ ) , Y ) . P = 1 X ˆ δ ( xi,y i ) . n i ≤

MOTIVATION ➤ We consider the standard Empirical Risk Minimization setup: ` ( z ) convex ˆ E ( Θ ) = E ( X,Y ) ∼ ˆ P ` ( Φ ( X ; Θ ) , Y ) + R ( Θ ) R ( Θ ): regularization E ( Θ ) = E ( X,Y ) ∼ P ` ( Φ ( X ; Θ ) , Y ) . P = 1 X ˆ δ ( x l ,y l ) L l ≤ L ➤ Population loss decomposition ( aka “fundamental theorem of ML ”): ˆ + E ( Θ ∗ ) − ˆ E ( Θ ∗ ) = E ( Θ ∗ ) E ( Θ ∗ ) . | {z } | {z } training error generalization gap ➤ Long history of techniques to provably control generalization error via appropriate regularization. ➤ Generalization error and optimization are entangled [Bottou & Bousquet]

MOTIVATION ➤ However, when is a large, deep network, current best Φ ( X ; Θ ) mechanism to control generalization gap has two key ingredients: ➤ Stochastic Optimization ➤ “ During training, it adds the sampling noise that corresponds to empirical- population mismatch ” [Léon Bottou]. ➤ Make the model convolutional and very large . ➤ see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang et al , ICLR’17].

MOTIVATION ➤ However, when is a large, deep network, current best Φ ( X ; Θ ) mechanism to control generalization gap has two key ingredients: ➤ Stochastic Optimization ➤ Make the model convolutional and as large as possible . ➤ We first address how overparametrization a ff ects the energy landscapes. ➤ Goal 1 : Study simple topological properties of these landscapes . for half-rectified neural networks. E ( Θ ) , ˆ E ( Θ ) ➤ Goal 2 : Estimate simple geometric properties with e ffi cient, scalable algorithms. Diagnostic tool.

OUTLINE ➤ Topology of Neural Network Energy Landscapes ➤ Geometry of Neural Network Energy Landscapes (a) without skip connections (b) with skip connections [Li et al.’17]

PRIOR RELATED WORK ➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15] ➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Hae ff ele et al.’15]

PRIOR RELATED WORK ➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15] ➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Hae ff ele et al.’15] ➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in the overparametrized regime. ➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer ReLU networks, also in the over-parametrized regime.

PRIOR RELATED WORK ➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15] ➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Hae ff ele et al.’15] ➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in the overparametrized regime. ➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer ReLU networks, also in the over-parametrized regime. ➤ [Tian’17] studies learning dynamics in a gaussian generative setting. ➤ [Chaudhari et al’17]: Studies local smoothing of energy landscape using the local entropy method from statistical physics. ➤ [Pennington & Bahri’17]: Hessian Analysis using Random Matrix Th. ➤ [Soltanolkotabi, Javanmard & Lee’17]: layer-wise quadratic NNs.

NON-CONVEXITY ≠ NOT OPTIMIZABLE ➤ We can perturb any convex function in such a way it is no longer convex, but such that gradient descent still converges. ➤ E.g. quasi-convex functions.

NON-CONVEXITY ≠ NOT OPTIMIZABLE ➤ We can perturb any convex function in such a way it is no longer convex, but such that gradient descent still converges. ➤ E.g. quasi-convex functions. ➤ In particular, deep models have internal symmetries. F ( θ ) = F ( g. θ ) , g ∈ G compact .

ANALYSIS OF NON-CONVEX LOSS SURFACES ➤ Given loss we consider its representation in E ( θ ) , θ ∈ R d , terms of level sets: Z ∞ 1 ( θ ∈ Ω u ) du , Ω u = { y ∈ R d ; E ( y ) ≤ u } . E ( θ ) = 0

ANALYSIS OF NON-CONVEX LOSS SURFACES ➤ Given loss we consider its representation in E ( θ ) , θ ∈ R d , terms of level sets: Z ∞ 1 ( θ ∈ Ω u ) du , Ω u = { y ∈ R d ; E ( y ) ≤ u } . E ( θ ) = 0 ➤ A first notion we address is about the topology of the level Ω u sets . Ω u ➤ In particular, we ask how connected they are, i.e. how many connected components at each energy level ? N u u

ANALYSIS OF NON-CONVEX LOSS SURFACES ➤ Given loss we consider its representation in E ( θ ) , θ ∈ R d , terms of level sets: Z ∞ 1 ( θ ∈ Ω u ) du , Ω u = { y ∈ R d ; E ( y ) ≤ u } . E ( θ ) = 0 ➤ A first notion we address is about the topology of the level Ω u sets . Ω u ➤ In particular, we ask how connected they are, i.e. how many connected components at each energy level ? N u u ➤ Related to presence of poor local minima: Proposition: If N u = 1 for all u then E has no poor local minima. (i.e. no local minima y ∗ s.t. E ( y ∗ ) > min y E ( y ))

LINEAR VS NON-LINEAR DEEP MODELS ➤ Some authors have considered linear “deep” models as a first step towards understanding nonlinear deep models: E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . X ∈ R n , Y ∈ R m , W k ∈ R n k × n k − 1 .

LINEAR VS NON-LINEAR DEEP MODELS ➤ Some authors have considered linear “deep” models as a first step towards understanding nonlinear deep models: E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . X ∈ R n , Y ∈ R m , W k ∈ R n k × n k − 1 . Theorem: [Kawaguchi’16] If Σ = E ( XX T ) and E ( XY T ) are full-rank and Σ has distinct eigenvalues, then E ( Θ ) has no poor local minima. • studying critical points. • later generalized in [Hardt & Ma’16, Lu & Kawaguchi’17]

LINEAR VS NON-LINEAR DEEP MODELS E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . Proposition: [BF’16] 1. If n k > min( n, m ) , 0 < k < K , then N u = 1 for all u . 2. (2-layer case, ridge regression) E ( W 1 , W 2 ) = E ( X,Y ) ∼ P k W 2 W 1 X � Y k 2 + λ ( k W 1 k 2 + k W 2 k 2 ) satisfies N u = 1 8 u if n 1 > min( n, m ). ➤ We pay extra redundancy price to get simple topology.

LINEAR VS NON-LINEAR DEEP MODELS E ( W 1 , . . . , W K ) = E ( X,Y ) ∼ P k W K . . . W 1 X � Y k 2 . Proposition: [BF’16] 1. If n k > min( n, m ) , 0 < k < K , then N u = 1 for all u . 2. (2-layer case, ridge regression) E ( W 1 , W 2 ) = E ( X,Y ) ∼ P k W 2 W 1 X � Y k 2 + λ ( k W 1 k 2 + k W 2 k 2 ) satisfies N u = 1 8 u if n 1 > min( n, m ). ➤ We pay extra redundancy price to get simple topology. ➤ This simple topology is an “artifact” of the linearity of the network: Proposition: [BF’16] For any architecture (choice of internal dimensions), there exists a distribution P ( X,Y ) such that N u > 1 in the ReLU ρ ( z ) = max(0 , z ) case.

PROOF SKETCH ➤ Goal: Given Θ A = ( W A K ) and Θ B = ( W B 1 , . . . , W A 1 , . . . , W B K ), we construct a path γ ( t ) that connects Θ A with Θ B st E ( γ ( t )) ≤ max( E ( Θ A ) , E ( Θ B )).

PROOF SKETCH ➤ Goal: Given Θ A = ( W A K ) and Θ B = ( W B 1 , . . . , W A 1 , . . . , W B K ), we construct a path γ ( t ) that connects Θ A with Θ B st E ( γ ( t )) ≤ max( E ( Θ A ) , E ( Θ B )). ➤ Main idea: 1. Induction on K . 2. Lift the parameter space to f W = W 1 W 2 : the problem is convex ⇒ there γ ( t ) that connects Θ A and Θ B . exists a (linear) path e 3. Write the path in terms of original coordinates by factorizing e γ ( t ). ➤ Simple fact: If M 0 , M 1 ∈ R n ⇥ n 0 with n 0 > n , then there exists a path t : [0 , 1] → γ ( t ) with γ (0) = M 0 , γ (1) = M 1 and M 0 , M 1 ∈ span( γ ( t )) for all t ∈ (0 , 1).

MODEL SYMMETRIES [with L. Venturi, A. Bandeira, ’17] ➤ How much extra redundancy are we paying to achieve N u = 1 instead of simply no poor-local minima?

MODEL SYMMETRIES [with L. Venturi, A. Bandeira, ’17] ➤ How much extra redundancy are we paying to achieve N u = 1 instead of simply no poor-local minima? ➤ In the multilinear case, we don’t need . n k > min( n, m ) ( W 1 , W 2 , . . . W K ) ∼ ( f W 1 , . . . , f W K ) ⇔ f W k = U k W k U − 1 k − 1 , U k ∈ GL ( R n k × n k ) .

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS - PowerPoint PPT Presentation

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU) MOTIVATION We consider the standard Empirical Risk Minimization setup: `

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Concentration of risk measures: A Wasserstein distance approach 1 Prashanth L. A. Joint work

The Failure of a Clearinghouse: Empirical Evidence Vincent Bignon Guillaume Vuillemey Banque de

Recent Results on Algorithmic Fairness and Meta-Learning Massimiliano Pontil Computational

Risk bounds for cl classification and re regre ression rules that interpolate Daniel Hsu

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Using Strengths Based Measures to Assess and Manage Risk of Future Negative outcomes Simone

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata

Sambuz

Useful Links

Newsletter

Mail Us