An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani Department of Electrical Engineering Stanford University (Joint Work with Shankar Krishnan & Ying Xiao from Google Research) June 2019 Behrooz Ghorbani Hessian Spectral Density June 2019 1 / 18
Overview Gradient descent and its variants are the most popular method of optimizing neural networks. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18
Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18
Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18
Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18
Overview Gradient descent and its variants are the most popular method of optimizing neural networks. The performance of these optimizers is highly dependent on the local curvature of the loss surface − → important to study the loss curvature We present a scalable algorithm for computing the full eigenvalue density of the Hessian for deep neural networks. We leverage this algorithm to study the effect of architecture / hyper-parameter choices on the optimization landscape. Behrooz Ghorbani Hessian Spectral Density June 2019 2 / 18
Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18
Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18
Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18
Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18
Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . λ i is the curvature of the loss in direction of q i in the neighborhood of θ . Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18
Basic Definitions θ ∈ R n is the model parameter. L ( θ ) ≡ 1 � N i = 1 L ( θ, ( x i , y i )) . N The Hessian matrix, H , is an n × n symmetric matrix of second derivatives: ∂ 2 L | θ = θ t H ( θ t ) i , j = ∂θ i ∂θ j H ( θ ) represents the (local) loss curvature at point θ . H ( θ ) has eigenvalue-eigenvector pairs ( λ i , q i ) n i = 1 with λ 1 ≥ λ 2 · · · ≥ λ n . λ i is the curvature of the loss in direction of q i in the neighborhood of θ . We focus on estimating the empirical distribution of λ i as a concrete way to study the loss curvature. Behrooz Ghorbani Hessian Spectral Density June 2019 3 / 18
4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18
4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18
4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ( t ) = 1 � n i = 1 δ ( t − λ i ) n Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18
4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ( t ) = 1 � n i = 1 δ ( t − λ i ) n 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18
4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → n Convolution with Gaussian 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18
4.0 Smoothed Density 3.5 δ ( t − λ i ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n φ σ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → i = 1 f σ ( t − λ i ) n n Convolution with Gaussian 4.0 δ ( t − λ i ) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18
Hessian Computation in Deep Networks The eigenvalue distribution function of H is defined as n φ ( t ) = 1 � δ ( t − λ i ) n i = 1 2 π exp( − x 2 1 Let f σ ( x ) = 2 σ 2 ) be the Gaussian density. √ σ φ ∗ f ( t ) φ ( t ) = 1 � n φ σ ( t ) = 1 � n i = 1 δ ( t − λ i ) − − − − − − − − − − − − − − − → i = 1 f σ ( t − λ i ) n n Convolution with Gaussian 4.0 4.0 Smoothed Density δ ( t − λ i ) 3.5 3.5 δ ( t − λ i ) 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Behrooz Ghorbani Hessian Spectral Density June 2019 4 / 18
Estimating the Smoothed Density Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18
Estimating the Smoothed Density Gene Golub and his students [Golub and Welsch (1969); Bai et al. (1996)] � m � Constructs w i , ℓ i i = 1 such that for all "nice" functions g , n m 1 � � g ( λ i ) ≈ w i g ( ℓ i ) n i = 1 i = 1 Behrooz Ghorbani Hessian Spectral Density June 2019 5 / 18
Recommend
More recommend