Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality of Polyak Momentum Fabian Pedregosa (Google Brain, Montreal) Damien Scieur (SAIT AI Lab, Montreal)
Motivation Consider the convex, quadratic optimization problem x ∈ R d f ( x ) = 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ∗ . min 1
Motivation Consider the convex, quadratic optimization problem x ∈ R d f ( x ) = 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ∗ . min Efficient methods: • Conjugate gradients (”most optimal”) • Chebyshev 1st kind acceleration (worst-case optimal) • Polyak heavy-ball method (asympt. worst-case optimal) 1
Polyak Heavy-Ball Polyak Momentum algorithm, for ℓ I � H � L I , x t +1 = x t − h ∇ f ( x t ) + m ( x t − x t − 1 ) where � √ √ � 2 4 L − ℓ √ √ √ h = L 2 + ℓ 2 , m = . L + ℓ • Requires the knowledge of ℓ, L . • Easy to use (widely used in deep learning) • Works well for non-quadratic (deterministic or stochastic). 2
Deep learning and large-scale problems In deep learning, we solve N � min f i ( x ) x ∈ R d i =1 for huge d . Consequences: • The minimum eigenvalue ℓ is extremely hard to estimate. • Behaves like a quadratic when using gradient descent. • Nice statistical properties, like expected spectral density . 3
Eigenvalue Density 0.4 0.3 0.2 0.1 0.0 0 2 4 6 Figure 1: Empirical vs expected spectral density of ∇ 2 f ( x ).
In this talk We study the average-case convergence on quadratic problems. • Standard optimization methods only use ℓ and L . What if we use expected density function? • How to build optimal methods in average case for given spectral densities? Can we get rid of ℓ ? • Asymptotic behavior of these optimal methods? 5
Part 1: Spectral density, optimal methods and orthogonal polynomials
Setting Consider a class of convex, quadratic optimization problem 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ⋆ min x ∈ R d For simplicity, assume that H is sampled randomly from some unknown distribution. 7
Setting Consider a class of convex, quadratic optimization problem 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ⋆ min x ∈ R d For simplicity, assume that H is sampled randomly from some unknown distribution. We define the expected spectral density µ of H to be � b � � P λ i ( H ) ∈ [ a , b ] = d µ, for random i , H . a Remark: we are not interested in knowing the distribution over H ! 7
Beyond the condition number: spectral density we know the distribution of the eigenvalues of H Spectral density 0.8 0.6 0.4 0.2 0.0 0 1 2 3 Likely to have Unlikely to see eigenvalues here. them over here. 8
First order methods and polynomials We will use first-order methods to solve the quadratic problem. x t ∈ x 0 + span {∇ f ( x 0 ) , . . . , ∇ f ( x t − 1 ) } . Main property. The error is a residual polynomial in H : x t − x ⋆ ( x 0 − x ⋆ ) , = P t ( H ) P t (0) = 1 . � �� � � �� � Iteration t Polynomial degree t 9
First order methods and polynomials We will use first-order methods to solve the quadratic problem. x t ∈ x 0 + span {∇ f ( x 0 ) , . . . , ∇ f ( x t − 1 ) } . Main property. The error is a residual polynomial in H : x t − x ⋆ ( x 0 − x ⋆ ) , = P t ( H ) P t (0) = 1 . � �� � � �� � Iteration t Polynomial degree t Example: Gradient descent. ∇ f ( x t − 1 ) � �� � x t − x ⋆ = x t − 1 − x ⋆ − h H ( x t − 1 − x ⋆ ) = ( I − h H ) t ( x 0 − x ⋆ ) = P t Grad ( H )( x 0 − x ⋆ ) with P t Grad ( λ ) = (1 − h λ ) t . 9
From algorithm to polynomials All first-order methods are polynomials ∗ and all polynomials ∗ are first-order methods! (If P t (0) = 1, i.e., P t is a residual polynomial). 10
Comparison of Polynomials Visualizing the polynomial for gradient descent and momentum. • Gradient descent: x t = x t − 1 − h ∇ f ( x t − 1 ). • Optimal Momentum: x t = x t − 1 − h t ∇ f ( x t − 1 ) + m t ( x t − 1 − x t − 2 ) The worst-case rate of convergence is given by the largest value � x t − x ∗ � 2 2 ≤ � x 0 − x ⋆ � 2 λ ∈ [ ℓ, L ] P t 2 ( λ ) max 11
Residual Polynomial for Momentum The residual polynomial p t , with t = 4 1.0 0.8 0.6 p t ( z ) 0.4 0.2 0.0 0.2 0.0 λ min 0.5 1.0 1.5 λ max Gradient Descent, p t ( z ) = (1 − 2 z/ ( λ min + λ max )) 2 t Momentum, p t ( z ) = Chebyshev polynomials 12
Residual Polynomial for Momentum The residual polynomial p t , with t = 6 1.0 0.8 0.6 p t ( z ) 0.4 0.2 0.0 0.2 0.0 λ min 0.5 1.0 1.5 λ max Gradient Descent, p t ( z ) = (1 − 2 z/ ( λ min + λ max )) 2 t Momentum, p t ( z ) = Chebyshev polynomials 12
Residual Polynomial for Momentum The residual polynomial p t , with t = 12 1.0 0.8 0.6 p t ( z ) 0.4 0.2 0.0 0.2 0.0 λ min 0.5 1.0 1.5 λ max Gradient Descent, p t ( z ) = (1 − 2 z/ ( λ min + λ max )) 2 t Momentum, p t ( z ) = Chebyshev polynomials 12
The worst-case rate of convergence is given by the largest value � x t − x ∗ � 2 2 ≤ � x 0 − x ⋆ � 2 λ ∈ [ ℓ, L ] P t 2 ( λ ) max What about the average-case ?
The worst-case rate of convergence is given by the largest value � x t − x ∗ � 2 2 ≤ � x 0 − x ⋆ � 2 λ ∈ [ ℓ, L ] P t 2 ( λ ) max What about the average-case ? Proposition If the eigenvalues λ i of H are distributed according µ , � � � � x t − x ∗ � 2 ≤ � x 0 − x ∗ � 2 P 2 E H t d µ 2 2 R Note: The expectation is taken over the inputs . Contrary to the worst-case , the average-case is aware of the whole spectrum of the matrix H !
Optimal Worst Case vs Optimal Average Case The optimal worst-case method solves λ ∈ [ ℓ, L ] P 2 ( λ ) min max P : P (0)=1 The unique solution is given by the Chebyshev polynomials of the first kind , depending only on ℓ, L . 14
Optimal Worst Case vs Optimal Average Case The optimal worst-case method solves λ ∈ [ ℓ, L ] P 2 ( λ ) min max P : P (0)=1 The unique solution is given by the Chebyshev polynomials of the first kind , depending only on ℓ, L . The optimal method in average-case solves � P 2 d µ min P : P (0)=1 R The solution depends on µ , and uses the concept of orthogonal residual polynomials . 14
Optimal Polynomial Proposition (e.g. Bernd Fischer) If { P i } is a sequence of residual orthogonal polynomials w.r.t λµ ( λ ), i.e., � = 0 if i � = j , P i ( λ ) P j ( λ ) d [ λµ ( λ )] > 0 otherwise , R then P t solves � P 2 d µ, P t ∈ arg min deg( P ) = t . P : P (0)=1 R 15
Polynomial to algorithms The optimal polynomial comes from an orthogonal basis, and follow a two-term recursion ! Proposition Let { P 1 , P 2 , . . . } be residual orthogonal polynomials. Then, for some m i and h i (function of λµ ( λ )), P i ( λ ) = P i − 1 ( λ ) − h i λ P i − 1 ( λ ) + m i ( P i − 1 ( λ ) − P i − 2 ( λ )) The optimal average-case algorithm reads x t = x t − 1 − h t ∇ f ( x t − 1 ) + m t ( x t − 1 − x t − 2 ) . 16
The recipe to create your own optimal method!
The recipe to create your own optimal method! 1. Find the distribution µ of the eigenvalues in H , or ∇ 2 f ( x ).
The recipe to create your own optimal method! 1. Find the distribution µ of the eigenvalues in H , or ∇ 2 f ( x ). 2. Find a sequence of orthogonal polynomials P t w.r.t λµ ( λ ). (It gives you m t and h t )
The recipe to create your own optimal method! 1. Find the distribution µ of the eigenvalues in H , or ∇ 2 f ( x ). 2. Find a sequence of orthogonal polynomials P t w.r.t λµ ( λ ). (It gives you m t and h t ) 3. Iterate over t : x t = x t − 1 − h t ∇ f ( x t − 1 ) + m t ( x t − 1 − x t − 2 ) .
Part 2: Spectral density estimation
Example of spectral densities In the paper we study 3 different cases: • Uniform distribution in [ ℓ, L ], • Exponential decay µ ( λ ) = e − λ , midpoint between quadratic convex and to convex non-smooth optimization. • Marchenko-Pastur distribution. Typical expected spectral distribution of ∇ f 2 ( x ⋆ ) for DNN. 19
Exponential decay Assume the spectral density of H is µ = e − λ λ 0 . Optimal algorithm: λ 0 λ 0 x t = x t − 1 − t + 1 ∇ f ( x t − 1 ) + t − 1( x t − 1 − x t − 2 ) Very close to stochastic averaged gradient for quadratics [Flammarion and Bach., 2015] Rate of convergence: E H � x t − x ∗ � 2 = λ 0 ( t +1) � x 0 − x ∗ � 2 1 20
Marchenko-Pastur distribution We know study the Marchenko-Pastur distribution: � ( L − λ )( λ − ℓ ) µ ( λ ) = δ 0 ( λ )(1 − r ) + + 1 λ ∈ [ ℓ, L ] . 2 πσ 2 λ = σ 2 (1 − √ r ) 2 , L def = σ 2 (1 + √ r ) 2 . with ℓ def 21
Marchenko-Pastur distribution We know study the Marchenko-Pastur distribution: � ( L − λ )( λ − ℓ ) µ ( λ ) = δ 0 ( λ )(1 − r ) + + 1 λ ∈ [ ℓ, L ] . 2 πσ 2 λ = σ 2 (1 − √ r ) 2 , L def = σ 2 (1 + √ r ) 2 . with ℓ def • σ 2 the variance • σ 2 r is the mean • Presence of zeros if r < 1 ! Motivation: For a certain class of nonlinear activation functions, the spectrum Hessian of Neural Network follow the MP distribution [Pennington et al, 2018] 21
Figure 2: MP distribution for different values of r .
Recommend
More recommend