Average-Case Acceleration Through Spectral Density Estimation and - PowerPoint PPT Presentation

Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality of Polyak Momentum Fabian Pedregosa (Google Brain, Montreal) Damien Scieur (SAIT AI Lab, Montreal)

Motivation Consider the convex, quadratic optimization problem x ∈ R d f ( x ) = 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ∗ . min 1

Motivation Consider the convex, quadratic optimization problem x ∈ R d f ( x ) = 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ∗ . min Efficient methods: • Conjugate gradients (”most optimal”) • Chebyshev 1st kind acceleration (worst-case optimal) • Polyak heavy-ball method (asympt. worst-case optimal) 1

Polyak Heavy-Ball Polyak Momentum algorithm, for ℓ I � H � L I , x t +1 = x t − h ∇ f ( x t ) + m ( x t − x t − 1 ) where � √ √ � 2 4 L − ℓ √ √ √ h = L 2 + ℓ 2 , m = . L + ℓ • Requires the knowledge of ℓ, L . • Easy to use (widely used in deep learning) • Works well for non-quadratic (deterministic or stochastic). 2

Deep learning and large-scale problems In deep learning, we solve N � min f i ( x ) x ∈ R d i =1 for huge d . Consequences: • The minimum eigenvalue ℓ is extremely hard to estimate. • Behaves like a quadratic when using gradient descent. • Nice statistical properties, like expected spectral density . 3

Eigenvalue Density 0.4 0.3 0.2 0.1 0.0 0 2 4 6 Figure 1: Empirical vs expected spectral density of ∇ 2 f ( x ).

In this talk We study the average-case convergence on quadratic problems. • Standard optimization methods only use ℓ and L . What if we use expected density function? • How to build optimal methods in average case for given spectral densities? Can we get rid of ℓ ? • Asymptotic behavior of these optimal methods? 5

Part 1: Spectral density, optimal methods and orthogonal polynomials

Setting Consider a class of convex, quadratic optimization problem 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ⋆ min x ∈ R d For simplicity, assume that H is sampled randomly from some unknown distribution. 7

Setting Consider a class of convex, quadratic optimization problem 1 2( x − x ⋆ ) T H ( x − x ⋆ ) + f ⋆ min x ∈ R d For simplicity, assume that H is sampled randomly from some unknown distribution. We define the expected spectral density µ of H to be � b � � P λ i ( H ) ∈ [ a , b ] = d µ, for random i , H . a Remark: we are not interested in knowing the distribution over H ! 7

Beyond the condition number: spectral density we know the distribution of the eigenvalues of H Spectral density 0.8 0.6 0.4 0.2 0.0 0 1 2 3 Likely to have Unlikely to see eigenvalues here. them over here. 8

First order methods and polynomials We will use first-order methods to solve the quadratic problem. x t ∈ x 0 + span {∇ f ( x 0 ) , . . . , ∇ f ( x t − 1 ) } . Main property. The error is a residual polynomial in H : x t − x ⋆ ( x 0 − x ⋆ ) , = P t ( H ) P t (0) = 1 . � �� Iteration t Polynomial degree t 9

First order methods and polynomials We will use first-order methods to solve the quadratic problem. x t ∈ x 0 + span {∇ f ( x 0 ) , . . . , ∇ f ( x t − 1 ) } . Main property. The error is a residual polynomial in H : x t − x ⋆ ( x 0 − x ⋆ ) , = P t ( H ) P t (0) = 1 . � �� Iteration t Polynomial degree t Example: Gradient descent. ∇ f ( x t − 1 ) � �� x t − x ⋆ = x t − 1 − x ⋆ − h H ( x t − 1 − x ⋆ ) = ( I − h H ) t ( x 0 − x ⋆ ) = P t Grad ( H )( x 0 − x ⋆ ) with P t Grad ( λ ) = (1 − h λ ) t . 9

From algorithm to polynomials All first-order methods are polynomials ∗ and all polynomials ∗ are first-order methods! (If P t (0) = 1, i.e., P t is a residual polynomial). 10

Comparison of Polynomials Visualizing the polynomial for gradient descent and momentum. • Gradient descent: x t = x t − 1 − h ∇ f ( x t − 1 ). • Optimal Momentum: x t = x t − 1 − h t ∇ f ( x t − 1 ) + m t ( x t − 1 − x t − 2 ) The worst-case rate of convergence is given by the largest value � x t − x ∗ � 2 2 ≤ � x 0 − x ⋆ � 2 λ ∈ [ ℓ, L ] P t 2 ( λ ) max 11

Residual Polynomial for Momentum The residual polynomial p t , with t = 4 1.0 0.8 0.6 p t ( z ) 0.4 0.2 0.0 0.2 0.0 λ min 0.5 1.0 1.5 λ max Gradient Descent, p t ( z ) = (1 − 2 z/ ( λ min + λ max )) 2 t Momentum, p t ( z ) = Chebyshev polynomials 12

The worst-case rate of convergence is given by the largest value � x t − x ∗ � 2 2 ≤ � x 0 − x ⋆ � 2 λ ∈ [ ℓ, L ] P t 2 ( λ ) max What about the average-case ?

The worst-case rate of convergence is given by the largest value � x t − x ∗ � 2 2 ≤ � x 0 − x ⋆ � 2 λ ∈ [ ℓ, L ] P t 2 ( λ ) max What about the average-case ? Proposition If the eigenvalues λ i of H are distributed according µ , � � � � x t − x ∗ � 2 ≤ � x 0 − x ∗ � 2 P 2 E H t d µ 2 2 R Note: The expectation is taken over the inputs . Contrary to the worst-case , the average-case is aware of the whole spectrum of the matrix H !

Optimal Worst Case vs Optimal Average Case The optimal worst-case method solves λ ∈ [ ℓ, L ] P 2 ( λ ) min max P : P (0)=1 The unique solution is given by the Chebyshev polynomials of the first kind , depending only on ℓ, L . 14

Optimal Worst Case vs Optimal Average Case The optimal worst-case method solves λ ∈ [ ℓ, L ] P 2 ( λ ) min max P : P (0)=1 The unique solution is given by the Chebyshev polynomials of the first kind , depending only on ℓ, L . The optimal method in average-case solves � P 2 d µ min P : P (0)=1 R The solution depends on µ , and uses the concept of orthogonal residual polynomials . 14

Optimal Polynomial Proposition (e.g. Bernd Fischer) If { P i } is a sequence of residual orthogonal polynomials w.r.t λµ ( λ ), i.e.,  � = 0 if i � = j ,  P i ( λ ) P j ( λ ) d [ λµ ( λ )] > 0 otherwise ,  R then P t solves � P 2 d µ, P t ∈ arg min deg( P ) = t . P : P (0)=1 R 15

Polynomial to algorithms The optimal polynomial comes from an orthogonal basis, and follow a two-term recursion ! Proposition Let { P 1 , P 2 , . . . } be residual orthogonal polynomials. Then, for some m i and h i (function of λµ ( λ )), P i ( λ ) = P i − 1 ( λ ) − h i λ P i − 1 ( λ ) + m i ( P i − 1 ( λ ) − P i − 2 ( λ )) The optimal average-case algorithm reads x t = x t − 1 − h t ∇ f ( x t − 1 ) + m t ( x t − 1 − x t − 2 ) . 16

The recipe to create your own optimal method!

The recipe to create your own optimal method! 1. Find the distribution µ of the eigenvalues in H , or ∇ 2 f ( x ).

The recipe to create your own optimal method! 1. Find the distribution µ of the eigenvalues in H , or ∇ 2 f ( x ). 2. Find a sequence of orthogonal polynomials P t w.r.t λµ ( λ ). (It gives you m t and h t )

The recipe to create your own optimal method! 1. Find the distribution µ of the eigenvalues in H , or ∇ 2 f ( x ). 2. Find a sequence of orthogonal polynomials P t w.r.t λµ ( λ ). (It gives you m t and h t ) 3. Iterate over t : x t = x t − 1 − h t ∇ f ( x t − 1 ) + m t ( x t − 1 − x t − 2 ) .

Part 2: Spectral density estimation

Example of spectral densities In the paper we study 3 different cases: • Uniform distribution in [ ℓ, L ], • Exponential decay µ ( λ ) = e − λ , midpoint between quadratic convex and to convex non-smooth optimization. • Marchenko-Pastur distribution. Typical expected spectral distribution of ∇ f 2 ( x ⋆ ) for DNN. 19

Exponential decay Assume the spectral density of H is µ = e − λ λ 0 . Optimal algorithm: λ 0 λ 0 x t = x t − 1 − t + 1 ∇ f ( x t − 1 ) + t − 1( x t − 1 − x t − 2 ) Very close to stochastic averaged gradient for quadratics [Flammarion and Bach., 2015] Rate of convergence: E H � x t − x ∗ � 2 = λ 0 ( t +1) � x 0 − x ∗ � 2 1 20

Marchenko-Pastur distribution We know study the Marchenko-Pastur distribution: � ( L − λ )( λ − ℓ ) µ ( λ ) = δ 0 ( λ )(1 − r ) + + 1 λ ∈ [ ℓ, L ] . 2 πσ 2 λ = σ 2 (1 − √ r ) 2 , L def = σ 2 (1 + √ r ) 2 . with ℓ def 21

Marchenko-Pastur distribution We know study the Marchenko-Pastur distribution: � ( L − λ )( λ − ℓ ) µ ( λ ) = δ 0 ( λ )(1 − r ) + + 1 λ ∈ [ ℓ, L ] . 2 πσ 2 λ = σ 2 (1 − √ r ) 2 , L def = σ 2 (1 + √ r ) 2 . with ℓ def • σ 2 the variance • σ 2 r is the mean • Presence of zeros if r < 1 ! Motivation: For a certain class of nonlinear activation functions, the spectrum Hessian of Neural Network follow the MP distribution [Pennington et al, 2018] 21

Figure 2: MP distribution for different values of r .

Average-Case Acceleration Through Spectral Density Estimation and - PowerPoint PPT Presentation

Average-Case Acceleration Through Spectral Density Estimation and Universal Asymptotic Optimality of Polyak Momentum Fabian Pedregosa (Google Brain, Montreal) Damien Scieur (SAIT AI Lab, Montreal) Motivation Consider the convex, quadratic

Average-case Acceleration Through Spectral Density Estimation Fabian Pedregosa (Google Research)

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Relative Density Chapters 3.5 Relative Density 1 2/5/2015 Minimum Density Pluviate soil from

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

Lesson 9 Introduction Signal Spectral Analysis: Estimation of the power spectral density

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Average Connectivity and Average Edge-connectivity in Graphs Suil O joint work with Jaehoon Kim

Polyethylene Monomer: Ethylene High Density Polyethylene (HDPE) Low Density Polyethylene

Bulk Density and Void Content Bulk Density Bulk density ( n .) the mass of a unit volume of bulk

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Power Spectral Density Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

The Dark Matter density MW Components Global density Data: inner Data: outer Data: masers

10Hz Spectral Lines Joschua Dilly 10Hz Spectral Lines 2 Introduction Ions 50cm Protons 30cm

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Extended Path Integral Formulation for Volumetric Transport T. Hachisuka I. Georgiev W. Jarosz

Sub-quadratic Markov tree mixture models for probability density estimation Sourour Ammar 1 , Ph.

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Statistical Geometry Processing Winter Semester 2011/2012 Machine Learning Topics Topics

( ) { ( ) } A random variable is subject to the following A random variable is subject to

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

0 tt pp X Yuri Oksuzian University of Florida PHENO 2010 1 Why and How? t Goal