the implicit regularization of stochastic gradient flow
play

The Implicit Regularization of Stochastic Gradient Flow for Least - PowerPoint PPT Presentation

The Implicit Regularization of Stochastic Gradient Flow for Least Squares Alnur Ali 1 , Edgar Dobriban 2 , and Ryan J. Tibshirani 3 1 Stanford University, 2 University of Pennsylvania, 3 Carnegie Mellon University Outline Overview


  1. The Implicit Regularization of Stochastic Gradient Flow for Least Squares Alnur Ali 1 , Edgar Dobriban 2 , and Ryan J. Tibshirani 3 1 Stanford University, 2 University of Pennsylvania, 3 Carnegie Mellon University

  2. Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Overview 2

  3. Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) Overview 3

  4. Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) ◮ Recently, lots of interest in implicit regularization ◮ In particular, a line of work showing (early-stopped) gradient descent is linked to ℓ 2 regularization Overview 3

  5. Introduction ◮ Given the sizes of modern data sets, stochastic gradient descent is one of the most widely used optimization algorithms today – Computational and statistical properties have been studied for decades (Robbins & Monro, 1951; Fabian, 1968; Ruppert, 1988; Kushner & Yin, 2003; Polyak & Juditsky, 1992; ...) ◮ Recently, lots of interest in implicit regularization ◮ In particular, a line of work showing (early-stopped) gradient descent is linked to ℓ 2 regularization ◮ Interesting, but also computationally convenient Overview 3

  6. Introduction ◮ Natural to ask: do the iterates generated by (mini-batch) stochastic gradient descent also possess (implicit) ℓ 2 regularity? Overview 4

  7. Introduction ◮ Natural to ask: do the iterates generated by (mini-batch) stochastic gradient descent also possess (implicit) ℓ 2 regularity? ◮ Why might there be a connection, at all? – Compare the paths for least squares regression Ridge Regression Stochastic Gradient Descent 0.8 0.8 0.6 0.6 0.4 0.4 Coefficients Coefficients 0.2 0.2 −0.2 −0.2 −0.6 −0.6 0 2 4 6 8 10 0 200 400 600 800 1000 1/lambda k ◮ In this paper, we’ll focus on least squares regression Overview 4

  8. Introduction ◮ Main tool for making the connection: a stochastic differential equation that we call stochastic gradient flow – Linked to SGD with a constant step size; more on this later ◮ We give a bound on the excess risk of stochastic gradient flow at time t , over ridge regression with tuning parameter λ = 1 /t – Result(s) hold across the entire optimization path – Results do not place strong conditions on the features – Proofs are simpler than in discrete-time Overview 5

  9. Introduction ◮ Main tool for making the connection: a stochastic differential equation that we call stochastic gradient flow – Linked to SGD with a constant step size; more on this later ◮ We give a bound on the excess risk of stochastic gradient flow at time t , over ridge regression with tuning parameter λ = 1 /t – Result(s) hold across the entire optimization path – Results do not place strong conditions on the features – Proofs are simpler than in discrete-time ◮ Roughly speaking, the bound decomposes into three parts – The variance of ridge regression scaled by a constant less than 1 – The “price of stochasticity”: a term that is non-negative, but vanishes as time grows – A term that is tied to the limiting optimization error: this term is zero in the overparametrized regime, but positive otherwise Overview 5

  10. Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Continuous-time viewpoint 6

  11. Stochastic gradient flow ◮ We consider the stochastic differential equation dβ ( t ) = 1 nX T ( y − Xβ ( t )) dt Q ǫ ( β ( t )) 1 / 2 dW ( t ) , + (1) � �� � � �� � fluctuations are governed by the just the gradient for cov. of the stochastic gradients least squares regression where β (0) = 0 , � � 1 mX T Q ǫ ( β ) = ǫ · Cov I I ( y I − X I β ) is the diffusion coefficient, I ⊆ { 1 , . . . , n } is a mini-batch, and ǫ > 0 is a (fixed) step size ◮ We call (1) stochastic gradient flow – Has a few nice properties, and bears several connections to SGD with a constant step size; more on this next Continuous-time viewpoint 7

  12. Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and Continuous-time viewpoint 8

  13. Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and – Implies the prediction errors match – Also, implies any deviation between the first two moments of stochastic gradient flow and SGD must be due to discretization error Continuous-time viewpoint 8

  14. Stochastic gradient flow ◮ Lemma: the Euler discretization of stochastic gradient flow ˜ β ( k ) , and constant step size SGD β ( k ) , share first and second moments, i.e., E (˜ β ( k ) ) = E ( β ( k ) ) Cov(˜ β ( k ) ) = Cov( β ( k ) ) and – Implies the prediction errors match – Also, implies any deviation between the first two moments of stochastic gradient flow and SGD must be due to discretization error ◮ Sanity check: revisiting the solution/optimization paths from earlier Ridge Regression Stochastic Gradient Descent Stochastic Gradient Flow 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 Coefficients Coefficients Coefficients 0.2 0.2 0.2 −0.2 −0.2 −0.2 −0.6 −0.6 −0.6 0 2 4 6 8 10 0 200 400 600 800 1000 0 2 4 6 8 10 1/lambda k t Continuous-time viewpoint 8

  15. Stochastic gradient flow ◮ A number of works consider instead the constant covariance process, � ǫ dβ ( t ) = 1 � 1 / 2 nX T ( y − Xβ ( t )) dt + m · ˆ Σ dW ( t ) , (2) Σ = X T X/n (cf. Langevin dynamics) where ˆ

  16. Stochastic gradient flow ◮ A number of works consider instead the constant covariance process, � ǫ dβ ( t ) = 1 � 1 / 2 nX T ( y − Xβ ( t )) dt + m · ˆ Σ dW ( t ) , (2) Σ = X T X/n (cf. Langevin dynamics) where ˆ ◮ Turns out (theoretically, empirically) stochastic gradient flow is a more accurate approximation to SGD than (2) is 1.2 2.5 SGD Non−Constant Covariance SGF Constant Covariance SGF 1.0 2.0 0.8 1.5 0.6 0.4 1.0 0.2 0.5 0.0 −0.2 0.0 0.0 0.5 1.0 1.5 2.0

  17. Outline Overview Continuous-time viewpoint Risk bounds Numerical examples Conclusion Risk bounds 10

  18. Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n

  19. Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n ◮ Recall a useful result for (batch) gradient flow (Ali et al., 2018) – For least squares regression, gradient flow is β ( t ) = 1 nX T ( y − Xβ ( t )) dt, ˙ β (0) = 0 – Has the solution β gf ( t ) = ( X T X ) + � I − exp( − tX T X/n ) X T y ˆ �

  20. Setup ◮ Assume a standard regression model η ∼ (0 , σ 2 I ) y = Xβ 0 + η, ◮ Fix X ; let s i , i = 1 , . . . , p , denote the eigenvalues of X T X/n ◮ Recall a useful result for (batch) gradient flow (Ali et al., 2018) – For least squares regression, gradient flow is β ( t ) = 1 nX T ( y − Xβ ( t )) dt, ˙ β (0) = 0 – Has the solution β gf ( t ) = ( X T X ) + � I − exp( − tX T X/n ) X T y ˆ � – Then, for any time t ≥ 0 (note the correspondence with λ ), Bias 2 (ˆ β gf ( t ); β 0 ) ≤ Bias 2 (ˆ β ridge (1 /t ); β 0 ) and Var(ˆ β gf ( t )) ≤ 1 . 6862 · Var(ˆ β ridge (1 /t )) , so that Risk(ˆ β gf ( t ); β 0 ) ≤ 1 . 6862 · Risk(ˆ β ridge (1 /t ); β 0 )

  21. Excess risk bound (over ridge) ◮ Thm.: for any time t > 0 (provided the step size is small enough), Risk(ˆ β sgf ( t ); β 0 ) − Risk(ˆ β ridge (1 /t ); β 0 ) ≤ 0 . 6862 · Var η (ˆ β ridge (1 /t )) (scaled ridge variance) � � p + ǫ · n exp( δ y ) s i � � � exp( − αt ) − exp( − 2 ts i ) E η m s i − α/ 2 i =1 (“price of stochasticity”) p + ǫ · n � �� � � E η γ y 1 − exp( − 2 ts i ) (limiting opt. error) m i =1 ◮ ǫ, m denote the step size and mini-batch size, respectively ◮ s i denote the eigenvalues of the sample covariance matrix ◮ α, γ y , δ y depend on n, p, m, ǫ, s i , y , but not t (see paper for details) Risk bounds 12

  22. Implications/observations ◮ The second and third (variance) terms ... – Roughly scale with ǫ/m (Goyal et al., 2017; Smith et al., 2017; You et al., 2017; Shallue et al., 2019); this is different from gradient flow – Depend on the signal-to-noise ratio; this is different from gradient flow (and linear smoothers in general, because stochastic gradient flow/descent are actually randomized linear smoothers) – The second term decreases with time, just as a bias would; this is different from gradient flow (see lemma in the paper) Risk bounds 13

Recommend


More recommend