Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � 9
Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � Part 1 9
Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � Part 1 – Part 2 9
Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � Part 1 – Part 2 – Part 3 9
Outline 1. Introduction. 2. A warm up! Results in finite dimension, ( d ≫ n ) ◮ Averaged stochastic descent: adaptivity ◮ Acceleration: two optimal rates 3. Non-parametric stochastic approximation 4. Stochastic approximation as a Markov chain: extension to non quadratic loss functions. 10
Behavior of Stochastic Approximation in high dimension Least-squares regression in finite dimension: �� � 2 � R ( θ ) = E ρ � θ, Φ( X ) � − Y . 11
Behavior of Stochastic Approximation in high dimension Least-squares regression in finite dimension: �� � 2 � R ( θ ) = E ρ � θ, Φ( X ) � − Y . � Φ( X )Φ( X ) ⊤ � ∈ R d × d : for θ ∗ the best linear predictor, Let Σ = E � � 2 � Σ 1 / 2 ( θ − θ ∗ ) � � R ( θ ) − R ( θ ∗ ) = . � � � Φ( X ) � 2 � Let R 2 := E , σ 2 := E � ( Y − � θ ∗ , Φ( X ) � ) 2 � . 11
Behavior of Stochastic Approximation in high dimension Least-squares regression in finite dimension: �� � 2 � R ( θ ) = E ρ � θ, Φ( X ) � − Y . � Φ( X )Φ( X ) ⊤ � ∈ R d × d : for θ ∗ the best linear predictor, Let Σ = E � � 2 � Σ 1 / 2 ( θ − θ ∗ ) � � R ( θ ) − R ( θ ∗ ) = . � � � Φ( X ) � 2 � Let R 2 := E , σ 2 := E � ( Y − � θ ∗ , Φ( X ) � ) 2 � . Consider stochastic gradient descent ( a.k.a., Least-Mean-Squares ) Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � � 2 − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) 11
Theorem 1 † , consequences Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � 2 � − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) � �� � � �� � Variance Bias † Dieuleveut and Bach [2015]. 12
Theorem 1 † , consequences Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � 2 � − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) � �� � � �� � Variance Bias Variance term Bias term 2 � Σ − 1 / 2 ( θ ∗ − θ 0 ) � � θ ∗ − θ 0 � 2 γσ 2 tr (Σ) σ 2 d n γ n γ 2 n 2 α = 1 α → ∞ r = 1 / 2 . r = 1 . † Dieuleveut and Bach [2015]. 12
Theorem 1 † , consequences Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � 2 � − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) � �� � � �� � Variance Bias Variance term Bias term 2 � Σ − 1 / 2 ( θ ∗ − θ 0 ) � � θ ∗ − θ 0 � 2 γσ 2 tr (Σ) σ 2 d n γ n γ 2 n 2 α = 1 α → ∞ r = 1 / 2 . r = 1 . � �� � � �� � Recovers Improves Bach and Moulines [2013] asymptotic Bias † Dieuleveut and Bach [2015]. 12
Theorem 1, consequences Theorem 1 For any γ ≤ 4 R 2 , for any n ∈ N , � � � � 2 � 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n − R ( θ ∗ ) ≤ inf . n 1 − 1 /α γ 2 r n min(2 r , 2) α> 1 , r ≥ 0 � �� � � �� � Variance Bias γ 1 /α tr (Σ 1 /α ) Adaptivity n 1 − 1 /α Upper bound on the variance term as a function of α . d ≫ n . α 13
Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML 14
Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Builds an estimator given n observations. � statistical lower bound : σ 2 d n 14
Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in t iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n t 2 14
Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in t iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n t 2 here, n = t . 14
Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in n iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n n 2 here, n = t . 14
Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in n iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n n 2 here, n = t . Theorem 1, for Av-SGD, gives as upper bound: � � ; L 2 � � � 2 L � θ 0 − θ ∗ � 2 σ 2 d � Σ − 1 / 2 ( θ 0 − θ ∗ ) + min . n 2 n n 14
Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . † Dieuleveut, Flammarion, Bach [2016] 15
Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . † Dieuleveut, Flammarion, Bach [2016] 15
Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. † Dieuleveut, Flammarion, Bach [2016] 15
Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, † Dieuleveut, Flammarion, Bach [2016] 15
Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, ◮ and for “additive” noise model only, † Dieuleveut, Flammarion, Bach [2016] 15
Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, ◮ and for “additive” noise model only, Caveat: LMS recursion does not provide an additive noise oracle. Different recursion with Σ known. † Dieuleveut, Flammarion, Bach [2016] 15
Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, ◮ and for “additive” noise model only, we achieve both of the optimal rates. Caveat: LMS recursion does not provide an additive noise oracle. Different recursion with Σ known. † Dieuleveut, Flammarion, Bach [2016] 15
Acceleration and averaging More precisely we consider: ν n − 1 − γ R ′ θ n = n ( ν n − 1 ) − γλ ( ν n − 1 − θ 0 ) � � ν n = θ n + δ θ n − θ n − 1 , Theorem For any γ ≤ 1 / 2 R 2 , for δ = 1, and λ = 0, − R ( θ ∗ ) ≤ 8 σ 2 d n + 1 + 36 � θ 0 − θ ∗ � 2 � � R (¯ E θ n ) γ ( n + 1) 2 . Optimal rate from both statistical and optimization point of view. 16
Outline 1. Introduction. 2. A warm up! Results in finite dimension, ( d ≫ n ) 3. Non-parametric stochastic approximation ◮ Averaged stochastic descent: statistical rate of convergence ◮ Acceleration: improving convergence in ill-conditioned regimes 4. Stochastic approximation as a Markov chain: extension to non quadratic loss functions. 17
Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ 18
Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . 18
Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . Bayes predictor minimizes the quadratic risk over L 2 ρ X : g ρ ( X ) = E [ Y | X ] . 18
Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . Bayes predictor minimizes the quadratic risk over L 2 ρ X : g ρ ( X ) = E [ Y | X ] . Moreover, for any function g in L 2 ρ X , the excess risk is: R ( g ) − R ( g ρ ) = � g − g ρ � 2 ρ X . L 2 18
Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . Bayes predictor minimizes the quadratic risk over L 2 ρ X : g ρ ( X ) = E [ Y | X ] . Moreover, for any function g in L 2 ρ X , the excess risk is: R ( g ) − R ( g ρ ) = � g − g ρ � 2 ρ X . L 2 H L 2 H a space of functions: there exists g H ∈ ¯ ρ X such that R ( g H ) = inf g ∈H R ( g ) . 18
Reproducing Kernel Hilbert Space Definition A Reproducing Kernel Hilbert Space (RKHS) H is a space of functions from X into R , such that there exists a reproducing kernel K : X × X → R , satisfying: ◮ For any x ∈ X , H contains the function K x , defined by: K x : X → R z �→ K ( x , z ) .
Reproducing Kernel Hilbert Space Definition A Reproducing Kernel Hilbert Space (RKHS) H is a space of functions from X into R , such that there exists a reproducing kernel K : X × X → R , satisfying: ◮ For any x ∈ X , H contains the function K x , defined by: K x : X → R z �→ K ( x , z ) . ◮ For any x ∈ X and f ∈ H , the reproducing property holds: � K x , f � H = f ( x ) . 19
Why are RKHS so nice? ◮ Computation: ◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing property. ◮ Only deal with functions in the set span { K x i , i = 1 . . . n } (representer theorem). � the algebraic framework is preserved ! 20
Why are RKHS so nice? ◮ Computation: ◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing property. ◮ Only deal with functions in the set span { K x i , i = 1 . . . n } (representer theorem). � the algebraic framework is preserved ! H L 2 ◮ Approximation: many kernels satisfy ¯ ρ X = L 2 ρ X , there is no approximation error ! 20
Why are RKHS so nice? ◮ Computation: ◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing property. ◮ Only deal with functions in the set span { K x i , i = 1 . . . n } (representer theorem). � the algebraic framework is preserved ! H L 2 ◮ Approximation: many kernels satisfy ¯ ρ X = L 2 ρ X , there is no approximation error ! ◮ Representation: Feature map, X → H x �→ K x maps points from any set into a linear space to apply a linear method. 20
Stochastic approximation in the RKHS. � ( � g , K X � H − Y ) 2 � As R ( g ) = E , for each pair of observations ( � g , K x n � H − y n ) K x n = ( g ( x n ) − y n ) K x n is an unbiased stochastic gradient of R at g . 21
Stochastic approximation in the RKHS. � ( � g , K X � H − Y ) 2 � As R ( g ) = E , for each pair of observations ( � g , K x n � H − y n ) K x n = ( g ( x n ) − y n ) K x n is an unbiased stochastic gradient of R at g . Consider the stochastic gradient recursion, starting from g 0 ∈ H : � � g n = g n − 1 − γ � g n − 1 , K x n � H − y n K x n , where γ is the step-size . 21
Stochastic approximation in the RKHS. � ( � g , K X � H − Y ) 2 � As R ( g ) = E , for each pair of observations ( � g , K x n � H − y n ) K x n = ( g ( x n ) − y n ) K x n is an unbiased stochastic gradient of R at g . Consider the stochastic gradient recursion, starting from g 0 ∈ H : � � g n = g n − 1 − γ � g n − 1 , K x n � H − y n K x n , where γ is the step-size . Thus n � g n = a i K x i , i =1 with ( a n ) n � 1 , a n = − γ n ( g n − 1 ( x n ) − y n ) . With averaging, n 1 � g n = g k n + 1 k =0 Total complexity: O ( n 2 ) 21
Kernel regression: Analysis � Y 2 � Assume E [ K ( X , X )] and E are finite. Define the covariance operator . � � K X K ⊤ Σ = E . X We make two assumptions: ◮ Capacity condition: eigenvalue decay of Σ. ◮ Source condition: position of g H w.r.t. the kernel space H . 22
Kernel regression: Analysis � Y 2 � Assume E [ K ( X , X )] and E are finite. Define the covariance operator . � � K X K ⊤ Σ = E . X We make two assumptions: ◮ Capacity condition: eigenvalue decay of Σ. ◮ Source condition: position of g H w.r.t. the kernel space H . Σ is a trace-class operator, that can be decomposed over its eigen-spaces. Its power: Σ τ , τ > 0. are thus well defined. 22
Capacity condition (CC) CC ( α ) : for some α > 1 , we assume that tr (Σ 1 /α ) < ∞ . 23
Capacity condition (CC) CC ( α ) : for some α > 1 , we assume that tr (Σ 1 /α ) < ∞ . If we denote ( µ i ) i ∈ I the sequence of non-zero eigenvalues of the operator Σ, in decreasing order, then µ i = O ( i − α ). 23
Capacity condition (CC) CC ( α ) : for some α > 1 , we assume that tr (Σ 1 /α ) < ∞ . If we denote ( µ i ) i ∈ I the sequence of non-zero eigenvalues of the operator Σ, in decreasing order, then µ i = O ( i − α ). Sobolev first order kernel Gaussian kernel Eigenvalue decay of the log 10 ( µ i ) covariance operator. log 10 ( i ) log 10 ( i ) Left: min kernel, ρ X = U [0; 1], − → CC ( α = 2). Right: Gaussian kernel, ρ X = U [ − 1; 1]. − → CC ( α ) , ∀ α ≥ 1. 23
Source condition (SC) Concerning the optimal function g H , we assume: SC( r ): for some r � 0 , g H ∈ Σ r � � L 2 ρ X Thus � Σ − r ( g H ) � L 2 ρ X < ∞ . r < 0 . 5 r = 0 . 5 r > 0 . 5 24
NPSA with large step sizes Theorem 1 Assume CC( α ) and SC( r ). Then for any γ ≤ 4 R 2 , � Σ − r ( g H − g 0 ) � 2 g n ) − R ( g H ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) L 2 ρ X E R (¯ + 4 . n 1 − 1 /α γ 2 r n min(2 r , 2) − 2 α r − 1+ α , for α − 1 for γ = γ 0 n 2 α ≤ r ≤ 1 2 α r +1 � � − 2 α r � � 4 σ 2 tr (Σ 1 /α ) + 4 � 2 � Σ − r ( g H − g 0 ) E R (¯ g n ) − R ( g H ) ≤ n . 2 α r +1 L 2 ρ X 25
NPSA with large step sizes Theorem 1 Assume CC( α ) and SC( r ). Then for any γ ≤ 4 R 2 , � Σ − r ( g H − g 0 ) � 2 g n ) − R ( g H ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) L 2 ρ X E R (¯ + 4 . n 1 − 1 /α γ 2 r n min(2 r , 2) − 2 α r − 1+ α , for α − 1 for γ = γ 0 n 2 α ≤ r ≤ 1 2 α r +1 � � − 2 α r � � 4 σ 2 tr (Σ 1 /α ) + 4 � 2 � Σ − r ( g H − g 0 ) E R (¯ g n ) − R ( g H ) ≤ n . 2 α r +1 L 2 ρ X ◮ Statistically optimal rate. [Caponnetto and De Vito, 2007]. ◮ Beyond: online, minimal assumptions... 25
Optimality regions r = 0 . 5 r > 0 . 5 r = α − 1 2 α 5 4 B > V Saturation α 3 2 1 r r = 1 = 2 r = 1 r < 0 . 5 r ≪ 0 . 5 Optimal rate in RKHS can be achieved via large step size and averaging in many situations. 26
Optimality regions r = 0 . 5 r > 0 . 5 r = α − 1 2 α 5 4 B > V Saturation α 3 2 1 r r = 1 = 2 r = 1 r < 0 . 5 r ≪ 0 . 5 Optimal rate in RKHS can be achieved via large step size and averaging in many situations. 26
Optimality regions r = 0 . 5 r > 0 . 5 r = α − 1 2 α 5 4 B > V Saturation α 3 2 1 r r = 1 = 2 r = 1 r < 0 . 5 r ≪ 0 . 5 Optimal rate in RKHS can be achieved via large step size and averaging in many situations. 26
Acceleration: Reproducing kernel Hilbert space setting We consider the RKHS setting presented before. Theorem Assume CC( α ) and SC( r ). Then for γ = γ 0 n − 4 r α +2 − α 2 r α +1 , for γ n 2 , for r ≥ α − 2 1 λ = 2 α , − 2 α r 2 α r +1 . E R (¯ g n ) − R ( g H ) ≤ C θ 0 ,θ ∗ , Σ n 27
Acceleration: Reproducing kernel Hilbert space setting We consider the RKHS setting presented before. Theorem Assume CC( α ) and SC( r ). Then for γ = γ 0 n − 4 r α +2 − α 2 r α +1 , for γ n 2 , for r ≥ α − 2 1 λ = 2 α , − 2 α r 2 α r +1 . E R (¯ g n ) − R ( g H ) ≤ C θ 0 ,θ ∗ , Σ n r = α − 1 r = α − 2 2 α 2 α 5 B > V 4 Saturation α 3 2 1 r r = 1 r = 1 = 2 27
Least squares: some conclusions ◮ Provide optimal rate of convergence under two assumptions for non-parametric regression in Hilbert spaces: large step sizes and averaging. 28
Least squares: some conclusions ◮ Provide optimal rate of convergence under two assumptions for non-parametric regression in Hilbert spaces: large step sizes and averaging. ◮ Sheds some light on FD case. 28
Least squares: some conclusions ◮ Provide optimal rate of convergence under two assumptions for non-parametric regression in Hilbert spaces: large step sizes and averaging. ◮ Sheds some light on FD case. ◮ Possible to attain simultaneously optimal rate from the statistical and optimization point of view. 28
Outline 1. Introduction. 2. Non-parametric stochastic approximation 3. Faster rates with acceleration 4. Stochastic approximation as a Markov chain: extension to non quadratic loss functions. ◮ Motivation ◮ Assumptions ◮ Convergence in Wasserstein distance. 29
Motivation 1/ 2. Large step sizes! � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression. Final iterate (dashed), and averaged recursion (plain). 30
Motivation 2/ 2. Difference between quadratic and logistic loss Logistic Regression Least-Squares Regression � 1 � E R (¯ E R (¯ θ n ) − R ( θ ∗ ) = O ( γ 2 ) θ n ) − R ( θ ∗ ) = O n with γ = 1 / (4 R 2 ) with γ = 1 / (4 R 2 ) 31
SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . 32
SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , 32
SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , ◮ satisfies Markov property 32
SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , ◮ satisfies Markov property ◮ is homogeneous, for γ constant, ( ε k ) k ∈ N i.i.d. 32
SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , ◮ satisfies Markov property ◮ is homogeneous, for γ constant, ( ε k ) k ∈ N i.i.d. Also assume: k = R ′ + ε k +1 is almost surely L -co-coercive. ◮ R ′ ◮ Bounded moments E [ � ε k ( θ ∗ ) � 4 ] < ∞ . 32
Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n † Dieuleveut, Durmus, Bach [2017] . 33
Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 ¯ ¯ θ n ,γ − → θ γ := E π γ [ θ ] . n →∞ † Dieuleveut, Durmus, Bach [2017] . 33
Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 ¯ ¯ θ n ,γ − → θ γ := E π γ [ θ ] . n →∞ ◮ Behavior under the limit distribution ( γ → 0): ¯ θ γ = θ ∗ + ?. † Dieuleveut, Durmus, Bach [2017] . 33
Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 ¯ ¯ θ n ,γ − → θ γ := E π γ [ θ ] . n →∞ ◮ Behavior under the limit distribution ( γ → 0): ¯ θ γ = θ ∗ + ?. � Provable convergence improvement with extrapolation tricks. † Dieuleveut, Durmus, Bach [2017] . 33
Existence of a limit distribution γ → 0 d ( θ γ Goal: n ) n ≥ 0 → π γ . Theorem For any γ < L − 1 , the chain ( θ γ n ) n ≥ 0 admits a unique stationary distribution π γ . In addition for all θ 0 ∈ R d , n ∈ N : � R d � θ 0 − ϑ � 2 d π γ ( ϑ ) . W 2 2 ( θ γ n , π γ ) ≤ (1 − 2 µγ (1 − γ L )) n 34
Existence of a limit distribution γ → 0 d ( θ γ Goal: n ) n ≥ 0 → π γ . Theorem For any γ < L − 1 , the chain ( θ γ n ) n ≥ 0 admits a unique stationary distribution π γ . In addition for all θ 0 ∈ R d , n ∈ N : � R d � θ 0 − ϑ � 2 d π γ ( ϑ ) . W 2 2 ( θ γ n , π γ ) ≤ (1 − 2 µγ (1 − γ L )) n Wasserstein metric: distance between probability measures. 34
Behavior under limit distribution. Ergodic theorem: ¯ θ n → E π γ [ θ ] =: ¯ θ γ . Where is ¯ θ γ ? 35
Behavior under limit distribution. Ergodic theorem: ¯ θ n → E π γ [ θ ] =: ¯ θ γ . Where is ¯ θ γ ? If θ 0 ∼ π γ , then θ 1 ∼ π γ . � � θ γ 1 = θ γ R ′ ( θ γ 0 ) + ε 1 ( θ γ 0 − γ 0 ) . � � R ′ ( θ ) = 0 E π γ 35
Behavior under limit distribution. Ergodic theorem: ¯ θ n → E π γ [ θ ] =: ¯ θ γ . Where is ¯ θ γ ? If θ 0 ∼ π γ , then θ 1 ∼ π γ . � � θ γ 1 = θ γ R ′ ( θ γ 0 ) + ε 1 ( θ γ 0 − γ 0 ) . � � R ′ ( θ ) = 0 E π γ In the quadratic case (linear gradients) Σ E π γ [ θ − θ ∗ ] = 0: ¯ θ γ = θ ∗ ! � � θ − θ ∗ � 4 � ≤ C γ 2 , and expand the In the general case, using E π γ Taylor expansion of R : And iterating this reasoning on higher moments of the chain: �� � ¯ � − 1 E π γ [ ε ( θ ) ⊗ 2 ] θ γ − θ ∗ = γ R ′′ ( θ ∗ ) − 1 R ′′′ ( θ ∗ ) R ′′ ( θ ∗ ) ⊗ I + I ⊗ R ′′ ( θ ∗ ) + O ( γ 2 ) Overall, ¯ θ γ − θ ∗ = γ ∆ + O ( γ 2 ) . 35
Constant learning rate SGD: convergence in the quadratic case 36
Constant learning rate SGD: convergence in the quadratic case θ 0 θ n θ 1 36
Recommend
More recommend