bridging the gap between stochastic approximation and

Bridging the gap between Stochastic Approximation and Markov chains - PowerPoint PPT Presentation

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS Paris, INRIA 17 november 2017 Joint work with Francis Bach and Alain Durmus. 1 Outline Introduction to Stochastic Approximation for Machine

  1. Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS Paris, INRIA 17 november 2017 Joint work with Francis Bach and Alain Durmus. 1

  2. Outline ◮ Introduction to Stochastic Approximation for Machine Learning. ◮ Markov chain: a simple yet insightful point of view on constant step size Stochastic Approximation. 2

  3. Supervised Machine Learning ◮ Consider an input/output pair ( X , Y ) ∈ X × Y , following some unknown distribution ρ . ◮ Y = R (regression) or {− 1 , 1 } (classification). ◮ We want to find a function θ : X → R , such that θ ( X ) is a good prediction for Y . ◮ Prediction as a linear function � θ, Φ( X ) � of features Φ( X ) ∈ R d . ◮ Consider a loss function ℓ : Y × R → R + : squared loss, logistic loss, 0-1 loss, etc. ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . 3

  4. Empirical Risk minimization (I) ◮ Data: n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. ◮ n very large, up to 10 9 ◮ Computer vision: d = 10 4 to 10 6 ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ Empirical risk minimization (regularized): find ˆ θ solution of n 1 � � � min ℓ y i , � θ, Φ( x i ) � + µ Ω( θ ) . n θ ∈ R d i =1 convex data fitting term + regularizer 4

  5. Empirical Risk minimization (II) ◮ For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 ◮ and logistic regression: n 1 � � � 1 + exp( − y i � θ, Φ( x i ) � ) min log + µ Ω( θ ) . n θ ∈ R d i =1 ◮ Two fundamental questions: (1) computing ˆ θ and (2) analyzing ˆ θ . 2 important insights for ML Bottou and Bousquet (2008): 1. No need to optimize below statistical error, 2. Testing error is more important than training error. 5

  6. Stochastic Approximation θ ∗ θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). θ 0 θ 1 6 θ

  7. Stochastic Approximation in Machine learning Loss for a single pair of observations, for any k ≤ n : f k ( θ ) = ℓ ( y k , � θ, Φ( x k ) � ) . ◮ Use one observation at each step ! ◮ Complexity: O ( d ) per iteration. ◮ Can be used for both true risk and empirical risk. 7

  8. Stochastic Approximation in Machine learning n ◮ For the empirical error ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } . ◮ F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ). ◮ At step k ∈ N ∗ , use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) E [ f ′ I Ik ( θ k − 1 ) |F k − 1 ] = ˆ R ′ ( θ k − 1 ) ◮ For the risk R ( θ ) = E f k ( θ ) = E ℓ ( y k , � θ, Φ( x k ) � ): ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) E [ f ′ k ( θ k − 1 ) |F k − 1 ] = R ′ ( θ k − 1 ) ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. Analysis: Key assumptions: smoothness and/or strong convexity. 8

  9. Mathematical framework: Smoothness ◮ A function g : R d → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � L For all θ ∈ R d : � 2 g ( θ ) ≤ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + L � θ − θ ′ � � 9

  10. Mathematical framework: Strong Convexity ◮ A twice differentiable function g : R d → R is µ -strongly convex if and only if ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � µ For all θ ∈ R d : � 2 g ( θ ) ≥ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + µ � θ − θ ′ � � 10

  11. Application to machine learning ◮ We consider an a.s. convex loss in θ . Thus ˆ R and R are convex. ◮ Hessian of ˆ R (resp R ) ≈ covariance matrix i =1 Φ( x i )Φ( x i ) ⊤ or E [Φ( X )Φ( X ) ⊤ ]. 1 � n n R ′′ ( θ ) = E [ ℓ ′′ ( � θ, Φ( X ) � , Y )Φ( X )Φ( X ) ⊤ ] ◮ If ℓ is smooth, and E [ � Φ( X ) � 2 ] ≤ r 2 , R is smooth. ◮ If ℓ is µ -strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. 11

  12. Analysis: behaviour of ( θ n ) n ≥ 0 θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) Importance of the learning rate (or sequence of step sizes) ( γ n ) n ≥ 0 . For smooth and strongly convex problem, traditional analysis shows Fabian (1968); Robbins and Siegmund (1985) that θ n → θ ∗ almost surely if ∞ ∞ � � γ 2 γ n = ∞ n < ∞ . n =1 n =1 And asymptotic normality √ n ( θ n − θ ∗ ) d → N (0 , V ), for γ n = γ 0 n , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size... 12

  13. Polyak Ruppert averaging θ 1 θ 0 θ 1 θ n θ ∗ Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ 1 θ 0 θ 1 n 1 ¯ � θ n = θ k . θ 2 n + 1 k =0 θ n θ ∗ θ n ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 n +1 ¯ n θ n +1 = n +1 θ n +1 + θ n . ◮ one could also consider other averaging schemes (e.g., 13 Lacoste-Julien et al. (2012)).

  14. Convex stochastic approximation: convergence results ◮ Known global minimax rates of convergence for non-smooth problems Nemirovsky and Yudin (1983); Agarwal et al. (2012) ◮ Strongly convex: O (( µ n ) − 1 ) Attained by averaged stochastic gradient descent with γ n ∝ ( µ n ) − 1 ◮ Non-strongly convex: O ( n − 1 / 2 ) Attained by averaged stochastic gradient descent with γ n ∝ n − 1 / 2 Smooth strongly convex problems ◮ ◮ All step sizes γ n = Cn − α with α ∈ (1 / 2 , 1), with averaging, lead to O ( n − 1 ): ◮ asymptotic normality Polyak and Juditsky (1992), with variance independent of µ ! ◮ non asymptotic analysis Bach and Moulines (2011). ◮ Rate µ n for γ n ∝ n − 1 / 2 : adapts to strong convexity. 1 14

  15. Stochastic Approximation: take home message ◮ Powerful algorithm: ◮ Simple to implement ◮ Cheap ◮ No regularization needed ◮ Convergence guarantees: 1 ◮ γ n = √ n good choice in most situations Problems: ◮ Initial conditions can be forgotten slowly: could we use even larger step sizes? 15

  16. Motivation 1/ 2. Large step sizes! � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression. Final iterate (dashed), and averaged recursion (plain). 16

  17. Motivation 1/ 2. Large step sizes, real data � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression, Covertype dataset, n = 581012, d = 54. Comparison between a constant learning rate and decaying 1 learning rate as √ n . 17

  18. Motivation 2/ 2. Difference between quadratic and logistic loss Logistic Regression Least-Squares Regression � 1 � E R (¯ θ n ) − R ( θ ∗ ) = O ( γ 2 ) E R (¯ θ n ) − R ( θ ∗ ) = O n with γ = 1 / (2 R 2 ) with γ = 1 / (2 R 2 ) 18

  19. Larger step sizes: Least-mean-square algorithm ◮ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � with 2 E θ ∈ R d ◮ SGD = least-mean-square algorithm ◮ Usually studied without averaging and decreasing step-sizes. ◮ New analysis for averaging and constant step-size γ = 1 / (4 R 2 ) Bach and Moulines (2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ almost surely ◮ No assumption regarding lowest eigenvalues of the Hessian ◮ Main result: θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ n γ n ◮ Matches statistical lower bound Tsybakov (2003). 19

  20. Related work in Sierra Led to numerous (non trivial) extensions, at least in our lab ! ◮ Beyond parametric models: Non Parametric Stochastic Approximation with Large step sizes. Dieuleveut and Bach (2015) ◮ Improved Sampling: Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions. D´ efossez and Bach (2015) ◮ Acceleration: Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. Dieuleveut et al. (2016) ◮ Beyond smoothness and euclidean geometry: Stochastic Composite Least-Squares Regression with convergence rate O (1 / n ). Flammarion and Bach (2017) 20

  21. SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ � � k ) , ◮ satisfies Markov property ◮ is homogeneous, for γ constant, ( ε k ) k ∈ N i.i.d. Also assume: k = R ′ + ε k +1 is almost surely L -co-coercive. ◮ R ′ ◮ Bounded moments E [ � ε k ( θ ∗ ) � 4 ] < ∞ . 21

  22. Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 θ γ ¯ ¯ − → θ γ := E π γ [ θ ] . n n →∞ ◮ Behavior under the limit distribution ( γ → 0): ¯ θ γ = θ ∗ + ?. � Provable convergence improvement with extrapolation tricks. † Dieuleveut, Durmus, Bach [2017]. 22


More recommend