stochastic algorithms in machine learning
play

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, - PowerPoint PPT Presentation

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ ee Algorithmes Stochastiques, Paris Dauphine 1 Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk .


  1. Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017 Journ´ ee Algorithmes Stochastiques, Paris Dauphine 1

  2. Outline 1. Machine learning context. 2. Stochastic algorithms to minimize Empirical Risk . 3. Stochastic Approximation: using stochastic gradient descent (SGD) to minimize Generalization Risk. 4. Markov chain: insightful point of view on constant step size Stochastic Approximation. 2

  3. Supervised Machine Learning Goal: predict a phenomenon from “explanatory variables”, given a set of observations. Bio-informatics Image classification Input: DNA/RNA sequence, Input: Handwritten digits / Output: Disease predisposition Images, / Drug responsiveness Output: Digit n → 10 to 10 4 n → up to 10 9 d (e.g., number of basis) d (e.g., number of pixels) → 10 6 → 10 6 “Large scale” learning framework: both the number of examples n and the number of explanatory variables d are large. 3

  4. Supervised Machine Learning ◮ Consider an input/output pair ( X , Y ) ∈ X × Y , following some unknown distribution ρ . ◮ Y = R (regression) or {− 1 , 1 } (classification). ◮ Goal: find a function θ : X → R , such that θ ( X ) is a good prediction for Y . ◮ Prediction as a linear function � θ, Φ( X ) � of features Φ( X ) ∈ R d . ◮ Consider a loss function ℓ : Y × R → R + : squared loss, logistic loss, 0-1 loss, etc. ◮ Define the Generalization risk (a.k.a., generalization error, “true risk”) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . 4

  5. Empirical Risk minimization (I) ◮ Data: n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n , i.i.d. ◮ n very large, up to 10 9 ◮ Computer vision: d = 10 4 to 10 6 ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ Empirical risk minimization (ERM) (regularized): find ˆ θ solution of n 1 � � y i , � θ, Φ( x i ) � � min ℓ + µ Ω( θ ) . n θ ∈ R d i =1 convex data fitting term + regularizer 5

  6. Empirical Risk minimization (II) For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 and logistic regression: n 1 � � � 1 + exp( − y i � θ, Φ( x i ) � ) min log + µ Ω( θ ) . n θ ∈ R d i =1 Two fundamental questions: (1) computing (2) analyzing ˆ θ . Take home ◮ Problem is formalized as a (convex) optimization problem. ◮ In the large scale setting, high dimensional problem and many examples. 6

  7. Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) Problem: computing the gradient costs O ( dn ) per iteration. ⇒ Stochastic algorithms 2. Large n = Stochastic Gradient Descent (SGD) 7

  8. Stochastic Gradient descent θ ∗ θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). θ 0 θ 1 8 θ

  9. SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 ℓ ( y k , � θ, Φ( x k ) � ). � n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) ◮ with F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ), n I k ( θ k − 1 ) |F k − 1 ] = 1 � ℓ ′ ( y k , � θ, Φ( x k ) � ) = ˆ E [ f ′ R ′ ( θ k − 1 ) . n k =1 Mathematical framework: smoothness and/or strong convexity. 9

  10. Mathematical framework: Smoothness ◮ A function g : R d → R is L -smooth if and only if it is twice differentiable and ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � L For all θ ∈ R d : � 2 g ( θ ) ≤ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + L � θ − θ ′ � � 10

  11. Mathematical framework: Strong Convexity ◮ A twice differentiable function g : R d → R is µ -strongly convex if and only if ∀ θ ∈ R d , eigenvalues g ′′ ( θ ) � � � µ For all θ ∈ R d : � 2 g ( θ ) ≥ g ( θ ′ ) + � g ( θ ′ ) , θ − θ ′ � + µ � θ − θ ′ � � 11

  12. Application to machine learning ◮ We consider an a.s. convex loss in θ . Thus ˆ R and R are convex. ◮ Hessian of ˆ R ≈ covariance matrix 1 � n i =1 Φ( x i )Φ( x i ) ⊤ n ( ≃ E [Φ( X )Φ( X ) ⊤ ].) n R ′′ ( θ ) = 1 � ℓ ′′ ( � θ, Φ( X i ) � , Y i )Φ( x i )Φ( x i ) ⊤ � ˆ � n i =1 ◮ If ℓ is smooth, and E [ � Φ( X ) � 2 ] ≤ r 2 , R is smooth. ◮ If ℓ is µ -strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. 12

  13. Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate (or sequence of step sizes) ( γ k ) k ≥ 0 . For smooth and strongly convex problem, traditional analysis shows Fabian (1968); Robbins and Siegmund (1985) that θ k → θ ∗ almost surely if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size... 13

  14. Polyak Ruppert averaging θ 1 θ 0 θ 1 θ n θ ∗ Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ 1 θ 0 θ 1 k 1 ¯ � θ k = θ i . θ 2 k + 1 i =0 θ n θ ∗ θ n ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 k +1 ¯ k θ k +1 = k +1 θ k +1 + θ k . ◮ one could also consider other averaging schemes (e.g., 14 Lacoste-Julien et al. (2012)).

  15. Convex stochastic approximation: convergence ◮ Known global minimax rates of convergence for non-smooth problems Nemirovsky and Yudin (1983); Agarwal et al. (2012) ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 Smooth strongly convex problems ◮ 1 ◮ Rate µ k for γ k ∝ k − 1 / 2 : adapts to strong convexity. 15

  16. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 16

  17. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 16

  18. Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 � ⊕ update costs the same as SGD � ⊖ needs to store all gradients f ′ i ( θ k i ) at “points in the past” Some references: ◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview. 17

  19. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k GD, SGD, SAG (Fig. from Schmidt et al. (2013)) 18

  20. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k Lower Bounds α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β : Black box first order optimization, Nesterov (2004) ; γ : Lower bounds for optimizing finite sums, Agarwal and Bottou (2014); Arjevani and Shamir (2016). 19

  21. Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD AGD SAG SGD � � � � � 1 1 1 Convex O O O √ √ k 2 k k O ( e −√ µ k ) � k � � � � 1 1 − ( µ ∧ 1 1 Stgly-Cvx O O n ) O µ k µ k Lower Bounds α β γ α : Stoch. opt. information theoretic lower bounds, Agarwal et al. (2012); β : Black box first order optimization, Nesterov (2004); γ : Lower bounds for optimizing finite sums, Agarwal and Bottou (2014). 20

Recommend


More recommend