communication trade offs for synchronized distributed sgd
play

Communication trade-offs for synchronized distributed SGD with large - PowerPoint PPT Presentation

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT EPLF, MLO 17 november 2017 Joint work with Kumar Kshitij Patel. 1 Outline 1. Stochastic gradient descent - supervised machine learning -


  1. Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT EPLF, MLO 17 november 2017 Joint work with Kumar Kshitij Patel. 1

  2. Outline 1. Stochastic gradient descent - supervised machine learning - setting, assumptions and proof techniques 2. Synchronized distributed SGD - from mini-batch averaging to model averaging 3. Optimality of Local-SGD. 2

  3. Stochastic Gradient Descent ◮ Goal: θ ∈ R d F ( θ ) min given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). 3

  4. Stochastic Gradient Descent θ 0 ◮ Goal: θ ∈ R d F ( θ ) min given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − η k g k ( θ k − 1 ) ◮ E [ g k ( θ k − 1 ) |F k − 1 ] = F ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 3

  5. Stochastic Gradient Descent θ 0 θ 1 ◮ Goal: θ ∈ R d F ( θ ) min θ n given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − η k g k ( θ k − 1 ) ◮ E [ g k ( θ k − 1 ) |F k − 1 ] = F ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 3

  6. Supervised Machine Learning ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . ◮ Empirical risk (or training error): n R ( θ ) = 1 � ˆ ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 4

  7. Supervised Machine Learning ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 ◮ and logistic regression: n 1 � � 1 + exp( − y i � θ, Φ( x i ) � ) � min log + µ Ω( θ ) . n θ ∈ R d i =1 4

  8. Polyak Ruppert averaging 5

  9. Polyak Ruppert averaging θ 1 θ 0 θ 1 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n n 1 θ ∗ ¯ � θ n = θ k . n + 1 k =0 ◮ off line averaging reduces the noise effect. 6

  10. Polyak Ruppert averaging θ 1 θ 0 θ 1 θ 2 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n n 1 θ ∗ θ n ¯ � θ n = θ k . n + 1 k =0 ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 n +1 ¯ n θ n +1 = n +1 θ n +1 + θ n . 6

  11. Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. 7

  12. Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third � < L , � �� �� � F (2) ( θ ) � �� �� uniformly bounded derivatives: sup θ ∈ R d � < M . Especially F is L -smooth. � �� �� � F (3) ( θ ) � �� �� and sup θ ∈ R d 7

  13. Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third � < L , � �� �� � F (2) ( θ ) � �� �� uniformly bounded derivatives: sup θ ∈ R d � < M . Especially F is L -smooth. � �� �� � F (3) ( θ ) � �� �� and sup θ ∈ R d Or: Q1 [Quadratic function] There exists a positive definite matrix Σ ∈ R d × d , such that the function F is the quadratic function θ �→ � Σ 1 / 2 ( θ − θ ⋆ ) � 2 / 2, 7

  14. Which step size would you use? Smooth functions. √ η k ≡ η 0 η k = 1 / k η k = 1 / ( µ k ) Convex Strongly Convex Quadratic 8

  15. Classical bound: Lyapunov approach � � � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � || θ k +1 − θ ⋆ || 2 |F k ≤ E − 2 η k E + η 2 k || g k ( θ k ) || 2 � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � ≤ E − 2 η k (1 − η k L ) + η 2 k || g k ( θ ⋆ ) || 2 � || θ k − θ ⋆ || 2 � � � η k ( F ( θ k ) − F ( θ ⋆ )) ≤ (1 − η k µ ) E || θ k +1 − θ ⋆ || 2 |F k − E + η 2 k || g k ( θ ⋆ ) || 2 9

  16. Classical bound: Lyapunov approach � � � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � || θ k +1 − θ ⋆ || 2 |F k ≤ E − 2 η k E + η 2 k || g k ( θ k ) || 2 � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � ≤ E − 2 η k (1 − η k L ) + η 2 k || g k ( θ ⋆ ) || 2 � || θ k − θ ⋆ || 2 � � � η k ( F ( θ k ) − F ( θ ⋆ )) ≤ (1 − η k µ ) E || θ k +1 − θ ⋆ || 2 |F k − E + η 2 k || g k ( θ ⋆ ) || 2 1 Conclusion: with η k = µ k , telescopic sum + Jensen: � � F (¯ θ k ) − F ( θ ⋆ ) ≤ O (1 /µ k ) . E 9

  17. Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). 10

  18. Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). Start with θ 0 = θ ⋆ : Then: k θ k − θ ⋆ = 1 ¯ � i η 2 M k i ε i . k i =1 Even with large step size η 2 i ≡ η , CLT is enough to control that ! 10

  19. Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). Start with θ 0 = θ ⋆ : Then: k θ k − θ ⋆ = 1 ¯ � i η 2 M k i ε i . k i =1 Even with large step size η 2 i ≡ η , CLT is enough to control that ! Tight control is much easier on the stochastic process θ k − θ ⋆ than through the “Lyapunov approach”. 10

  20. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . 11

  21. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 11

  22. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual 11

  23. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual � tight control of || F ′′ ( θ ⋆ ) � ¯ θ K − θ ⋆ � || . 11

  24. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual � tight control of || F ′′ ( θ ⋆ ) � ¯ θ K − θ ⋆ � || . Correct control of the noise for smooth and strongly convex All step sizes η n = Cn − α with α ∈ (1 / 2 , 1) lead to O ( n − 1 ). LMS algorithm: constant step-size → statistical optimality. 11

Recommend


More recommend