adaptive distributed stochastic gradient descent for
play

Adaptive Distributed Stochastic Gradient Descent for Minimizing - PowerPoint PPT Presentation

2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website:


  1. 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website: tiny.cc/serge-kas-hanna

  2. Joint work with Parimal Parag, IISC Rawad Bitar, TUM Venkat Dasari, US Army RL Salim El Rouayheb, Rutgers Serge Kas Hanna IEEE ICASSP 2020 2

  3. Distributed Computing and Applications The Age of Big Data Internet of Things (IoT) Cloud computing Focus of this talk: Distributed Machine Learning Outsourcing computations Distributed Machine to companies Learning Serge Kas Hanna IEEE ICASSP 2020 3

  4. Speeding Up Distributed Machine Learning 𝐡 " Master Master wants to run a ML algorithm on a 𝐡 # Dataset 𝐡 large dataset 𝐡 … 𝐡 % Learning process can be made faster by outsourcing computations to worker nodes … Workers who perform local computations and communicate results back to master Worker 1 Worker 2 Worker 3 𝐡 " 𝐡 & 𝐡 % 𝐡 # Challenge: Stragglers: slow or unresponsive workers can significantly delay the learning process Master is as fast as the slowest worker! Stragglers Serge Kas Hanna IEEE ICASSP 2020 4

  5. Distributed Machine Learning Ø Master has dataset π‘Œ ∈ ℝ *Γ—, , labels 𝒛 ∈ ℝ 𝒏 and wants to learn a model 𝒙 βˆ— ∈ ℝ , that best represents 𝒛 as a function of π‘Œ Optimization problem Master Find 𝒙 βˆ— ∈ ℝ , that minimizes a 𝑛 𝑛 certain loss function 𝐺 π‘Œ 𝒛 data vectors labels 𝒙 βˆ— = arg min 𝐺(π‘Œ, 𝒛, 𝒙) 𝒙 𝑒 dimension Ø When the dataset is large (𝑛 ≫) , computation is a bottleneck Ø Distributed learning: recruit workers 1) Distribute data to π‘œ 𝐡 " Worker 1 workers Master 2) Workers compute on 𝐡 " 𝐡 # Worker 2 local data & send to 𝐡 # π‘Œ 𝒛 = master … … … 𝐡 % 3) Master aggregates responses & updates 𝐡 % Worker π‘œ 𝐡 = [π‘Œ|𝒛 ] model 5

  6. GD, SGD & batch SGD Ø Gradient Descent (GD), choose 𝒙 B randomly then iterate 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝐡, 𝒙 π’Œ , where πœƒ is the step size and βˆ‡πΊ is the gradient of 𝐺 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝒙 Ø When dataset 𝐡 is large , computing βˆ‡πΊ 𝐡, 𝒙 is cumbersome Ø Stochastic Gradient Descent (SGD): at each iteration, update 𝒙 C based on one row of 𝐡 ∈ ℝ ,D" that is chosen uniformly at random sample 1 𝒃 randomly chosen row at data vector from A random 𝐡 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝒃, 𝒙 π’Œ , Ø Batch SGD: choose a batch of 𝑑 < 𝑛 data vectors uniformly at random sample batch of 𝑑 𝑇 random batch of rows at data vectors 𝐡 random 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝑇, 𝒙 π’Œ , Ø SGD & Batch SGD can converge to 𝒙 βˆ— with a higher number of iterations 6

  7. Synchronous Distributed GD Ø Distributed GD: each worker computes a partial gradient on its local data 𝐡 " Compute 𝑕 " (𝒙 C ) = βˆ‡πΊ(𝐡 " , π‘₯ C ) Worker 1 𝒙 C Dataset 𝑕 " (𝒙 C ) 𝐡 " Master 𝐡 # Compute 𝑕 # (𝒙 C ) = βˆ‡πΊ(𝐡 # , π‘₯ C ) Worker 2 𝒙 C 𝐡 # … 𝑕 # (𝒙 C ) … … 𝐡 % 𝒙 C Master computes 𝑕(𝒙 C ) = 𝑕 " + 𝑕 # + β‹― + 𝑕 % 𝑕 % (𝒙 C ) Worker π‘œ 𝐡 % Compute 𝑕 % (𝒙 C ) = βˆ‡πΊ(𝐡 % , π‘₯ C ) Ø At iteration π‘˜: 1. Master sends the current model 𝒙 C to all workers 2. Workers compute their partial gradients and send them to the master 3. Master aggregates the partial gradients by summing them to obtain full gradient Ø Aggregation with simple summation works if βˆ‡πΊ is additively separable, e.g. β„’ # loss Ø Straggler problem: Master is as fast as the slowest worker 7

  8. Speeding up Distributed GD: Previous Work Ø Coding theoretic approach: Gradient coding [Tandon et al. β€˜17], [Yu et al. β€˜17], [Halbawi et al. β€˜18], [Kumar et al. β€˜18], … β€’ Main idea: Distribute data redundantly and encode the partial gradients β€’ Responses from stragglers are treated as erasures and the full gradient is decoded from responses of non-stragglers Ø Approximate gradient coding: [Chen et. al β€˜17], [Wang et al. β€˜19], [Bitar et al. β€˜19], … β€’ Main idea: master does not need to compute exact gradient, e.g. SGD β€’ Ignore the response of stragglers and obtain an estimate of the full gradient β€’ Fastest- 𝒍 SGD : wait for the responses of the fastest 𝑙 < π‘œ workers and ignore the responses of the π‘œ βˆ’ 𝑙 stragglers Ø Mixed Strategies: [Charles et al. β€˜17], [Maity et al. β€˜18], … 8

  9. Fastest- 𝑙 SGD Ø Our question: how to choose the value of 𝑙 in fastest- 𝑙 SGD with fixed step size? Ø Numerical example on synthetic data: linear regression, β„’ # loss function Error vs Time of Fastest- 𝑙 SGD π‘œ = 50 workers 𝑛 = 2000 data points 𝑒 = 10 dimension Response time of workers iid ∼ exp(1) Key observation Error-runtime trade-off: convergence is faster for small 𝑙 but accuracy is lower Ø What does theory say? Theorem [Murata 1998] : SGD with fixed step size goes through an exponential phase where error decreases exponentially, then enters a stationary phase where 𝒙 C oscillates around 𝒙 βˆ— Ø Previous work on fastest- 𝑙 SGD: Analysis by [Bottou et al. ’18] & [Duta et al. β€˜18] for predetermined (fixed) 𝑙 9

  10. Our Contribution: Adaptive fastest- 𝑙 SGD Ø Our goal: speed up distributed SGD in the presence of stragglers, i.e., achieve lower error is less time Envelope Ø Approach: adapt the value of 𝑙 throughout the runtime to maximize time spent in exponential decrease Ø Adaptive: start with smallest 𝑙 and then increase 𝑙 gradually every time error hits a plateau Ø Challenge: in practice we do not know the error because we do not know 𝒙 βˆ— Ø Our results: 1. Theoretical: β€’ Derive an upper bound on the error of fastest- 𝑙 SGD as a function of time β€’ Determine the bound-optimal switching times 2. Practical: Devise an algorithm for adaptive fastest- 𝑙 SGD based on a statistical heuristic Serge Kas Hanna IEEE ICASSP 2020 10

  11. Our Theoretical Results Theorem 1 [Error vs. Time of fastest- 𝑙 SGD]: Under certain assumptions on the loss function, the error of fastest- 𝑙 SGD after wall-clock time 𝑒 with fixed step size satisfies ≀ πœƒπ‘€πœ # 𝐺 𝒙 B βˆ’ 𝐺 𝒙 βˆ— βˆ’ πœƒπ‘€πœ # ` f g "hi 𝔽 𝐺 𝒙 ` βˆ’ 𝐺 𝒙 βˆ— | 𝐾 𝑒 2𝑑𝑙𝑑 + 1 βˆ’ πœƒπ‘‘ , 2𝑑𝑙𝑑 with high probability for large 𝑒 , where 0 < πœ— β‰ͺ 1 is a constant error term, 𝐾(𝑒) is the number of iterations completed in time 𝑒 , and 𝜈 m is the average of the 𝑙 `n order statistic of the random response times. Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑒 m , 𝑙 = 1, … , π‘œ βˆ’ 1 , at which the master should switch from waiting for the fastest 𝑙 workers to waiting for the fastest 𝑙 + 1 workers are given by 𝜈 m βˆ’ ln 1 βˆ’ πœƒπ‘‘ [ln 𝜈 mD" βˆ’ 𝜈 m βˆ’ ln πœƒπ‘€πœ # 𝜈 m 𝑒 m = 𝑒 mh" + βˆ’ πœƒπ‘€ 𝑙 + 1 𝜏 # ] + ln(2𝑑𝑙 𝑙 + 1 𝑑 𝐺 𝒙 ` gpq βˆ’ 𝐺 𝒙 βˆ— where 𝑒 B = 0 . Serge Kas Hanna IEEE ICASSP 2020 11

  12. Example on Theorem 2 Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑒 m , ..., are given by 𝜈 m βˆ’ ln 1 βˆ’ πœƒπ‘‘ [ln 𝜈 mD" βˆ’ 𝜈 m βˆ’ ln πœƒπ‘€πœ # 𝜈 m 𝑒 m = 𝑒 mh" + βˆ’ πœƒπ‘€ 𝑙 + 1 𝜏 # ] + ln(2𝑑𝑙 𝑙 + 1 𝑑 𝐺 𝒙 ` gpq βˆ’ 𝐺 𝒙 βˆ— where 𝑒 B = 0 . Ø Example with iid exponential response times: evaluate upper bound and apply Thm 2 12

  13. Algorithm for Adaptive fastest- 𝑙 SGD Ø Start with 𝑙 = 1 and then increase 𝑙 every time a phase transition is detected Ø Phase transition detection: monitor the sign of consecutive gradients Exponential phase Stationary phase Stochastic approximation: In stationary phase, In exponential phase, [Pflug 1990] consecutive gradients are consecutive gradients are likely to point in opposite likely to point in the same Detect phase transition: directions due to oscillation direction [Chee and Toulis β€˜18] t > 0 t < 0 β‡’ βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ β‡’ βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ CD" CD" Ø Initialize a counter to zero and update: t < 0 π‘‘π‘π‘£π‘œπ‘’π‘“π‘  = zπ‘‘π‘π‘£π‘œπ‘’π‘“π‘  + 1, 𝑗𝑔 βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ CD" t > 0 π‘‘π‘π‘£π‘œπ‘’π‘“π‘  βˆ’ 1, 𝑗𝑔 βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ CD" Ø Declare a phase transition if counter goes above a certain threshold & increase 𝑙 13

  14. Simulation Results: Non-adaptive vs Adaptive Fastest- 𝑙 SGD Simulation on synthetic data π‘Œ : Ø - Generate π‘Œ : pick 𝑛 data vectors chosen uniformly at random from 1,2, … , 10 , - Pick 𝒙 ⋆ uniformly at random from 1,2, … , 100 , - Generate labels: 𝒛 ∼ π’ͺ(π‘Œπ’™ ⋆ , 1 ) - Loss function: β„’ # loss (least square errors) - Workers’ response times are iid ∼ exp(1) and independent across iterations Simulation results on adaptive fastest- 𝑙 SGD for π‘œ = 50 workers Ø π‘œ = 50 workers 𝑛 = 2000 data vectors 𝑒 = 100 dimension πœƒ = 0.005 step size 14

Recommend


More recommend