2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website: tiny.cc/serge-kas-hanna
Joint work with Parimal Parag, IISC Rawad Bitar, TUM Venkat Dasari, US Army RL Salim El Rouayheb, Rutgers Serge Kas Hanna IEEE ICASSP 2020 2
Distributed Computing and Applications The Age of Big Data Internet of Things (IoT) Cloud computing Focus of this talk: Distributed Machine Learning Outsourcing computations Distributed Machine to companies Learning Serge Kas Hanna IEEE ICASSP 2020 3
Speeding Up Distributed Machine Learning π΅ " Master Master wants to run a ML algorithm on a π΅ # Dataset π΅ large dataset π΅ β¦ π΅ % Learning process can be made faster by outsourcing computations to worker nodes β¦ Workers who perform local computations and communicate results back to master Worker 1 Worker 2 Worker 3 π΅ " π΅ & π΅ % π΅ # Challenge: Stragglers: slow or unresponsive workers can significantly delay the learning process Master is as fast as the slowest worker! Stragglers Serge Kas Hanna IEEE ICASSP 2020 4
Distributed Machine Learning Γ Master has dataset π β β *Γ, , labels π β β π and wants to learn a model π β β β , that best represents π as a function of π Optimization problem Master Find π β β β , that minimizes a π π certain loss function πΊ π π data vectors labels π β = arg min πΊ(π, π, π) π π dimension Γ When the dataset is large (π β«) , computation is a bottleneck Γ Distributed learning: recruit workers 1) Distribute data to π π΅ " Worker 1 workers Master 2) Workers compute on π΅ " π΅ # Worker 2 local data & send to π΅ # π π = master β¦ β¦ β¦ π΅ % 3) Master aggregates responses & updates π΅ % Worker π π΅ = [π|π ] model 5
GD, SGD & batch SGD Γ Gradient Descent (GD), choose π B randomly then iterate π CD" = π C β πβπΊ π΅, π π , where π is the step size and βπΊ is the gradient of πΊ π CD" = π C β πβπΊ π Γ When dataset π΅ is large , computing βπΊ π΅, π is cumbersome Γ Stochastic Gradient Descent (SGD): at each iteration, update π C based on one row of π΅ β β ,D" that is chosen uniformly at random sample 1 π randomly chosen row at data vector from A random π΅ π CD" = π C β πβπΊ π, π π , Γ Batch SGD: choose a batch of π‘ < π data vectors uniformly at random sample batch of π‘ π random batch of rows at data vectors π΅ random π CD" = π C β πβπΊ π, π π , Γ SGD & Batch SGD can converge to π β with a higher number of iterations 6
Synchronous Distributed GD Γ Distributed GD: each worker computes a partial gradient on its local data π΅ " Compute π " (π C ) = βπΊ(π΅ " , π₯ C ) Worker 1 π C Dataset π " (π C ) π΅ " Master π΅ # Compute π # (π C ) = βπΊ(π΅ # , π₯ C ) Worker 2 π C π΅ # β¦ π # (π C ) β¦ β¦ π΅ % π C Master computes π(π C ) = π " + π # + β― + π % π % (π C ) Worker π π΅ % Compute π % (π C ) = βπΊ(π΅ % , π₯ C ) Γ At iteration π: 1. Master sends the current model π C to all workers 2. Workers compute their partial gradients and send them to the master 3. Master aggregates the partial gradients by summing them to obtain full gradient Γ Aggregation with simple summation works if βπΊ is additively separable, e.g. β # loss Γ Straggler problem: Master is as fast as the slowest worker 7
Speeding up Distributed GD: Previous Work Γ Coding theoretic approach: Gradient coding [Tandon et al. β17], [Yu et al. β17], [Halbawi et al. β18], [Kumar et al. β18], β¦ β’ Main idea: Distribute data redundantly and encode the partial gradients β’ Responses from stragglers are treated as erasures and the full gradient is decoded from responses of non-stragglers Γ Approximate gradient coding: [Chen et. al β17], [Wang et al. β19], [Bitar et al. β19], β¦ β’ Main idea: master does not need to compute exact gradient, e.g. SGD β’ Ignore the response of stragglers and obtain an estimate of the full gradient β’ Fastest- π SGD : wait for the responses of the fastest π < π workers and ignore the responses of the π β π stragglers Γ Mixed Strategies: [Charles et al. β17], [Maity et al. β18], β¦ 8
Fastest- π SGD Γ Our question: how to choose the value of π in fastest- π SGD with fixed step size? Γ Numerical example on synthetic data: linear regression, β # loss function Error vs Time of Fastest- π SGD π = 50 workers π = 2000 data points π = 10 dimension Response time of workers iid βΌ exp(1) Key observation Error-runtime trade-off: convergence is faster for small π but accuracy is lower Γ What does theory say? Theorem [Murata 1998] : SGD with fixed step size goes through an exponential phase where error decreases exponentially, then enters a stationary phase where π C oscillates around π β Γ Previous work on fastest- π SGD: Analysis by [Bottou et al. β18] & [Duta et al. β18] for predetermined (fixed) π 9
Our Contribution: Adaptive fastest- π SGD Γ Our goal: speed up distributed SGD in the presence of stragglers, i.e., achieve lower error is less time Envelope Γ Approach: adapt the value of π throughout the runtime to maximize time spent in exponential decrease Γ Adaptive: start with smallest π and then increase π gradually every time error hits a plateau Γ Challenge: in practice we do not know the error because we do not know π β Γ Our results: 1. Theoretical: β’ Derive an upper bound on the error of fastest- π SGD as a function of time β’ Determine the bound-optimal switching times 2. Practical: Devise an algorithm for adaptive fastest- π SGD based on a statistical heuristic Serge Kas Hanna IEEE ICASSP 2020 10
Our Theoretical Results Theorem 1 [Error vs. Time of fastest- π SGD]: Under certain assumptions on the loss function, the error of fastest- π SGD after wall-clock time π’ with fixed step size satisfies β€ πππ # πΊ π B β πΊ π β β πππ # ` f g "hi π½ πΊ π ` β πΊ π β | πΎ π’ 2πππ‘ + 1 β ππ , 2πππ‘ with high probability for large π’ , where 0 < π βͺ 1 is a constant error term, πΎ(π’) is the number of iterations completed in time π’ , and π m is the average of the π `n order statistic of the random response times. Theorem 2 [Bound-optimal switching times]: The bound optimal switching times π’ m , π = 1, β¦ , π β 1 , at which the master should switch from waiting for the fastest π workers to waiting for the fastest π + 1 workers are given by π m β ln 1 β ππ [ln π mD" β π m β ln πππ # π m π’ m = π’ mh" + β ππ π + 1 π # ] + ln(2ππ π + 1 π‘ πΊ π ` gpq β πΊ π β where π’ B = 0 . Serge Kas Hanna IEEE ICASSP 2020 11
Example on Theorem 2 Theorem 2 [Bound-optimal switching times]: The bound optimal switching times π’ m , ..., are given by π m β ln 1 β ππ [ln π mD" β π m β ln πππ # π m π’ m = π’ mh" + β ππ π + 1 π # ] + ln(2ππ π + 1 π‘ πΊ π ` gpq β πΊ π β where π’ B = 0 . Γ Example with iid exponential response times: evaluate upper bound and apply Thm 2 12
Algorithm for Adaptive fastest- π SGD Γ Start with π = 1 and then increase π every time a phase transition is detected Γ Phase transition detection: monitor the sign of consecutive gradients Exponential phase Stationary phase Stochastic approximation: In stationary phase, In exponential phase, [Pflug 1990] consecutive gradients are consecutive gradients are likely to point in opposite likely to point in the same Detect phase transition: directions due to oscillation direction [Chee and Toulis β18] t > 0 t < 0 β βπΊ π₯ C βπΊ π₯ β βπΊ π₯ C βπΊ π₯ CD" CD" Γ Initialize a counter to zero and update: t < 0 πππ£ππ’ππ = zπππ£ππ’ππ + 1, ππ βπΊ π₯ C βπΊ π₯ CD" t > 0 πππ£ππ’ππ β 1, ππ βπΊ π₯ C βπΊ π₯ CD" Γ Declare a phase transition if counter goes above a certain threshold & increase π 13
Simulation Results: Non-adaptive vs Adaptive Fastest- π SGD Simulation on synthetic data π : Γ - Generate π : pick π data vectors chosen uniformly at random from 1,2, β¦ , 10 , - Pick π β uniformly at random from 1,2, β¦ , 100 , - Generate labels: π βΌ πͺ(ππ β , 1 ) - Loss function: β # loss (least square errors) - Workersβ response times are iid βΌ exp(1) and independent across iterations Simulation results on adaptive fastest- π SGD for π = 50 workers Γ π = 50 workers π = 2000 data vectors π = 100 dimension π = 0.005 step size 14
Recommend
More recommend