TensorLights End-Host Traffic Scheduling for Distributed Deep Learning Xin Sunny Huang Ang Chen T. S. Eugene Ng Rice University 1
This Work The Parameter Server (PS) architecture is the most popular approach for distributed Deep Learning. Disadvantage: traffic contention at PS introduces harmful stragglers. TensorLights mitigates stragglers with improved application performance and machine utilization. 2
The Rise of Deep Learning (DL) Language Image Classic AI problems processing Recognition Also used for ... Power System Network Database Scheduling [1] Security [2] Routing [3] Index [4] [1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) 3
The Rise of Deep Learning (DL) Language Image 10.5 × increase of DL training Classic AI problems processing Recognition jobs in Microsoft [5] Also used for ... Power System Network Database Scheduling [1] Security [2] Routing [3] Index [4] [1] Deepmind AI reduces Google data centre cooling bill by 40%. (2016) [2] Abadi, M. et al. Learning to protect communications with adversarial neural cryptography. (arXiv 2016) [3] Valadarsky, A. et al. Learning to route. (HotNets 2017) [4] Kraska, T. et al. The case for learned index structures. (SIGMOD 2018) [5] Gu, J. et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) 4
Distributed Deep Learning (DL) with Parameter Server (PS) mo model ba barrier up updat ate worker 1 step st ep=1 Parameter Server (PS) step st ep=2 worker 2 gradient update steps per job: 000s [1] 1,000s 1, 000s to o 1, 1,000, 000,000s [1] Szegedy, C. et al. Going Deeper with Convolutions (CVPR ‘15) 5
Supporting DL at Scale Cl Cluster scheduler to manage the lifecycles of DL jobs. • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w A DL job … Cl Cluster scheduler w w w PS PS (e.g. YARN [1] ) host host host [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 6
Supporting DL at Scale Cl Cluster scheduler to manage the lifecycles of DL jobs. • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w Another DL job … Cluster scheduler Cl PS PS w w w (e.g. YARN [1] ) w PS PS w host host host [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 7
Supporting DL at Scale Cluster scheduler to manage the lifecycles of DL jobs. Cl • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w Also a DL job … Cl Cluster scheduler PS PS w w PS PS w w w (e.g. YARN [1] ) w PS PS w host host host [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 8
Supporting DL at Scale Cl Cluster scheduler to manage the lifecycles of DL jobs. • Gr Grid Search : run many DL jobs to train the same model of • different hyperparameter configurations (e.g. model initialization methods) to find the best set of model configurations. PS PS w w Also a DL job … Cluster scheduler Cl PS PS w w PS PS w w w (e.g. YARN [1] ) w PS PS w Co Contention among collocated PSe PSes! host host host How would PS contention impact the performance of distributed DL jobs? [1] Vavilapalli, V. K. et al. Apache Hadoop YARN: Yet another resource negotiator. (ACM SoCC 2013) 9
Measurement Setup • Wo Workload : Each TensorFlow [1] job: 1 parameter server (PS) and • 20 workers, all tasks on a different machine. Each job trains the ResNet-32 [2] model on the Cifar10 [3] • dataset until 30,000 global step is reached. Total 21 concurrent jobs. • host x 21 21 … PS PS w w 1 PS 20 workers [1] https://www.tensorflow.org/ [2] He, K. et al. Deep residual learning for image recognition. (IEEE CVPR 2016) [3] Krizhevsky, A. Learning multiple layers of features from tiny images. (University of Toronto Technical Report 2009) 10
Measurement Setup (cont.) • Tes Testbed ed : CPU cluster with 21 hosts, all connected to an Ethernet switch with 10 Gbps link rate. ent : Each job’s 21 tasks are on a different • Tas Task placem acement host. A range of PS placements from skewed to uniform. PS 1 … PS PS 7 PS Skewed placement Uniform placement … PS 8 … … PS PS 14 PS PS 1 … 14 … PS PS PS 21 PS 1 PS PS 21 PS 21 21 15 … 21 PSes on one host PS PS 15 PS PS 21 One PS per host 21 7 PSes per host on each of 3 hosts 11
Impact of PS Placements Job Completion Time (JCT) under various PS placements 2000 ↓ Lower is better JCT (seconds) 1500 75% 1830 1000 1471 1213 1153 1110 1092 1078 1045 500 0 #1 #2 #3 #4 #5 #6 #7 #8 Placement Index intense mild Traffic contention among PSes Application performance degrades due to contention at PS. 12
Stragglers under Contention host machine PS PS 1 PS 2 PS mo model update sent time … … to workers FI FIFO FO 1 2 13
Stragglers under Contention host machine Possibl Po ble straggl gglers de detected! d! PS PS 1 PS 2 PS Workers (of PS 1 ) receiving the tail part will model update sent mo delay the progress of the whole job time … … to workers FIFO FI FO 1 2 14
Stragglers under Contention host machine Po Possibl ble straggl gglers de detected! d! PS PS 1 PS PS 2 Workers (of PS 1 ) receiving the tail part will mo model update sent delay the progress of the whole job time … … to workers FIFO FI FO 1 2 In Inte ter-jo job le level: l: Intr In tra-jo job le level : Multiple jobs are delayed Mu One straggling worker simultaneously if each job delays th the w whole le jo job has a few stragglers. including other workers. 15
Stragglers under Contention host machine Possibl Po ble straggl gglers de detected! d! PS PS 1 PS 2 PS Workers (of PS 1 ) receiving the tail part will model update sent mo delay the progress of the whole job time … … to workers FIFO FI FO 1 2 Inte In ter-jo job le level: l: In Intr tra-jo job le level : Multiple jobs are delayed Mu One straggling worker simultaneously if each job delays th the w whole le jo job has a few stragglers. including other workers. Ap Applicat cation on per erfor ormance ance deg egrad adat ation on and and machi achine ne und under erut utilizat ation. on. 16
Mitigate Stragglers with Traffic Priority host machine PS PS 1 PS 2 PS mo model update sent time … … to workers FI FIFO FO 1 2 1 2 Wi With priority “1>2” 17
Mitigate Stragglers with Traffic Priority host machine PS PS 1 PS PS 2 mo model update sent time … … One priority for one job’s On to workers model update (from mo m PS) FIFO FI FO 1 2 1 2 With priority “1>2” Wi 18
Mitigate Stragglers with Traffic Priority host machine PS PS 1 PS 2 PS mo model update sent time … … One priority for one job’s On to workers model update (from mo m PS) FI FIFO FO 1 2 1 2 With priority “1>2” Wi Tr Traf affic c prior oritizat zation on mitigat ates es st strag aggler ers: s: workers of the same job are expected to wait for similar lengths of time . 19
Reducing Stragglers with TensorLights host machine PS PS 1 PS 2 PS mo model update sent time … … to workers FIFO FI FO 1 2 Rotate priority assignments of TensorLi Tensor Light hts 1 2 “1>2” and “2>1” -On One TensorLi Tensor Light hts 1 2 1 2 -Ro RoundRo Robin 20
Reducing Stragglers with TensorLights host machine PS PS 1 PS PS 2 mo model update sent time … … to workers FI FIFO FO 1 2 Re Reducing stragglers with priority Rotate priority while achieving fair progress wh assignments of TensorLi Tensor Light hts 1 2 am among ong concur concurrent ent job obs! s! “1>2” and “2>1” -On One TensorLi Tensor Light hts 1 2 1 2 -Ro RoundRo Robin 21
Scheduling Model with TensorLights PS 1 PS PS 1 PS PS 1 PS PS PS 2 PS PS 2 PS PS 2 FIFO TensorLights TensorLights -RoundRobin -One 22
Other commun TensorLights acceleration str ✗ Inaccurate rate Resource Re ng ü Work conserving leads to bandw schedul sched uling under-utilization Sc Schedu duling g ü Local, light-weight ✗ Global coordin overhead over head ü No change to app, ✗ Modifying the D De Deployment cluster scheduler, at various level or hardware 23 23
Recommend
More recommend