clusters for dnn training workloads
play

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram - PowerPoint PPT Presentation

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang 1 Deep Learning at a Large Enterprise Cortana Speech, Image, Ads, NLP,


  1. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang 1

  2. Deep Learning at a Large Enterprise Cortana Speech, Image, Ads, NLP, Web Search … DL training jobs require large GPU clusters Philly: Cluster manager for DL workloads on large shared GPU clusters Motivated by observations in Philly Recent Cluster Optimus Gandiva Tiresias Managers [EuroSys 18] [OSDI 18] [NSDI 19] Objective Average JCT Consolidation Average JCT Scheduler SRTF Time-sharing Gittins Index 2

  3. Microsoft Philly Significant increase in scale during 2017 10.5 × in DL training jobs 5 × in GPU cluster size • Resource scheduling (GPU, network) • Storage for data & model ckpt Philly cluster manager • Failure handling • Multi-tenancy • …. 3

  4. Job Lifecycle in Philly Job Queue 2 4 4 2 Philly Scheduler & Job Placement 4 N N-GPU DL job Free GPU 4 Occupied GPU 4-GPU machine GPU Cluster

  5. Contributions • 1. First characterization study of large-scale • GPU clusters for DNN training • 2. Study cluster utilization and how effectively GPUs are used • 3. Present lessons for better cluster manager designs 5

  6. Contributions • 1. First characterization study of large-scale • GPU clusters for DNN training 75-day period from Oct. 2017 to Dec. 2017 Total of 96,260 jobs across thousands of users • 2. Study cluster utilization and how effectively GPUs are used • 3. Present lessons for better cluster manager designs 6

  7. Study Details Track scheduling decision and Queue 2 4 4 2 utilization info during job lifecycle Philly Scheduler & • Scheduler logs Job Placement – Job arrival, GPU alloc, finish status 4 • HW perf counters N N-GPU DL job Free GPU – GPU, CPU, memory utilization 4 Occupied GPU 4-GPU machine • AI engine logs – stderr/stdout for executed jobs GPU Cluster 7

  8. Contributions • 1. First characterization study of large-scale • GPU clusters for DNN training • 2. Study cluster utilization and how effectively GPUs are used • 3. Present lessons for better cluster manager designs 8

  9. Most GPUs in the cluster are allocated How effectively are the GPUs utilized for DNN training? 9

  10. GPU Utilization for Job Sizes Median GPU Utilization 100 GPU utilization is low! Mean 80 GPU utilization (Lower in distributed training) 64.7 59.2 60 51.6 44.8 Two reasons: 40 - Distribution across servers 20 - Intra-server interference 0 1-GPU Jobs 4 GPU 4-GPU Jobs 8 GPU 8-GPU Jobs 16-GPU Jobs 1 GPU 16 GPU Jobs Jobs Jobs Jobs 10

  11. Effect of Distribution on Dedicate Servers Dedicate servers → No other jobs on this server Distributed training itself causes utilization to go lower! 11

  12. Scheduling Distributed Training Queue Relaxing locality constraints 2 4 2 • High intra-server locality Philly Scheduler & – High communication efficiency Job Placement 4 – Long queueing time • Low intra-server locality N N-GPU DL job – Low queueing time Free GPU – Contention in the use of network Occupied GPU – Risk of intra-server interference 4-GPU machine (across jobs) GPU Cluster 12

  13. Failures occur during training How do job failures affect cluster utilization? 13

  14. Failures Can Reduce Cluster Utilization Training Training started completed A job is unsuccessful if it repeatedly fails (waste resources) 1.11 1.2 1.09 0.98 1 0.8 Average of one failure per 0.6 distributed training job 0.33 0.4 0.2 0 2-4 GPU 5-8 GPU 1 GPU >8 GPU 1 GPU 2-4 GPU 5-8 GPU >8 GPU Jobs Jobs Jobs Jobs 14

  15. Challenge: Failures across Stack Infrastructure AI Engine User Program Resource Scheduler Our study: classify into failure types and identify utilization impacts Improve failure handling 15

  16. Failure Classifier GPU util Who - job & user ID GPU hours - # of GPUs x Time to failure Time stderr/stdout Where - Infra? AI engine? User? (signature, failure category) >230 signatures 16

  17. Failures in High Frequency Reason: User errors in code or configuration failure occurrences 50 Mean 40 Repetitive and appearing early 40 31.5 30 30 % out of total 25.4 occurrences 20 20 7.7 10 6.8 10 2.9 1.3 0 0 CPU OOM CPU Incorrect Incorrect inputs Semantic Sematic error Invalid mem access Invalid Syntax error GPU OOM GPU Syntax OOM OOM inputs error mem access error 17

  18. Failures in High Resource Use Reason: Infrastructure failures and semantics errors GPU hours until failure 50 Mean 40 Spread across many layers of 40 system stack 30 30 % out of total GPU hours 24.2 17.6 20 20 16.3 15.3 10 10 0 0 Incorrect Incorrect inputs Semantic Semantic error Model ckpt Model ckpt error MPI runtime MPI runtime failure failure inputs error error 18

  19. Contributions • 1. First characterization study of large-scale • GPU clusters for DNN training • 2. Study cluster utilization and how effectively GPUs are used • 3. Present lessons for better cluster manager designs 19

  20. Locality v.s. Waiting Time • Users prefer lower queuing delays • Initial delays can outweigh giving up locality for long-running jobs Queueing Run time Low locality (0 hour) 24 hours example High locality (1 hour) 16 hours Scheduler needs to consider: 1) trade-off between queueing delay and locality-aware scheduling 2) incorporating job migration 20

  21. Job Pre-Run before Scheduling Reason: User errors in code or configuration failure occurrences 50 Simple validation before Mean 40 40 scheduling (e.g., pre-run) avoids 31.5 30 30 % out of total 25.4 occurrences a majority of these failures 20 20 7.7 10 6.8 10 2.9 1.3 0 0 CPU OOM CPU Incorrect Incorrect inputs Semantic Sematic error Invalid mem access Invalid Syntax error GPU OOM GPU Syntax OOM OOM inputs error mem access error 21

  22. More in the Paper • Job queueing – Fair-share delay v.s. fragmentation delay – Impact of out-of-order scheduling on job queueing • Job failures – Full classification of failures and detailed statistics – How to mitigate failures by proactively analyzing failures at runtime • Effectiveness of the last epochs – Opportunity to not perform the last bunch of epochs 22

  23. Conclusion Queue • 1. First characterization study of large-scale 2 4 2 • GPU clusters for DNN training Philly Scheduler & Job Placement • 2. Inefficiencies come from multiple factors 4 • 3. Lessons on locality-awareness and failure handling Traces available!  https://github.com/msr-fiddle/philly-traces GPU Cluster

Recommend


More recommend