Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram - PowerPoint PPT Presentation

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang 1

Deep Learning at a Large Enterprise Cortana Speech, Image, Ads, NLP, Web Search … DL training jobs require large GPU clusters Philly: Cluster manager for DL workloads on large shared GPU clusters Motivated by observations in Philly Recent Cluster Optimus Gandiva Tiresias Managers [EuroSys 18] [OSDI 18] [NSDI 19] Objective Average JCT Consolidation Average JCT Scheduler SRTF Time-sharing Gittins Index 2

Microsoft Philly Significant increase in scale during 2017 10.5 × in DL training jobs 5 × in GPU cluster size • Resource scheduling (GPU, network) • Storage for data & model ckpt Philly cluster manager • Failure handling • Multi-tenancy • …. 3

Job Lifecycle in Philly Job Queue 2 4 4 2 Philly Scheduler & Job Placement 4 N N-GPU DL job Free GPU 4 Occupied GPU 4-GPU machine GPU Cluster

Contributions • 1. First characterization study of large-scale • GPU clusters for DNN training • 2. Study cluster utilization and how effectively GPUs are used • 3. Present lessons for better cluster manager designs 5

Contributions • 1. First characterization study of large-scale • GPU clusters for DNN training 75-day period from Oct. 2017 to Dec. 2017 Total of 96,260 jobs across thousands of users • 2. Study cluster utilization and how effectively GPUs are used • 3. Present lessons for better cluster manager designs 6

Study Details Track scheduling decision and Queue 2 4 4 2 utilization info during job lifecycle Philly Scheduler & • Scheduler logs Job Placement – Job arrival, GPU alloc, finish status 4 • HW perf counters N N-GPU DL job Free GPU – GPU, CPU, memory utilization 4 Occupied GPU 4-GPU machine • AI engine logs – stderr/stdout for executed jobs GPU Cluster 7

Most GPUs in the cluster are allocated How effectively are the GPUs utilized for DNN training? 9

GPU Utilization for Job Sizes Median GPU Utilization 100 GPU utilization is low! Mean 80 GPU utilization (Lower in distributed training) 64.7 59.2 60 51.6 44.8 Two reasons: 40 - Distribution across servers 20 - Intra-server interference 0 1-GPU Jobs 4 GPU 4-GPU Jobs 8 GPU 8-GPU Jobs 16-GPU Jobs 1 GPU 16 GPU Jobs Jobs Jobs Jobs 10

Effect of Distribution on Dedicate Servers Dedicate servers → No other jobs on this server Distributed training itself causes utilization to go lower! 11

Scheduling Distributed Training Queue Relaxing locality constraints 2 4 2 • High intra-server locality Philly Scheduler & – High communication efficiency Job Placement 4 – Long queueing time • Low intra-server locality N N-GPU DL job – Low queueing time Free GPU – Contention in the use of network Occupied GPU – Risk of intra-server interference 4-GPU machine (across jobs) GPU Cluster 12

Failures occur during training How do job failures affect cluster utilization? 13

Failures Can Reduce Cluster Utilization Training Training started completed A job is unsuccessful if it repeatedly fails (waste resources) 1.11 1.2 1.09 0.98 1 0.8 Average of one failure per 0.6 distributed training job 0.33 0.4 0.2 0 2-4 GPU 5-8 GPU 1 GPU >8 GPU 1 GPU 2-4 GPU 5-8 GPU >8 GPU Jobs Jobs Jobs Jobs 14

Challenge: Failures across Stack Infrastructure AI Engine User Program Resource Scheduler Our study: classify into failure types and identify utilization impacts Improve failure handling 15

Failure Classifier GPU util Who - job & user ID GPU hours - # of GPUs x Time to failure Time stderr/stdout Where - Infra? AI engine? User? (signature, failure category) >230 signatures 16

Failures in High Frequency Reason: User errors in code or configuration failure occurrences 50 Mean 40 Repetitive and appearing early 40 31.5 30 30 % out of total 25.4 occurrences 20 20 7.7 10 6.8 10 2.9 1.3 0 0 CPU OOM CPU Incorrect Incorrect inputs Semantic Sematic error Invalid mem access Invalid Syntax error GPU OOM GPU Syntax OOM OOM inputs error mem access error 17

Failures in High Resource Use Reason: Infrastructure failures and semantics errors GPU hours until failure 50 Mean 40 Spread across many layers of 40 system stack 30 30 % out of total GPU hours 24.2 17.6 20 20 16.3 15.3 10 10 0 0 Incorrect Incorrect inputs Semantic Semantic error Model ckpt Model ckpt error MPI runtime MPI runtime failure failure inputs error error 18

Locality v.s. Waiting Time • Users prefer lower queuing delays • Initial delays can outweigh giving up locality for long-running jobs Queueing Run time Low locality (0 hour) 24 hours example High locality (1 hour) 16 hours Scheduler needs to consider: 1) trade-off between queueing delay and locality-aware scheduling 2) incorporating job migration 20

Job Pre-Run before Scheduling Reason: User errors in code or configuration failure occurrences 50 Simple validation before Mean 40 40 scheduling (e.g., pre-run) avoids 31.5 30 30 % out of total 25.4 occurrences a majority of these failures 20 20 7.7 10 6.8 10 2.9 1.3 0 0 CPU OOM CPU Incorrect Incorrect inputs Semantic Sematic error Invalid mem access Invalid Syntax error GPU OOM GPU Syntax OOM OOM inputs error mem access error 21

More in the Paper • Job queueing – Fair-share delay v.s. fragmentation delay – Impact of out-of-order scheduling on job queueing • Job failures – Full classification of failures and detailed statistics – How to mitigate failures by proactively analyzing failures at runtime • Effectiveness of the last epochs – Opportunity to not perform the last bunch of epochs 22

Conclusion Queue • 1. First characterization study of large-scale 2 4 2 • GPU clusters for DNN training Philly Scheduler & Job Placement • 2. Inefficiencies come from multiple factors 4 • 3. Lessons on locality-awareness and failure handling Traces available!  https://github.com/msr-fiddle/philly-traces GPU Cluster

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram - PowerPoint PPT Presentation

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang 1 Deep Learning at a Large Enterprise Cortana Speech, Image, Ads, NLP,

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Outlier Channel Splitting Improving DNN Quantization without Retraining Ritchie Zhao , Yuwei Hu,

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T.

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

Locational narratives in creative clusters An exploration of place, reputation and creative

DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large

Clustering: K-Means & Mixture models Prof. Mike Hughes Many ideas/slides attributable to:

Introduction to Machine Learning Part 1 and Part 2 Yingyu Liang yliang@cs.wisc.edu Computer

Clustering in Go May 2016 Wilfried Schobeiri MediaMath

Detecting Clusters in Moderate-to-high Dimensional Data: Subspace Clustering, Pattern-based

A Fistful of Bitcoins: Characterizing Payments Among Men with No Names Sarah Meiklejohn (UC San

Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation Jun Chen,

Reliable Variational Learning for Hierarchical Dirichlet Processes Erik Sudderth Brown University

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram - PowerPoint PPT Presentation

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang 1 Deep Learning at a Large Enterprise Cortana Speech, Image, Ads, NLP,

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Outlier Channel Splitting Improving DNN Quantization without Retraining Ritchie Zhao , Yuwei Hu,

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T.

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

Locational narratives in creative clusters An exploration of place, reputation and creative

DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large

Clustering: K-Means &amp; Mixture models Prof. Mike Hughes Many ideas/slides attributable to:

Introduction to Machine Learning Part 1 and Part 2 Yingyu Liang yliang@cs.wisc.edu Computer

Clustering in Go May 2016 Wilfried Schobeiri MediaMath

Detecting Clusters in Moderate-to-high Dimensional Data: Subspace Clustering, Pattern-based

A Fistful of Bitcoins: Characterizing Payments Among Men with No Names Sarah Meiklejohn (UC San

Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation Jun Chen,

Reliable Variational Learning for Hierarchical Dirichlet Processes Erik Sudderth Brown University

Clustering: K-Means & Mixture models Prof. Mike Hughes Many ideas/slides attributable to: