Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju - PowerPoint PPT Presentation

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo

GPU Cluster for Deep Learning Training • Deep learning (DL) is popular • 10 10.5 × increase of DL training jobs in Microsoft • DL training jobs require GPU Google Lens Siri • Distributed deep learning (DDL) training with multiple GPUs • GPU cluster for DL training • 5 × increase of GPU cluster scale in Microsoft [1] How to efficiently manage a GPU cluster for DL training jobs? 1 [1]. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. https://arxiv.org/abs/1901.05758

GPU Cluster Manager Job Queue Design Objectives 2 4 1 2 Scheduler 1 Minimize Cluster-Wide Average Placement Scheme 1 Job Completion Time ( JCT ) N N-GPU DL job Achieve Free GPU Occupied GPU High Resource (GPU) Utilization 4-GPU machine GPU Cluster 2

Challenge Ⅰ : Unpredictable Training Time § Unknown execution time of DL training jobs § Job execution time is useful when minimizing JCT § Predict job execution time § Use the smooth loss curve of DL training jobs ( Optimus [1] ) 1.0 1.0 ⎯ DSSM ⎯ ResNext ⎯ Seq2Seq Norm. Train. Loss Norm. Train. Loss 0.5 0.5 ⎯ Job 1 ⎯ Job 2 0.0 0.0 Progress Progress [1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 3

Challenge Ⅰ : Unpredictable Training Time § Unknown execution time of DL training jobs § Job execution time is useful when minimizing JCT § Predict job execution time § Use the smooth loss curve of DL training jobs ( Optimus [1] ) 1.0 1.0 ⎯ DSSM ⎯ ResNext ⎯ Seq2Seq Norm. Train. Loss Norm. Train. Loss It’s hard to predict training time of DL jobs in many cases 0.5 0.5 ⎯ Job 1 ⎯ Job 2 0.0 0.0 Progress Progress [1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 4

Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 3 Machine 4 5

Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training § Consolidated placement for good training performance 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 2 Machine 3 Machine 4 6

Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training § Consolidated placement for good training performance 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 2 Machine 2 Machine 3 Machine 4 7

Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training § Consolidated placement for good training performance § Fragmented free GPUs in the cluster § Longer queuing delay 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 2 Machine 2 Machine 3 Machine 4 8

Prior Solutions II. Over-Aggressive Job Consolidation I. Unpredictable Training Time ( Scheduling ) ( Job Placement ) Optimus [1] None None YARN-CS FIFO None Gandiva [2] Time-sharing Trial-and-error [1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 9 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18

Tiresias A GPU cluster manager for Distributed Deep Learning Without Complete Knowledge I. Age-Based Scheduler Minimize JCT without complete knowledge of jobs 2. Model Profile-Based Placement Place jobs without additional information from users 10

Challenge I How To Schedule DL Training Jobs Without Complete Job Information?

Characteristics of DL Training Jobs Temporal and Spatial Co-scheduling § Variations in both temporal and spatial aspects 128 # of GPUs 64 Number of GPUs Job execution time 32 16 8 4 2 1 10 10 2 10 3 10 4 10 5 Job execution time (min) 12

Characteristics of DL Training Jobs Temporal and Spatial Co-scheduling § Variations in both temporal and spatial aspects 128 # of GPUs 64 Scheduler should consider both Number of GPUs Job execution time 32 temporal and spatial 16 aspects of DL training jobs 8 4 2 1 10 10 2 10 3 10 4 10 5 Job execution time (min) 13

Available Job Information 1. Spatial: number of GPUs G 3 G 2 ? … # of GPUs G 1 Time 0 1 2 3 4 5 6 7 8 9 10 11 14

Available Job Information 1. Spatial: number of GPUs 2. Temporal: executed time Executed time G 3 G 2 ? … # of GPUs G 1 Time 0 1 2 3 4 5 6 7 8 9 10 11 15

Age-Based Schedulers • Least-Attained Service [1] (LAS) • Prioritize job that has the shortest executed time • Gittins Index policy [2] • Need the distribution of job execution time • Prioritize job that has the highest probability to complete in the near future Age ( executed time ) G 3 G 2 ? … # of GPUs # of GPUs G 1 Time 0 1 2 3 4 5 6 7 8 9 10 11 [1]. Feedback queueing models for time-shared systems. JACM, 1968 16 [2]. Multi-armed bandit allocation indices. Wiley, Chichester, 1989

Two-Dimensional Age-Based Scheduler (2DAS) • Age calculated by two-dimensional attained service • i.e., a job’s total executed GPU time (# of GPUs × executed time) • No prior information • 2D-LAS • With partial information: distribution of job GPU time • 2D-Gittins Index 17

2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority # of GPUs Duration Attained Service Gittins Index Execution time J 1 2 2 0 0.25 J 2 1 8 0 0.25 J 3 2 6 0 0.25 Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 18

2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority Distribution # of GPUs Duration Attained Service Gittins Index Execution time J 1 2 2 0 0.25 (4, 8,12) J 2 1 8 0 0.25 J 3 2 6 0 0.25 Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 19

2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs Duration Attained Service Gittins Index Execution time 0.6 J 1 2 2 0 0.25 0.4 (4, 8,12) J 2 1 8 0 0.25 0.2 J 3 2 6 0 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 20

2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs # of GPUs Duration Duration Attained Service Attained Service Gittins Index Gittins Index Execution time 0.6 J 1 J 1 2 2 2 2 0 4 0.25 0.2 0.4 (4, 8,12) J 2 J 2 1 1 8 8 0 0 0.25 0.25 0.2 J 3 J 3 2 2 6 6 0 0 0.25 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT J 1 end G 2 2D-Gittins Index 10.0 GPU time distribution G 1 2D-LAS 11.7 None Time 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 6 23

2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs # of GPUs # of GPUs Duration Duration Duration Attained Service Attained Service Attained Service Gittins Index Gittins Index Gittins Index Execution time 0.6 J 1 J 1 J 1 2 2 2 2 2 2 4 0 4 0.25 0.2 0.2 0.4 (4, 8,12) J 2 J 2 J 2 1 1 1 8 8 8 0 4 0 0.25 0.25 0.2 0.2 J 3 J 3 J 3 2 2 2 6 6 6 0 0 0 0.25 0.25 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT J 1 end Job switch G 2 2D-Gittins Index 10.0 GPU time distribution G 1 2D-LAS 11.7 None Time 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 6 24

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju - PowerPoint PPT Presentation

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo GPU Cluster for Deep Learning Training Deep

Narcissus Beiza, Julia, Glara & Anna Tjresias The blind prophet Tie story of tiresias

Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1

Fibonacci Heaps Lecture slides adapted from: Chapter 20 of Introduction to Algorithms by

Outline Part 1 The High Voltage grid Sources of Energy at CERN Normal operation The

CS5412: DANGERS OF CONSOLIDATION Lecture XXIII Ken Birman Are Clouds Inherently Dangerous? 2

2019 Earnings Conference February 26, 2020 Legal Disclaimers Forward-Looking Statements This

Surviving Mergers A Guide for Healthcare HR Elliot Clark Bonnie Britton Keith Minnis Vice

Slide #2 Pro: District Consolidating will increase our pool of leaders. Con: Consolidating

2019 Fall Training Conference Commercial Acquisition Disrupters: e- Commerce, Consolidation, and

Security Aware and Energy-Efficient Virtual Machine Consolidation in Cloud Computing Systems

NetBricks: Taking the V out of NFV Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia

Flood Management Task Force April 17, 2020 Welcome and Introductions Thanks for attending!

Hospital Competition and Hospital Mergers: What the Best Evidence Tells Us Webinar with Zack

Congratulations SKMC Class of 2020! From: The Financial Aid Office Team 1 Before we start the

Please note, these are the actual video-recorded proceedings from the live CME event and may

Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2

CoCo: Compact and Optimized Consolidation of Modularized Service Function Chains in NFV Zili Meng

Objectives Basic principles of lung ultrasound Key lung ultrasound findings Brief

49.9% Minority Interest in Big River with Clear Path to Consolidation David Burritt President

ALTERNATIVE PAYMENT METHOD (APM) TRENDS Erin Bonney, Manager of Payer-Provider Performance April

Consolidation in Health Care Markets Robert Berenson, MD Institute Fellow, The Urban Institute

The Welfare Implications of Fiscal Consolidations in Low-income Countries Adrian Peralta-Alva 1

Entity Extraction and Consolidation for Social Web Content Preservation Stefan Dietze 1 , Diana

How to Mitigate SQL Server Costs Amidst Pandemic-Ravaged Budgets Subhead goes here

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju - PowerPoint PPT Presentation

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo GPU Cluster for Deep Learning Training Deep

Narcissus Beiza, Julia, Glara &amp; Anna Tjresias The blind prophet Tie story of tiresias

Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1

Fibonacci Heaps Lecture slides adapted from: Chapter 20 of Introduction to Algorithms by

Outline Part 1 The High Voltage grid Sources of Energy at CERN Normal operation The

CS5412: DANGERS OF CONSOLIDATION Lecture XXIII Ken Birman Are Clouds Inherently Dangerous? 2

2019 Earnings Conference February 26, 2020 Legal Disclaimers Forward-Looking Statements This

Surviving Mergers A Guide for Healthcare HR Elliot Clark Bonnie Britton Keith Minnis Vice

Slide #2 Pro: District Consolidating will increase our pool of leaders. Con: Consolidating

2019 Fall Training Conference Commercial Acquisition Disrupters: e- Commerce, Consolidation, and

Security Aware and Energy-Efficient Virtual Machine Consolidation in Cloud Computing Systems

NetBricks: Taking the V out of NFV Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia

Flood Management Task Force April 17, 2020 Welcome and Introductions Thanks for attending!

Hospital Competition and Hospital Mergers: What the Best Evidence Tells Us Webinar with Zack

Congratulations SKMC Class of 2020! From: The Financial Aid Office Team 1 Before we start the

Please note, these are the actual video-recorded proceedings from the live CME event and may

Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2

CoCo: Compact and Optimized Consolidation of Modularized Service Function Chains in NFV Zili Meng

Objectives Basic principles of lung ultrasound Key lung ultrasound findings Brief

49.9% Minority Interest in Big River with Clear Path to Consolidation David Burritt President

ALTERNATIVE PAYMENT METHOD (APM) TRENDS Erin Bonney, Manager of Payer-Provider Performance April

Consolidation in Health Care Markets Robert Berenson, MD Institute Fellow, The Urban Institute

The Welfare Implications of Fiscal Consolidations in Low-income Countries Adrian Peralta-Alva 1

Entity Extraction and Consolidation for Social Web Content Preservation Stefan Dietze 1 , Diana

How to Mitigate SQL Server Costs Amidst Pandemic-Ravaged Budgets Subhead goes here

Narcissus Beiza, Julia, Glara & Anna Tjresias The blind prophet Tie story of tiresias