SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - PowerPoint PPT Presentation

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

“AI is the new electricity.” • Machine translation • Recommendation system • Autonomous driving • Object detection and recognition Supervised Unsupervised Learning Transfer Reinforcement 2

ML algorithms are approximate • ML model: a parametric transformation # ! " $ 3

ML algorithms are approximate • ML model: a parametric transformation $ ! " % • maps input variables ! to output variables " • typically contains a set of parameters # • Quality: how well model maps input to the correct output • Loss function: discrepancy of model output and ground truth 4

Training ML models: an iterative process Job Worker Send Worker Worker Task Model ! " # Model Replica ! Update " Tasks Model Data Shards • Training algorithms iteratively minimize a loss function • E.g., stochastic gradient descent (SGD), L-BFGS 5

Training ML models: an iterative process LRVV ReductLRn % 100 80 60 LRgReg LDA 40 0LPC 20 6V0 0 0 20 40 60 80 100 CuPulDtLve TLPe % • Quality improvement is subject to diminishing returns • More than 80% of work done in 20% of time 6

Exploratory ML training: not a one-time effort Adjust Feature Space Collect Data Train ML Tune Hyperparameters Models Extract Features Restructure Models • Train model multiple times for exploratory purposes • Provide early feedback, direct model search for high quality models 7

How to schedule multiple training jobs on shared cluster? Worker 3 1 • Key features of ML jobs Job #1 • Approximate Worker 2 3 Scheduler Job #2 • Diminishing returns Worker 1 2 • Exploratory process Job #3 Worker 1 3 • Problem with resource fairness scheduling • Jobs in early stage: could benefit a lot from additional resources • Jobs almost converged: make only marginal improvement 8

SLAQ: quality-aware scheduling • Intuition: in the context of approximate ML training, more resources should be allocated to jobs that have the most potential for quality improvement 4ualLty-Aware 4ualLty-Aware 4ualLty-Aware FaLr 5esRurFe FaLr 5esRurFe FaLr 5esRurFe 1.0 1.0 1.0 AFFuraFy AFFuraFy AFFuraFy 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 0 0 50 50 50 100 100 100 150 150 150 200 200 200 250 TLme 250 TLme 250 TLme 3.0 3.0 3.0 2.4 2.4 2.4 LRss LRss LRss 1.8 1.8 1.8 1.2 1.2 1.2 0.6 0.6 0.6 0.0 0.0 0.0 9

Solution Overview Normalize Predict quality Quality-driven quality metrics improvement scheduling 10

Normalizing quality metrics Applicable to All Comparable Applicable to All Comparable Known Range? Predictable? Known Range? Predictable? Algorithms? Magnitudes? Algorithms? Magnitudes? Accuracy / F1 Score / Area Under Accuracy / F1 Score / Area Under � � � � Curve / Confusion Matrix / etc. Curve / Confusion Matrix / etc. � � � � Loss Loss � � � � Normalized Loss Normalized Loss � � � � ∆ Loss ∆ Loss � � � � Normalized ∆ Loss Normalized ∆ Loss 11

Normalizing quality metrics • Normalize change of loss values w.r.t. largest change so far • Currently does not support some non-convex optimization algorithms .-0eDnV 6903Rly 0L3C 1RrPDlLzeG ∆ LRVV 1.0 LRgReg GBT LDA 0.8 LLnReg 690 GBTReg 0.6 0.4 0.2 0.0 −0.2 0 30 60 90 120 IterDtLRn 12

Training iterations: loss prediction • Previous work: offline profiling / analysis [Ernest NSDI 16] [CherryPick NSDI 17] • Overhead for frequent offline analysis is huge • Strawman: use last ∆Loss as prediction for future ∆Loss • SLAQ: online prediction using weighted curve fitting 6WrDwPDn WeLghWeG Curve 6WrDwPDn WeLghWeG Curve 3reGLcWLRn ErrRr % 3reGLcWLRn ErrRr % 10 0 10 0 52.5 52.5 10 -1 10 -1 6.1 6.1 4.8 4.7 4.3 4.8 4.7 4.3 3.6 3.6 1.2 1.1 1.2 10 -2 1.1 10 -2 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.1 0.1 10 -3 10 -3 0.0 0.0 10 -4 10 -4 G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly LDA LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 13

Scheduling approximate ML training jobs • Predict how much quality can be improved when assign X workers to jobs • Reassign workers to maximize quality improvement Scheduler Prediction Resource Worker 3 1 Allocation Job #1 Worker 2 3 Job #2 Worker 1 2 Job #3 Worker 1 3 14

Experiment setup • Representative mix of training jobs with • Compare against a work-conserving fair scheduler Algorithm Acronym Type Optimization Algorithm Dataset K-Means K-Means Clustering Lloyd Algorithm Synthetic Logistic Regression LogReg Classification Gradient Descent Epsilon [33] Support Vector Machine SVM Classification Gradient Descent Epsilon SVM (polynomial kernel) SVMPoly Classification Gradient Descent MNIST [34] Gradient Boosted Tree GBT Classification Gradient Boosting Epsilon GBT Regression GBTReg Regression Gradient Boosting YearPredictionMSD [35] Multi-Layer Perceptron Classifier MLPC Classification L-BFGS Epsilon Latent Dirichlet Allocation LDA Clustering EM / Online Algorithm Associated Press Corpus [36] Linear Regression LinReg Regression L-BFGS YearPredictionMSD 15

Evaluation: resource allocation across jobs • 160 training jobs submitted to cluster following Poisson distribution • 25% jobs with high loss values • 25% jobs with medium loss values • 50% jobs with low loss values (almost converged) 6haUe of ClusteU C38s (%) 6haUe of ClusteU C38s (%) 6haUe of ClusteU C38s (%) 6haUe of ClusteU C38s (%) %ottoP 50% Jobs %ottoP 50% Jobs %ottoP 50% Jobs %ottoP 50% Jobs 6econd 25% Jobs 6econd 25% Jobs 6econd 25% Jobs 6econd 25% Jobs 7oS 25% Jobs 7oS 25% Jobs 7oS 25% Jobs 7oS 25% Jobs 100 100 100 100 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 0 0 0 100 100 100 100 200 200 200 200 300 300 300 300 400 400 400 400 500 500 500 500 600 600 600 600 700 700 700 700 800 800 800 800 7iPe (seconds) 7iPe (seconds) 7iPe (seconds) 7iPe (seconds) 16

Evaluation: cluster-wide quality and time )aLr 5esRurFe 6LA4 Quality 0.20 0.15 LRss 0.10 • SLAQ’s average loss is 73% lower 0.05 than that of the fair scheduler 0.00 0 100 200 300 400 500 600 700 800 7Lme (seFRQds) )aLr 5esRurFe SLA4 Time TLme (seFRQds) 200 100 40 • SLAQ reduces time to reach 90% 20 (95%) loss reduction by 45% (30%) 10 80 85 90 95 100 LRss 5eduFtLRQ % 17

SLAQ Evaluation: Scalability • Frequently reschedule and reconfigure in reaction to changes of progress • Even with thousands of concurrent jobs, SLAQ makes rescheduling decisions in just a few seconds 1000 2000 3000 4000 Jobs 6chedulinJ Time (s) 2.0 1.5 1.0 0.5 0.0 1000 2000 4000 8000 16000 1umber of WorNers 18

Conclusion • SLAQ leverages the approximate and iterative ML training process • Highly tailored prediction for iterative job quality • Allocate resources to maximize quality improvement 6haUe of ClusteU C38s (%) %ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 6WrDwPDn WeLghWeG Curve 3reGLcWLRn ErrRr % 10 0 100 52.5 80 10 -1 6.1 4.8 4.7 4.3 3.6 1.2 60 1.1 10 -2 0.6 0.4 0.4 0.2 40 0.1 10 -3 0.0 20 10 -4 LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 0 0 100 200 300 400 500 600 700 800 7iPe (seconds) • SLAQ achieves better overall quality and end-to-end training time 19

Training iterations: runtime prediction • Iteration runtime: ! " #/% • Model complexity ! , data size # , number of workers % • Model update (i.e., size of Δ( ) is comparably much smaller 10K 100K 10 100 Iteration 7ime (s) 10 4 2347 2394 2398 2406 2406 2307 2323 2318 10 3 10 2 10 1 32 64 96 128 160 192 224 256 1umber oI Cores 20

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - PowerPoint PPT Presentation

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman AI is the new electricity. Machine translation Recommendation system Autonomous driving Object

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

CPU Scheduling Questions Why is scheduling needed? CSCI [4|6] 730 What is

10 years of observation for greenhouse gases by commercial airliner in the CONTRAIL project Y.

Continuous Integration at a Distribution Level Topics 2017-02-02 1. Motivation 2. Test

into useful electrical energy using piezoelectric 2D materials G. Abadal and F. Torres NOEMS for

Mathematical questions raised by the non-uniform Doppler effect John E. Gray Electromagnetic and

Background Bas c Cryptography Background: Basic Cryptography Symmetric Key System

Network Traffic Characterization Srinidhi Varadarajan Traffic Analysis: Introduction

QML across all UI Stacks Support for Widgets Kevin Krammer and Tobias Knig KDAB Agenda

How to Create a Custom Widget? Jan Holesovsky <kendy@collabora.com> kendy,

Sambuz

Useful Links

Newsletter

Mail Us

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - PowerPoint PPT Presentation

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman AI is the new electricity. Machine translation Recommendation system Autonomous driving Object

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

CPU Scheduling Questions Why is scheduling needed? CSCI [4|6] 730 What is

10 years of observation for greenhouse gases by commercial airliner in the CONTRAIL project Y.

Continuous Integration at a Distribution Level Topics 2017-02-02 1. Motivation 2. Test

into useful electrical energy using piezoelectric 2D materials G. Abadal and F. Torres NOEMS for

Mathematical questions raised by the non-uniform Doppler effect John E. Gray Electromagnetic and

Background Bas c Cryptography Background: Basic Cryptography Symmetric Key System

Network Traffic Characterization Srinidhi Varadarajan Traffic Analysis: Introduction

QML across all UI Stacks Support for Widgets Kevin Krammer and Tobias Knig KDAB Agenda

How to Create a Custom Widget? Jan Holesovsky &lt;kendy@collabora.com&gt; kendy,

Sambuz

Useful Links

Newsletter

Mail Us

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman AI is the new electricity. Machine translation Recommendation system Autonomous driving Object

How to Create a Custom Widget? Jan Holesovsky <kendy@collabora.com> kendy,