slaq quality driven scheduling for distributed machine
play

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - PowerPoint PPT Presentation

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman AI is the new electricity. Machine translation Recommendation system Autonomous driving Object


  1. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

  2. “AI is the new electricity.” • Machine translation • Recommendation system • Autonomous driving • Object detection and recognition Supervised Unsupervised Learning Transfer Reinforcement 2

  3. ML algorithms are approximate • ML model: a parametric transformation # ! " $ 3

  4. ML algorithms are approximate • ML model: a parametric transformation $ ! " % • maps input variables ! to output variables " • typically contains a set of parameters # • Quality: how well model maps input to the correct output • Loss function: discrepancy of model output and ground truth 4

  5. Training ML models: an iterative process Job Worker Send Worker Worker Task Model ! " # Model Replica ! Update " Tasks Model Data Shards • Training algorithms iteratively minimize a loss function • E.g., stochastic gradient descent (SGD), L-BFGS 5

  6. Training ML models: an iterative process LRVV ReductLRn % 100 80 60 LRgReg LDA 40 0LPC 20 6V0 0 0 20 40 60 80 100 CuPulDtLve TLPe % • Quality improvement is subject to diminishing returns • More than 80% of work done in 20% of time 6

  7. Exploratory ML training: not a one-time effort Adjust Feature Space Collect Data Train ML Tune Hyperparameters Models Extract Features Restructure Models • Train model multiple times for exploratory purposes • Provide early feedback, direct model search for high quality models 7

  8. How to schedule multiple training jobs on shared cluster? Worker 3 1 • Key features of ML jobs Job #1 • Approximate Worker 2 3 Scheduler Job #2 • Diminishing returns Worker 1 2 • Exploratory process Job #3 Worker 1 3 • Problem with resource fairness scheduling • Jobs in early stage: could benefit a lot from additional resources • Jobs almost converged: make only marginal improvement 8

  9. SLAQ: quality-aware scheduling • Intuition: in the context of approximate ML training, more resources should be allocated to jobs that have the most potential for quality improvement 4ualLty-Aware 4ualLty-Aware 4ualLty-Aware FaLr 5esRurFe FaLr 5esRurFe FaLr 5esRurFe 1.0 1.0 1.0 AFFuraFy AFFuraFy AFFuraFy 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 0 0 50 50 50 100 100 100 150 150 150 200 200 200 250 TLme 250 TLme 250 TLme 3.0 3.0 3.0 2.4 2.4 2.4 LRss LRss LRss 1.8 1.8 1.8 1.2 1.2 1.2 0.6 0.6 0.6 0.0 0.0 0.0 9

  10. Solution Overview Normalize Predict quality Quality-driven quality metrics improvement scheduling 10

  11. Normalizing quality metrics Applicable to All Comparable Applicable to All Comparable Known Range? Predictable? Known Range? Predictable? Algorithms? Magnitudes? Algorithms? Magnitudes? Accuracy / F1 Score / Area Under Accuracy / F1 Score / Area Under � � � � Curve / Confusion Matrix / etc. Curve / Confusion Matrix / etc. � � � � Loss Loss � � � � Normalized Loss Normalized Loss � � � � ∆ Loss ∆ Loss � � � � Normalized ∆ Loss Normalized ∆ Loss 11

  12. Normalizing quality metrics • Normalize change of loss values w.r.t. largest change so far • Currently does not support some non-convex optimization algorithms .-0eDnV 6903Rly 0L3C 1RrPDlLzeG ∆ LRVV 1.0 LRgReg GBT LDA 0.8 LLnReg 690 GBTReg 0.6 0.4 0.2 0.0 −0.2 0 30 60 90 120 IterDtLRn 12

  13. Training iterations: loss prediction • Previous work: offline profiling / analysis [Ernest NSDI 16] [CherryPick NSDI 17] • Overhead for frequent offline analysis is huge • Strawman: use last ∆Loss as prediction for future ∆Loss • SLAQ: online prediction using weighted curve fitting 6WrDwPDn WeLghWeG Curve 6WrDwPDn WeLghWeG Curve 3reGLcWLRn ErrRr % 3reGLcWLRn ErrRr % 10 0 10 0 52.5 52.5 10 -1 10 -1 6.1 6.1 4.8 4.7 4.3 4.8 4.7 4.3 3.6 3.6 1.2 1.1 1.2 10 -2 1.1 10 -2 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.1 0.1 10 -3 10 -3 0.0 0.0 10 -4 10 -4 G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly LDA LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 13

  14. Scheduling approximate ML training jobs • Predict how much quality can be improved when assign X workers to jobs • Reassign workers to maximize quality improvement Scheduler Prediction Resource Worker 3 1 Allocation Job #1 Worker 2 3 Job #2 Worker 1 2 Job #3 Worker 1 3 14

  15. Experiment setup • Representative mix of training jobs with • Compare against a work-conserving fair scheduler Algorithm Acronym Type Optimization Algorithm Dataset K-Means K-Means Clustering Lloyd Algorithm Synthetic Logistic Regression LogReg Classification Gradient Descent Epsilon [33] Support Vector Machine SVM Classification Gradient Descent Epsilon SVM (polynomial kernel) SVMPoly Classification Gradient Descent MNIST [34] Gradient Boosted Tree GBT Classification Gradient Boosting Epsilon GBT Regression GBTReg Regression Gradient Boosting YearPredictionMSD [35] Multi-Layer Perceptron Classifier MLPC Classification L-BFGS Epsilon Latent Dirichlet Allocation LDA Clustering EM / Online Algorithm Associated Press Corpus [36] Linear Regression LinReg Regression L-BFGS YearPredictionMSD 15

  16. Evaluation: resource allocation across jobs • 160 training jobs submitted to cluster following Poisson distribution • 25% jobs with high loss values • 25% jobs with medium loss values • 50% jobs with low loss values (almost converged) 6haUe of ClusteU C38s (%) 6haUe of ClusteU C38s (%) 6haUe of ClusteU C38s (%) 6haUe of ClusteU C38s (%) %ottoP 50% Jobs %ottoP 50% Jobs %ottoP 50% Jobs %ottoP 50% Jobs 6econd 25% Jobs 6econd 25% Jobs 6econd 25% Jobs 6econd 25% Jobs 7oS 25% Jobs 7oS 25% Jobs 7oS 25% Jobs 7oS 25% Jobs 100 100 100 100 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 0 0 0 100 100 100 100 200 200 200 200 300 300 300 300 400 400 400 400 500 500 500 500 600 600 600 600 700 700 700 700 800 800 800 800 7iPe (seconds) 7iPe (seconds) 7iPe (seconds) 7iPe (seconds) 16

  17. Evaluation: cluster-wide quality and time )aLr 5esRurFe 6LA4 Quality 0.20 0.15 LRss 0.10 • SLAQ’s average loss is 73% lower 0.05 than that of the fair scheduler 0.00 0 100 200 300 400 500 600 700 800 7Lme (seFRQds) )aLr 5esRurFe SLA4 Time TLme (seFRQds) 200 100 40 • SLAQ reduces time to reach 90% 20 (95%) loss reduction by 45% (30%) 10 80 85 90 95 100 LRss 5eduFtLRQ % 17

  18. SLAQ Evaluation: Scalability • Frequently reschedule and reconfigure in reaction to changes of progress • Even with thousands of concurrent jobs, SLAQ makes rescheduling decisions in just a few seconds 1000 2000 3000 4000 Jobs 6chedulinJ Time (s) 2.0 1.5 1.0 0.5 0.0 1000 2000 4000 8000 16000 1umber of WorNers 18

  19. Conclusion • SLAQ leverages the approximate and iterative ML training process • Highly tailored prediction for iterative job quality • Allocate resources to maximize quality improvement 6haUe of ClusteU C38s (%) %ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 6WrDwPDn WeLghWeG Curve 3reGLcWLRn ErrRr % 10 0 100 52.5 80 10 -1 6.1 4.8 4.7 4.3 3.6 1.2 60 1.1 10 -2 0.6 0.4 0.4 0.2 40 0.1 10 -3 0.0 20 10 -4 LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 0 0 100 200 300 400 500 600 700 800 7iPe (seconds) • SLAQ achieves better overall quality and end-to-end training time 19

  20. Training iterations: runtime prediction • Iteration runtime: ! " #/% • Model complexity ! , data size # , number of workers % • Model update (i.e., size of Δ( ) is comparably much smaller 10K 100K 10 100 Iteration 7ime (s) 10 4 2347 2394 2398 2406 2406 2307 2323 2318 10 3 10 2 10 1 32 64 96 128 160 192 224 256 1umber oI Cores 20

Recommend


More recommend