hypersched
play

HyperSched Deadline-aware Scheduler for Model Development Richard - PowerPoint PPT Presentation

HyperSched Deadline-aware Scheduler for Model Development Richard Liaw , Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov 1 2 Data Science @ Boogle Inc. 2 3 3 Learning Rate?


  1. HyperSched Deadline-aware Scheduler for Model Development Richard Liaw , Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov � 1

  2. � 2

  3. Data Science @ Boogle Inc. � 2

  4. � 3

  5. � 3

  6. Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization????? � 3

  7. Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization????? L � 3

  8. How to optimize? Try Random Search � 4

  9. How to optimize? Try Random Search � 4

  10. Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5

  11. Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5

  12. Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5

  13. Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5

  14. Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy # GPUs Time Time � 6

  15. Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy # GPUs Time Time � 6

  16. Scheduling Problem? DEADLINES EXIST Accuracy # GPUs Time Time � 7

  17. Scheduling Problem? DEADLINES EXIST Accuracy # GPUs Time Time � 7

  18. Scheduling Problem Given finite time and compute resources, Instead of increasing - DL cluster e ffi ciency [OSDI 2018] - Job Completion Time [NSDI 2019, EuroSys 2018] � 8

  19. Scheduling Problem Exploration Problem Given finite time and compute resources, Instead of increasing evaluate many random - DL cluster e ffi ciency trials (configurations) [OSDI 2018] - Job Completion Time [NSDI 2019, EuroSys 2018] � 8

  20. Scheduling Problem Exploration Problem Exploitation Problem Given finite time and compute resources, Instead of increasing evaluate many random - DL cluster e ffi ciency trials (configurations) [OSDI 2018] - Job Completion Time [NSDI 2019, EuroSys 2018] to obtain the best trained model � 8

  21. HyperSched is an application-level scheduler for model development. � 9

  22. HyperSched is an application-level scheduler for model development. • Balances explore and exploit by adaptively allocating resources based on: � 9

  23. HyperSched is an application-level scheduler for model development. • Balances explore and exploit by adaptively allocating resources based on: • Awareness of resource constraints # GPU TIME � 9

  24. HyperSched is an application-level scheduler for model development. • Balances explore and exploit by adaptively allocating resources based on: • Awareness of resource constraints • Awareness of training objectives Accuracy # GPU TIME TIME � 9

  25. Properties/Assumptions of model development workloads � 10

  26. Properties/Assumptions of model development workloads Model development consists of evaluating many trials. � 10

  27. Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results � 10

  28. Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results • Trials can be checkpointed during training. � 10

  29. Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results Accuracy • Trials can be checkpointed during training. • All trials share the same objective. Care only about 1 model. Time � 10

  30. Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results Accuracy • Trials can be checkpointed during training. • All trials share the same objective. Care only about 1 model. Time • Model training can be accelerated by parallelizing/ distributing its workload (data parallelism). � 10

  31. How to use allocation for exploration and exploitation # GPU TIME � 11

  32. Naive Approach: Static Space/Time Allocation # GPU TIME � 12

  33. Exploration Naive Approach: Static Space/Time Allocation # GPU TIME � 12

  34. Exploration Exploitation Naive Approach: Static Space/Time Allocation # GPU TIME � 12

  35. Naive Approach: Static Space/Time Allocation 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 � 13

  36. Naive Approach: Static Space/Time Allocation Problem: Initial Performance is a weak proxy of final behavior 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 � 13

  37. Naive Solution: Static Space/Time Allocation Underallocate exploration… # GPU TIME TIME � 14 � 14

  38. Naive Solution: Static Space/Time Allocation … or underallocate exploitation # GPU TIME TIME � 15 � 15

  39. Naive Solution: Static Space/Time Allocation Main problem: Cannot rely on initial performance. � 16

  40. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] � 17

  41. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] - Distributed hyperparameter tuning algorithm based o ff optimal resource allocation. � 17

  42. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] - Distributed hyperparameter tuning algorithm based o ff optimal resource allocation. - SOTA results over other existing algorithms � 17

  43. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] - Distributed hyperparameter tuning algorithm based o ff optimal resource allocation. - SOTA results over other existing algorithms - Deployed on many AutoML o ff erings today � 17

  44. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME - r: min. epoch - R : max epoch - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18

  45. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18

  46. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18

  47. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18

  48. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit LIMIT *= η - Intuition : Progressively allocate more resources to promising trials � 18

  49. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit LIMIT *= η - Intuition : Progressively allocate more else: resources to promising trials � 18

  50. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit LIMIT *= η - Intuition : Progressively allocate more else: # allow new trials to start resources to promising trials trial.pause(); break � 18

  51. Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] Benefit: Mitigate noisy initial performance by adaptive allocation Accuracy TIME � 19

Recommend


More recommend