HyperSched Deadline-aware Scheduler for Model Development Richard Liaw , Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov � 1
� 2
Data Science @ Boogle Inc. � 2
� 3
� 3
Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization????? � 3
Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization????? L � 3
How to optimize? Try Random Search � 4
How to optimize? Try Random Search � 4
Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5
Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5
Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5
Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy GPUs Time Time � 5
Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy # GPUs Time Time � 6
Trials (sets of hyperparameters to evaluate) Terri is faced with the decision choosing the right level of parallelism Accuracy # GPUs Time Time � 6
Scheduling Problem? DEADLINES EXIST Accuracy # GPUs Time Time � 7
Scheduling Problem? DEADLINES EXIST Accuracy # GPUs Time Time � 7
Scheduling Problem Given finite time and compute resources, Instead of increasing - DL cluster e ffi ciency [OSDI 2018] - Job Completion Time [NSDI 2019, EuroSys 2018] � 8
Scheduling Problem Exploration Problem Given finite time and compute resources, Instead of increasing evaluate many random - DL cluster e ffi ciency trials (configurations) [OSDI 2018] - Job Completion Time [NSDI 2019, EuroSys 2018] � 8
Scheduling Problem Exploration Problem Exploitation Problem Given finite time and compute resources, Instead of increasing evaluate many random - DL cluster e ffi ciency trials (configurations) [OSDI 2018] - Job Completion Time [NSDI 2019, EuroSys 2018] to obtain the best trained model � 8
HyperSched is an application-level scheduler for model development. � 9
HyperSched is an application-level scheduler for model development. • Balances explore and exploit by adaptively allocating resources based on: � 9
HyperSched is an application-level scheduler for model development. • Balances explore and exploit by adaptively allocating resources based on: • Awareness of resource constraints # GPU TIME � 9
HyperSched is an application-level scheduler for model development. • Balances explore and exploit by adaptively allocating resources based on: • Awareness of resource constraints • Awareness of training objectives Accuracy # GPU TIME TIME � 9
Properties/Assumptions of model development workloads � 10
Properties/Assumptions of model development workloads Model development consists of evaluating many trials. � 10
Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results � 10
Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results • Trials can be checkpointed during training. � 10
Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results Accuracy • Trials can be checkpointed during training. • All trials share the same objective. Care only about 1 model. Time � 10
Properties/Assumptions of model development workloads Model development consists of evaluating many trials. • Each trial is iterative and returns intermediate results Accuracy • Trials can be checkpointed during training. • All trials share the same objective. Care only about 1 model. Time • Model training can be accelerated by parallelizing/ distributing its workload (data parallelism). � 10
How to use allocation for exploration and exploitation # GPU TIME � 11
Naive Approach: Static Space/Time Allocation # GPU TIME � 12
Exploration Naive Approach: Static Space/Time Allocation # GPU TIME � 12
Exploration Exploitation Naive Approach: Static Space/Time Allocation # GPU TIME � 12
Naive Approach: Static Space/Time Allocation 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 � 13
Naive Approach: Static Space/Time Allocation Problem: Initial Performance is a weak proxy of final behavior 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 � 13
Naive Solution: Static Space/Time Allocation Underallocate exploration… # GPU TIME TIME � 14 � 14
Naive Solution: Static Space/Time Allocation … or underallocate exploitation # GPU TIME TIME � 15 � 15
Naive Solution: Static Space/Time Allocation Main problem: Cannot rely on initial performance. � 16
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] � 17
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] - Distributed hyperparameter tuning algorithm based o ff optimal resource allocation. � 17
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] - Distributed hyperparameter tuning algorithm based o ff optimal resource allocation. - SOTA results over other existing algorithms � 17
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] - Distributed hyperparameter tuning algorithm based o ff optimal resource allocation. - SOTA results over other existing algorithms - Deployed on many AutoML o ff erings today � 17
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME - r: min. epoch - R : max epoch - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit - Intuition : Progressively allocate more resources to promising trials � 18
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit LIMIT *= η - Intuition : Progressively allocate more resources to promising trials � 18
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit LIMIT *= η - Intuition : Progressively allocate more else: resources to promising trials � 18
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] η * r r η * r Accuracy # GPU r * Simplified representation η * η * r TIME TIME LIMIT = r while trial.iter < R: - r: min. epoch trial.run_one_epoch() - R : max epoch if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/ η ): - η (eta): Balance explore/exploit LIMIT *= η - Intuition : Progressively allocate more else: # allow new trials to start resources to promising trials trial.pause(); break � 18
Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018] Benefit: Mitigate noisy initial performance by adaptive allocation Accuracy TIME � 19
Recommend
More recommend