FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen, Seung-Hwan Lim, Robert Patton 1
Build an image classifier? Deep Neural Network (DNN) CPU GPU Training Pre-processing Storage Decoding Hyperparameters tuning: Rotation • # layers Cropping • # parameters in each layer … • Learning rate scheduling • … 2
Ensemble Training • concurrently train a set of DNNs on a cluster of nodes. Training Train model 1 Pre-processing Training Train model 2 Pre-processing Storage … … … Training Train model N Pre-processing 3
Preprocessing is redundant across the pipelines. Training Train model 1 Pre-processing Training Train model 2 Pre-processing Storage … … Training Train model N Pre-processing 4
Pittman et al., 2018 Eliminate pipeline redundancies in preprocessing through data sharing • Reduce CPU usage by 2-11X • Achieve up to 10X speedups with 15% energy consumption Pittman, Randall, et al. "Exploring flexible communications for streamlining DNN ensemble training pipelines." SC18: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2018. 5
Ensemble training with data sharing CPU GPU Training Train model 1 Pre-processing Training Train model 2 … Storage … Pre-processing … Training Train model N 6
With data sharing, the training goes even slower! CPU GPU Training Train model 1 Pre-processing Training Train model 2 … Storage … Pre-processing … Training Train model N 7
CPU GPU Training Heterogenous Ensemble A set of DNNs with Pre-processing Training different architectures and configurations. … … Pre-processing Varying training rate Training Varying convergence speed 8
CPU GPU 100 images/sec Training Heterogenous Ensemble Varying training rate 100 images/sec Pre-processing Training Training rate: … compute throughput of processing units used for … Pre-processing 40 images/sec training the DNN. Training 9
CPU GPU 100 images/sec Training Heterogenous Ensemble Varying training rate 100 images/sec Pre-processing Training If a DNN consumes data … slower, other DNNs will have to wait for it before … Pre-processing 40 images/sec evicting current set of Training cached batches. Bottleneck 10
CPU GPU 40 epochs Training Heterogenous Ensemble Varying training rate 40 epochs Pre-processing Varying convergence speed Training … Due to differences in architectures and hyper- … Pre-processing parameters, some DNNs 50 epochs Training converge slower than others. 11
CPU GPU 40 epochs Resource Training under-utilized Heterogenous Ensemble Varying training rate 40 epochs Pre-processing Varying convergence speed Training A subset of DNNs have … already converged while the … Pre-processing shared preprocessing have 50 epochs Training to keep working for the remaining ones. 12
Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Varying training rate 1.12 – 1.92X speedup Varying convergence speed 13
Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Varying training rate 1.12 – 1.92X speedup Varying convergence speed Contributions: 1. Optimal resource allocation 14
Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Data-parallel distributed training Varying training rate Varying convergence speed Checkpointing Contributions: 1. Optimal resource allocation 2. Greedy allocation algorithm 15
Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Data-parallel distributed training Varying training rate Varying convergence speed Checkpointing Contributions: 1. Optimal resource allocation 2. Greedy allocation algorithm 3. A set of techniques to solve challenges in implementing FLEET 16
Focus of This Talk A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Data-parallel distributed training Varying training rate Varying convergence speed Checkpointing Contributions: 1. Optimal resource allocation 2. Greedy allocation algorithm 3. A set of techniques to solve challenges in implementing FLEET 17
Resource Allocation Problem CPU GPU Training DNN 1 Pre-processing What is an Training optimal DNN 2 GPU Pre-processing allocation ? Optimal CPU allocation : … Set #processes for preprocessing Training to be the one that just meet the DNN N computing requirements of training DNNs 18
GPU Allocation DNN 1 GPU Node 1 GPU DNN 2 DNN 3 GPU Node 2 GPU DNN 4 19
GPU Allocation: 1 GPU to 1 DNN Training rate DNN 1 100 images/sec GPU Node 1 GPU DNN 2 80 images/sec 80 images/sec DNN 3 GPU Node 2 DNN 4 40 images/sec GPU Training rate of the pipeline With data sharing, the slowest DNN determines the training rate of the ensemble training pipeline. 20
GPU Allocation: Different GPUs to Different DNNs Training rate DNN 1 GPU 100 images/sec Node 1 GPU DNN 4 105 images/sec DNN 4 GPU Reduce waiting time Node 2 Increase utilization GPU DNN 4 Training rate of the pipeline Another way to allocate GPUs: only DNN 1 and DNN 4 are trained together with data sharing. 21
GPU Allocation: Different GPUs to Different DNNs Training rate DNN 1 GPU 100 images/sec Node 1 GPU DNN 4 105 images/sec DNN 4 GPU Node 2 GPU DNN 4 Flotilla A set of DNNs to be trained together with data sharing (e.g., DNN1 and DNN4 ). 22
GPU Allocation: Different GPUs to Different DNNs Training rate DNN 1 DNN 2 GPU Node 1 GPU DNN 4 DNN 2 … … DNN 4 DNN 3 GPU Node 2 GPU DNN 3 DNN 4 Flotilla 2 Flotilla 1 We need to create a list of flotillas to train all DNNs to converge. 23
Optimal Resource Allocation Given a set of DNN to train and a cluster of nodes, find: (1) the list of flotillas and (2) GPU assignments within each flotilla such that the end-to-end ensemble training time is minimized. NP-hard 24
Greedy Allocation Algorithm Dynamically determine the list of flotillas : (1) whether a DNN is converged or not, (2) the training rate of each DNN. Once a flotilla is created, derive an optimal GPU assignment 25
Greedy Allocation Algorithm DNN ensemble Profile training rates of each DNN on m GPUs Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Converged DNNs 26
Greedy Allocation Algorithm: DNN ensemble profiling Profile training rates of each DNN on m GPUs Training rates (images/sec) of DNNs on GPUs. # GPU 1 2 3 4 Create a new flotilla DNN 1 100 190 270 350 DNN 2 80 150 220 280 Assign GPUs for DNNs in the flotilla DNN 3 80 150 200 240 DNN 4 40 75 105 120 Train DNNs in the flotilla with data sharing Converged DNNs 27
Greedy Allocation Algorithm DNN ensemble Profile training rates of each DNN on m GPUs Create a new flotilla Step 1: Flotilla Creation Assign GPUs for DNNs in Step 2: GPU Assignment the flotilla Train DNNs in the flotilla Step 3: Model training with data sharing Converged DNNs 28
Step 1: Flotilla Creation DNN ensemble #1: DNNs in the same flotilla should be able to Profile training rates reach a similar training rate if a proper number of of each DNN on m GPUs GPUs is assigned to each DNN. Create a new flotilla Reduce GPU waiting time #2: Pack into one flotilla as many DNNs as possible. Assign GPUs for DNNs in the flotilla Avoid inefficiency due to sublinear scaling Allow more DNNs to share preprocessing Train DNNs in the flotilla with data sharing Converged DNNs 29
Step 1: Flotilla Creation DNN ensemble Profile training rates # GPUs available: 4-1 à 3 – 3 à 0 of each DNN on m GPUs # GPU 1 2 3 4 DNN 1 100 190 270 350 Create a new flotilla DNN 2 80 150 220 280 DNN 3 80 150 200 240 Assign GPUs for DNNs in the flotilla DNN 4 40 75 105 120 Train DNNs in the flotilla DNN 1 DNN 4 with data sharing #GPU=3 #GPU=1 Converged DNNs 30
Step 2: GPU Assignment DNN ensemble Profile training rates #1: When assigning multiple GPUs to a DNN, of each DNN on m GPUs try to use GPUs in the same node. #2: Try to assign DNNs that need a smaller Create a new flotilla number of GPUs to the same node. Assign GPUs for DNNs in Reduce the variation in communication latency the flotilla Node 2 Node 1 Train DNNs in the flotilla GPU GPU GPU GPU with data sharing DNN 4 DNN 1 DNN 4 DNN 4 Converged DNNs 31
Step 1: Flotilla Creation DNN ensemble Step 2: GPU Assignment Profile training rates of each DNN on m GPUs Data-parallel distributed training Create a new flotilla Assign GPUs for DNNs in Varying training rate the flotilla Train DNNs in the flotilla with data sharing Converged DNNs 32
Step 3: Model Training DNN ensemble Profile training rates of each DNN on m GPUs Checkpointing Create a new flotilla Assign GPUs for DNNs in Varying convergence speed the flotilla Train DNNs in the flotilla with data sharing Converged DNNs 33
Recommend
More recommend