Scalable Distributed Training with Parameter Hub: a whirlwind tour
TVM Stack Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet
Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Hardware Edge Cloud ASIC Fleet FPGA FPGA Active Topology Probing Your Cloud Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.
Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication 4 Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy
Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication * In the cloud-based training context 4 Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy
Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5
Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5
Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6
Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6
EC2 reclaims your GPU instances as they run out of capacity 7
EC2 reclaims your GPU instances as they run out of capacity 7
Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Server Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Parameter Server Worker 8
Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Sever Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Parameter Server Worker 9
Distributed Training Today IN THE CONTEXT OF THE CLOUD Network Core ToR ToR Machine with GPUs Machine with GPUs Machine Machine 10
Distributed Training Today FORWARD AND BACKWARD PASSES IN WORKER Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 11
Distributed Training Today AGGREGATION AND OPTIMIZATION IN PS Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 12
Distributed training is communication bound ResNet 269 - Problem gets worse 1.8 GPU idle, waiting on network over time: shifting GPU and Network active 1.35 bottleneck. Seconds - With modern GPUs 0.9 most of the time is spent 0.45 on communication. 0 - Making GPUs faster will GRID 520 K80 M60 V100 do little to increase throughput 2012 2014 2015 2017 - Wasting compute resources. 13
Distributed training is communication bound AlexNet ResNet 269 GoogleNet Inception V3
Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 15
Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 16
Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS Training Framework … Network Core Network GPU ToR ToR Worker 1 PS 2 PS 1 Worker 2 16
Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS 17
Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS ResNet 269 Compute Inception Data Copy and Communication Aggregator GoogleNet Optimizer Synchronization and other Overheads AlexNet 0 0.4 0.8 1.2 1.6 Seconds 17
Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 18
Bottlenecks in DDNN training BANDWIDTH BOTTLENECK Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 19
Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps 20
Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps Cloud Bandwidth 10 Gbps 20
Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 20
Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? ResNet: 100 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 20
Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps AlexNet: 1200 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? ResNet: 100 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 20
Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 21
Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 22
Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD 1 2 3 4 5 6 7 8 • Transient congestion, or 1 9 Gbps Cluster 1: 1 3 4 5 7 2 4. oversubscription by Cluster 2: 2 6 8 7 design 3 8. 4. Hosts • Cross-rack 9 7 communication cost is 4 8. 4. 8. 9 7 9 higher than Intra-rack 5 8. 4. 8. 8. communication. 9 7 9 9 4 Gbps 6 4. 9. 4. 4. 4. 7 0 7 7 7 Hosts 7 8. 4. 9. 8. 9. 4. 23 9 7 0 9 0 7
Parameter Hub Optimizations CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING ToR ToR PS 2 PS 1 Worker 2 24
Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 25
Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 26
Software Optimizations GRADIENTS Network MEMORY Core ToR ToR CPU Worker 1 PS 2 PS 1 Worker 2 27
Software Optimizations GRADIENTS Network MEMORY Core ToR ToR CPU Worker 1 PS 2 PS 1 Worker 2 27
Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q For each input Q, launch a Sequentially aggregates the Organize processors into from different workers and series of threads for same portion of gradients hierarchy. Perform NUMA writes to different locations to aggregation. This is used in within each queue. (Tall aware tree reduction. the output queue MxNet. (Wide Aggregation) Aggregation) 28
Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q For each input Q, launch a Sequentially aggregates the Organize processors into from different workers and series of threads for same portion of gradients hierarchy. Perform NUMA writes to different locations to aggregation. This is used in within each queue. (Tall aware tree reduction. the output queue MxNet. (Wide Aggregation) Aggregation) 28
Software Optimizations TALL AGGREGATION AND OPTIMIZATION - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Core Mappings Gradient Array for Key 0 from 8 workers 29
Software Optimizations TALL AGGREGATION AND OPTIMIZATION - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Core Mappings Gradient Array for Key 0 from 8 workers 29
Software Optimizations TALL AGGREGATION AND OPTIMIZATION - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Core Mappings - Virtual gradients are transferred independently. Gradient Array for Key 0 from 8 workers 29
Recommend
More recommend