Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 1
- DNN training is computationally expensive - Needs to train it in distributed fashion - People use cloud for DDNN training Major cloud providers all have an ecosystem for cloud- based DDNN training. 2
Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Server Worker 1 F2 B2 F3 B2 F1 B1 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 3
Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter A1 O1 A2 O2 Sever Worker 1 F2 B2 F3 B2 F1 B1 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 4
Cloud-based Distributed Training Today IN THE CONTEXT OF THE CLOUD Network Core ToR ToR Machine with GPUs Machine with GPUs Machine Machine 5
Cloud-based Distributed Training Today FORWARD AND BACKWARD PASSES IN WORKER Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 6
Cloud-based Distributed Training Today AGGREGATION AND OPTIMIZATION IN PS Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 7
DDNN training is communication bound 2 ResNet 269 - Problem gets worse over 1.5 time: shifting bottleneck. Seconds - With modern GPUs most 1 of the time is spent on communication. 0.5 - Making GPUs faster will do little to increase 0 throughput GRID 520 K80 M60 V100 - Wasting compute 2012 2014 2015 2017 resources. GPU idle, waiting on network GPU and Network active 8
DDNN training is communication bound AlexNet ResNet 269 GoogleNet Inception V3
Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 10
Bottlenecks in Cloud-based DDNN training FRAMEWORK BOTTLENECKS Training Framework … Network Core Network GPU ToR ToR Worker 1 PS 2 PS 1 Worker 2 11
Bottlenecks in Cloud-based DDNN training FRAMEWORK BOTTLENECKS ResNet 269 Inception Compute Data Copy and Communication GoogleNet Aggregator Optimizer AlexNet Synchronization and other Overheads 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Seconds 12
Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 13
Bottlenecks in Cloud-based DDNN training BANDWIDTH BOTTLENECK Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 14
Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required 1300 Gbps AlexNet: 1200 Gbps for each of the popular NNs for communication to not 1000 Gbps bottleneck computation? ResNet: 100 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps Cloud Bandwidth 10 Gbps 15
Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 16
Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 17
Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD 1 2 3 4 5 6 7 8 • Transient congestion, or 1 9 Gbps Cluster 1: 1 3 4 5 7 oversubscription by design 2 4.7 Cluster 2: 2 6 8 • Cross-rack communication 3 8.9 4.7 Hosts cost is higher than Intra- 4 8.9 4.7 8.9 rack communication. 5 8.9 4.7 8.9 8.9 • Comm. bottlenecked by 6 4.7 9.0 4.7 4.7 4.7 slowest link. 7 8.9 4.7 9.0 8.9 9.0 4.7 4 Gbps 8 4.7 9.0 4.7 4.7 4.7 9.0 4.7 Hosts 18
Parameter Hub Optimizations CODESIGNING SOFTWARE, HARDWARE AND CLUSTER CONFIGURATION FOR EFFICIENT CLOUD-BASED DDNN TRAINING ToR ToR PS 2 PS 1 Worker 2 19
Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 20
Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Data Copy Aggregation Optimization … Network Core GPU Network ToR ToR Worker 1 PS 2 PS 1 Worker 2 21
Software Optimizations GRADIENTS Network MEMORY Core ToR ToR CPU Worker 1 PS 2 PS 1 Worker 2 22
Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization T oo much coherence and Great locality. No synchronization synchronization NUMA NUMA 0 1 Each core reads the input Q For each input Q, launch a Sequentially aggregates the Organize processors into from different workers and series of threads for same portion of gradients hierarchy. Perform NUMA writes to different locations to aggregation. This is used in within each queue. (Tall aware tree reduction. the output queue MxNet. (Wide Aggregation) Aggregation) 23
Software Optimizations TALL AGGREGATION AND OPTIMIZATION - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Core Mappings Aggregated - Virtual gradients are transferred independently. - A chunk is only processed by a single core : maintaining maximum locality. Gradient Array for Key 0 from 8 workers 24
Software Optimizations TALL AGGREGATION AND OPTIMIZATION When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. Aggregated Array for Key 0 from 8 workers 25
Software Optimizations TALL AGGREGATION AND OPTIMIZATION When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - Allows overlapping of aggregation, Aggregated Optimized optimization and gradient transmission. Array for Key 0 from 8 workers 26
Software Optimizations NOT ENOUGH ON THEIR OWN! Typical server configuration is unbalanced Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 10 800 1100 14000 Gbps 27
Eliminating bandwidth bottlenecks: PBox hardware: balanced computation and communication resources. Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 28
Eliminating bandwidth bottlenecks: PBox hardware: balanced computation and communication resources. Network Core ToR ToR Worker 1 PS 2 Worker 2 PS 1 29
Hardware Optimization THE PBOX • Balanced computation and communication • Extends the balance and locality notion Network across NUMA domains and NICs. Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 10 800 1100 14000 Gbps 30
Hardware Optimization THE PBOX • Balanced computation and communication • Extends the balance and locality notion Network across NUMA domains and NICs. Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 20 800 1100 14000 Gbps 31
Hardware Optimization THE PBOX • Balanced computation and communication • Extends the balance and locality notion Network across NUMA domains and NICs. Core ToR T Worker 1 PS … PS 1 Wor 800 1100 14000 800 Gbps 32
Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 34
Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core ToR ToR Worker 1 PS 2 PS 1 Worker 2 35
PBox Deployment RACK SCALE PARAMETER SERVICE Cluster CM Network Rack ToR ToR Worker/PS 1 Worker/PS 1 Worker/PS N Worker/PS 2 36
PBox Deployment RACK SCALE PARAMETER SERVICE Cluster CM Network Rack ToR ToR PBox PBox Worker/PS 1 Worker/PS 1 Worker/PS 2 Worker/PS N 37
Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Cluster Network N times traffic reduction! Rack 2. Inter-Rack aggregation ToR ToR PBox PBox 1. Intra-Rack central aggregation Worker/PS 1 Worker/PS 1 Worker/PS 2 Worker/PS N 38
Up to 2.7x performance in 10Gbps cloud- like environment 3 2.7 2.3 2.3 2.3 2.2 2.2 1.9 2 1.3 1.3 1 0 AlexNet VGG 11 VGG 19 GoogleNet Inception V3 ResNet 18 ResNet 50 ResNet 269 ResNext 269 8 Workers. GTX 1080 Ti. MxNet: InfiniBand-enhanced baseline. PBox. Batch Size 64 for ResNext, 128 for ResNet 269, 256 for all others. 39
Framework Bottlenecks • Data Copy • Aggregation and Optimization • Synchronization ResNet 269 Inception GoogleNet AlexNet 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Seconds 40
Scalability LINEAR SCALING IN COMM. ONLY BENCHMARK 100000 90000 Memory Bandwidth (MB/s) 80000 70000 60000 50000 40000 30000 20000 Microbenchmark limit PHub training 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of active workers 41
Scalability PCI-E TO MEMORY SUBSYSTEM BRIDGE 100000 120 Machines training ResNet 50 90000 Memory Bandwidth (MB/s) 80000 70000 60000 50000 40000 30000 20000 Microbenchmark limit PHub training 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of active workers 42
Recommend
More recommend