HiPS : Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li 1
ACM SIGCOMM Workshop on NetAI Net AI for 2
Background Computation Communication Distributed Machine Learning 3
Background Strong Computation Power (GPU & TPU) 4
Background Communication Challenge TCP: High Latency & Low Throughput, Kernel Overheads, etc. RDMA-Promising Alternative to TCP 5
Background A MNIST Benchmark with 1 million paras 6
Background RoCE/RDMA –multi-vendor ecosystem Many Problems in Fat-Tree based Deployment 7
Background Fat-Tree based Deployment PFC pause frame storm [SIGCOMM’15,’16, NS-3 1. Simulation] Resilient RoCE-Performance Sacrifice [Chelsio-Tech] 2. Synchronization Performance 3. 8
Background Fat-Tree based Deployment PFC pause frame storm [SIGCOM’15,’16] 1. Resilient RoCE-Performance Sacrifice 2. 9
Background Fat-Tree based Deployment Synchronization Performance 1. 1 0
Background Server-Centric Networks Less hops lead to less PFC pause frames 1. Servers prevent cascading effect of PFC pause frame 2. 1 1
Background Synchronization Algorithm PS-based 1. Mesh-based 2. Ring-based 3. 1 2
Background Synchronization Algorithm PS-based (Pull+Push) 1. 1 3
Background Synchronization Algorithm Mesh-based (Diffuse+Collect) 1. 1 4
Background Synchronization Algorithm Ring-based (Scatter+Gather) 1. 1 5
Background Synchronization Algorithm Ring-based (Scatter+Gather) 1. 1 6
HiPS Design Map Logic View and Physical Structure Flexible (Topology-Aware) 1. Hierarchical (Efficient) 2. 1 7
HiPS Design HiPS in BCube 1 8
HiPS Design HiPS in BCube 1 9
HiPS Design HiPS in BCube 2 0
HiPS Design HiPS in BCube (Server <01>) 2 1
HiPS Design HiPS in BCube 2 2
HiPS Design HiPS in Torus 2 3
Theoretical Evaluation 2 4
Theoretical Evaluation 2 5
Theoretical Evaluation 2 6
Future Work Conduct Further Comparative Study Integrate HiPS into DML systems 2 7
Simulation Evaluation NS-3 Simulation with VGG Workload BCube: GST reduced by 37.5 % ∼ 61.9%. 1. Torus: GST reduced by 49 .6% ∼ 66.4% 2. GST Comparison with RDMA in BCube GST Comparison with RDMA in Torus 2 8
Testbed Evaluation System Instance of HiPS: BML Add an OP in Tensorflow 1. 9 Servers, each equipped with 2 RNICs (BCube (3,1)) 2. MINIST and VGG19 as benchmarks 3. Ring Allreduce in Ring and Mesh-based (P2P) Sync in 4. Fat-Tree as Baseline 2 9
Testbed Evaluation 3 0
Testbed Evaluation 18.7%~56.4% 3 1
Ongoing Work Conduct Further Comparative Study Optimize HiPS in DML systems More Cases of Network for AI 3 2
Thanks! NASP Research Group https://nasp.cs.tsinghua.edu.cn/ 3 3
Recommend
More recommend