hips hierarchical parameter synchronization in large
play

HiPS Hierarchical Parameter Synchronization in Large-Scale - PowerPoint PPT Presentation

HiPS Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li 1 ACM SIGCOMM Workshop on NetAI Net AI for 2 Background Computation Communication


  1. HiPS : Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li 1

  2. ACM SIGCOMM Workshop on NetAI Net AI for 2

  3. Background Computation Communication Distributed Machine Learning 3

  4. Background  Strong Computation Power (GPU & TPU) 4

  5. Background  Communication Challenge  TCP: High Latency & Low Throughput, Kernel Overheads, etc.  RDMA-Promising Alternative to TCP 5

  6. Background  A MNIST Benchmark with 1 million paras 6

  7. Background  RoCE/RDMA –multi-vendor ecosystem  Many Problems in Fat-Tree based Deployment 7

  8. Background  Fat-Tree based Deployment PFC pause frame storm [SIGCOMM’15,’16, NS-3 1. Simulation] Resilient RoCE-Performance Sacrifice [Chelsio-Tech] 2. Synchronization Performance 3. 8

  9. Background  Fat-Tree based Deployment PFC pause frame storm [SIGCOM’15,’16] 1. Resilient RoCE-Performance Sacrifice 2. 9

  10. Background  Fat-Tree based Deployment Synchronization Performance 1. 1 0

  11. Background  Server-Centric Networks Less hops lead to less PFC pause frames 1. Servers prevent cascading effect of PFC pause frame 2. 1 1

  12. Background  Synchronization Algorithm PS-based 1. Mesh-based 2. Ring-based 3. 1 2

  13. Background  Synchronization Algorithm PS-based (Pull+Push) 1. 1 3

  14. Background  Synchronization Algorithm Mesh-based (Diffuse+Collect) 1. 1 4

  15. Background  Synchronization Algorithm Ring-based (Scatter+Gather) 1. 1 5

  16. Background  Synchronization Algorithm Ring-based (Scatter+Gather) 1. 1 6

  17. HiPS Design  Map Logic View and Physical Structure Flexible (Topology-Aware) 1. Hierarchical (Efficient) 2. 1 7

  18. HiPS Design  HiPS in BCube 1 8

  19. HiPS Design  HiPS in BCube 1 9

  20. HiPS Design  HiPS in BCube 2 0

  21. HiPS Design  HiPS in BCube (Server <01>) 2 1

  22. HiPS Design  HiPS in BCube 2 2

  23. HiPS Design  HiPS in Torus 2 3

  24. Theoretical Evaluation 2 4

  25. Theoretical Evaluation 2 5

  26. Theoretical Evaluation 2 6

  27. Future Work  Conduct Further Comparative Study  Integrate HiPS into DML systems 2 7

  28. Simulation Evaluation  NS-3 Simulation with VGG Workload BCube: GST reduced by 37.5 % ∼ 61.9%. 1. Torus: GST reduced by 49 .6% ∼ 66.4% 2. GST Comparison with RDMA in BCube GST Comparison with RDMA in Torus 2 8

  29. Testbed Evaluation  System Instance of HiPS: BML Add an OP in Tensorflow 1. 9 Servers, each equipped with 2 RNICs (BCube (3,1)) 2. MINIST and VGG19 as benchmarks 3. Ring Allreduce in Ring and Mesh-based (P2P) Sync in 4. Fat-Tree as Baseline 2 9

  30. Testbed Evaluation 3 0

  31. Testbed Evaluation 18.7%~56.4% 3 1

  32. Ongoing Work  Conduct Further Comparative Study  Optimize HiPS in DML systems  More Cases of Network for AI 3 2

  33. Thanks! NASP Research Group https://nasp.cs.tsinghua.edu.cn/ 3 3

Recommend


More recommend