overhead of a decentralized gossip algorithm on the
play

OVERHEAD OF A DECENTRALIZED GOSSIP ALGORITHM ON THE PERFORMANCE OF - PowerPoint PPT Presentation

The Hebrew University of Jerusalem Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group OVERHEAD OF A DECENTRALIZED GOSSIP ALGORITHM ON THE PERFORMANCE OF HPC APPLICATIONS ELY LEVY, AMNON BARAK, AMNON


  1. The Hebrew University 
 of Jerusalem Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group OVERHEAD OF A DECENTRALIZED 
 GOSSIP ALGORITHM ON THE 
 PERFORMANCE OF HPC APPLICATIONS ELY LEVY, AMNON BARAK, AMNON SHILOH, MATTHIAS LIEBER, CARSTEN WEINHOLD, HERMANN HÄRTIG

  2. MOTIVATION Management tasks in supercomputers: Process placement Load management System monitoring Up-to-date information required to make informed decisions TU Dresden Overhead of a Decentralized Gossip Algorithm 2

  3. REQUIREMENTS Low overhead on application performance Scalability: Decentralized information dissemination Decentralized decision making Fault tolerance TU Dresden Overhead of a Decentralized Gossip Algorithm 3

  4. RANDOMIZED GOSSIP TU Dresden Overhead of a Decentralized Gossip Algorithm 4

  5. MERGING WINDOWS Node A ... A:0 B:12 C:2 D:4 E:11 ... A:0 C:2 D:4 ... A:1 A:1 C:3 C:3 D:5 Node E ... A:5 B:2 C:4 D:3 E:0 TU Dresden Overhead of a Decentralized Gossip Algorithm 5

  6. WINDOW SIZE 1024 Nodes 2048 Nodes 0 5 10 15 0 5 10 15 Window size (rel. to node count) 10% 14.21 14.86 20% 9.77 10.46 30% 8.46 9.15 40% 7.83 8.53 50% 7.49 8.19 60% 7.29 7.99 70% 7.18 7.87 80% 7.09 7.78 90% 7.03 7.73 100% 7.01 7.71 How much data to send? • Small window sizes already yield good average age • Diminishing return for larger window sizes • Example: 20% of 1024 nodes w/ 1KiB per node ➞ 200 KiB TU Dresden Overhead of a Decentralized Gossip Algorithm 6

  7. HARDWARE BlueGene/Q at Jülich (JUQUEEN) 28.672 nodes total (used 1024 – 8192) 16 cores per node (PowerPC A2 @ 1.6 GHz) 5D torus network (10 links per node) 2 GB/s per link send + receive Total bandwidth per node: 40 GB/s 2.6 µs worst-case latency TU Dresden Overhead of a Decentralized Gossip Algorithm 7

  8. GOSSIP ALGORITHM MPI-based implementation ( MPI_Bsend ) Gossip algorithm runs on 1 core Application uses remaining 15 cores How to run two programs on BG/Q? Gossip algorithm + application linked together MPI communicators configured to hide every 16th core from application Wrapped all uses of MPI_COMM_WORLD TU Dresden Overhead of a Decentralized Gossip Algorithm 8

  9. BENCHMARKS Heavy network usage HPCC suite: MPI-FFT HPCC suite: PTRANS Application: COSMO-SPECS+FD4 Moderate network usage TU Dresden Overhead of a Decentralized Gossip Algorithm 9

  10. BENCHMARKS HPCC suite: MPI-FFT HPCC suite: PTRANS Application: COSMO-SPECS+FD4 TU Dresden Overhead of a Decentralized Gossip Algorithm 10

  11. HPCC: MPI-FFT 1024 Nodes 2048 Nodes 0 10 20 0 20 40 60 Benchmark: Fast Without Gossip 12.2 s 19.0 s 36.4 s 50.2 s Fourier Transform Interval = 1024 ms 12.2 s 19.0 s 36.4 s 50.2 s • All - to - all comm - 12.2 s 19.0 s 36.5 s 50.4 s Interval = 256 ms unication pattern 12.2 s 19.0 s 36.6 s 50.5 s Interval = 64 ms Interval = 16 ms 12.3 s 19.1 s 37.2 s 51.1 s • Stresses bisection 12.4 s 19.2 s 38.1 s 52.0 s Interval = 8 ms bandwidth 12.7 s 19.5 s 40.0 s 54.0 s Interval = 4 ms • 1024 nodes: 136 13.2 s 20.0 s Interval = 2 ms million vector 0 20 40 60 0 20 40 60 elements 
 Without Gossip 32.9 s 40.0 s 24.0 s 27.8 s (2025 GiB) Interval = 1024 ms 32.9 s 40.0 s 24.2 s 28.0 s Interval = 256 ms • 2048+ nodes: 544 33.1 s 40.2 s 24.7 s 28.5 s Interval = 64 ms 33.6 s 40.7 s 26.1 s 29.9 s million vector Interval = 16 ms 35.5 s 42.6 s 31.4 s 35.2 s elements 
 Interval = 8 ms 38.1 s 45.3 s 38.4 s 42.2 s (8100 GiB) Interval = 4 ms Interval = 2 ms 4096 Nodes 8192 Nodes TU Dresden Overhead of a Decentralized Gossip Algorithm 11

  12. COSMO-SPECS+FD4 1024 Nodes 2048 Nodes 0 10 20 30 40 50 0 10 20 30 40 50 Benchmark: Atmo - 8.2 s 40.6 s 4.3 s 36.7 s Without Gossip spheric Simulation 8.2 s 40.6 s 4.2 s 36.7 s Interval = 1024 ms • COSMO: static 8.2 s 40.6 s 4.2 s 36.7 s Interval = 256 ms 8.2 s 40.6 s 4.2 s 36.7 s Interval = 64 ms regular communi - 8.2 s 40.6 s 4.3 s 36.8 s Interval = 16 ms cation Interval = 8 ms 8.2 s 40.6 s 4.4 s 36.9 s 8.3 s 40.7 s 4.7 s 37.2 s • SPECS: dynamic, Interval = 4 ms 8.4 s 40.8 s 5.5 s 38.0 s Interval = 2 ms irregular communi - Interval = 1 ms 8.6 s 41.1 s cation 0 10 20 30 40 50 0 10 20 30 40 50 • Model coupling: Without Gossip 4.3 s 36.7 s 4.8 s 37.3 s dynamic, irregular, Interval = 1024 ms 4.3 s 36.7 s 4.8 s 37.3 s small volume 4.4 s 36.7 s 4.9 s 37.3 s Interval = 256 ms Interval = 64 ms 4.4 s 36.8 s 5.0 s 37.5 s • Partitioning: Interval = 16 ms 4.8 s 37.3 s 5.3 s 37.9 s collectives Interval = 8 ms 5.1 s 37.6 s 5.6 s 38.2 s Interval = 4 ms 6.1 s 38.7 s • Migration: highly Interval = 2 ms local, mostly Interval = 1 ms between neighbors 4096 Nodes 8192 Nodes TU Dresden Overhead of a Decentralized Gossip Algorithm 12

  13. GOSSIP SCALABILITY 1024 Nodes 2048 Nodes 0 50% 100% 0 50% 100% Computational Without Gossip 0.0% 0.0% complexity: 
 Interval = 1024 ms 0.0% 0.3% O(n·log(n)) Interval = 256 ms 1.3% 2.6% Interval = 64 ms 5.4% 10.7% Interval = 16 ms 10.6% 20.9% Interval = 8 ms 20.9% 41.3% Interval = 4 ms 41.4% 79.3% Interval = 2 ms 80.3% 0 50% 100% 0 50% 100% Without Gossip 0.0% 0.2% Interval = 1024 ms 1.0% 2.4% Interval = 256 ms 5.4% 11.0% Interval = 64 ms 21.2% 42.7% Interval = 16 ms 41.6% 80.5% Interval = 8 ms 79.4% Interval = 4 ms Interval = 2 ms 4096 Nodes 8192 Nodes TU Dresden Overhead of a Decentralized Gossip Algorithm 13

  14. RATE VS WIN SIZE Window Size Gossip Interval 10 % 20 % 40 % 80 % 1 ms 3.8% 2 ms 1.8% 4.7% More Data 4 ms 1.1% 2.6% 6.3% 17 .2% 8 ms 0.7% 1.4% 3.2% 8.7% 16 ms 0.3% 0.7% 1.8% 4.5% 32 ms 0.3% 0.4% 0.9% 2.6% 64 ms 0.2% 0.3% 0.6% 1.5% More Data TU Dresden Overhead of a Decentralized Gossip Algorithm 14

  15. REACTION TIME? Gossip intervals + average vector age: 256 ms ➞ 2–3 s 1024 ms ➞ 10 s Applicability for system services: Global load information (allocation, …) Local load balancing (MOSIX-like, …) System monitoring (node health, …) TU Dresden Overhead of a Decentralized Gossip Algorithm 15

  16. FUTURE WORK Other types of network (Infiniband, Cray, …) Fault tolerance, loss of messages Must adapt for exascale systems: Incomplete knowledge at each node Groups of gossip nodes Smaller vectors Hierarchical gossip for global view TU Dresden Overhead of a Decentralized Gossip Algorithm 16

  17. CONCLUSIONS Gossip algorithm scales to thousands of nodes Increasing window size causes more overhead than decreasing gossip interval Collective MPI communication most sensitive Gossip intervals of 256–1024 ms with no noticeable overhead (in most cases) Average age of information at nodes in the order of 2–3 s with gossip interval of 256 ms German Priority Programme 1648 FFMK .tudos.org Software for Exascale Computing TU Dresden Overhead of a Decentralized Gossip Algorithm 17

Recommend


More recommend