kolganov a s msu the bfs algorithm
play

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - PowerPoint PPT Presentation

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on


  1. Kolganov A.S., MSU

  2.  The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 2

  3.  The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 3

  4.  Breadth-first search – one of the most important and fundamental processing algorithms in graphs;  Алгоритмические трудности BFS: ◦ Very few computations; ◦ An irregular memory access. 4

  5.  The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 5

  6.  Using BFS algorithm for ranking supercomputers (TEPS PS – traversed edges per second);  Using the MTEPS / WATT metrics for ranking in the GreenGraph500 rating of energy-efficient supercomputers ;  The both lists have not yet been filled: ◦ Graph500 (201 positions in the list); ◦ GreenGraph500 (63 positions in the list). 6

  7.  Generating of edges;  Building a graph from edges (timed, included in the table);  Generating of 64 random vertices;  For each vertex: ◦ Running BFS algorithm; (timed, included in the rating); ◦ Checking the result;  Printing the resulting information. 7

  8. nodes cores scale 8

  9. nodes cores scale 12,6 2,6 MW 7,8 ,8 MW 3,9 ,9 MW 17,8 ,8 MW x2 x2 x2 x2 9

  10. Big DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60 Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 10

  11. Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60 Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 11

  12. Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32)  GTX Titan X – 12ГБ, Tesla K80 – 24ГБ ;  For computing scale 30 needed ~192 ГБ ;  <GTX Titan X> x 16 = 192 ГБ 4 kW peak!  <Tesla K80> x 8 = 192 ГБ 2.4 kW peak! Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 12

  13.  The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 13

  14.  Phase 1: ◦ reconstruction and transformation of graph; ◦ loading to GPU memory;  Phase 2: ◦ The main cycle of algorithm; ◦ Use the hybrid BFS (Top Down + Bottom Up).  The main ideas were taken from GraphCREST: « Fast and Energy-efficient Breadth-First Search on a Single NUMA System , 2014» 14

  15.  Transformation to CSR (compressed sparse rows) COO CSR start vertex adj_ptr …….. …….. final vertex adjacency weights weights 15

  16.  Global sorting of vertices by the degree of connectedness 16

  17.  Local sorting of neighbors by the degree of connectedness V1 V2 V3 .. .. .. .. Vn V1 V2 V3 .. .. .. .. Vn 17

  18.  Synchronized on levels To Top-Down own foreach (i = [0, N]) Next { iteration foreach (k = =[rInd[i], rInd[i+1] ) front { unsigned v = endV[k]; if (levels[v] == 0) { levels[v] = lvl; Current Level K+1 parents[v] = i; front } } Level K } 18

  19.  Synchronized on levels Bottom-Up Up foreach ach (i = [0, N]) { if if (levels[i] == 0) { foreach ach (k=[rInd[i], rInd[i+1] ]) { unsigned endk = endV[k]; if if (levels[endk] == lvl - 1) { parents[i] = endk; levels[i] = lvl; break ak; Level K+1 } } } Level K } 19

  20.  Hybrid algorithm: Top-Down wn + Bottom-Up Up (direction optimization) The graph SCALE 26 |V| = 2^26 (67,108,864) |E| = 2^30 (1,073,741,824) Leve vel Top-Down Bottom-Up Hybrid Hybrid 0 2 2,103,840,895 2 1 66,206 206 1,766,587,029 66,206 2 346,918,235 52,677 677,69 691 52,677,691 3 1,727,195,615 12,820 820,85 854 12,820,854 4 29,557,400 103, 3,184 184 103,184 5 82,357 21,467 467 21,467 6 221 221 21,240 221 Total: 2,103,820,036 3,936,072,360 65,689,625 100% 187% 3.12% = 2x|E| A significant decrease the number of edges viewed

  21.  Using the CUDA Dynamic Parallelism for balancing load in Top-Down;  Using vectorization in each thread;  Using the align reordering for better access to memory in Bottom-Up;  Using queue in Bottom-Up at the last iterations. 21

  22.  The first position in GGraph500: Small DATA: GT GTX Titan X – 132 GTEPS, 815 MTEPS/W, SCALE:26 26;  The second position in GGraph500: Small DATA: GT GTX Titan – 114 GTEPS, 540 MTEPS/W, SCALE:25 25;  The 15th position in GGraph500: Small DATA: Intel Xeon E5 – 10.6 GTEPS, 81 MTEPS/W, SCALE:27 27;  Reached memory bandwidth GP GPU at 140-150 GB/s (50- 60% of peak);  Reached energy consumption at 50% of peak. 22

  23.  The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 23

  24. The time of BFS on 1GPU GTX Titan X GTX Titan X SCALE 26 ~ 8.43m 3ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 24

  25. GTX Titan X GTX Titan X The all coping GTX Titan X GTX Titan X GPU->HOST: ~128 МБ GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 25

  26. GTX Titan X GTX Titan X The all back coping GTX Titan X GTX Titan X HOST->GPU: ~2000 МБ GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 26

  27. The time of BFS on 16 GPU GTX Titan X GTX Titan X SCALE 30 ~ 9 ms ms GTX Titan X GTX Titan X The total time of coping SCALE 30 ~ ~ 140 140 ms ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X 115 GTEPS 115 PS GTX Titan X GTX Titan X ~ 100 M MTEPS S / W W GTX Titan X GTX Titan X (the current 1 st position – 62,93 MTEPS / W ) 27

  28. The time of BFS on 16 GPU GTX Titan X GTX Titan X SCALE 30 ~ 9 ms ms GTX Titan X GTX Titan X The total time of coping SCALE 30 ~ ~ 30 30 ms ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X NVlink GTX Titan X GTX Titan X 440 440 GTEPS GTX Titan X GTX Titan X ~ 3 300 MTEPS / W W 28

  29. Alexander Kolganov, MSU, alexander.k.s@mail.ru 29

Recommend


More recommend