Parallel high-performance graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU
Graph algorithms Bioinformatics Social networks analysis Business-analytics Data mining City planning and others … 2
Graph algorithms Breadth-first search ◦ Easy to understand ◦ Widespread ◦ Many ways to parallelize Graph500 ◦ www.graph500.org ◦ R. Murphy, K. Wheeler, B. Barrett, and J. Ang. Introducing the graph 500. In Cray User’s Group (CUG), 2010 ◦ Parallel breadth-first search ◦ MPI and OpenMP implementations ◦ Designed for graphs with relatively small diameter and skewed degree distribution ◦ “Scale - Free” graphs 3
Parallel breadth-first search Level synchronous algorithms ◦ Processing of level N+1 begins only when processing of level N is over 0 1 2 4 3 5 6 4
Obstacles for efficient parallel implementation Data transfer problem Graph marking problem 5
Data transfer problem Problem description ◦ Real-world graphs have irregular memory access pattern ◦ Graphs with small diameter have big overheads connected with data transfer through interconnect network Suggested solution ◦ Combine different types of level synchronous algorithms 6
Level synchronous algorithms Top-down traversal ◦ Traditional way to implement breadth-first search ◦ Active vertex tries to check all its neighbors Bottom-up traversal ◦ Inactive vertices looking for active vertices in its neighbor lists ◦ S. Beamer, K. Asanovic, D. A. Patterson, Direction-optimizing breadth-first search // in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. 7
Top down breadth-first search for all u in dist dist [u] ← -1 dist [s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = level for each neighb in vert.neighbors if neighb in V.this_node if dist[neighb] = -1 dist[neighb ] ← level + 1 pred[neighb ] ← vert else vert_batch_to_send.push(neighb) send(vert_batch_to_send) receive(vert_batch_to_receive) parallel for each vert in vert_batch_to_receive if dist[vert] = -1 dist [vert] ← level + 1 pred [vert] ← vert.pred level++ while (!check_end()) 8
Top down breadth-first search Data transfers according to data transfer matrix ◦ 𝑏 𝑗𝑘 – amount of data (bytes) to transfer from process 𝑗 to process 𝑘 0 10 0 20 0 0 0 0 0 1 0 1 8 0 0 0 0 0 0 0 2 4 2 4 9
Top down breadth-first search Data transfer time for every iteration of the algorithm ◦ ~1 million vertices graph, 4 nodes parallelization ◦ Total time spent on data transfer – 0,83 sec. 0,06 0,05 0,04 Time, sec. 0,03 0,02 0,01 0 1 2 3 4 5 6 7 Iteration 10
Bottom up breadth-first search for all u in dist dist [u] ← -1 dist [s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = -1 for each neighb in vert.neighbors if bitmap_current.neighb = 1 dist [vert] ← level + 1 pred [vert] ← neighb bitmap_next.vert ← 1 break all_gather(bitmap_next) swap(bitmap_current, bitmap_next) level++ while (!check_end()) 11
Bottom up breadth-first search Data synchronization through collective communications node 0 node 0 node 3 node 3 node 1 node 1 node 2 node 2 all_gather node 0 node 0 node 3 node 3 node 1 node 1 node 2 node 2 12
Bottom up breadth-first search Data transfer time for every iteration of the algorithm ◦ ~1 million vertices graph, 4 nodes parallelization ◦ Total time spent on data transfer – 0,001 sec. 0,0002 0,00015 Time, sec. 0,0001 0,00005 0 1 2 3 4 5 6 7 Iteration 13
Data transfer problem Suggested solution – hybrid graph traversal ◦ First two iterations – “top - down” ◦ Next three iterations – “bottom - up” ◦ All the rest iterations – “top - down” 14
Hybrid graph traversal Data transfer time for every iteration of the algorithm ◦ ~1 million vertices graph, 4 nodes parallelization ◦ Total time spent on data transfer – 0,0005 sec. 0,00015 0,0001 Time, sec. 0,00005 0 1 2 3 4 5 6 7 Iteration 15
Graph marking problem Skewed degree distribution ◦ Many vertices with big number of in-/outgoing edges ◦ Few vertices with small number of in-/outgoing edges 16
Graph marking problem CSR is one of the most popular format to store graph data 0 1 2 3 4 Row pointers: 0, 3, 6, 7, 10, 12 Column ids: 1, 2, 3, 0, 3, 4, 0, 0, 1, 2, 1, 3 17
Graph marking problem Using CSR suppose to traverse through Row pointers array ◦ It’s not know in advance how many edges are incident to some vertex (e.g. how many elements need to traverse in Column ids array) Performance of every iteration in level synchronous algorithms depends on performance of processing mostly “heavy - weight” vertex ◦ Workload imbalance 18
Graph marking problem Suggested solution – method of workload balancing ◦ Divide Column ids array on equal parts, each consisting of Max edges elements ◦ Map every part of Rowstarts array and Column ids array using additional array Part column ◦ Every thread will process previously known number of edges which determined by corresponding elements of Part column array 19
Graph marking problem Graph processing without workload balancing 20
Graph marking problem Graph processing with workload balancing ◦ max_edges = 4 21
Graph marking problem Filling of Part column array parallel for i in V.this_node first ← V.this_node[i] last ← V.this_node[i+1] index ← round_up(first/max_edges) current ← index* max_edges while (current < last) part_column [index] ← i current += max_edges index++ 22
Top down breadth-first search Main loop of level synchronous breadth-first search modified for workload balancing // preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)* max_edges curr_vert ← part_column[i] for each edge є [ first_edge;last_edge) if neighbors of curr_vert є [first_edge;last_edge) if dist[curr_vert] = level for each k є neighbors of curr_vert if dist[k] = -1 dist [k] ← level + 1 pred [k] ← curr_vert curr_vert++ // data synchronization... 23
Top down breadth-first search Time spent to graph traversal with- and without workload balancing ◦ ~1 million vertices graph, 4 nodes parallelization Without balancing With balancing 2 1,5 Time, sec. 1 0,5 0 1 2 3 4 5 6 7 Iteration 24
Bottom-up breadth-first search Main loop of level synchronous breadth-first search modified for workload balancing // preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)* max_edges curr_vert ← part_column[i] for each edge є [ first_edge;last_edge) if neighbors of curr_vert є [first_edge;last_edge) if dist[curr_vert] = -1 for each k є neighbors of curr_vert if bitmap_current.k = 1 dist[curr_vert ] ← level + 1 pred[curr_vert ] ← k bitmap_next.vert ← 1 break curr_vert++ // data synchronization... 25
Bottom-up breadth-first search Time spent to graph traversal with- and without workload balancing ◦ ~1 million vertices graph, 4 nodes parallelization Without balancing With balancing 0,014 0,012 0,01 Time, sce. 0,008 0,006 0,004 0,002 0 1 2 3 4 5 6 7 Iteration 26
Combining methods Methods can be used together to achieve maximum performance of breadth-first search algorithm ◦ Method of workload balancing – to reduce time spent on graph processing on each iteration ◦ Hybrid traversal – to reduce data transfer overheads in data synchronization step of every iteration 27
Benchmarking All methods are integrated in custom implementation of Graph500 benchmark Measure performance of custom implementation for various number of nodes ◦ 1, 2, 4, 8 nodes of “ Uran ” supercomputer ◦ CPU Intel Xeon X5675, 192 GB DRAM ◦ “Scale” varies from 20 till 25 Compare custom implementation with reference Graph500 implementations ◦ Simple implementation ◦ Replicated implementation Performance metrics – speed of graph traversal ◦ Measuring in Traversed Edges Per Second (TEPS) 28
Results (1 node) custom replicated simple 900 800 700 Speed, MTEPS 600 500 400 300 200 100 0 20 21 22 23 24 25 Scale 29
Results (2 nodes) custom replicated simple 1800 1600 1400 Speed, MTEPS 1200 1000 800 600 400 200 0 20 21 22 23 24 25 Scale 30
Results (4 nodes) custom replicated simple 3000 2500 Speed, MTEPS 2000 1500 1000 500 0 20 21 22 23 24 25 Scale 31
Results (8 nodes) custom replicated simple 4500 4000 3500 Speed, MTEPS 3000 2500 2000 1500 1000 500 0 20 21 22 23 24 25 Scale 32
Results Combining methods of workload balancing and traversal hybridization allows to achieve performance improvement of parallel breadth-first search Custom implementation has potential for further parallelization 33
Conclusion Method of workload balancing helps to reduce overheads connected with graph processing Method of traversal hybridization helps to reduce overheads connected with data transfer on every iteration Future plans ◦ Investigate scalability of developed algorithm ◦ Modify custom implementation for using performance accelerators and coprocessors 34
Questions? 35
Recommend
More recommend