GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems Dipanjan Sengupta Shuaiwen Leon Song Kapil Agarwal Pacific Northwest National Karsten Schwan Lab CERCS - Georgia Tech
Talk Outline Motivation Background on GAS Hybrid Programming model GraphReduce Architecture Experimental Results Conclusion Future Work
Motivation Why use GPUs ? – GPU-based frameworks are orders of magnitude faster Previous GPU-based graph processing doesn’t handle datasets that doesn’t fit in memory Yahoo-web graph with 1.4 billion vertices requires 6.6 GB memory just to store its vertex values. Se Seve veral cha hallenge nges s in n large ge- scale scale grap graph h pro rocessing cessing How to to parti titi tion th the graph ? How and when to to move th the parti titi tions betw tween host t and GPU GPU ? How to to best t extr tract t multi ti-level parallelis parallelism in in G GPUs ? ?
Background – GAS model Gather phase: each U 1 U 2 U 2 U 1 U 2 U 1 vertex aggregates values a a b a b b associated with its v U 3 v U 3 v U 3 c c c incoming edges and d d d U 4 U 4 U 4 source vertices Gather Apply Scatter Apply phase: each vertex updates its state using the gather result Scatter phase: each vertex updates the state of every outgoing edge.
Hybrid Programming Model vertex_scatter (vertex v) edge_scatter (edge e) send updates over outgoing edges of v send update over e vertex gather (vertex v) update_gather (update u) apply updates from inbound edges of v apply update u to u. destination while not done while not done for all vertices v that need to scatter updates for all edges e vertex_scatter (v) edge_scatter (e) for all vertices v that have updates for all updates u vertex_gather (v) update_gather (u) Vertex-centric GAS Edge-centric GAS Existing systems choose either vertex- or edge-centric GAS programming model for graph execution. Different processing phases have different types of parallelism and memory access characteristics GraphReduce adopts a hybrid model with a combination of both vertex and edge centric model
GraphReduce Architecture
GraphReduce Architecture Contd… Three major components Partition Engine Data Movement Engine Computation Engine ! Partition Engine has two responsibilities Load balanced shard creation , such that each shard contains approximately equal number of edges Ordering the edges in a shard based on their source or destination vertices for efficient data movement and memory access Data Movement Engine has following responsibilities Moving shards in and out of limited GPU memory to process large-scale graphs Efficiently utilize GPU hardware resources using CUDA streams and Hyper-Qs to achieve high performance Saturate the data transfer bandwidth of the PCI-E bus connecting the host and the GPUs
Compute Engine Four phases of computation Gather Map: fetches all the updates/ messages along the in-edges. Gather Reduce: reduce all the collected updates for each vertex Apply: apply the update to each vertex Scatter: distribute the updated states of the vertices along the out-edges Combination of vertex and edge centric implementation Gather Map – edge centric Gather Reduce – vertex centric Apply – vertex centric Scatter – edge centric
Experimental Setup Experimental Setup Node configuration Two Intel Xeon E5-2670 processors running at 2.6 GHz and 32 GB of RAM NVIDIA Tesla K20c GPU with 4.8 GB of DRAM
Benchmarks and Dataset Graph algorith thms used are are BFS BFS an and d Pa PageR geRank nk 9 real w 9 real world an orld and d synth theti tic graph data tasets ts as shown in th the ta table.
Results
Conclusions Gr GraphR hRed educ uce dev develops elops a g a graph raph proces processin ing f fram ramew ework ork for input t data tasets ts th that t may or may not t fit t in GPU me memo mory Adopts ts a combinati tion of both th edge and verte tex centr tric implementa tati tion of GAS programming model Leverages CUDA DA str treams and hardware supports ts like hyper-Qs to to str tream data ta in and out t of GPU for high perf perform orman ance ce Outp tperforms CPU-based out- t-of-core graph processing framework across a variety ty of real data ta sets ts
Future Work Ex Exte tending Gr GraphR hRed educ uce framework to to multi tiple nodes in a cluste ter using communicati tion models like MPI like MPI Addressing th the limite ted on-node memory size th through th the usage of SSD D and oth ther sto torage dev devices ices Pr Processing essing dyna namic mically lly evo evolving lving gr graphs hs Understa tanding how dynamic profiling could be inte tegrate ted into to Gr GraphR hRed educ uce
Thank You!
Recommend
More recommend