Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 · February 27, 2018
Graph Processing • Problem modelled as objects (vertices) and connections between them (edges) • Examples: • Internet (pages and hyperlinks) • Social network (people and friendships) • Roads and intersections • Products and ratings 2
Graph Processing F L I I I A A E E E C C G G G B B K K D D H H H H J 3
Graph Processing F’ L’ I’ A’ E’ C’ G’ B’ K’ D’ H’ J’ Repeat until convergence 4
Graph Processing Push Pull Group by source vertex Group by destination vertex Hybrid: dynamically select push or pull for each iteration 5
Graph Processing foreach vertex v in graph.vertices foreach edge e in v.(in|out)edges // process the edge ... 6
Parallelizing Graph Processing • Outer loop parallelization • Between cores: assign entire vertices to threads • Inner loop parallelization • Between cores: subdivide the edges within each vertex • Within one core: vectorize the loop 7
Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 8
Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 9
Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 10
Pull’s Performance Challenge Serial Inner Loop Parallel Inner Loop Contribution #1: “Scheduler Awareness” A technique that can be applied to the inner loop of a pull engine to parallelize it without introducing conflicts. • One thread per vertex • Multiple threads per vertex • Updates are thread-private • Updates will conflict 11
Pull’s Performance Opportunity • Further gains possible using SIMD vectorization • Improve parallelism of the computation • Improve memory bandwidth utilization Contribution #2: “Vector-Sparse” A low-level modification to a data structure commonly used • Data structure layout issues impede effective to represent graphs, intended to enhance vectorization. vectorization in existing work 12
Grazelle • A hybrid graph processing framework that embodies both of our contributions • Outperforms the state-of-the-art by over 10× in some cases • Available for download at https://github.com/stanford-mast/Grazelle-PPoPP18 13
Scheduler Awareness Contribution #1 14
Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 15
Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 16
Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 17
Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 18
Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 19
Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 20
Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 21
Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 22
Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 23
Analyzing Scheduler Awareness • Performance impact depends primarily on the scheduling granularity • Scheduler Un-Awareness: trade-off between load balance and probability of write conflicts • Scheduler Awareness: finer granularity leads to increased merge operation overhead 24
PageRank: Performance vs. Scheduling Granularity dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler-Aware 1.2 1.2 50× 1.2× Rel. Execution Time Rel. Execution Time 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 3.3× 0.2 0.2 0.0 0.0 100 1,000 10,000 1,000 10,000 100,000 Chunk Size Chunk Size 10× Different 25
PageRank: Performance vs. Number of Cores dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler Awar e Key Insights 50 70 Rel. Performance Rel. Performance 60 40 Huge improvement for extremely skewed graphs 50 • 30 40 Still beneficial for evenly-distributed low-degree graphs • Scaling enabled by Scaling improved by 30 20 Scheduler Awareness Scheduler Awareness 20 10 10 0 0 0 14 28 42 56 0 14 28 42 56 # Physical Cores # Physical Cores 26
Vector-Sparse Contribution #2 27
Compressed-Sparse [0] [1] [2] [3] Vertex Index 0 3 7 10 Edges 23 10 50 4 0 53 62 1 78 50 23 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 28
Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 29
Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 × Vertex 1 30
Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 31
Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 32
Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 × × Vertex 0 Vertex 1 Vertex 2 Padding 33
Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits 34
Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 0 1 2 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits + top-level vertex spread-encoding 35
Analyzing Vector-Sparse Packing Efficiency Performance Impact 100% PageRank CC BFS Average Efficiency 3.0 75% 2.5 Speedup 2.0 50% 1.5 1.0 25% 0.5 0% 0.0 twitter-2010 acs-usa twitter-2010 livejournal friendster acs-usa uk-2007 livejournal friendster uk-2007 di m di m Generally ≥ 75% 1.5× to 2.5× 36
Performance Comparison Putting it all together 37
Evaluation Scope • Grazelle is compared with Ligra, Polymer, GraphMat, and X-Stream • Three applications: PageRank, Connected Components, Breadth-First Search • Running on a machine equipped with four Intel Xeon E7-4850 v3 processors • 14 physical cores / 28 logical cores per socket 38
PageRank: Peak Processing Throughput Grazelle-Pull Grazelle-Pus h Ligra-Pull Ligra-Push Polymer GraphMat X-Stream 1E+5 15.2× Execution Time (ms) 1E+4 1.4× 2.3× 1E+3 2.3× 3.6× 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 39
Connected Components: Dynamic Control Flow Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 4.9× 1E+5 21.1× Execution Time (ms) 1.5× 1.6× 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 40
Breadth-First Search: Compatibility of Optimizations Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 1E+5 Execution Time (ms) 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 41
Recommend
More recommend