making pull based graph processing performant
play

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - PowerPoint PPT Presentation

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 February 27, 2018 Graph Processing Problem modelled as


  1. Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 · February 27, 2018

  2. Graph Processing • Problem modelled as objects (vertices) and connections between them (edges) • Examples: • Internet (pages and hyperlinks) • Social network (people and friendships) • Roads and intersections • Products and ratings 2

  3. Graph Processing F L I I I A A E E E C C G G G B B K K D D H H H H J 3

  4. Graph Processing F’ L’ I’ A’ E’ C’ G’ B’ K’ D’ H’ J’ Repeat until convergence 4

  5. Graph Processing Push Pull Group by source vertex Group by destination vertex Hybrid: dynamically select push or pull for each iteration 5

  6. Graph Processing foreach vertex v in graph.vertices foreach edge e in v.(in|out)edges // process the edge ... 6

  7. Parallelizing Graph Processing • Outer loop parallelization • Between cores: assign entire vertices to threads • Inner loop parallelization • Between cores: subdivide the edges within each vertex • Within one core: vectorize the loop 7

  8. Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 8

  9. Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 9

  10. Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 10

  11. Pull’s Performance Challenge Serial Inner Loop Parallel Inner Loop Contribution #1: “Scheduler Awareness” A technique that can be applied to the inner loop of a pull engine to parallelize it without introducing conflicts. • One thread per vertex • Multiple threads per vertex • Updates are thread-private • Updates will conflict 11

  12. Pull’s Performance Opportunity • Further gains possible using SIMD vectorization • Improve parallelism of the computation • Improve memory bandwidth utilization Contribution #2: “Vector-Sparse” A low-level modification to a data structure commonly used • Data structure layout issues impede effective to represent graphs, intended to enhance vectorization. vectorization in existing work 12

  13. Grazelle • A hybrid graph processing framework that embodies both of our contributions • Outperforms the state-of-the-art by over 10× in some cases • Available for download at https://github.com/stanford-mast/Grazelle-PPoPP18 13

  14. Scheduler Awareness Contribution #1 14

  15. Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 15

  16. Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 16

  17. Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 17

  18. Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 18

  19. Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 19

  20. Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 20

  21. Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 21

  22. Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 22

  23. Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 23

  24. Analyzing Scheduler Awareness • Performance impact depends primarily on the scheduling granularity • Scheduler Un-Awareness: trade-off between load balance and probability of write conflicts • Scheduler Awareness: finer granularity leads to increased merge operation overhead 24

  25. PageRank: Performance vs. Scheduling Granularity dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler-Aware 1.2 1.2 50× 1.2× Rel. Execution Time Rel. Execution Time 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 3.3× 0.2 0.2 0.0 0.0 100 1,000 10,000 1,000 10,000 100,000 Chunk Size Chunk Size 10× Different 25

  26. PageRank: Performance vs. Number of Cores dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler Awar e Key Insights 50 70 Rel. Performance Rel. Performance 60 40 Huge improvement for extremely skewed graphs 50 • 30 40 Still beneficial for evenly-distributed low-degree graphs • Scaling enabled by Scaling improved by 30 20 Scheduler Awareness Scheduler Awareness 20 10 10 0 0 0 14 28 42 56 0 14 28 42 56 # Physical Cores # Physical Cores 26

  27. Vector-Sparse Contribution #2 27

  28. Compressed-Sparse [0] [1] [2] [3] Vertex Index 0 3 7 10 Edges 23 10 50 4 0 53 62 1 78 50 23 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 28

  29. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 29

  30. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 × Vertex 1 30

  31. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 31

  32. Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 32

  33. Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 × × Vertex 0 Vertex 1 Vertex 2 Padding 33

  34. Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits 34

  35. Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 0 1 2 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits + top-level vertex spread-encoding 35

  36. Analyzing Vector-Sparse Packing Efficiency Performance Impact 100% PageRank CC BFS Average Efficiency 3.0 75% 2.5 Speedup 2.0 50% 1.5 1.0 25% 0.5 0% 0.0 twitter-2010 acs-usa twitter-2010 livejournal friendster acs-usa uk-2007 livejournal friendster uk-2007 di m di m Generally ≥ 75% 1.5× to 2.5× 36

  37. Performance Comparison Putting it all together 37

  38. Evaluation Scope • Grazelle is compared with Ligra, Polymer, GraphMat, and X-Stream • Three applications: PageRank, Connected Components, Breadth-First Search • Running on a machine equipped with four Intel Xeon E7-4850 v3 processors • 14 physical cores / 28 logical cores per socket 38

  39. PageRank: Peak Processing Throughput Grazelle-Pull Grazelle-Pus h Ligra-Pull Ligra-Push Polymer GraphMat X-Stream 1E+5 15.2× Execution Time (ms) 1E+4 1.4× 2.3× 1E+3 2.3× 3.6× 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 39

  40. Connected Components: Dynamic Control Flow Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 4.9× 1E+5 21.1× Execution Time (ms) 1.5× 1.6× 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 40

  41. Breadth-First Search: Compatibility of Optimizations Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 1E+5 Execution Time (ms) 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 41

Recommend


More recommend