graph processing is memory bound
play

Graph processing is memory-bound 2 Irregular structure causes - PowerPoint PPT Presentation

C ACHE -G UIDED S CHEDULING E XPLOITING C ACHES T O M AXIMIZE L OCALITY I N G RAPH P ROCESSING Anurag Mukkara , Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound 2 Irregular


  1. C ACHE -G UIDED S CHEDULING E XPLOITING C ACHES T O M AXIMIZE L OCALITY I N G RAPH P ROCESSING Anurag Mukkara , Nathan Beckmann, Daniel Sanchez 1 st AGP – Toronto, Ontario – 24 June 2017

  2. Graph processing is memory-bound 2 ¨ Irregular structure causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs PageRank General-purpose system Specialized accelerator Main Memory Compute + Caches Main Memory Compute + Caches 35 35 35 30 30 30 nJ per Edge nJ per Edge nJ per Edge 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 hollywood hollywood wikipedia wikipedia liveJournal liveJournal indochina indochina webbase webbase uk uk d a l a e k a u o i n s d n o i a e r h w b u p c b o y i o k J e l d l i e w o w n h v i i l Graph Input Graph Input Graph Input 50% of system energy is Memory bottleneck becomes due to main-memory more critical

  3. Exploiting graph structure through caches 3 ¨ Real-world graphs have strong community structure ¤ Significant potential locality ¤ Difficult to predict ahead of time ¨ Idea: Let the cache guide scheduling! ¤ Cache has information about the right vertices to process next – those which cause fewest misses ¨ This work: A limit study on the benefits of cache-guided scheduling (CGS) ¤ CGS reduces misses by up to 6x

  4. Impact of Scheduling on Locality

  5. Many graph algorithms allow flexibility in schedule 5 ¨ Schedule : Order in which vertices of the graph are processed ¨ Many important algorithms are unordered – schedule does not affect correctness ¤ Ex. PageRank, Collaborative Filtering, Label Propagation, Triangle Counting ¨ Schedule impacts locality significantly

  6. Vertex-ordered schedule follows layout order 6 ¨ Vertices are processed in the order of their id ¨ All edges of a vertex are processed EdgeList consecutively Offsets ¨ Used by state-of-the-art graph processing frameworks Destinations ¤ Ligra, GraphMat, etc. ¨ Simplifies scheduling and parallelism ¨ Poor locality

  7. Layout order might not match community structure 7 In-memory vertex layout Consecutive vertices in layout are spread out across the graph

  8. Access pattern of vertex-ordered schedule 8 Edge list Vertex data . . . . . . . . . . . . . Random Streaming Cache High Low misses

  9. Preprocessing changes layout for better order 9 Vertices in each cluster map to consecutive vertex ids Preprocessing

  10. Access pattern with preprocessed graph 10 Edge list Vertex data . . . . . . . . . . . . . Good locality Streaming Cache Low Low misses

  11. Preprocessing is often impractical 11 PageRank Preprocessing 36 20 9 5 Normalized Runtime • Preprocessing is more 4 expensive than algorithm itself 3 2 • Impractical for many 1 important use cases 0 hol wik liv ind web uk Wei et al. Speedup Graph Processing by Graph Ordering, SIGMOD’16

  12. Cache-guided scheduling finds good order at runtime 12 Edge list Vertex data . . . . . . . . . . . . . Slightly Irregular Good Locality Cache Low Moderate misses

  13. Cache-Guided Scheduling Design

  14. High-level design 14 Main Memory Loads Core 4 Shared Last Level Cache Stores Core 3 Probes Event Core 2 Query cache Notifications contents Tasks Core 1 Cache Engine Maintains a list of tasks ranked based on a locality metric

  15. Costs, benefits, and idealizations 15 ¨ Extra memory accesses to edge list ¤ Filling worklist with tasks ¤ Keeping task scores up to date For this limit study we ignore these costs ¨ Space overheads of worklist and auxiliary metadata ¤ Takes away some of the available cache capacity ¨ Large reduction in memory accesses ¤ Better energy efficiency and performance

  16. Cache-Guided Scheduling of Vertices (CGS-V) 16 ¨ Ranks and schedules each vertex of the graph ¨ Vertices ranked by fraction of neighbors that are cached Uncached Vertices Cached Vertices

  17. Cache-Guided Scheduling of Vertices (CGS-V) 17 ¨ Large locality benefits ¨ Track vertices only (not edges) ¨ Pitfall: Real-world graphs have skewed degree distributions ¤ Many high-degree vertices that are connected to most of the graph ¨ Processing high-degree vertices ¤ Flushes the cache and kills locality ¤ Misses opportunities to process other beneficial regions

  18. Cache-Guided Scheduling of Edges (CGS-E) 18 ¨ Ranks and schedules edges instead of vertices ¨ Better locality due to finer-grained scheduling ¨ Each edge causes exactly two cache accesses ¤ Simpler ranking algorithm - Number of endpoints that are cached ¨ #Edges >> #Vertices → Higher tracking overheads

  19. Limit Study on Benefits of CGS

  20. Methodology 20 ¨ Large real-world graphs with up to 100 million vertices, 1 billion edges Graph hol wik liv ind uk web nfl yms Vertices (Millions) 1.1 3.5 4.8 7.4 19 118 0.5 0.5 Edges (Millions) 113 45 69 194 298 1020 100 61 ¨ Graph algorithms ¤ PageRank – 16-byte vertex objects ¤ Collaborative Filtering – 256-byte vertex objects ¨ Custom cache simulator to compute main-memory accesses ¤ Single core system ¤ 2-level cache hierarchy with 32KB L1, 8MB L2 ¨ See paper for details

  21. Large reduction in memory accesses for PageRank 21 Vertex-Ordered CGS-V CGS-E 1.0 Main Memory Accesses Memory Access Reduction 0.8 0.6 CGS-V - 2.4x gmean 0.4 0.2 CGS-E - 4.6x gmean 0.0 hol wik liv ind web uk gmean Graph Input

  22. Much larger benefits with Collaborative Filtering 22 Vertex-Ordered CGS-V CGS-E 1.0 Memory Access Reduction Main Memory Accesses CGS-V - 1.5x gmean 0.8 CGS-E -12x gmean 0.6 Larger vertex data – 256 bytes per vertex 0.4 • Edge list accesses are negligible (3% only) 0.2 • Finer-granularity scheduling of CGS-E 0.0 becomes more important yms gmean nfl Graph Input

  23. CGS benefits from better graph layout 23 Preproc. Preproc. CGS-V Preproc. CGS-V CGS-V + Preproc. Preproc. CGS-V Preproc. CGS-V + Preproc. CGS-V CGS-V + Preproc. CGS-E CGS-E CGS-E + Preproc. 0.8 0.8 0.8 0.8 0.8 Main Memory Accesses Main Memory Accesses Main Memory Accesses Main Memory Accesses Main Memory Accesses 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 hol hol hol hol hol wik wik wik wik wik liv liv liv liv liv ind ind ind ind ind web web web web web uk uk uk uk uk gmean gmean gmean gmean gmean Graph Input Graph Input Graph Input Graph Input Graph Input

  24. Ongoing Work CGS Hardware Implementation

  25. Reducing storage overheads 25 ¨ Maintaining all vertices in the worklist is prohibitively expensive ¨ Can a small worklist capture most of the benefits? ¤ Order in which the worklist is filled is crucial ¨ Adding vertices in order of their id is bad ¤ Explores multiple disjoint regions of the graph simultaneously ¨ Insight: Explore the graph in depth-first fashion to fill the worklist ¤ 100 element worklist gives 50% of the benefits of CGS-E

  26. Reducing processing overheads 26 ¨ Processing each edge takes only a few instructions ¤ Ex. PageRank: One floating point addition per edge ¤ Task scheduling logic must be cheap ¨ CGS-E gives much better locality than CGS-V, but has higher overheads ¨ Practical middle ground: Each task processes a cache line of edges ¤ Minimizes loss of spatial locality in edge list accesses ¤ Sidesteps the issue of high-degree nodes

  27. Conclusion 27 ¨ Real-world graphs have abundant locality, but hard to predict ¨ Cache has rich information about which regions are best to process ¨ Cache-Guided Scheduling gives large reduction in memory accesses

  28. T HANKS F OR Y OUR A TTENTION ! Q UESTIONS A RE W ELCOME !

Recommend


More recommend