C ACHE -G UIDED S CHEDULING E XPLOITING C ACHES T O M AXIMIZE L OCALITY I N G RAPH P ROCESSING Anurag Mukkara , Nathan Beckmann, Daniel Sanchez 1 st AGP – Toronto, Ontario – 24 June 2017
Graph processing is memory-bound 2 ¨ Irregular structure causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs PageRank General-purpose system Specialized accelerator Main Memory Compute + Caches Main Memory Compute + Caches 35 35 35 30 30 30 nJ per Edge nJ per Edge nJ per Edge 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 hollywood hollywood wikipedia wikipedia liveJournal liveJournal indochina indochina webbase webbase uk uk d a l a e k a u o i n s d n o i a e r h w b u p c b o y i o k J e l d l i e w o w n h v i i l Graph Input Graph Input Graph Input 50% of system energy is Memory bottleneck becomes due to main-memory more critical
Exploiting graph structure through caches 3 ¨ Real-world graphs have strong community structure ¤ Significant potential locality ¤ Difficult to predict ahead of time ¨ Idea: Let the cache guide scheduling! ¤ Cache has information about the right vertices to process next – those which cause fewest misses ¨ This work: A limit study on the benefits of cache-guided scheduling (CGS) ¤ CGS reduces misses by up to 6x
Impact of Scheduling on Locality
Many graph algorithms allow flexibility in schedule 5 ¨ Schedule : Order in which vertices of the graph are processed ¨ Many important algorithms are unordered – schedule does not affect correctness ¤ Ex. PageRank, Collaborative Filtering, Label Propagation, Triangle Counting ¨ Schedule impacts locality significantly
Vertex-ordered schedule follows layout order 6 ¨ Vertices are processed in the order of their id ¨ All edges of a vertex are processed EdgeList consecutively Offsets ¨ Used by state-of-the-art graph processing frameworks Destinations ¤ Ligra, GraphMat, etc. ¨ Simplifies scheduling and parallelism ¨ Poor locality
Layout order might not match community structure 7 In-memory vertex layout Consecutive vertices in layout are spread out across the graph
Access pattern of vertex-ordered schedule 8 Edge list Vertex data . . . . . . . . . . . . . Random Streaming Cache High Low misses
Preprocessing changes layout for better order 9 Vertices in each cluster map to consecutive vertex ids Preprocessing
Access pattern with preprocessed graph 10 Edge list Vertex data . . . . . . . . . . . . . Good locality Streaming Cache Low Low misses
Preprocessing is often impractical 11 PageRank Preprocessing 36 20 9 5 Normalized Runtime • Preprocessing is more 4 expensive than algorithm itself 3 2 • Impractical for many 1 important use cases 0 hol wik liv ind web uk Wei et al. Speedup Graph Processing by Graph Ordering, SIGMOD’16
Cache-guided scheduling finds good order at runtime 12 Edge list Vertex data . . . . . . . . . . . . . Slightly Irregular Good Locality Cache Low Moderate misses
Cache-Guided Scheduling Design
High-level design 14 Main Memory Loads Core 4 Shared Last Level Cache Stores Core 3 Probes Event Core 2 Query cache Notifications contents Tasks Core 1 Cache Engine Maintains a list of tasks ranked based on a locality metric
Costs, benefits, and idealizations 15 ¨ Extra memory accesses to edge list ¤ Filling worklist with tasks ¤ Keeping task scores up to date For this limit study we ignore these costs ¨ Space overheads of worklist and auxiliary metadata ¤ Takes away some of the available cache capacity ¨ Large reduction in memory accesses ¤ Better energy efficiency and performance
Cache-Guided Scheduling of Vertices (CGS-V) 16 ¨ Ranks and schedules each vertex of the graph ¨ Vertices ranked by fraction of neighbors that are cached Uncached Vertices Cached Vertices
Cache-Guided Scheduling of Vertices (CGS-V) 17 ¨ Large locality benefits ¨ Track vertices only (not edges) ¨ Pitfall: Real-world graphs have skewed degree distributions ¤ Many high-degree vertices that are connected to most of the graph ¨ Processing high-degree vertices ¤ Flushes the cache and kills locality ¤ Misses opportunities to process other beneficial regions
Cache-Guided Scheduling of Edges (CGS-E) 18 ¨ Ranks and schedules edges instead of vertices ¨ Better locality due to finer-grained scheduling ¨ Each edge causes exactly two cache accesses ¤ Simpler ranking algorithm - Number of endpoints that are cached ¨ #Edges >> #Vertices → Higher tracking overheads
Limit Study on Benefits of CGS
Methodology 20 ¨ Large real-world graphs with up to 100 million vertices, 1 billion edges Graph hol wik liv ind uk web nfl yms Vertices (Millions) 1.1 3.5 4.8 7.4 19 118 0.5 0.5 Edges (Millions) 113 45 69 194 298 1020 100 61 ¨ Graph algorithms ¤ PageRank – 16-byte vertex objects ¤ Collaborative Filtering – 256-byte vertex objects ¨ Custom cache simulator to compute main-memory accesses ¤ Single core system ¤ 2-level cache hierarchy with 32KB L1, 8MB L2 ¨ See paper for details
Large reduction in memory accesses for PageRank 21 Vertex-Ordered CGS-V CGS-E 1.0 Main Memory Accesses Memory Access Reduction 0.8 0.6 CGS-V - 2.4x gmean 0.4 0.2 CGS-E - 4.6x gmean 0.0 hol wik liv ind web uk gmean Graph Input
Much larger benefits with Collaborative Filtering 22 Vertex-Ordered CGS-V CGS-E 1.0 Memory Access Reduction Main Memory Accesses CGS-V - 1.5x gmean 0.8 CGS-E -12x gmean 0.6 Larger vertex data – 256 bytes per vertex 0.4 • Edge list accesses are negligible (3% only) 0.2 • Finer-granularity scheduling of CGS-E 0.0 becomes more important yms gmean nfl Graph Input
CGS benefits from better graph layout 23 Preproc. Preproc. CGS-V Preproc. CGS-V CGS-V + Preproc. Preproc. CGS-V Preproc. CGS-V + Preproc. CGS-V CGS-V + Preproc. CGS-E CGS-E CGS-E + Preproc. 0.8 0.8 0.8 0.8 0.8 Main Memory Accesses Main Memory Accesses Main Memory Accesses Main Memory Accesses Main Memory Accesses 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 hol hol hol hol hol wik wik wik wik wik liv liv liv liv liv ind ind ind ind ind web web web web web uk uk uk uk uk gmean gmean gmean gmean gmean Graph Input Graph Input Graph Input Graph Input Graph Input
Ongoing Work CGS Hardware Implementation
Reducing storage overheads 25 ¨ Maintaining all vertices in the worklist is prohibitively expensive ¨ Can a small worklist capture most of the benefits? ¤ Order in which the worklist is filled is crucial ¨ Adding vertices in order of their id is bad ¤ Explores multiple disjoint regions of the graph simultaneously ¨ Insight: Explore the graph in depth-first fashion to fill the worklist ¤ 100 element worklist gives 50% of the benefits of CGS-E
Reducing processing overheads 26 ¨ Processing each edge takes only a few instructions ¤ Ex. PageRank: One floating point addition per edge ¤ Task scheduling logic must be cheap ¨ CGS-E gives much better locality than CGS-V, but has higher overheads ¨ Practical middle ground: Each task processes a cache line of edges ¤ Minimizes loss of spatial locality in edge list accesses ¤ Sidesteps the issue of high-degree nodes
Conclusion 27 ¨ Real-world graphs have abundant locality, but hard to predict ¨ Cache has rich information about which regions are best to process ¨ Cache-Guided Scheduling gives large reduction in memory accesses
T HANKS F OR Y OUR A TTENTION ! Q UESTIONS A RE W ELCOME !
Recommend
More recommend