Graph processing is memory-bound 2 Irregular structure causes - PowerPoint PPT Presentation

C ACHE -G UIDED S CHEDULING E XPLOITING C ACHES T O M AXIMIZE L OCALITY I N G RAPH P ROCESSING Anurag Mukkara , Nathan Beckmann, Daniel Sanchez 1 st AGP – Toronto, Ontario – 24 June 2017

Graph processing is memory-bound 2 ¨ Irregular structure causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs PageRank General-purpose system Specialized accelerator Main Memory Compute + Caches Main Memory Compute + Caches 35 35 35 30 30 30 nJ per Edge nJ per Edge nJ per Edge 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 hollywood hollywood wikipedia wikipedia liveJournal liveJournal indochina indochina webbase webbase uk uk d a l a e k a u o i n s d n o i a e r h w b u p c b o y i o k J e l d l i e w o w n h v i i l Graph Input Graph Input Graph Input 50% of system energy is Memory bottleneck becomes due to main-memory more critical

Exploiting graph structure through caches 3 ¨ Real-world graphs have strong community structure ¤ Significant potential locality ¤ Difficult to predict ahead of time ¨ Idea: Let the cache guide scheduling! ¤ Cache has information about the right vertices to process next – those which cause fewest misses ¨ This work: A limit study on the benefits of cache-guided scheduling (CGS) ¤ CGS reduces misses by up to 6x

Impact of Scheduling on Locality

Many graph algorithms allow flexibility in schedule 5 ¨ Schedule : Order in which vertices of the graph are processed ¨ Many important algorithms are unordered – schedule does not affect correctness ¤ Ex. PageRank, Collaborative Filtering, Label Propagation, Triangle Counting ¨ Schedule impacts locality significantly

Vertex-ordered schedule follows layout order 6 ¨ Vertices are processed in the order of their id ¨ All edges of a vertex are processed EdgeList consecutively Offsets ¨ Used by state-of-the-art graph processing frameworks Destinations ¤ Ligra, GraphMat, etc. ¨ Simplifies scheduling and parallelism ¨ Poor locality

Layout order might not match community structure 7 In-memory vertex layout Consecutive vertices in layout are spread out across the graph

Access pattern of vertex-ordered schedule 8 Edge list Vertex data . . . . . . . . . . . . . Random Streaming Cache High Low misses

Preprocessing changes layout for better order 9 Vertices in each cluster map to consecutive vertex ids Preprocessing

Access pattern with preprocessed graph 10 Edge list Vertex data . . . . . . . . . . . . . Good locality Streaming Cache Low Low misses

Preprocessing is often impractical 11 PageRank Preprocessing 36 20 9 5 Normalized Runtime • Preprocessing is more 4 expensive than algorithm itself 3 2 • Impractical for many 1 important use cases 0 hol wik liv ind web uk Wei et al. Speedup Graph Processing by Graph Ordering, SIGMOD’16

Cache-guided scheduling finds good order at runtime 12 Edge list Vertex data . . . . . . . . . . . . . Slightly Irregular Good Locality Cache Low Moderate misses

Cache-Guided Scheduling Design

High-level design 14 Main Memory Loads Core 4 Shared Last Level Cache Stores Core 3 Probes Event Core 2 Query cache Notifications contents Tasks Core 1 Cache Engine Maintains a list of tasks ranked based on a locality metric

Costs, benefits, and idealizations 15 ¨ Extra memory accesses to edge list ¤ Filling worklist with tasks ¤ Keeping task scores up to date For this limit study we ignore these costs ¨ Space overheads of worklist and auxiliary metadata ¤ Takes away some of the available cache capacity ¨ Large reduction in memory accesses ¤ Better energy efficiency and performance

Cache-Guided Scheduling of Vertices (CGS-V) 16 ¨ Ranks and schedules each vertex of the graph ¨ Vertices ranked by fraction of neighbors that are cached Uncached Vertices Cached Vertices

Cache-Guided Scheduling of Vertices (CGS-V) 17 ¨ Large locality benefits ¨ Track vertices only (not edges) ¨ Pitfall: Real-world graphs have skewed degree distributions ¤ Many high-degree vertices that are connected to most of the graph ¨ Processing high-degree vertices ¤ Flushes the cache and kills locality ¤ Misses opportunities to process other beneficial regions

Cache-Guided Scheduling of Edges (CGS-E) 18 ¨ Ranks and schedules edges instead of vertices ¨ Better locality due to finer-grained scheduling ¨ Each edge causes exactly two cache accesses ¤ Simpler ranking algorithm - Number of endpoints that are cached ¨ #Edges >> #Vertices → Higher tracking overheads

Limit Study on Benefits of CGS

Methodology 20 ¨ Large real-world graphs with up to 100 million vertices, 1 billion edges Graph hol wik liv ind uk web nfl yms Vertices (Millions) 1.1 3.5 4.8 7.4 19 118 0.5 0.5 Edges (Millions) 113 45 69 194 298 1020 100 61 ¨ Graph algorithms ¤ PageRank – 16-byte vertex objects ¤ Collaborative Filtering – 256-byte vertex objects ¨ Custom cache simulator to compute main-memory accesses ¤ Single core system ¤ 2-level cache hierarchy with 32KB L1, 8MB L2 ¨ See paper for details

Large reduction in memory accesses for PageRank 21 Vertex-Ordered CGS-V CGS-E 1.0 Main Memory Accesses Memory Access Reduction 0.8 0.6 CGS-V - 2.4x gmean 0.4 0.2 CGS-E - 4.6x gmean 0.0 hol wik liv ind web uk gmean Graph Input

Much larger benefits with Collaborative Filtering 22 Vertex-Ordered CGS-V CGS-E 1.0 Memory Access Reduction Main Memory Accesses CGS-V - 1.5x gmean 0.8 CGS-E -12x gmean 0.6 Larger vertex data – 256 bytes per vertex 0.4 • Edge list accesses are negligible (3% only) 0.2 • Finer-granularity scheduling of CGS-E 0.0 becomes more important yms gmean nfl Graph Input

CGS benefits from better graph layout 23 Preproc. Preproc. CGS-V Preproc. CGS-V CGS-V + Preproc. Preproc. CGS-V Preproc. CGS-V + Preproc. CGS-V CGS-V + Preproc. CGS-E CGS-E CGS-E + Preproc. 0.8 0.8 0.8 0.8 0.8 Main Memory Accesses Main Memory Accesses Main Memory Accesses Main Memory Accesses Main Memory Accesses 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 hol hol hol hol hol wik wik wik wik wik liv liv liv liv liv ind ind ind ind ind web web web web web uk uk uk uk uk gmean gmean gmean gmean gmean Graph Input Graph Input Graph Input Graph Input Graph Input

Ongoing Work CGS Hardware Implementation

Reducing storage overheads 25 ¨ Maintaining all vertices in the worklist is prohibitively expensive ¨ Can a small worklist capture most of the benefits? ¤ Order in which the worklist is filled is crucial ¨ Adding vertices in order of their id is bad ¤ Explores multiple disjoint regions of the graph simultaneously ¨ Insight: Explore the graph in depth-first fashion to fill the worklist ¤ 100 element worklist gives 50% of the benefits of CGS-E

Reducing processing overheads 26 ¨ Processing each edge takes only a few instructions ¤ Ex. PageRank: One floating point addition per edge ¤ Task scheduling logic must be cheap ¨ CGS-E gives much better locality than CGS-V, but has higher overheads ¨ Practical middle ground: Each task processes a cache line of edges ¤ Minimizes loss of spatial locality in edge list accesses ¤ Sidesteps the issue of high-degree nodes

Conclusion 27 ¨ Real-world graphs have abundant locality, but hard to predict ¨ Cache has rich information about which regions are best to process ¨ Cache-Guided Scheduling gives large reduction in memory accesses

T HANKS F OR Y OUR A TTENTION ! Q UESTIONS A RE W ELCOME !

Graph processing is memory-bound 2 Irregular structure causes - PowerPoint PPT Presentation

C ACHE -G UIDED S CHEDULING E XPLOITING C ACHES T O M AXIMIZE L OCALITY I N G RAPH P ROCESSING Anurag Mukkara , Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound 2 Irregular

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Branch-and-Bound Math 482, Lecture 33 Misha Lavrov April 27, 2020 Branch-and-bound methods

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Consistency Analysis for Massively Inconsistent Datasets in Bound-to-Bound Data Collaboration

Rightward Bound: The Rise of Conservatism in Postwar America Rightward Bound : The Rise of

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

An inequality between the edge-Wiener index and the Wiener index of a graph A. Tepeh joint work

Agile methods in biomedical software development: a multi - site experience report David W Kane,

Best practices in scientific programming Software Carpentry, Part I Valentin H anel

An overview of key aspects in adopting Scrum in teaching process Boris Milainovi University

Development Methodologies Perdita Stevens perdita@inf.ed.ac.uk

Reviewing my data: What does it all mean? Dr Fraser Gibb MB ChB BSc (Hons) PhD FRCP Consultant

A Tabling Engine for the Yap Prolog System Ricardo Rocha Fernando Silva V tor Santos Costa

Developing the Right Skills for the Future of the UK Aerospace Industry Mark Stewart , General

Ultra-peripheral collisions in ALICE experiment Jaroslav Adam On behalf of ALICE Collaboration

Graph processing is memory-bound 2 Irregular structure causes - PowerPoint PPT Presentation

C ACHE -G UIDED S CHEDULING E XPLOITING C ACHES T O M AXIMIZE L OCALITY I N G RAPH P ROCESSING Anurag Mukkara , Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound 2 Irregular

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Branch-and-Bound Math 482, Lecture 33 Misha Lavrov April 27, 2020 Branch-and-bound methods

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Consistency Analysis for Massively Inconsistent Datasets in Bound-to-Bound Data Collaboration

Rightward Bound: The Rise of Conservatism in Postwar America Rightward Bound : The Rise of

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

An inequality between the edge-Wiener index and the Wiener index of a graph A. Tepeh joint work

Agile methods in biomedical software development: a multi - site experience report David W Kane,

Best practices in scientific programming Software Carpentry, Part I Valentin H anel

An overview of key aspects in adopting Scrum in teaching process Boris Milainovi University

Development Methodologies Perdita Stevens perdita@inf.ed.ac.uk

Reviewing my data: What does it all mean? Dr Fraser Gibb MB ChB BSc (Hons) PhD FRCP Consultant

A Tabling Engine for the Yap Prolog System Ricardo Rocha Fernando Silva V tor Santos Costa

Developing the Right Skills for the Future of the UK Aerospace Industry Mark Stewart , General

Ultra-peripheral collisions in ALICE experiment Jaroslav Adam On behalf of ALICE Collaboration

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri