Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - PowerPoint PPT Presentation

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 · February 27, 2018

Graph Processing • Problem modelled as objects (vertices) and connections between them (edges) • Examples: • Internet (pages and hyperlinks) • Social network (people and friendships) • Roads and intersections • Products and ratings 2

Graph Processing F L I I I A A E E E C C G G G B B K K D D H H H H J 3

Graph Processing F’ L’ I’ A’ E’ C’ G’ B’ K’ D’ H’ J’ Repeat until convergence 4

Graph Processing Push Pull Group by source vertex Group by destination vertex Hybrid: dynamically select push or pull for each iteration 5

Graph Processing foreach vertex v in graph.vertices foreach edge e in v.(in|out)edges // process the edge ... 6

Parallelizing Graph Processing • Outer loop parallelization • Between cores: assign entire vertices to threads • Inner loop parallelization • Between cores: subdivide the edges within each vertex • Within one core: vectorize the loop 7

Parallelizing Graph Processing Push (O uter Loop) Push (Both Loops) Push (Both) + Pull (Outer ) Push (Both) + Pull (Bot h) 10 8 Speedup 6 4 2 0 PageRank Br eadth-Fir st Sear ch Running Ligra on twitter-2010 graph 8

Pull’s Performance Challenge Serial Inner Loop Parallel Inner Loop Contribution #1: “Scheduler Awareness” A technique that can be applied to the inner loop of a pull engine to parallelize it without introducing conflicts. • One thread per vertex • Multiple threads per vertex • Updates are thread-private • Updates will conflict 11

Pull’s Performance Opportunity • Further gains possible using SIMD vectorization • Improve parallelism of the computation • Improve memory bandwidth utilization Contribution #2: “Vector-Sparse” A low-level modification to a data structure commonly used • Data structure layout issues impede effective to represent graphs, intended to enhance vectorization. vectorization in existing work 12

Grazelle • A hybrid graph processing framework that embodies both of our contributions • Outperforms the state-of-the-art by over 10× in some cases • Available for download at https://github.com/stanford-mast/Grazelle-PPoPP18 13

Scheduler Awareness Contribution #1 14

Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 15

Serial Inner Loop Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 16

Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 17

Scheduler Un-Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 18

Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 19

Scheduler Awareness Vertex Data Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 20

Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 21

Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 22

Scheduler Awareness Vertex Data Merge Buffers Vertex (Outer Loop) 0 1 2 Edge (Inner Loop) 0 1 2 3 4 5 6 7 0 1 2 0 1 2 3 4 Chunk A Chunk B Chunk C 1. Writes at the end of a chunk are redirected to a private per-chunk merge buffer. 2. Writes in the middle of a chunk can be committed to shared state without synchronization. 23

Analyzing Scheduler Awareness • Performance impact depends primarily on the scheduling granularity • Scheduler Un-Awareness: trade-off between load balance and probability of write conflicts • Scheduler Awareness: finer granularity leads to increased merge operation overhead 24

PageRank: Performance vs. Scheduling Granularity dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler-Aware 1.2 1.2 50× 1.2× Rel. Execution Time Rel. Execution Time 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 3.3× 0.2 0.2 0.0 0.0 100 1,000 10,000 1,000 10,000 100,000 Chunk Size Chunk Size 10× Different 25

PageRank: Performance vs. Number of Cores dimacs-usa uk-2007 (low, even degree distribution) (extremely skewed) Scheduler Un-Aware Scheduler Awar e Key Insights 50 70 Rel. Performance Rel. Performance 60 40 Huge improvement for extremely skewed graphs 50 • 30 40 Still beneficial for evenly-distributed low-degree graphs • Scaling enabled by Scaling improved by 30 20 Scheduler Awareness Scheduler Awareness 20 10 10 0 0 0 14 28 42 56 0 14 28 42 56 # Physical Cores # Physical Cores 26

Vector-Sparse Contribution #2 27

Compressed-Sparse [0] [1] [2] [3] Vertex Index 0 3 7 10 Edges 23 10 50 4 0 53 62 1 78 50 23 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 28

Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 Vertex 1 29

Vectorizing Compressed-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 23 10 50 4 0 53 62 1 78 50 23 4 Vertex 0 × Vertex 1 30

Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 × × Vertex 0 Vertex 1 Vertex 2 Padding 33

Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits 34

Vector-Sparse [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 23 10 50 4 0 53 62 1 78 50 23 4 0 1 2 1 1 1 0 1 1 1 1 1 1 1 0 1 1 Vertex 0 Vertex 1 Vertex 2 Padding + “valid” bits + top-level vertex spread-encoding 35

Analyzing Vector-Sparse Packing Efficiency Performance Impact 100% PageRank CC BFS Average Efficiency 3.0 75% 2.5 Speedup 2.0 50% 1.5 1.0 25% 0.5 0% 0.0 twitter-2010 acs-usa twitter-2010 livejournal friendster acs-usa uk-2007 livejournal friendster uk-2007 di m di m Generally ≥ 75% 1.5× to 2.5× 36

Performance Comparison Putting it all together 37

Evaluation Scope • Grazelle is compared with Ligra, Polymer, GraphMat, and X-Stream • Three applications: PageRank, Connected Components, Breadth-First Search • Running on a machine equipped with four Intel Xeon E7-4850 v3 processors • 14 physical cores / 28 logical cores per socket 38

PageRank: Peak Processing Throughput Grazelle-Pull Grazelle-Pus h Ligra-Pull Ligra-Push Polymer GraphMat X-Stream 1E+5 15.2× Execution Time (ms) 1E+4 1.4× 2.3× 1E+3 2.3× 3.6× 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 39

Connected Components: Dynamic Control Flow Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 4.9× 1E+5 21.1× Execution Time (ms) 1.5× 1.6× 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 40

Breadth-First Search: Compatibility of Optimizations Grazelle Ligra Ligra-Dense Polymer GraphMat X-Stream 1E+6 1E+5 Execution Time (ms) 1E+4 1E+3 1E+2 1E+1 × 1E+0 di m acs-usa livejournal twitter-2010 friendster uk-2007 Logarithmic 41

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - PowerPoint PPT Presentation

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 February 27, 2018 Graph Processing Problem modelled as

FUTURE PULL: Future Pull Creating Change From the THE FARMHOUSE IN MY FUTURE Future Back Bill

Performant Multiplatform Kotlin Serialization Eric Cochran KotlinConf October 5, 2018 Performant

= Pull- -Off Force Off Force JKR Pull- -Off Off Pull Pull 2 0 . 6 / F W d F

MSP Commercial Repayment Center Contractor Transition Performant Recovery, Inc. NGHP ORM Town

MSP Commercial Repayment Center Contractor Transition Performant Recovery, Inc. GHP Town Hall

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

The Power of Pull The Power of Pull a platform approach to learning a platform approach

Science/Technology Push Technological Development Science Push Market Pull Incremental

Q1 : Pull up (PUP) network consists of PMOS only and Pull Down (PDN) consists of NMOS only in a

Pull Request Feedback A DVCS pull request approach to personal feedback ludwig@schubert.io duh.

Automate pull and push to Git Joe Santino(jsantino) The problem Working on a project,

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Integration Testing Path Based Chapter 13 Call graph based integration Use the call graph

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

DESY 2017 BT demand P r o j e c t n a m e a n d s h o r t d e s c r i p t i o n : C A L I C E S i W

CSE 306 Operating Systems Introduction Don Porter Paperwork I am handing out a survey on

Fast Simulation of Calorimeter Punch-Through Particles in ATLAS A Status Report Elmar Ritsch

Use Cases and Controllers November 8, 2007 1 Use Case Describe an interaction between a user

v Tuning of a DQW crab cavity S. Verdu-Andres (BNL) for the Crab Cavity Work Package Tuning of a

Data Structures Algorithm Theory WS 2012/13 Fabian Kuhn Examples Dictionary: Operations:

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February

Advanced topologies Half-bridge, Full-bridge, Push-Pull, Cuk, SEPIC Cuk Sepic

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , - PowerPoint PPT Presentation

Making Pull-Based Graph Processing Performant Samuel Grossman 1 , Heiner Litz 2 , and Christos Kozyrakis 1 1 Stanford University 2 University of California, Santa Cruz PPoPP 2018 February 27, 2018 Graph Processing Problem modelled as

FUTURE PULL: Future Pull Creating Change From the THE FARMHOUSE IN MY FUTURE Future Back Bill

Performant Multiplatform Kotlin Serialization Eric Cochran KotlinConf October 5, 2018 Performant

= Pull- -Off Force Off Force JKR Pull- -Off Off Pull Pull 2 0 . 6 / F W d F

MSP Commercial Repayment Center Contractor Transition Performant Recovery, Inc. NGHP ORM Town

MSP Commercial Repayment Center Contractor Transition Performant Recovery, Inc. GHP Town Hall

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

The Power of Pull The Power of Pull a platform approach to learning a platform approach

Science/Technology Push Technological Development Science Push Market Pull Incremental

Q1 : Pull up (PUP) network consists of PMOS only and Pull Down (PDN) consists of NMOS only in a

Pull Request Feedback A DVCS pull request approach to personal feedback ludwig@schubert.io duh.

Automate pull and push to Git Joe Santino(jsantino) The problem Working on a project,

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Integration Testing Path Based Chapter 13 Call graph based integration Use the call graph

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

DESY 2017 BT demand P r o j e c t n a m e a n d s h o r t d e s c r i p t i o n : C A L I C E S i W

CSE 306 Operating Systems Introduction Don Porter Paperwork I am handing out a survey on

Fast Simulation of Calorimeter Punch-Through Particles in ATLAS A Status Report Elmar Ritsch

Use Cases and Controllers November 8, 2007 1 Use Case Describe an interaction between a user

v Tuning of a DQW crab cavity S. Verdu-Andres (BNL) for the Crab Cavity Work Package Tuning of a

Data Structures Algorithm Theory WS 2012/13 Fabian Kuhn Examples Dictionary: Operations:

Distributed Systems (3rd Edition) Chapter 07: Consistency &amp; Replication Version: February

Advanced topologies Half-bridge, Full-bridge, Push-Pull, Cuk, SEPIC Cuk Sepic

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February