PowerGraph Distributed Graph-Parallel Computation on Natural Graphs - PowerPoint PPT Presentation

PowerGraph Distributed Graph-Parallel Computation on Natural Graphs JOSHUA SEND 24/10/2017 LSDPO SESSION 3

Intuition for Graph Processing Systems Overall goal • efficiently compute over large graphs of data – key is distributing work Typical tasks: Single Source Shortest Path, PageRank etc. Approach • Define computation graph on the data rather than passing graph through computation steps

Existing Systems – Pregel [1] Input data graph • Assign computation to each vertex, vertices to instances Synchronous supersteps Directed Edges

Existing Systems – GraphLab [2] Also facilitates processing large graphs of data and distributes graph vertices to instances No explicit message passing and directed edges Asynchronous execution – no supersteps

Motivation ◦ Power Law connectivity: P(d) ∝ d −α ◦ Eg. Social networks, internet ( α ≈ 2)

Natural Graphs

Contributions 1. Generalized “vertex program” 2. Distribute graph edge-by-edge rather than vertex- by-vertex 3. Practical parallel locking

Generalized Vertex Program Apply • Collect data and • Disseminate to aggregate neighbors • Perform • Commutative, • Activate their operation on associative operation gathered data aggregator Gather Scatter

Vertex Splitting Standard approach – assign each vertex of graph to an instance – often requires ‘ghosts’ Idea – assign each edge to an instance Leads to vertices appearing on different instances Parallelization of data gathering and scattering “within” one vertex as edges may be in different instances Set of instances containing a particular vertex called replicas and randomly assign a master, rest are called mirrors Master receives partial aggregations, applies vertex operation, sends changes to edges to scatter

Master, Mirrors

How to actually distribute Edges 3 different strategies 1. Random ◦ Deploy edge to instance based on hash 2. Greedy Heuristic ◦ Reduce number of replicas per vertex ◦ Requires estimate of sets of replicas per vertex

Heuristic Distribution 1. Oblivious ◦ Estimate sets from local information only ◦ Paper unclear on how exactly this works 2. Coordinated ◦ Keep distributed table of sets replicas per vertex Tradeoff space: longer load time vs. fewer replicas & faster execution

Execution Stategies Supports: ◦ Synchronized supersteps (à la Pregel), ◦ Asynchronous ◦ Asynchronous + serializable utilizing parallel locking Tradeoff space: predictability/determinism vs throughput vs runtime/convergence speed

Miscellaneous Delta Caching ◦ Update edges with deltas rather than rewriting values. If delta is 0, neighbor may not have to recompute Fault Tolerance ◦ Checkpointing

Results Partitioning scheme ◦ Random > oblivious > coordinated in terms of replication factor ◦ All faster than Pregel/Piccolo and GraphLab for synthetic natural graphs Execution Strategy ◦ Synchronized: 3-8x faster implementing PageRank than on Spark per iteration ◦ Async : Even faster (authors don’t provide a direct comparison?) ◦ Async + Serializable: less throughput, converges faster (less recomputation)

Remarks Paper’s details are hard to understand Evaluation is a bit sloppy – missing some direct comparisons between execution strategies and combinations of partitioning and execution Large tradeoff space, hard to navigate o Eg. Coordinated distribution can increase load times 4x o Authors highlight 60s vs 240s for random vs coordinated partitioning o Meanwhile, SSSP on 6.5B edges takes 65s to run

Remarks Solid theoretical foundation for partitioning heuristic Very solid gains over prior systems, especially in tasks with natural graphs!

References 1. G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski: Pregel: A System for Large-Scale Graph Processing, SIGMOD, 2010. 2. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein: Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB, 2012. 3. J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin: Powergraph: distributed graph-parallel computation on natural graphs. OSDI, 2012.

PowerGraph Distributed Graph-Parallel Computation on Natural Graphs - PowerPoint PPT Presentation

PowerGraph Distributed Graph-Parallel Computation on Natural Graphs JOSHUA SEND 24/10/2017 LSDPO SESSION 3 Intuition for Graph Processing Systems Overall goal efficiently compute over large graphs of data key is distributing work

PowerGraph Distributed Graph-Parallel Computation on Natural Graphs by Gonzalez, Joseph E., et

PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs J. E. Gonzales, Y. Low, H.

PowerGraph : Distributed Graph-Parallel Computation on Natural Graphs Gonzales et al. James

Tradeoffs Between Synchronous and Asynchronous Execution in PowerGraph Joshua Send Trinity Hall

CS 744: Powergraph Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Midterm grades (end of)

CS 744: Powergraph Shivaram Venkataraman Fall 2020 ADMINISTRIVIA ! ! - Midterm update

Analyzing the Graph-Processing Pipeline: A comparative study of GraphLab and GraphX An open

X-Stream: Edge-centric Graph Processing using Streaming Partitions Amitabha Roy, Ivo Mihailovic,

Ligra: A Lightweight Graph Processing Framework for Shared Memory Shared memory Other not

Multilayer networks in GraphLab An open source project study Mariana Marasoiu, R212 GraphLab

Computation on Natural Graphs Presenter: Mengxiao Wang Problem: Existing distributed graph

Year 4 Multiplication Tables Check (M (MTC) John Perry Primary School Mul ultip ipli

Stay and Share November 2016 Welcome to Year 2 Class 2 At the end of the year your children

SMCL A Domain-Specific Programming Language for Secure Multiparty Computation Janus Dam Nielsen

Growth: To Proficiency and Beyond While growth can still be an important consideration for

Better mathematics conference Subject leadership Paul Tomkow HMI Autumn 2016 Aims of workshops

TIMES TABLES HOW WE TEACH TIMES TABLES AND HOW YOU CAN HELP WHY ARE TIMES TABLES IMPORTANT?

Number Sense Routines that Support the SMPs GLAMC Mini Conference #1 Susan Tate and Melissa

USING ARRAYS TO HELP CHILDREN MOVE FROM ADDITIVE TO MULTIPLICATIVE THINKING Sen Delaney

Sage-Symbolics Making a new system for performing Calculus and Physics Gary Furnish August 6,

INFERENCE SCHEMES FOR M BEST SOLUTIONS FOR SOFT CSPS Emma Rolln, Natalia Flerova and Rina

Process and Impacts of Soil Compaction John Chapman Program Director Erosion and Stormwater

Landscape Salvatore Mangiafico Rutgers Cooperative Extension Environmental and Resource

Introduction For site investigation, in-situ tests are increasingly used to determine the soil