Pregelix: Think Like a Vertex, Scale Like Spandex Yingyi Bu (UC - PowerPoint PPT Presentation

Pregelix: Think Like a Vertex, Scale Like Spandex Yingyi Bu (UC Irvine) Work with: Vinayak Borkar (UC Irvine) , Michael J. Carey (UC Irvine), Tyson Condie (Microsoft & UCLA)

Outline Introduction Programming Model Example Applications System Internals Experimental Results Related Work Conclusions

Introduction Big Graphs are becoming common ○ web graph ○ social network ○ ......

Introduction ● How Big are Big Graphs? ○ Web: 8.53 Billion pages in 2012 ○ Facebook active users: 1.01 Billion ○ de Bruijn graph: 3 Billion nodes ○ ...... ● Weapons for mining Big Graphs ○ Hadoop/Hive (Facebook) ○ Pregel (Google) ○ Distributed GraphLab (CMU)

Programming Model (Pregel) ● Think like a vertex ○ receive messages ○ update states ○ send messages

Programming Model (BSP) Receive Receive Update Send msgs msgs states msgs an iteration Bulk synchronized A synchronization barrier between iterations

Programming Model - API ● Vertex (a super class for all applications) public abstract class Vertex <I extends WritableComparable, V extends Writable, E extends Writable, M extends Writable> implements Writable{ /** * @param msgIterator an iterator of incoming messages */ public abstract void compute (Iterator<M> msgIterator); ....... } ● Helper methods ○ sendMsg(I vertexId, M msg) ○ voteToHalt()

Programming Model - Optional APIs ● Combiner ○ Combine messages ○ Reduce network traffic ● Global Aggregator ○ Aggregate statistics over all vertices ○ Done for each iteration ● Early Termination (not in standard Pregel) ○ Force the job to terminate

Example Applications PageRank ConnectedComponents Shortest Paths Reachability query Start the Demo!

System Internals Pregel Vertex/map/msg data structures GraphLab Giraph ...... Task scheduling Memory management Message delivery Network management ● Our philosophy ○ Stop building one-off systems like Pregel, GraphLab, and Giraph, instead, building them on a data-flow engine !

Pregelix System Internals dest_id UDAF (combine) UDF (compute) Pregel Semantics Barrier Vertex/map/msg data structures Msg Vertice Task scheduling Record/Index Task scheduling management Memory management Buffer Data exchanging management Message delivery Connection management Network management A general purpose parallel dataflow engine

System Internals - Runtime ● Runtime Choice? Hyracks Hadoop ● The UCI Hyracks data-parallel execution engine ○ connection management ○ a set of operators: sorting, grouping, joining ○ task scheduling for jobs (a DAG of operators) ○ index support: B-tree, LSM-Btree, R-tree....

System Internals - Storage Pregelix Job DFS DFS B-tree bulkload Sorting DFS Read B-tree bulkload Sorting DFS Read B-tree bulkload Sorting DFS Read B-tree index scan DFS Write B-tree index scan DFS Write B-tree index scan DFS Write

System Internals - Outer Join Execution Plan dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) Barrier Barrier Barrier UDF (compute) UDF (compute) UDF (compute) Msg Msg Msg Vertice B-tree Vertice B-tree Vertice B-tree

System Internals -Inner Join Execution Plan dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) dest_id UDAF (combine) Barrier Barrier Barrier UDF (compute) UDF (compute) UDF (compute) Live vertex Live vertex Live vertex IDs IDs IDs Vertice B-tree Vertice B-tree Vertice B-tree Msg Msg Msg

System Internals - Implementations ● Right-outer join ○ Index merging join ● Sender-side group-by ○ Sort + pre-clustered group-by ● Data redistribution ○ Hash merging repartitioning connector ○ Sender-side materialization pipelining ● Receiver-side group-by ○ Pre-clustered group-by ● Inner join ○ Index probing join ● Set Union ○ Index set union

System Internals Spark, GraphLab, HaLoop all have caches for this kind of iterative jobs. What do you do for caching? ● Iteration-aware (sticky) scheduling? ○ 1 Loc: location constraints ● Caching of invariant data? ○ B-tree buffer pool -- 1 Loc: never flush dirty pages ○ File system cache -- free

Experimental Results ● Setup ○ Machines: Yahoo! Research cluster ~ 180 machines. Each has 8 cores, 12GB memory, 4 disk drives. ○ Dataset: Yahoo! webmap (1,413,511,393 vertice)

Experimental Results ● 10 iteration PageRank ● 1x webmap dataset

Experimental Results ● 10 iteration PageRank ● 1x webmap on 88 machines, 2x webmap on 175 machines

Related Work ● Spark [NSDI 2012] ○ OutOfMemoryError ● HaLoop [VLDB 2010] ○ Only 1.8X from Hadoop ● Giraph ○ OutOfMemoryError ● Mahout ○ OutOfMemoryError ● Distributed GraphLab [VLDB 2012] ○ Haven't tried yet (just published in September...)

Conclusions ● Vertex-oriented programming model is simple ● Dataflow implementation is neat and efficient ● We target Pregelix to be an open-sourced production system, rather than just a research prototype: ○ http://hyracks.org/projects/pregelix/

Pregelix: Think Like a Vertex, Scale Like Spandex Yingyi Bu (UC - PowerPoint PPT Presentation

Pregelix: Think Like a Vertex, Scale Like Spandex Yingyi Bu (UC Irvine) Work with: Vinayak Borkar (UC Irvine) , Michael J. Carey (UC Irvine), Tyson Condie (Microsoft & UCLA) Outline Introduction Programming Model Example Applications

PRESENTATION Style; DGL 078 Fabric; 98 % cotton 2 % spandex Size Range: regular Price; USD

11.4 The Pricing Method: Vertex Cover Weighted Vertex Cover Weighted vertex cover. Given a

Graphs Vertex Cover Vertex Cover A vertex cover of a graph G=(V ,E) is a set C of vertices such

Control Points Switch Office Information Server Fixed Network DB Base Station Vechicle

Polygon decomposition into monotone polygons Vertex types START vertex (2 edges on the right and

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Pregelix: Big(ger) Graph Analytics on A Dataflow Engine Yingyi Bu (UC Irvine) Joint work with:

Cuts and Connectivity Cuts and Connectivity CSE, IIT KGP Vertex Cut and Connectivity Vertex Cut

Vertex reconstruction Vertex reconstruction in large liquid scintillator detectors in large

Stochastic six vertex model Ivan Corwin (Columbia University) Stochastic six vertex 1 Page 1

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Graph Search Methods Graph Search Methods A search method starts at a given vertex v and

The Silicon Vertex Detector of the Belle II Experiment Thomas Bergauer (HEPHY Vienna) Vertex

All-Pairs Shortest Paths Given an n-vertex directed weighted graph, find a shortest path from

VERTEX magnets status Antoni Aduszkiewicz University of Warsaw EATM, October 6, 2015 Antoni

How Economists Think and Things They Think About How Economists Think and Things They Think About

Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010

V E R T E X S I M I L A R I T Y A N D I T S A P P L I C A T I O N T O F U N C T I O N A L P

Efficient Delivery with Mobile Agents Andreas B artschi NSEC/CNLS, baertschi@lanl.gov CNLS

Unicorn Runtime Provenance-Based Detector for Advanced Persistent Threats Thomas Pasquier

A K a l ma n f i l t e r f o r t h e C M S M u o n T r i g g e r f

Vehicle Routing Marco Chiarandini Outline 1. Vehicle Routing Introduction 2. CVRP 3. VRPTW

Outline Other Variants of VRP DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. A Uniform Model

PREDICTION of EMERGING TECHNOLOGIES BASED on ANALYSIS of the U.S. PATENT CITATION NETWORK e ter

Sambuz

Useful Links

Newsletter

Mail Us