HelP: High-level Primitives for Large- Scale Graph Processing Semih Salihoglu — Stanford University Jennifer Widom — Stanford University 1
Large-scale Graph Processing 10s or 100s billion vertices and edges Distributed Shared-Nothing Systems Machine k Machine 1 Machine 2 ……… Distributed Storage Pregel PowerGraph 2
APIs of Existing Systems Specialized map() and reduce() type APIs Pregel’s compute() PowerGraph’s gather(), apply(), scatter() Vertex-centric/Graph-parallel Message-passing Machine k Machine 1 Machine 2 ……… Distributed Storage 3
Advantages Transparent parallelism Flexible. Can express many graph algorithms: PageRank HITS Shortest Paths Collaborative Filtering Affinity Propagation Loopy Belief Propagation Weakly Connected Components Triangle Counting Strongly Connected Components Betweenness-Centrality Minimum Spanning Tree Diameter Estimation … … 4
Disadvantages Custom code for common operations, such as: Initializing vertex values Aggregating neighbor values Difficult to read and understand some programs: Complex UDFs hide higher-level graph operations … graph = Pregel.compute Pregel.compute(U (UDF1 F1) graph graph = = Pregel.compute Pregel.compute(U (UDF2 F2) graph = Pregel.compute Pregel.compute(U (UDF3 F3) … Too low-level for some operations E.g: forming super vertices in a minimum spanning tree Multiple rounds of complex messaging inside compute() 5
HelP Primitives Large-Scale Data Large-Scale Graph Processing Processing X X X X X map() reduce() X compute() gather() apply() scatter() Pig and Hive: HelP: join, group by, ? select, … 6
Steps in Our Work 1. Implemented a wide suite of distributed graph algorithms 2. Identified the commonly appearing operations 3. Abstracted the operations into HelP primitives 4. Implemented HelP on GraphX 5. Reimplemented the suite of algorithms on GraphX 7
Graph Algorithms We Implemented Algorithm PageRank HITS Conductance Approx. Betweenness Centrality Clustering Coefficient Semi-clustering Multi-level clustering Approx. Maximum Weight Matching Random Bipartite Matching Weakly Connected Components Strongly Connected Components Single Source Shortest Paths Graph Coloring Maximal Independent Set K-core Triangle Counting Diameter Estimation K-truss Minimum Spanning Forest 8
HelP Primitives Primitive Type of Operation Aggregate Neighbor Values (ANV) Vertex-centric Update Local Update of Vertices (LUV) Vertex-centric Update Vertex-centric Update Update Vertices Using One Other Vertex (UVUOV) Filter Topology Modification Form Supervertices (FS) Topology Modification Aggregate Global Value (AGV) Global Aggregation 9
Algorithms & HelP Primitives Algorithm Filter ANV LUV UVUOV FS AGV PageRank x x HITS x x x Conductance x x Approx. Betweenness Centrality x x x Clustering Coefficient x x Semi-clustering x x x Multi-level clustering x x x Approx. Maximum Weight Matching x x Random Bipartite Matching x x x Weakly Connected Components x x Strongly Connected Components x x x x Single Source Shortest Paths x x Graph Coloring x x x Maximal Independent Set x x x K-core x x Triangle Counting x Diameter Estimation x x x K-truss x Minimum Spanning Forest x x x x 10
Example: Aggregate Neighbor Values Vertices a ggregate some or all of their neighbors’ values Update own value with the aggregated value Version 1: Non-iterative => aggregateNeighborValues PageRank … for (i=0; i < 10; ++i) { g.aggregat aggregateN eNeig eighb hbor orVa Valu lues es( v -> true /* aggregate all vertices */, nbr -> true /* which neighbors to aggregate */ , nbr -> nbr.val.pr/nbr.degree, AggrFnc.SUM, (v, sumPr)->{v.val.pr = 0.85*sumPr + 0.15/g.numV;}) } 11
Version 2: Iterative => propagateAndAggregate Continue aggregations until vertex values converge Ex: Weakly Connected Components 1 1 1 1 7 7 7 7 5 1 4 9 7 9 5 2 4 8 8 8 8 2 2 2 2 9 8 9 5 5 3 5 5 5 5 4 9 5 9 9 3 3 5 3 5 … 5 4 3 5 4 4 9 4 9 9 9 g.propaga gate teAn AndAg Aggr greg egat ate( EdgeDirection.BOTH, v -> true, /* start propagation from all */ v -> v.val.wccID, AggrFnc.MAX, (v, aggrWCCID) -> {v.val.wccID = aggrWCCID;}) 12
Related Work (see paper) Vertex-centric APIs MapReduce-based APIs Higher-Level Data Analysis Languages Domain-Specific Graph Languages MPI-based Libraries 13
GraphX Implementation, Limitations, Future Work See Our Paper & Poster! 14
Questions? 15
GraphX Implementation (Non-iterative Version) Graph EdgesRDD MessagesRDD v 1 .ID v 2 .ID e 1 v 1 .ID v 3 .ID e 2 mapreduceTriplets v 1 .I aggrMsg 1 (join + map + D v 2 .ID v 3 .ID e 3 reduceBy) v 2 .I aggrMsg 2 v 3 .ID v 1 .ID e 4 D v 4 .ID v 2 .ID e 5 v 3 .I aggrMsg 3 D v 4 .ID v 1 .ID e 6 VerticesMsgsRDD NewVerticesRDD v 1 .ID v 1 .val aggrMsg 1 v 1 .ID v 1 .newval VerticesRDD join v 2 .ID v 2 .val aggrMsg 2 v 2 .ID v 2 .newval map v 1 .I v 1 .val v 3 .ID v 3 .newval v 3 .ID v 3 .val aggrMsg 3 D v 4 .ID v 4 .mewval v 4 .ID v 4 .val aggrMsg 4 v 2 .I v 2 .val D v 3 .I v 3 .val D v 4 .I v 4 .val D 16 Replace VerticesRDD with NewVerticesRDD.
Recommend
More recommend