Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri
Outline • Distributed Graph Processing • Gelly: Batch Graph Processing with Flink • Gelly-Stream: Continuous Graph Processing with Flink
WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
MISCONCEPTION #1 MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE Big Data Ninja
A SOCIAL NETWORK
YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT
INTERMEDIATE DATA: THE OFTEN ▸ Naive Who(m) to Follow: ▸ compute a friends-of-friends list per user ▸ exclude existing friends ▸ rank by common connections
MISCONCEPTION #2 DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE Data Science Rockstar
GRAPHS DON’T APPEAR OUT OF THIN AIR Expectation…
GRAPHS DON’T APPEAR OUT OF THIN AIR Reality!
HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
GRAPH APPLICATIONS ARE DIVERSE ▸ Iterative value propagation ▸ PageRank, Connected Components, Label Propagation ▸ Traversals and path exploration ▸ Shortest paths, centrality measures ▸ Ego-network analysis ▸ Personalized recommendations ▸ Pattern mining ▸ Finding frequent subgraphs
LINEAR ALGEBRA Adjacency Matrix 1 2 3 4 5 2 1 0 0 1 1 0 4 2 1 0 0 1 0 1 3 0 0 0 0 0 4 0 1 1 0 1 5 3 5 0 0 1 0 0 - Partition by rows, columns, blocks - Efficient representation of non-zero elements - Algorithms expressed as vector-matrix multiplications
BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0
BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0
BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 1 0 0 0 0 0 1 1 0 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0
BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 0 0 0 0 0 0 0 1 1 1 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0
BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 0 0 0 0 0 1 1 1 1 1 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque 2004 2009 2010 2012 2013 2014 2015 Tinkerpop Pegasus PowerGraph Signal-Collect NScale Iterative value propagation Ego-network analysis
PREGEL: THINK LIKE A VERTEX 1 3, 4 2 4 1 2 1, 4 . . 5 . 3 5 3
PREGEL: SUPERSTEPS Superstep i Superstep i+1 1 1 3, 3, 2 2 1, 1, . . . . 5 5 3 3 (V i+1 , outbox) <— compute(V i , inbox)
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING Transition VertexID Out-degree 2 Probability 1 2 1/2 4 1 2 2 1/2 3 0 - 5 3 4 3 1/3 5 1 1
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING Transition VertexID Out-degree 2 Probability 1 2 1/2 4 1 2 2 1/2 3 0 - 5 3 4 3 1/3 5 1 1 PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
PREGEL EXAMPLE: PAGERANK void compute(messages): sum up sum = 0.0 received messages for (m <- messages) do sum = sum + m update end for vertex rank setValue(0.15/numVertices() + 0.85*sum) distribute rank to neighbors for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for
SIGNAL-COLLECT Superstep i Superstep i+1 Signal Collect 1 1 1 3, 3, 3, 2 2 2 1, 1, 1, . . . . . . 5 5 5 3 3 3 outbox <— signal(V i ) V i+1 <— collect(inbox)
SIGNAL-COLLECT EXAMPLE: PAGERANK void signal(): distribute rank for (edge <- getOutEdges()) do to neighbors sendMessageTo( edge.target(), getValue()/numEdges) end for void collect(messages): sum up received sum = 0.0 messages for (m <- messages) do sum = sum + m update vertex rank end for setValue(0.15/numVertices() + 0.85*sum)
GATHER-SUM-APPLY (POWERGRAPH) Superstep i Superstep i+1 Sum Apply Gather Gather 1 3 1 1 3 1 5 2 1 5 . . . . . . . . . . . . 5 3 5 3 5
GSA EXAMPLE: PAGERANK double gather(source, edge, target): return target.value() / target.numEdges() compute partial double sum(rank1, rank2): rank return rank1 + rank2 combine partial ranks double apply(sum, currentRank): return 0.15 + 0.85*sum update rank
PROBLEMS WITH VERTEX-CENTRIC MODELS ▸ Excessive communication ▸ Worker load imbalance ▸ Global Synchronization ▸ High memory requirements ▸ inbox /outbox can grow too large ▸ overhead for low-degree vertices in GSA
Vertex-Centric Connected Components ‣ Propagate the minimum value through the graph ‣ In each superstep, the value propagates one hop ‣ Requires diameter + 1 supersets to converge
THINK LIKE A (SUB)GRAPH 2 2 4 4 1 1 5 3 - compute() on the entire partition - Information flows freely inside 3 5 each partition - Network communication between partitions, not vertices
Subgraph-Centric Connected Components ‣ In each superstep, the value propagates throughout each subgraph ‣ Communication between partitions only ‣ Requires less (possibly) supersteps to converge
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque 2004 2009 2010 2012 2013 2014 2015 Tinkerpop Pegasus PowerGraph Signal-Collect NScale Iterative value propagation Ego-network analysis
CAN WE HAVE IT ALL? ▸ Data pipeline integration: built on top of an efficient distributed processing engine ▸ Graph ETL: high-level API with abstractions and methods to transform graphs ▸ Familiar programming model: support popular programming abstractions
Gelly the Apache Flink Graph API
Flink Stack Dataflow (WiP) Hadoop M/R Cascading Dataflow SAMOA Gelly Table Table CEP ML DataSet (Java/Scala) DataStream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Embedded
Why Graph Processing with Apache Flink? • Native Iteration Operators • DataSet Optimizations • Ecosystem Integration
Flink Iteration Operators Result Result Replace Iterative Iterative State Update Function Update Function Input Workset Solution Set
Optimization • the runtime is aware of the iterative execution • no scheduling overhead between iterations • caching and state maintenance are handled automatically Push work Cache Loop-invariant Data Maintain state as index “out of the loop”
Beyond Iterations • Performance & Scalability • Memory management • Efficient serialization framework • Operations on binary data • Automatic Optimizations • Choose best execution strategy • Cache invariant data
Meet Gelly • Java & Scala Graph APIs on top of Flink • graph transformations and utilities • iterative graph processing • library of graph algorithms • Can be seamlessly mixed with the DataSet Flink API to easily implement applications that use both record-based and graph-based analysis
Hello, Gelly! Java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env); Graph<Long, Long, NullValue> graph = Graph. fromDataSet (edges, env); DataSet<Vertex<Long, Long>> verticesWithMinIds = graph. run ( new ConnectedComponents(maxIterations)); Scala val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env) val graph = Graph. fromDataSet (edges, env) val components = graph. run (new ConnectedComponents(maxIterations))
Graph Methods Transformations Graph Properties map, filter, join subgraph, union, getVertexIds difference getEdgeIds reverse, undirected numberOfVertices getTriplets numberOfEdges Mutations getDegrees add vertex/edge ... remove vertex/edge
Example: mapVertices // increment each vertex value by one val graph = Graph . fromDataSet ( ... ) // increment each vertex value by one val updatedGraph = graph . mapVertices ( v => v . getValue + 1 ) 5 5 3 4 7 8 1 2 4 5
Example: subGraph val graph : Graph[Long , Long , Long] = ... // keep only vertices with positive values // and only edges with negative values val subGraph = graph . subgraph ( vertex => vertex . getValue > 0 , edge => edge . getValue < 0 )
Neighborhood Methods • Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel graph.reduceOnNeighbors( new MinValue, EdgeDirection.OUT)
Iterative Graph Processing • Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations • vertex-centric • scatter-gather • gather-sum-apply • partition-centric*
Recommend
More recommend