batch stream graph processing with apache flink
play

Batch & Stream Graph Processing with Apache Flink Vasia - PowerPoint PPT Presentation

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph Processing with Flink WHEN


  1. Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

  2. Outline • Distributed Graph Processing • Gelly: Batch Graph Processing with Flink • Gelly-Stream: Continuous Graph Processing with Flink

  3. WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

  4. MISCONCEPTION #1 MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE Big Data Ninja

  5. A SOCIAL NETWORK

  6. YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT

  7. INTERMEDIATE DATA: THE OFTEN ▸ Naive Who(m) to Follow: ▸ compute a friends-of-friends list per user ▸ exclude existing friends ▸ rank by common connections

  8. MISCONCEPTION #2 DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE Data Science Rockstar

  9. GRAPHS DON’T APPEAR OUT OF THIN AIR Expectation…

  10. GRAPHS DON’T APPEAR OUT OF THIN AIR Reality!

  11. HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

  12. GRAPH APPLICATIONS ARE DIVERSE ▸ Iterative value propagation ▸ PageRank, Connected Components, Label Propagation ▸ Traversals and path exploration ▸ Shortest paths, centrality measures ▸ Ego-network analysis ▸ Personalized recommendations ▸ Pattern mining ▸ Finding frequent subgraphs

  13. LINEAR ALGEBRA Adjacency Matrix 1 2 3 4 5 2 1 0 0 1 1 0 4 2 1 0 0 1 0 1 3 0 0 0 0 0 4 0 1 1 0 1 5 3 5 0 0 1 0 0 - Partition by rows, columns, blocks - Efficient representation of non-zero elements - Algorithms expressed as vector-matrix multiplications

  14. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  15. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  16. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 1 0 0 0 0 0 1 1 0 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  17. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 0 0 0 0 0 0 0 1 1 1 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  18. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 0 0 0 0 0 1 1 1 1 1 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  19. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque 2004 2009 2010 2012 2013 2014 2015 Tinkerpop Pegasus PowerGraph Signal-Collect NScale Iterative value propagation Ego-network analysis

  20. PREGEL: THINK LIKE A VERTEX 1 3, 4 2 4 1 2 1, 4 . . 5 . 3 5 3

  21. PREGEL: SUPERSTEPS Superstep i Superstep i+1 1 1 3, 3, 2 2 1, 1, . . . . 5 5 3 3 (V i+1 , outbox) <— compute(V i , inbox)

  22. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING Transition VertexID Out-degree 2 Probability 1 2 1/2 4 1 2 2 1/2 3 0 - 5 3 4 3 1/3 5 1 1

  23. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING Transition VertexID Out-degree 2 Probability 1 2 1/2 4 1 2 2 1/2 3 0 - 5 3 4 3 1/3 5 1 1 PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

  24. PREGEL EXAMPLE: PAGERANK void compute(messages): sum up sum = 0.0 received messages for (m <- messages) do sum = sum + m update end for vertex rank setValue(0.15/numVertices() + 0.85*sum) distribute rank to neighbors for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

  25. SIGNAL-COLLECT Superstep i Superstep i+1 Signal Collect 1 1 1 3, 3, 3, 2 2 2 1, 1, 1, . . . . . . 5 5 5 3 3 3 outbox <— signal(V i ) V i+1 <— collect(inbox)

  26. SIGNAL-COLLECT EXAMPLE: PAGERANK void signal(): distribute rank for (edge <- getOutEdges()) do to neighbors sendMessageTo( edge.target(), getValue()/numEdges) end for void collect(messages): sum up received sum = 0.0 messages for (m <- messages) do sum = sum + m update vertex rank end for setValue(0.15/numVertices() + 0.85*sum)

  27. GATHER-SUM-APPLY (POWERGRAPH) Superstep i Superstep i+1 Sum Apply Gather Gather 1 3 1 1 3 1 5 2 1 5 . . . . . . . . . . . . 5 3 5 3 5

  28. GSA EXAMPLE: PAGERANK double gather(source, edge, target): return target.value() / target.numEdges() compute partial double sum(rank1, rank2): rank return rank1 + rank2 combine partial ranks double apply(sum, currentRank): return 0.15 + 0.85*sum update rank

  29. PROBLEMS WITH VERTEX-CENTRIC MODELS ▸ Excessive communication ▸ Worker load imbalance ▸ Global Synchronization ▸ High memory requirements ▸ inbox /outbox can grow too large ▸ overhead for low-degree vertices in GSA

  30. Vertex-Centric Connected Components ‣ Propagate the minimum value through the graph ‣ In each superstep, the value propagates one hop ‣ Requires diameter + 1 supersets to converge

  31. THINK LIKE A (SUB)GRAPH 2 2 4 4 1 1 5 3 - compute() on the entire partition - Information flows freely inside 3 5 each partition - Network communication between partitions, not vertices

  32. Subgraph-Centric Connected Components ‣ In each superstep, the value propagates throughout each subgraph ‣ Communication between partitions only ‣ Requires less (possibly) supersteps to converge

  33. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque 2004 2009 2010 2012 2013 2014 2015 Tinkerpop Pegasus PowerGraph Signal-Collect NScale Iterative value propagation Ego-network analysis

  34. CAN WE HAVE IT ALL? ▸ Data pipeline integration: built on top of an efficient distributed processing engine ▸ Graph ETL: high-level API with abstractions and methods to transform graphs ▸ Familiar programming model: support popular programming abstractions

  35. Gelly the Apache Flink Graph API

  36. Flink Stack Dataflow (WiP) Hadoop M/R Cascading Dataflow SAMOA Gelly Table Table CEP ML DataSet (Java/Scala) DataStream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Embedded

  37. Why Graph Processing with Apache Flink? • Native Iteration Operators • DataSet Optimizations • Ecosystem Integration

  38. Flink Iteration Operators Result Result Replace Iterative Iterative State Update Function Update Function Input Workset Solution Set

  39. Optimization • the runtime is aware of the iterative execution • no scheduling overhead between iterations • caching and state maintenance are handled automatically Push work 
 Cache Loop-invariant Data Maintain state as index “out of the loop”

  40. Beyond Iterations • Performance & Scalability • Memory management • Efficient serialization framework • Operations on binary data • Automatic Optimizations • Choose best execution strategy • Cache invariant data

  41. Meet Gelly • Java & Scala Graph APIs on top of Flink • graph transformations and utilities • iterative graph processing • library of graph algorithms • Can be seamlessly mixed with the DataSet Flink API to easily implement applications that use both record-based and graph-based analysis

  42. Hello, Gelly! Java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env); Graph<Long, Long, NullValue> graph = Graph. fromDataSet (edges, env); DataSet<Vertex<Long, Long>> verticesWithMinIds = graph. run ( new ConnectedComponents(maxIterations)); Scala val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env) val graph = Graph. fromDataSet (edges, env) val components = graph. run (new ConnectedComponents(maxIterations))

  43. Graph Methods Transformations Graph Properties map, filter, join subgraph, union, getVertexIds difference getEdgeIds reverse, undirected numberOfVertices getTriplets numberOfEdges Mutations getDegrees add vertex/edge ... remove vertex/edge

  44. 
 Example: mapVertices // increment each vertex value by one 
 val graph = Graph . fromDataSet ( ... ) 
 // increment each vertex value by one 
 val updatedGraph = graph . mapVertices ( v => v . getValue + 1 ) 5 5 3 4 7 8 1 2 4 5

  45. 
 Example: subGraph val graph : Graph[Long , Long , Long] = ... 
 // keep only vertices with positive values 
 // and only edges with negative values 
 val subGraph = graph . subgraph ( vertex => vertex . getValue > 0 , edge => edge . getValue < 0 )

  46. Neighborhood Methods • Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel graph.reduceOnNeighbors( new MinValue, EdgeDirection.OUT)

  47. Iterative Graph Processing • Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations • vertex-centric • scatter-gather • gather-sum-apply • partition-centric*

Recommend


More recommend