graphx graph processing in a
play

GraphX : Graph Processing in a Distributed Dataflow Framework OSDI - PowerPoint PPT Presentation

GraphX : Graph Processing in a Distributed Dataflow Framework OSDI 2014 Bidyut Hota Agenda Analytics space background Motivation Goal Approach Optimizations Results Flaws/Limitations Questions Real life Analytics


  1. GraphX : Graph Processing in a Distributed Dataflow Framework OSDI 2014 Bidyut Hota

  2. Agenda • Analytics space background • Motivation • Goal • Approach • Optimizations • Results • Flaws/Limitations • Questions

  3. Real life Analytics Pipeline Raw data Link Table Page Rank Desired results Eg. Google Knowledge graph :570MVertices, 18B Edges ( as in Mid 2017)

  4. Real life Analytics Pipeline Raw data Link Table Page Rank Desired results Tables

  5. Real life Analytics Pipeline Raw data Link Table Page Rank Desired results Graphs

  6. Systems landscape

  7. • Currently separate systems exist to compute on these data representation. • Ability to combine data Motivation sources. • Enhance dataflow frameworks to leverage inherent positives.

  8. Current drawbacks of dataflow frameworks • Implementing iterative algorithms -> requires multiple stages of complex joins. • Do not cover common patterns in graph algorithms -> Room for optimization. • Unlike Spark, no fine grained control of data partitioning.

  9. Current drawbacks of specialized systems • Lacking ability for combining graphs with unstructured or tabular data • Systems favoring snapshot recovery rather than fault tolerance like in Spark

  10. • Immutability of RDD’s What can we • Reusing indices across graph and collection views over iterations. leverage? • Increase in performance

  11. Goal • General purpose distributed frameworks for graph computations • Comparable performances to specialized graph processing systems

  12. Approach • Unifies Tabular view and Graph view • Imbibe the best of specialized systems • Graph representation on dataflow frameworks • Optimizations • Develop GraphX API on top of Spark

  13. Graph approach: Page Rank example • Eg. Page Rank algorithm • Graph parallel abstraction • Define a vertex program • Terminate when vertex programs vote to halt Figure : PageRank in Pregel

  14. Approach • GAS (Gather Apply Scatter) How to apply this in dataflow frameworks? • Map, group-by, join dataflow operators

  15. Representing Property graphs as Tables Never transfer edges

  16. GraphX API

  17. Using the dataflow operators Logical representation Join of vertices table on edges table

  18. Using the dataflow operators on vertex program Userdefined

  19. Optimizations Remote caching Specialized Data Structure Vertex-cut Partitioning Active Set Tracking

  20. Implementing Optimizations • Reusable Hash index • Sequential scan or clustered scan based on active set (Dynamic) • Incremental updates • Automatic Join elimination Additional optimizations: • Memory based shuffle • Batching and columnar structure • Variable Integer encoding

  21. Results

  22. Results Scaling for PageRank Effect of partitioning on on Twitter dataset communication

  23. Current Flaws • Is not optimized for dynamic graphs. • Requires incremental updates to routing table. • Is not designed for streaming applications. • Asynchronous graph computation not available. This is where Naiad will outperform.

  24. Questions

Recommend


More recommend