graph analytics using vertica relational database
play

Graph Analytics using Vertica Relational Database Meichun Hsu - PowerPoint PPT Presentation

Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Mal Castellanos Microsoft MIT Vertica Vertica * work done while at MIT Motivation for graphs on DB Data anyways in a DB - avoid expensive


  1. Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Malú Castellanos Microsoft MIT Vertica Vertica * work done while at MIT

  2. Motivation for graphs on DB • Data anyways in a DB 
 - avoid expensive copying 
 - end-to-end data analysis 
 - leverage other DB features • Processing involves full scans and joins 
 - relational engines could run them efficiently 
 - particularly suited for column stores • Relational algebra/SQL offers powerful declarative syntax 
 - in fact, we could express Giraph as an operator DAG 
 - can even express more complex graph analytics

  3. 5-point Agenda • From graph queries to SQL : how do we make the translation? • Graph query optimization : can we leverage decades of relational wisdom? • Column store backends : why are they a good choice? • Comparison with specialized graph systems : how do the numbers look? • Extending column stores : can we do better?

  4. 1. From Graph to SQL

  5. Vertex-centric Graph Queries • Popular language for graph analytics 1 • Vertex programs that run in 3 supersets and communicate via 2 5 message passing 4

  6. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 inf • Vertex programs that run in 1 3 inf supersets and communicate via 2 5 inf message passing 4 inf

  7. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 inf supersets and communicate via 2 2 5 1 message passing 2 2 4 inf

  8. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 3 4 2

  9. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 4 2

  10. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 4 • Programmer only specifies a vertex 2 program • System takes care of running it in parallel

  11. The Giraph Plan • Giraph: a popular, open-source graph analytics system on Hadoop

  12. HDFS G=(V,E) The Giraph Plan Split Input Superstep Scan W 1 W 2 W 3 W 4 … … … RecRead Shuffle Server Data W 1 W 2 W 3 W 4 partition store edge store … … … • Giraph: a popular, open-source message store Master synchronize graph analytics system on Hadoop vertexCompute Superstep 1 Shuffle Server Data W 1 W 2 W 3 W 4 • The Giraph physical plan: hard partition store edge store … … … message store coded physical execution pipeline Master synchronize ……. vertexCompute Shuffle … Server Data W 1 W 2 W 3 W 4 partition store edge store … … … message store Output Superstep Master synchronize cleanup store HDFS G’=(V’,E’)

  13. The Giraph Plan Modified New • Giraph: a popular, open-source Vertices Messages V’ U M’ graph analytics system on Hadoop vertexCompute • The Giraph physical plan: hard γ coded physical execution pipeline V V.id=M.to • Giraph logical query plan using V.id=E.from M relational operators Messages E V Edges Vertices

  14. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V.id=E.from vertexCompute vertexCompute vertexCompute V’ E γ γ γ V 1 V vertexCompute V V.id=M.to γ V.id=M.to V 1 .id=E.to V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V.id=M.to E V V M V 2 E E V

  15. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ V’ M’ V’ V’ U M’ V.id=E.from σ d’<V 1 .d vertexCompute vertexCompute vertexCompute vertexCompute V’ E γ γ γ V 1 V vertexCompute Γ V d’=min(V 2 .d+1) V.id=M.to γ V.id=M.to V 1 .id=E.to γ V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V 1 V.id=M.to E V V M V 2 E V 1 .id=E.to E V V 2 .id=E.from V 1 V 2 E Single Source Shortest Path

  16. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V’ V.id=E.from σ vertexCompute vertexCompute vertexCompute vertexCompute V’ E γ cc’<V 1 .cc γ γ V 1 V vertexCompute Γ V cc’=min(V 2 .id) V.id=M.to γ V.id=M.to V 1 .id=E.to γ V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V 1 V.id=M.to E V V M V 2 E V 1 .id=E.to E V V 2 .id=E.from V 1 V 2 E Connected Components

  17. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V’ V.id=E.from vertexCompute vertexCompute Γ vertexCompute vertexCompute V’ E V 1 .r=0.15/n+0.85* 
 γ γ γ V 1 sum(V 2 .r/V 2 .outD) γ V vertexCompute V V.id=M.to γ V 1 V.id=M.to V 1 .id=E.to V.id=E.from V 1 .id=E.to V V.id=E.from M M V 2 .id=E.from V 1 V.id=M.to V 2 .id=E.from V 1 E V V M V 2 E E V V 2 E PageRank

  18. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M vertexCompute UDF Replacing join 1 2 2 3 3 query plan vertexCompute UDF by V E as Table UDF with union V’ U M’ V’ U M’ M’ V’ V’ U M’ V’ V’ U M’ V.id=E.from Table UDF vertexCompute vertexCompute Γ Table UDF vertexCompute vertexCompute V’ vertexCompute E V 1 .r=0.15/n+0.85* 
 γ vertexCompute γ γ V 1 sum(V 2 .r/V 2 .outD) γ V sort vertexCompute V sort γ V.pid V.id=M.to γ V 1 V.id=M.to V 1 .id=E.to γ V.pid .id=M.to V.id=E.from V 1 .id=E.to V V.id=E.from M M V 2 .id=E.from V.id=M.to V 1 V.id=M.to V 2 .id=E.from U V 1 V.id=E.from E V V M V 2 E E M V V M E V 2 E E V Unmodified Vertex Compute Optimized Unmodified Program, e.g. SGD Vertex Compute Program

  19. 2. Graph Query Optimization

  20. Leveraging Relational Query Optimizers • Multiple rule- or cost-based query rewriting possible; pick the best one using an optimizer • No hard-coded physical execution plan • Several new optimizations proposed: 
 - update vs replace 
 - incremental evaluation 
 - join elimination 
 - alternate direction graph exploration

  21. Inner Join Update 0 Updated Input Output 1 1 1 Node Value Node Value 1 3 inf 1 0 2 1 SSSP 2 5 2 1 3 1 1 3 1 Inner Join 4 4 inf inf 5 inf Good for small number of updates!

  22. Outer Join Replace 0 Input Output New Input 1 1 1 Node Value Node Value Node Value 1 3 Outer Join SSSP inf 1 0 2 1 1 0 2 5 2 inf 3 1 2 1 1 3 inf 3 1 4 4 inf 4 inf inf 5 inf 5 inf Good for bulk updates!

  23. Incremental Computation New Inc. Input Inc. Input Node Value 0 Output Node Value 2 1 1 SSSP 1 1 0 3 1 Node Value 1 1 2 1 Input 3 inf New Input 3 1 2 5 1 Node Value Node Value 1 0 Outer Join 1 0 4 2 inf 2 1 inf 3 inf 3 1 4 inf 4 inf 5 inf 5 inf

  24. Incremental Computation New Inc. Input Inc. Input Node Value 0 Node Value Output 4 2 1 2 1 SSSP 5 2 Node Value 1 3 1 4 2 3 2 New Input 2 Input 5 2 2 5 1 Node Value 2 Node Value 2 Outer Join 1 0 1 0 4 2 1 2 1 2 3 1 3 1 4 2 4 inf 5 2 5 inf Faster Iteration Runtime!

  25. 3. Column Store Backends

  26. 
 
 Why columns stores could be a good choice? • Modern column stores provide several features 
 - physical design 
 - join optimizations 
 - query pipelining 
 - intra-query parallelism • For more details, pick your favorite column store papers: 
 - MonetDB 
 [Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct, Peter A. Boncz et. al., PVLDB 2009.] 
 - C-Store 
 [C-Store: A Column-oriented DBMS, Mike Stonebraker et. al., VLDB 2005.] 
 - Vertica 
 [The Vertica Analytic Database: C-Store 7 Years Later, Andrew Lamb et. al., VLDB 2012.]

  27. node[3]=node3 (executor) Up Root OutBlk=[UncTuple(2)] Illustration: Vertica NewEENode OutBlk=[UncTuple(2)] Query Plan for SSSP ExprEval: e.to_node, <SVAR> Recv from: node0,node1,node2,node3 Send to: node0 • Early filtering using FilterStep: (<SVAR> < <SVAR>) GroupByPipe: 1 keys sideways information Aggs: min((n1.value + 1)), min(n2.value) passing StorageMergeStep: twitter_edge; 1 sorted GroupByPipe: 1 keys Aggs: min((n1.value + 1)), min(n2.value) • Fully pipelined query ExprEval: e.to_node, (n1.value + 1), n2.value execution Join: Merge-Join: using previous join and twitter_node_b0 Join: Hash-Join: StorageMergeStep: twitter_node_b0; 1 sorted using twitter_edge and twitter_node_b0 • Picks the right join ScanStep: twitter_edge SIP2(HashJoin): e.from_node ScanStep: twitter_node_b0 Recv from: node0,node1,node2,node3 SIP1(MergeJoin): e.to_node id, value strategies, 
 to_node (not emitted),from_node e.g. broadcast Send to: node0,node1,node2,node3 StorageUnionStep: twitter_node_b0 ScanStep: twitter_node_b0 id, value

  28. 4. Comparison with Specialized Graph Systems

Recommend


More recommend