mit and they are growing bigger x x
play

MIT And they are growing bigger X X -> DATA DATABASE - PowerPoint PPT Presentation

Graphs On Databases at Talking on NEDB 2014 Alekh Jindal collaborate Supervisors Amol Deshpande Sam Madden work sabbatical Mike Stonebraker work University work of Maryland MIT And they are growing bigger X X -> DATA


  1. Graphs On Databases at Talking on NEDB 2014 Alekh Jindal collaborate Supervisors Amol Deshpande Sam Madden work sabbatical Mike Stonebraker work University work of Maryland MIT

  2. And they are growing bigger

  3. X X -> DATA DATABASE

  4. DATABASE Relational -> DATA Relational

  5. DATABASE Streaming -> DATA Streaming

  6. XML XML -> DATA DATABASE

  7. RDF RDF -> DATA DATABASE

  8. Graph Graph -> DATA DATABASE

  9. DATABASE DATA

  10. APPLICATIONS Logical Data Independence DATABASE Physical Data Independence DATA

  11. Barriers to “Graphs on Databases” • Graphs in relational model • Graph operations in SQL • Expressing iterative graph queries • Efficient graph analytics performance • Ease-of-use

  12. Graphs in Relational Model id value Nodes 1 1 4 2 1 5 3 1 4 1 2 5 1 1 fromId toId weight 3 Edges t 1 2 1 2 4 1 2 5 1 3 2 1 5 1 1

  13. Graphs Operations in SQL • Node access 
 4 Select * From Nodes Where Id=ID 5 • Neighborhood access 
 Select * From Edges Where fromId=ID 2 1 • Parallel neighborhood access 
 3 Select * From Edges Group By fromId • 1-hop neighbors 
 Select * From Edges e1,Edges e2 Where e1.toId=e2.fromId

  14. Example: Shortest Paths UPDATE Nodes AS node SET value= new_node.value FROM( SELECT e.toId AS Id, min(n1.value+1) AS value FROM Nodes AS n1, Edges AS e, Nodes AS n2 WHERE n1.Id=e.fromId AND n2.Id=e.toId GROUP BY e.toId, n2.value HAVING min(n1.value+1) < n2.value ) AS new_node WHERE node.Id = new_node.Id;

  15. Example: Shortest Paths UPDATE Nodes AS node SET value= new_node.value FROM( Nested Query SELECT e.toId AS Id, min(n1.value+1) AS value FROM Nodes AS n1, Edges AS e, Nodes AS n2 WHERE n1.Id=e.fromId AND n2.Id=e.toId Parallel GROUP BY e.toId, n2.value Graph Exploration HAVING min(n1.value+1) < n2.value ) AS new_node WHERE node.Id = new_node.Id; Sorting/Indexing

  16. Iterative Graph Queries • Driver program: 
 4 UDF / Stored Procedure 5 • Three Things: 
 2 - initialization 
 1 - actual graph query (in a loop) 
 3 - termination condition

  17. Example: Shortest Paths Initialization: 1. Set the value of start node to 0 2. Set the value of all other node to inf Loop: UPDATE Nodes AS node SET value=new_node.value FROM( SELECT e.toId AS Id, min(n1.value+1) AS value The shortest paths SQL FROM Nodes AS n1, Edges AS e, Nodes AS n2 WHERE n1.Id=e.fromId AND n2.Id=e.toId GROUP BY e.toId, n2.value HAVING min(n1.value+1) < n2.value ) AS new_node WHERE node.Id = new_node.Id; Termination Condition: No more nodes to Update

  18. Efficient Graph Analytics • Three SQL Databases: 
 - row store 
 - column store 
 - main-memory store • Two Graph Databases: 
 - transactional graph database 
 - graph analytics system • Two queries: PageRank, Shortest Paths • Social network dataset from snap.stanford.edu/data

  19. PageRank 10000 Graph Database 4,172.4 Main-memory Database Row Store Database Apache Giraph Column Store Database 1000 589.0 Time (seconds) 218.1 101.5 100 53.5 47.0 29.4 28.0 17.4 10 4.2 3.3 1 Twitter GPlus LiveJournal

  20. Shortest Paths 100000 Graph Database Main-memory Database 18,702.2 Row Store Database Apache Giraph 10000 Column Store Database Time (seconds) 1000 492.9 395.6 135.1 115.5 100 74.5 50.8 43.7 29.1 9.2 10 6.7 1 Twitter GPlus LiveJournal

  21. Ease-of-Use SQL Pregel void compute(vector<float> messages){ UPDATE Nodes AS node SET value=new_node.value // get the minimum distance FROM( SELECT e.toId AS Id, min(n1.value+1) AS value float mindist = id==START_NODE ? 0 : DBL_MAX; FROM Nodes AS n1, Edges AS e, Nodes AS n2 for(vector<float>::iterator it = messages.begin(); WHERE n1.Id=e.fromId AND n2.Id=e.toId it != messages.end(); ++it) GROUP BY e.toId, n2.value mindist = min(mindist,*it); HAVING min(n1.value+1) < n2.value // send messages to all edges if new minimum is found ) AS new_node float vvalue = getVertexValue(); WHERE node.Id = new_node.Id; if(mindist < vvalue){ modifyVertexValue(mindist); vector<int> edges = getOutEdges(); for(vector<int>::iterator it = edges.begin(); it != edges.end(); ++it) sendMessage(*it, mindist+1); } // halt voteToHalt(); }

  22. Ease-of-Use SQL Pregel 4 Nodes Edges 5 fromId toId weight id value 1 2 1 1 1 2 1 2 4 1 2 2 5 1 3 1 3 2 1 4 1 1 5 1 5 1 1 3

  23. Vertex-centric Query Interface Vertex Programs APPLICATION Pregel-style API: - getMessages() 
 - getEdges() 
 - sendMessages() - voteToHalt(), etc. Logical Data Independence Vertex UDF Invokes the vertex program if: 
 - the vertex is active, or DATABASE - the vertex has incoming messages Coordinator Synchronizes supersteps Redistributes Messages Physical Data Independence DATA Vertex (V), Edge (E), Message (M)

  24. Vertex-centric Query Interface Vertex Programs APPLICATION Pregel-style API: - getMessages() 
 - getEdges() 
 - sendMessages() - voteToHalt(), etc. Logical Data Independence Vertex UDF Invokes the vertex program if: 
 Batching - the vertex is active, or DATABASE - the vertex has incoming messages Coordinator No in- place Synchronizes supersteps Updates Redistributes Messages Physical Data Independence DATA Vertex (V), Edge (E), Message (M) Union

  25. PageRank (Vertex) 100000 Main-memory Database Apache Giraph Column Store Database 10000 2,071.0 Time (seconds) 1000 421.5 335.5 218.1 100 53.5 47.7 47.0 10.9 10 1 Twitter GPlus LiveJournal

  26. Shortest Paths (Vertex) 10000 7,950.1 Main-memory Database Apache Giraph Column Store Database 1000 712.2 Time (seconds) 146.3 121.0 115.5 100 50.8 43.7 23.8 10.6 10 1 Twitter GPlus LiveJournal

  27. Vertex-centric interface allows… • Connected Components • Random Walks with Restart • Stochastic Gradient Descent • Or, other message Passing Algorithms …. right within the database system!

  28. Advantages of “Graphs on Databases” • Running arbitrary SQL queries • Pre- and post- processing of data • Updates are trivial • ACID for free • Don’t need to deal with Yet-Another-System!

  29. Summary • Graph analytics can be mapped to relational queries (plus UDFs) • SQL systems can offer very good performance over relational queries • We can extend SQL systems to provide more graph-natural query interfaces

Recommend


More recommend