graphs on databases
play

Graphs On Databases Alekh Jindal Sam Madden Mike Stonebraker - PowerPoint PPT Presentation

Graphs On Databases Alekh Jindal Sam Madden Mike Stonebraker CSAIL, MIT + = Jena FlockDB AllegeroGraph TAO Pegasus Neo4j DEX HypergraphDB Pregel Titan GraphBase Twister Giraph Trinity HaLoop GraphLab PrItr SQL ? ? Can we


  1. Graphs On Databases Alekh Jindal Sam Madden Mike Stonebraker CSAIL, MIT

  2. +

  3. =

  4. Jena FlockDB AllegeroGraph TAO Pegasus Neo4j DEX HypergraphDB Pregel Titan GraphBase Twister Giraph Trinity HaLoop GraphLab PrItr

  5. SQL ?

  6. ?

  7. Can we have efficient graph analytics within a SQL Database?

  8. Graph Analytics on SQL Databases • Graph: set of nodes, set of edges • Node table: nodeId and associated metadata • Edge table: (from,to) nodeIds and associated metadata • Undirected graph: two tuples per edge • Node/Edge access: selection, projection on node and edge tables • Graph traversal: successive joins between node and edge tables

  9. Optimizations • Number of Joins 
 Parallel graph exploration • Number of queries; round trips 
 Nested queries; database handles the optimizations • Data movement between server and client 
 UDFs, Stored Procedures • Database connections 
 Keep connections alive between iterations • SQL query performance 
 Sort orders, indexes

  10. How does the performance look like?

  11. PageRank 10000 Graph Database SQL: Main-memory Store SQL: Row Store 1000 SQL: Column Store Time (seconds) 100 10 1 Facebook Twitter GPlus LiveJournal Nodes 4K 76K 107K 4.8M Edges 88K 2.4M 30M 69M

  12. Shortest Paths 100000 Graph Database 18,702.2 SQL: Main-memory Store 10000 SQL: Row Store SQL: Column Store Time (seconds) 1,231.7 1000 428.4 395.6 212.7 168.1 100 21.3 20.2 8.7 10 4.7 4.4 3.2 1 Facebook Twitter GPlus LiveJournal Nodes 4K 76K 107K 4.8M Edges 88K 2.4M 30M 69M

  13. What about the query interface? Is SQL the right choice for graph queries?

  14. Shortest Path in SQL UPDATE NNodes AS nnode SET Estimate = new_nnode.Estimate, Predecessor = new_nnode.Predecessor FROM ( SELECT temp.Id, temp.Estimate, edge.from_node AS Predecessor FROM NNodes AS nn, edge, ( SELECT e.to_node AS Id, min(n1.Estimate+1) AS Estimate FROM NNodes AS n1, edge AS e, NNodes AS n2 WHERE n1.Id=e.from_node AND n2.Id=e.to_node GROUP BY e.to_node, n2.Estimate Tables !!! HAVING min(n1.Estimate+1) < n2.Estimate ) AS temp WHERE nn.Id=edge.from_node AND edge.to_node=temp.Id AND nn.estimate=temp.estimate-1 ) AS new_nnode WHERE nnode.Id = new_nnode.Id;

  15. Shortest Path in Pregel void compute(vector<vfloat> messages){ � // get the minimum distance vfloat mindist = id==START_NODE ? 0 : DBL_MAX; for(vector<vfloat>::iterator it = messages.begin(); it != messages.end(); ++it) mindist = min(mindist,*it); � // send messages to all edges if new minimum is found vfloat vvalue = getVertexValue(); Graph !!! if(mindist < vvalue){ modifyVertexValue(mindist); vector<vint> edges = getOutEdges(); for(vector<vint>::iterator it = edges.begin(); it != edges.end(); ++it) sendMessage(*it, mindist+1); } � // halt voteToHalt(); }

  16. What about having a vertex-centric interface in a SQL Database?

  17. Vertex-centric Interface in SQL Databases • Idea: Map vertex-centric program execution to SQL queries in a SQL database • The programmer specifies what happens on each vertex in the graph • Vertices are executed as long as they are in active state or if they have an incoming message • Exact same API as in Pregel, e.g. getting incoming messages, vertex value, vertex edges, etc.

  18. Implementation Details • Vertex (V), Edge (E), Message (M) • The vertex programs are executed in parallel (super step) as UDFs in the SQL database • Vertex Input: (V,E,M) for the vertex • Vertex Output: outgoing M from the vertex • A coordinator synchronizes between super steps, i.e. redistributes the messages from one super step to the next • The coordinator stops when there are no more messages

  19. Optimizations • 3-way join, instead of 2-way 
 Table unions in place of joins • UDF call overheads 
 Batching several vertices in each UDF call • Too many new messages in each super step 
 Replace messages table, no in-place updates • SQL query performance 
 Sort orders, indexes

  20. How does the performance look like?

  21. Shortest Paths 1000 SQL: Column Store 439.8 Vertex-centric: Column Store 212.7 Vertex-centric: Apache Giraph 100.9 Time (seconds) 92.5 100 47.0 35.0 28.2 21.3 20.2 18.4 10 5.7 4.4 1 Facebook Twitter GPlus LiveJournal Nodes 4K 76K 107K 4.8M Edges 88K 2.4M 30M 69M

  22. Vertex-centric interface allows… • Connected Components • Random Walks with Restart • Stochastic Gradient Descent • Message Passing Algorithms • Or, any other vertex-centric algorithm …. right within the SQL database system!

  23. Summary • Graph analytics can be mapped to relational queries (plus UDFs) • SQL systems can offer very good performance over relational queries • We can extend SQL systems to provide more graph-natural query interfaces

  24. +

  25. =

  26. Team Members

  27. Non-faculty Members

  28. Non-faculty, non-postdoc …

  29. Graphs On Databases Alekh Jindal Sam Madden Mike Stonebraker CSAIL, MIT

Recommend


More recommend