Extending In-Memory Relational Database Engines with Native Graph Support Mohamed S. Hassan 1 Tatiana Kuznetsova 1 Hyun Chai Jeong 1 Walid G. Aref 1 Mohammad Sadoghi 2 2 University of California – Davis, CA, USA 1 Purdue University – West Lafayette, IN, USA EDBT’18
Graphs are Ubiquitous 2 Biological Network Road Network Social Network Datacenter Network
Specialized Graph Database Systems 3 ¨ Specialized graph databases can handle graph query-workloads ¤ Vital queries include shortest-path and reachability queries
Why Use Relational Database Systems to Support Graphs ? 4 ¨ RDBMS technology is very mature and widely-adopted ¨ Relational data can have latent graphs ¨ Can easily represent graphs using relational tables ¨ Many applications involve graph queries ¤ Queries that involve both relational and graph predicates n E.g., for each Patient P in a selected area, find the nearest hospital to P ¨ How can an RDBMS effectively and efficiently handle graph query workloads ?
Graph Support in RDBMSs 5 ¨ Why is it challenging ? ¤ There is an impedance mismatch between the relational model and the graph model ¨ Two common approaches for supporting graphs: ¤ Native Relational-Core ¤ Native Graph-Core ¨ Native G+R Core (The proposed GRFusion system)
Native Relational-Core 6 ¨ Use a vanilla RDBMS Results ¨ Encode graphs in relational schemas Graph Queries ¨ Support limited graph queries ¨ Translate the supported graph queries into SQL or procedural SQL Relational Queries SQL Translation Layer ¨ E.g., SQLGraph [SIGMOD’15], (SQL) Grail [CIDR’15] ¨ Pros: ¤ Use of very mature RDBMS technology Relational Data Graph Encoded into ¨ Cons: Relational Tables ¤ Several graph queries are inefficient to evaluate using pure SQL ¤ Graphs are encoded in complex schemas Relational Database
Native Graph-Core 7 ¨ Build on top of an RDBMS Results ¨ Extract graphs from the RDBMS Graph Queries ¨ Store graphs and process queries outside the realm of the RDBMS Graph Extraction and Materialization Engine ¨ E.g., Ringo [SIGMOD’15], GraphGen [VLDB’15, SIGMOD’17] Graph Extraction Extracted Graphs ¨ Pros: Queries (SQL) ¤ Native processing of graph operations ¨ Cons: Relational Data ¤ Graph updates require re-extracting the graphs ¤ Queries cannot reference any non-extracted relational data Relational Database
The Relational Model vs. the Graph Model 8 ¨ Graph-core approach ¤ +ve: Queries involving graph traversals are efficiently handled in the graph model (e.g., shortest paths) ¤ -ve: Not as pervasive and mature as RDBMSs ¨ Relational-core approach ¤ +ve: Mature and pervasive ¤ -ve: Either many temporary inserts/deletes/updates, or too many joins to traverse a graph n Intermediate-result size and cardinality estimation ¨ Can the best of the two worlds be combined ? ¤ Support native graph processing inside an RDBMS
Proposed Approach: Native G+R Core 9 ¨ Assume that graphs have relational schemas Results ¨ Relational schemas describe the edges/nodes Graph-Relational Queries (SQL) ¨ Enables graphs to be defined as native database objects π Graph and Relational Operators ¨ Store graphs in non-relational structures ⋈ in the Same QEP optimized for graph operations σ GraphOp ¨ Extend the SQL language Graph Views (Topology Relational Data ¤ Queries can compose relational and + Tuple Pointers) graph operations ¨ Cross-Data-Model QEPs (Query Evaluation Plans) ¨ Graph updates are supported Graph Construction Relational Database
GRFusion: Realizing the G+R Approach 10 Declarative Graph-Relational Queries ¨ We realize the G+R approach inside VoltDB Query Parser ¤ An open-source in-memory RDBMS Query Optimizer ¤ GRFusion: Our realization of the G+R Plan Executor approach into VoltDB Graph-Relational Query Engine ¤ A demo of GRFusion will appear in SIGMOD 2018 Relational Data Graph Views In-Memory Relational Database
Create Graph View 11 ¨ Create-Graph-View statement ¤ Create a named graph database object that can be referenced in queries ¤ Define the relational sources of the graph’s vertexes/edges ¤ Materialize the topology of the graph in main-memory as a singleton graph structure
Graph-View of a Social Network 12
Graph-View Structure [Traversal Index] 13
Graph-View Structure [Traversal Index] 14
The VERTEXES Construct 15 ¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.VERTEXES v ¨ VERTEXES represents the vertexes of a graph view ¨ A vertex is a tuple with the following properties: ¤ Id ¤ FanIn ¤ FanOut ¤ Property for each vertex attribute
The EDGES Construct 16 ¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.EDGES v ¨ EDGES represents the edges of a graph view ¨ An edge is a tuple with the following properties: ¤ Id ¤ StartVertexId ¤ EndVertexId ¤ Property for each edge attribute
The PATHS Construct – Extended SQL 17 ¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.PATHS P ¨ PATHS represents a set of lazily-evaluated paths ¨ A path is a set of consecutive edges ¨ Each edge has two endpoint vertexes ¤ E.g., (V:attributes) –(:E:attributes) à (V:attributes) ….. ¨ A path is a tuple with the following properties: ¤ Length ¤ StartVertex ¤ EndVertex ¤ Vertexes ¤ Edges
Declarative Graph-Relational Queries 18
The PathScan Operator 19 ¨ PathScan is a logical operator that acts on a graph-view ¤ Has three corresponding physical operators: BFScan, DFScan, SPScan ¨ The output of PathScan is a tuple ¤ Extends the standard relational tuple ¤ PathScan output can be ingested by other relational operators in the QEP ¨ PathScan accepts the id of the vertex to start the traversal from ¤ Otherwise, all the vertexes will be considered as start vertexes ¨ Filters can be pushed as Hints into the PathScan operator ¤ E.g., P.PathLength = 2
Friends-of-Friends Query Example 20 ¨ For all the users working as lawyers, retrieve the last name of their friends of friends, where the friendships happened after 1/1/2000
QEP of the Friends-of-Friends Query 21
Reachability Query Example 22 ¨ Check if Protein X interacts directly (i.e., by an edge) or indirectly (i.e., by a path) with Protein Y through either a covalent or a stable interaction type.
Shortest-Path Queries with Relational Predicates 23
Evaluating The Native G+R Approach 24 ¨ Realized a certralized version of GRFusion ( Native G+R Core approach) inside VoltDB Version 6.7 ¨ Single node running Linux kernel Version 3.17.7 n 32 cores of Intel Xeon 2.90 GHz n 384 GB of RAM ¨ Comparing against: ¤ Native Relational-Core: n SQLGraph [SIGMOD’15], Grail [CIDR’15] ¤ Natice Graph-Core Systems: n Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]
Experimental Setup 25 ¨ Native relational-core approach ¤ SQLGraph [SIGMOD’15] n Represent path traversal using recursive relational joins n Commercial system (code not available) n Implemented the techniques in VoltDB in-memory ¤ Grail [CIDR’15] n Implemented Grail in VoltDB n Also evaluated Grail in Hekaton n Got similar conclusions (Do not report the Hekaton results here)
Experimental Setup (Cont’d) 26 ¨ Native Graph Approch ¤ Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan] n Native graph-cores (specialized graph systems) n Disk-based systems n Titan: configured to use the in-memory storage configuration n Neo4j: Run on RamDisk to mitigate the disk IO cost ¤ GRFusion uses simple graph algorithms (single-source-shortest-path - Dijkstra’s algorithm) n Want to investigate performance gains, if any, of the G+R approach in contrast to the native relational-core
Evaluating GRFusion 27 ¨ Graph queries ¤ Reachability queries ¤ Reachability queries with filtering predicates ¤ Shortest path queries ¤ Subgraph queries (e.g., count triangles) ¨ Datasets
Reachability Queries (DBLP Dataset) 28 ¨ Performance of GRFusion, Neo4j, Titan more stable in contrast to SQLGraph ¤ Avoid overheads of recursive relational joins ¨ GRFusion performs better than Neo4J &Titan ¤ VoltDB is optimized for main-memory ¤ Disk-based Titan/Neo4j (although runs on RamDisk) are not optimized for main-memory ¤ Graph views in GRFusion are more compact n Encode only the topology within the graph n No vertex/edge attributes in the topology n Thus, GRFusion makes better use of caching ¤ GRFusion/VoltDB are C++-based ¤ Neo4j and Titan are Java-based n Overheads from the automatic memory management of Java
Reachability Queries (String Dataset) 29 ¨ String dataset: ~ 0.5B edges >> DBLP ¨ SQLGraph (based on VoltDB): ¤ Materialize the join results at each intermediate stage ¤ Explosion in size of intermediate results (perform more than 11 joins) ¨ SQLGraph and GRFusion follow BFS evaluation ¨ GRFusion follows as iterative model: ¤ Evaluate one path at a time ¤ Also, only the vertex Ids are stored in BFS queue ¤ More efficient storage-wise than storing the tuples of the relational joins as intermediate results
Reachability Queries (Twitter Dataset) 30 ¨ Twitter dataset: 1.4B edges dataset ¨ Fan-out is also a factor in the performance ¤ But we did not study effect of fan- out ¤ Would require synthetic datasets ¤ Current study focus on real datasets
Recommend
More recommend