GraphGen : Adaptive Graph Processing using Relational Databases Department of Computer Science University of Maryland
Graph Analytics / Querying Graph datasets can provide value in many domains Protein Interaction Email Networks Social Networks Stock Trading Networks Networks Many different types of ways to manage graph data Graph Databases (neo4j, orientDB, RDF stores) ● Distributed Batch Analytics systems (Giraph, GraphX, GraphLab) ● In-Memory systems (Ligra, Green-Marl, X-Stream) ● Many research prototypes / custom indexes. ●
RDBMS-based Graph Systems vs GraphGen DECLARATIVE
Example: TPC-H Which customers bought LineItem the same item? Customer order_key part_key c_key name o1 p1 c_1 John o1 p2 Orders LineItem c_2 Jane o2 p1 c_key p_key cust1 cust2 o2 p3 c1 p1 c1 c4 o3 p1 c1 p2 o3 p2 c1 c6 c3 p2 On order_key On p_key o3 p2 c1 c3 c4 p1 c4 c6 c6 p1 Orders order_key customer_key c1 c4 o1 c1 o2 c2 Which customer bought o3 c3 which product? c6 c3
Example: TPC-H Which customers bought LineItem the same item? Customer order_ke part_ke c_ke name y y y o1 p1 c_1 John o1 p2 Orders LineItem c_2 Jane Many other graphs of potential interest : o2 p1 c_key p_key ● Suppliers that sell a common item cust1 cust2 o2 p3 c1 p1 ● Employees working under the same manager c1 c4 o3 p1 c1 p2 ● Parts that were ordered together o3 p2 c1 c6 c3 p2 On order_key ● Bipartite graph between Part and Supplier On p_key o3 p2 c1 c3 c4 p1 ● ... c4 c6 c6 p1 Orders order_key customer_key c1 c4 o1 c1 o2 c2 Which customer bought o3 c3 which product? c6 c3
GraphGen Directly over Vertex- Graph Centric Results Java Program Graph Definition Graph Analysis Direct Graph Queries Queries Access In-Memory Engine DSL Parser + Optimizer GraphGen SQL Queries Backend Relational DBMS
GraphGenDL - Definition Language Definition of a GraphView over the database ● User specifies how to construct the Nodes and Edges ○ CREATE GRAPHVIEW CoAuthors AS Nodes (ID, name) :- Author(ID, name). Edges (ID1, ID2, wt= $COUNT (pub)) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub). Edge Property : number of publications Definition of a collection of graphs ( Multi-Graph View ) over the database ● Extract all Can enable many optimizations ○ ego-graphs CREATE GRAPHVIEW AuthorEgoNetworks(X) WHERE Author(X) AS Nodes (X, name) :- Author(X, name). Nodes (ID, name) :- AuthorPub(X,pub), AuthorPub(ID,pub), Author(ID, name). Edges (ID1, ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub).
GraphGenQL - Query Language ● Specifying Graph Queries over GraphViews ● Support for subgraph pattern matching languages like SPARQL, Cypher, PGQL etc. ● Datalog is a natural fit for expressing recursive computation over the Edges VIEW Find triangles of authors whose areas follow: “ML” -> “DB” -> “AL” USING GRAPHVIEW CoAuthors Triangle(X, Y, Z) :- Nodes (X, _, “ ML ” ), Nodes (Y, _, “ DB ” ), Nodes (Z, _, “ AL ” ), Edges (X, Y), Edges (Y, Z), Edges (X, Z).
GraphGen Directly over Vertex- Graph Centric Results Java Program Graph Definition Graph Analysis Direct Graph Queries Queries Access In-Memory Engine DSL Parser + Optimizer GraphGen SQL Queries Backend Relational DBMS
GraphGen Directly over Vertex- Graph Centric Results Java Program Graph Definition Graph Analysis Goal : We want to adapt the execution Direct Graph Queries ● Queries Access based on the query/analysis. What are some of the challenges here?? In-Memory Engine ● DSL Parser + Optimizer GraphGen SQL Queries Backend Relational DBMS
1. Where to execute Queries/ Tasks Dataset DBS1 DBS2 ● Depends on workload , rate of updates , rate of queries … Small 0.899 s 0.22 In-memory execution Dataset In-memory ETL MySQL PosgreSQL Large 4.25 s NA Small 0.001 s 2.05 s 0.8 s 0.1 s Large 0.015 s 17.52 s 4.26 0.704 s Triangle Pattern Matching Key Challenge: Develop accurate cost models, tools, ● techniques. Decide what to compute where In-database execution Other issues: Large-output joins [SIGMOD ‘17] , and selectivity ● estimation errors associated with them.
2. Query Rewriting Assume the execution is to be pushed to the database ● Many different ways to construct equivalent SQL queries ● Auto-generated SQL can be verbose → Challenging to optimize ● 1) With vs VIEW 2) Duplicate Elimination ( DISTINCT ) DISTINCT With Nodes as (...) Create View Edges as (...) With Edges as (...) Create View Nodes as (...) (SQL for answering query) (SQL for answering query) DISTINCT The costly duplicate removal might even be unnecessary if ● the query / analysis doesn’t care about them!
2. Query Rewriting Assume the execution is to be pushed to the database ● Many different ways to construct equivalent SQL queries ● Auto-generated SQL can be verbose → Challenging to optimize ● 1) With vs VIEW 2) Duplicate Elimination ( DISTINCT ) DISTINCT With Nodes as (...) Create View Edges as (...) With Edges as (...) Create View Nodes as (...) (SQL for answering query) (SQL for answering query) Time for query to finish in seconds DISTINCT The costly duplicate removal might even be unnecessary if ● the query / analysis doesn’t care about them!
3. Optimizing Multi-Graph Views Ego Graph Analysis, Graph snapshot analysis ● Ability to refer to each graph independently → significant ● savings Opportunity: Overlap computation and storage over ● collections of graphs Snapshots Key Challenge : Develop a systematic approach to optimizing the extraction of and execution against such multi-graph CREATE GRAPHVIEW CoAuthorsSnapshot( X ) WHERE X IN RANGE (1950 , 2017 , 1) Nodes (ID,name) :- Author(ID,name). views. Edges (ID1,ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub), Publication(pub, _, Y), Y <= X. Please see E.g. Ego-Graph Analysis full paper Naive : Generate a separate SQL query for each distinct graph. ● Result-Tagging: We can extract all graphs with a single query! ●
Find the edges 1-hop Result-Tagging away for the source (tag) & Union the result with the initial Tagged Edges table Tagged Edges Table e1. aid2 = e2. aid1 Tags show which ego-graphs involve the edge aid1 aid2 tag aid1 aid2 tag a1 a2 a1 a2 a3 a1 a1 a5 a1 a5 a3 a1 aid1 aid2 tags[] a1 a6 a1 a6 a7 a1 a1 a2 [a1] a6 a7 a6 Tag a7 a8 a6 a1 a5 [a1] Aggregation a7 a8 a7 a3 a4 a5 a1 a6 [a1] a5 a3 a5 a3 a4 a2 a6 a7 [a1,a6] a3 a4 a3 a1 a2 a1 a7 a8 [a6,a7] a2 a3 a2 a1 a5 a1 a5 a3 [a5,a1] a1 a6 a1 a3 a4 [a2,a3,a5] a6 a7 a6 a2 a3 [a1,a2] a7 a8 a7 a5 a3 a5 a3 a4 a3 a2 a3 a2
Thank you! Take Aways Questions? Need for a unified framework for extraction and analysis of ● graphs stored implicitly in a structured data store. We propose declarative a Datalog-based DSL for specifying: ● GraphViews over relational schemas ○ Declarative Graph queries ○ Expose a series of APIs for defining complex graph analytics over ● GraphViews There is a variety of challenges & opportunities here in terms of: Deciding where to execute graph queries ● Handling large-output joins and inaccuracies of query optimizers ● Rewriting SQL queries pushed to the database ● Optimizing across collections of graphs ( Multi-Graph Views ) ●
Recommend
More recommend