Foundational aspects of Graph Data Management Wim Martens University of Bayreuth EPIT Spring School on Ti eoretical Computer Science Luminy, 2019
Outline - Graph Data Model - Q ueries - Graph Q uery Evaluation - Graph Q uery Containment - Graphs vs Trees - "Real Q ueries" - Data Value Comparisons
Notation
Notation and Basic Principles If n ∈ ℕ , we use [ n ] to denote the set { 1,..., n } Finite Automata We denote a nondeterministic fi nite automaton (NFA) as N = ( S, A, 𝜀 , I, F ) where - S is the fi nite set of states - A is the fi nite alphabet - 𝜀 ⊆ S ⨉ A ⨉ S is the transition relation - I ⊆ S is the set of initial states - F ⊆ S is the set of accepting states Ti e language of N is denoted L ( N )
Notation and Basic Principles Regular Expressions Operators: (1) Kleene star (denoted *) (2) concatenation (omitted in notation) (3) disjunction (denoted +) Priorities of operators: fi rst (1), then (2), then (3) Example: ab + cd * Ti e language of regular expression r is denoted L ( r ) We use r n to abbreviate n -fold concatenation of r
Motivation
Why Graph Databases? - Graph databases are becoming more and more standard in industry [Neo4j, Tigergraph, Oracle, ...] - Ti ey bring "reasoning about connectedness" to the masses (*) ...and they are a nice source of theory problems (*) I heard this pitch from Hassan Cha fi , Oracle
Wikidata: "US artists who died of poisoning" SELECT ?x WHERE { ?x wdt:occupation/wdt:subclassof* wd:artist . ?x wdt:citizenship wd:United_States . ?x wdt:cause_of_death ?y . ?y wdt:subclass_of* wd:poisoning } (*): Original Wikidata query: politicians who died of cancer https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Politicians_who_died_of_cancer_.28of_any_type.29
Graph Q ueries By Example Wikidata: "US artists who died of poisoning" guitarist ?x cause of death barbiturate overdose occupation occupation subclassof cause of death • • Jimi Hendrix instrumentalist United States subclassof citizenship subclassof occupation subclassof* subclassof* United States singer musician poisoning artist subclassof citizenship occupation drug overdose subclassof cause of death Marilyn Monroe occupation ... artist actor subclassof subclassof occupation citizenship cause of death River Phoenix poisoning
Graph Q ueries By Example Wikidata: "US artists who died of poisoning" guitarist ?x cause of death barbiturate overdose occupation occupation subclassof cause of death • • Jimi Hendrix instrumentalist United States subclassof citizenship subclassof occupation subclassof* subclassof* United States singer musician poisoning artist subclassof citizenship occupation drug overdose subclassof cause of death Marilyn Monroe occupation ... artist actor subclassof subclassof occupation citizenship cause of death River Phoenix poisoning
Data Model
What are Graph Databases? Currently, two main data models: - Property Graph-like Databases - RDF-like Databases
Property Graph Data Model profession name: film actor Labels L : person, profession, spouse Values V : Liz, Taylor, 10.10.1975 hasprofession hasprofession from: 1943 Properties P : fi rst name, last name from: 1942 spouse from: 10.10.1975 until: 29.07.1976 person person first name: Richard spouse first name: Liz last name: Burton last name: Taylor from: 15.03.1964 until: 26.06.1974 More formally, this is Ti e G-Core model also - a set of node identi fi ers N directly incorporates a third - a set of edge identi fi ers E set, containing paths - a function that maps E to N ⨉ N [Angles et al., SIGMOD'18] - a function from N ∪ E to (subsets of ) labels L - a function from (N ∪ E) ⨉ P to (subsets of ) values V
RDF Data Model person Richard instance of Liz first name first name spouse person first name: Liz Q34851 Q151973 last name last name: Taylor last name Taylor spouse profession Burton stage actor More formally, this is a set of triples from I ⨉ I ⨉ (I ∪ L) where - I is the set of Internationalized Resource Identi fi ers (IRIs) - L is the set of literals (constants) Ti ese triples (s,p,o) are referred to as subject / predicate / object triples ( Ti ere are also blank nodes )
RDF Data Model Profession stage actor Liz Taylor stage actor subclass of profession Liz Taylor film actor subclass of Liz Taylor actor artist profession subclass of Subclass of film actor actor film actor stage actor actor actor artist "RDF-like" graph database
RDF Data Model http://d-nb.info/standards/elementset/gnd#fieldOfActivity equivalent property instance of stage actor subclass of profession property for items about people subclass of Liz Taylor actor artist profession subclass of film actor "RDF-like" graph database
What We Consider Today person Richard instance of Liz first name first name spouse Q34851 Q151973 last name last name Taylor spouse profession Burton stage actor Edge-labeled, directed graphs
Graph Database We assume that Σ is a countably in fi nite set of labels De fi nition A graph database (over Σ ) is a pair G = ( V, E ) where - V is a fi nite set of nodes - E ⊆ V ⨉ Σ ⨉ V is a fi nite set of edges
Q ueries
Plan Conjunctive Q ueries (CQs) Regular Path Q ueries (RPQs) Conjunctive Regular Path Q ueries (CRPQs)
Conjunctive Q ueries (CQs) Intuition Not much di ff erent from CQs in relational DBs R Q3 Q15 Example (CQ on binary relations) S S a R ( x , y ) ⋀ S ( x , a ) ⋀ S ( y , a ) (uses variables x , y and constant a ) Example (CQ in graph databases) More visual notation x R y ∧ x S a ∧ y S a R x y or even S S a
Conjunctive Q ueries De fi nition (Conjunctive Q uery over Graphs) A conjunctive query over graphs (CQ) is an expression of the form a 1 y 1 ) ∧ ⋯ ∧ ( x n a n y n ) ) ∃ z ( ( x 1 where - is a tuple of variables from { x 1 ,..., x n , y 1 ,..., y n } and z - { a 1 , ... , a n } ⊆ Σ Main technical di ff erence with CQs over relations: we only use binary relations here
Conjunctive Q ueries By a 1 y 1 ) ∧ ⋯ ∧ ( x n a n y n ) ) Q ( o ) = ∃ z ( ( x 1 we denote that a 1 y 1 ) ∧ ⋯ ∧ ( x n a n y n ) ) Q = ∃ z ( ( x 1 is a conjunctive query and that ⊆ { x 1 ,..., x n , y 1 ,..., y n } is the o tuple of free variables (or output variables)
Conjunctive Q ueries: Example Richard Liz S F F Q3 Q15 L S L Taylor P P Burton stage actor Example (CQ on binary relations) Q ( x ) = ( x S y ) ∧ ( x P z ) ∧ ( y P z ) S x y P P Returns: {Q3, Q15} z Homomorphism h 1 : { x ↦ Q3 , y ↦ Q15 , z ↦ stage actor } Homomorphism h 2 : { x ↦ Q15 , y ↦ Q3 , z ↦ stage actor }
Regular Path Q ueries Why regular path queries? Conjunctive queries (and even fi rst-order queries) on graphs are limited: they can only express "local" properties [Gaifman 1982, Hanf 1965] Regular path queries overcome this, using regular expressions to query paths De fi nition A path in graph G is a sequence p = (v 0 , a 1 , v 1 ) (v 1 , a 2 , v 2 ) ... (v n-1 , a n , v n ) of edges of G
Regular Path Q ueries De fi nition A regular path query (RPQ) is an expression of the form x r y where x and y are variables and r is a regular expression over Σ (Notice that r can only mention a fi nite subset of Σ ) Semantics Ti ere are di ff erent semantics of RPQs in the literature! every path trail simple path shortest path Ti e di ff erences between these are important
Semantics of RPQs Why will we consider these di ff erent semantics? Each of these semantics is important: - Every path semantics has been studied most in the literature - (A variant of ) simple path semantics was standard in SPARQL for a while - Trail semantics is the default in Neo4j Cypher - Simple path semantics was the fi rst that was studied [Cruz, Mendelzon, Wood 1987] Members of the OpenCypher project were discussing recently which of the semantics to use for Cypher (www.opencypher.org) Consensus seems to be: "All should be supported"
Semantics of RPQs Matching Paths Let r be a regular expression and G be a graph A path p = (v 0 , a 1 , v 1 ) (v 1 , a 2 , v 2 ) ... (v n-1 , a n , v n ) in G matches r , if a 1 a 2 ... a n ∈ L(r) Semantics of RPQs (every path semantics) Q = ( x r y ) Let be a regular path expression and G be a graph [ [ Q ] ] G Ti e semantics of Q on G = (V, E) is = { (u, v) ∈ V ⨉ V | there exists a path p from u to v in G that matches r } [ [ Q ] ] G
Semantics of RPQs G matches r ✔ RPQ Q = ( x r y ) u ( u , v ) is returned i ff v there is a path from u to v that matches r
Recommend
More recommend