Experimental Study of Context-Free Path Query Evaluation Methods Jochem Kuijpers Fifth openCypher Implementers Meeting Berlin 2019
Introduction ● MSc student CS & Eng. at TU/e ● Academic internship at Neo4j ● Supervised by: George Fletcher Tobias Lindaaker Nikolay Yakovets TU/e Database Group Neo4j ● We implemented and evaluated four methods for computing context-free path query results
Context-Free Grammars Example: the language of even-length palindromes of {a, b}* = { ε, a a, b b, a a a a, a b b a, b a a b, … } A grammar that accepts this language: S ⇒ a S a S ⇒ b S b S ⇒ ε
Context-Free Grammars Example: the language of even-length Example derivation of the string a b b a palindromes of {a, b}* = { ε, a a, b b, a a a a, a b b a, b a a b, … } A grammar that accepts this language: S ⇒ a S a S ⇒ b S b S ⇒ ε
Context-Free Grammars Example: the language of even-length Example derivation of the string a b b a palindromes of {a, b}* = { ε, a a, b b, a a a a, a b b a, b a a b, … } A grammar that accepts this language: S ⇒ a S a S ⇒ b S b S ⇒ ε
Context-Free Grammars Example: the language of even-length Example derivation of the string a b b a palindromes of {a, b}* = { ε, a a, b b, a a a a, a b b a, b a a b, … } A grammar that accepts this language: S ⇒ a S a S ⇒ b S b S ⇒ ε
Context-Free Grammars Example: the language of even-length Example derivation of the string a b b a palindromes of {a, b}* = { ε, a a, b b, a a a a, a b b a, b a a b, … } A grammar that accepts this language: S ⇒ a S a S ⇒ b S b S ⇒ ε
Context-Free Grammars Example: the language of even-length Example derivation of the string a b b a palindromes of {a, b}* = { ε, a a, b b, a a a a, a b b a, b a a b, … } A grammar that accepts this language: S ⇒ a S a S ⇒ b S b S ⇒ ε
Context-Free Path Query ● A query is a context-free grammar ● Grammar where terminals are edge-labels ● Find paths whose edge labels are accepted by the grammar
Context-Free Path Query ● Why? ● Increased expressiveness w.r.t. regular expressions (regular path query) ● Use-cases in ○ biological data analysis ○ static code analysis ○ …
Our work ● We implemented four context-free path query evaluation methods ● Used Neo4j components ○ Graph store (vertices and edges) ○ PageCache ● Query evaluation is separately implemented on top of these components ○ (not integrated into Cypher)
The evaluated methods 1. Annotating the context-free grammar Hellings, Jelle. "Path results for context-free grammar queries on graphs." arXiv preprint arXiv:1502.02242 (2015). 2. Matrix multiplication (GPGPU) Azimov, Rustam, and Semyon Grigorev. "Context-free path querying by matrix multiplication." Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA). ACM, 2018. 3. Adapted GLR (Tomita) parser Santos, Fred C., Umberto S. Costa, and Martin A. Musicante. "A Bottom-Up Algorithm for Answering Context-Free Path Queries in Graph Databases." International Conference on Web Engineering. Springer, Cham, 2018. 4. Adapted Earley parser Sevon, Petteri, and Lauri Eronen. "Subgraph queries by context-free grammars." Journal of Integrative Bioinformatics 5.2 (2008): 157-172.
1. Annotating the grammar Grammar in Chomsky Normal Form S ⇒ A B A ⇒ a B ⇒ b Annotate the grammar: A[u,v] ⇔ there exists an A-path from u to v
1. Annotating the grammar Grammar in Chomsky Normal Form S ⇒ A B A ⇒ a B ⇒ b Annotate the grammar: A[1,4], A[2,1], A[3,4] B[2,3], B[4,2]
1. Annotating the grammar Grammar in Chomsky Normal Form S ⇒ A B A ⇒ a B ⇒ b Annotate the grammar: A[1,4], A[2,1], A[3,4] B[2,3], B[4,2] S[1,2], S[3,2] ⇒ (1,2) and (3,2) are vertex pairs matching the grammar
2. Matrix Multiplication ● Relation matrix representation of the annotated grammar method ● Each grammar non-terminal is stored in the matrix 1 2 3 4 1 B A ● The step of combining X ⇒ Y Z is implemented as a 2 A “multiplication” 3 A ● Can be implemented on GPU 4 B
2. Matrix Multiplication ● Relation matrix representation of the annotated grammar method ● Each grammar non-terminal is stored in the matrix 1 2 3 4 1 S B A ● The step of combining X ⇒ Y Z is implemented as a 2 A “multiplication” 3 S A ● Can be implemented on GPU 4 B
3. Adapted GLR (Tomita) parser ● GLR is a generalization of LR parsers ● Use context-free grammars to parse input strings ● Whenever the parser has multiple options, the parse state is duplicated and both options are tested separately ● If at least one of these options leads to acceptance, the input is accepted ● Has a data structure that reduces duplicate work
3. Adapted GLR (Tomita) parser Adaptations for graph parsing instead of string parsing ● A separate parse state is initialized for each vertex ● Consumes edges instead of string symbols ● Accepting states in w are backtraced to vertex v where parsing started ○ Emits result (v,w) ● The data structure helps keep duplicate work low ● There are some conditions where this algorithm terminates too early ○ Failing to produce some results
4. Subgraph Parsing ● Similar to the previous method, this is a string parser (Earley parser) adapted for graph input ● Upon acceptance at vertex v, backtracking is used to find all paths that accept at v, and are added to a new graph. ● Query result is the induced subgraph of accepted paths! ● Termination problem ○ This algorithm depends on a maximum length parameter to stop ○ This makes it unsuitable for matching paths of arbitrary length ○ Further: There exist conditions where it is missing results or returns no results at all
Results C ⇒ c C c -1 Grammar 1: S ⇒ A B C B ⇒ b B D ⇒ d A ⇒ a a B ⇒ b C ⇒ D A ⇒ a -1 a -1
Results Grammar 2: S ⇒ a X a -1 X ⇒ b X b -1 X ⇒ d X ⇒ c X c -1
Results Highly ambiguous Method Time (s) Memory (MB) grammar: GLR (list) 2,798.6 3.15 GLR (matrix) 372.0 2.36 S ⇒ X X ⇒ X X Ann. Gram (relational) 0.7 0.31 X ⇒ a Ann. Gram (arbitrary) 0.7 0.48 X ⇒ b Ann. Gram (shortest) 3.7 1.55 Tested on a small Ann. Gram (all-path) 2.8 9.09 (a,b)-labeled graph Matrix Multiplication 0.1 < 0.01 of just 50 vertices
Conclusions ● CFPQ evaluation is not real-time ○ For a graph of 15,000 vertices, run time typically exceeds 1 hour ● Requires large amounts of memory ○ Grammar 2 at 5,000 vertices required multiple gigabytes of memory for most methods ● Annotating the grammar seems most promising ○ Robust, can handle ambiguous grammars well ○ Many possible query semantics ○ Running time: arbitrary path ≈ all-path
Future work ● Specialized methods for more restrictive grammars could be much faster ● The annotated grammar and the matrix representation could serve as a path index or reachability index respectively ○ Related to path index work being done at Neo4j
Recommend
More recommend