Graph Databases Marco Serafini COMPSCI 532 Lecture 10
Graph DB Use cases • Social network queries • E.g. Facebook stores the entire metadata in a social graph • Network security • Find sequence of steps that lead to intrusion • Fraud detection • Find fraud rings • Knowledge bases • Answer questions, language models 2
Resource Description Framework • World Wide Web Consortium specification • Used for the Semantic Web • Web pages define human-readable content • Goal: add machine-readable meta-data describing how pages relate • Format to reuse and share data across the Web • Examples • Wikipedia, census, life sciences, DBPedia • Directed labeled multi-graph 3 3
RDF Format • Graph is set of triplets = (Subject, Predicate, Object) • Subject and predicate are resources • Associated with Unique Resource Identifiers (URI) • Object can be resource or literal (string) From S. Decker et al., “Framework for the Semantic Web: An RDF Tutorial” 4 4
Query Language: SPARQL • Declarative • Defines a query graph • RDF store must find all instances in data graph • Example • “Return friends of user alice01 who live in Paris” PREFIX sn: http://socialnetwork.com/ontology/ SELECT ?friend WHERE { ?user sn:hasName “alice01”; sn:isFriendOf ?friend. ?friend sn:livesIn sn:Paris. } 5 5
Property Graph Format • Vertices and edges can have associated properties • Key-value pairs •Vertices can be grouped by label • Similar to tables, e.g., employees • Properties are similar to columns of a table • Not a “global” format: no URIs required •Typical more compact than RDFs • Common is NoSQL graph databases 6 6
Query Languages • Cypher • Originally used by Neo4j • Linear queries • Previous example in Cypher MATCH (u:User)-[:isFriend]->(f:User)–[:livesIn]->(:City {name: ‘Paris’}) WHERE (u.name = ‘Alice’) RETURN f.name 7 7
Relational Representation of Graphs • Graphs is a relational DBMS • Vertex table, edge table • Sometimes edges as triplets • Pattern matching • Maintain a set of partial matches • Extend by edge: self-join on edge table 8 8
Why are Graph Workloads Hard? • Many joins: difficult to estimate cardinality • Joins require random access • Cardinality estimation gets harder at every join • Skew: few vertices have very high degree • Indexing •Adjacency list scans are very frequent • Graph-aware databases optimize these • Some queries have very low selectivity • E.g. triangle closure (potential friends) 9 9
Worst-Case Optimal Joins • Worst-Case Optimality • O(intermediate results) <= O(final results) • Edge-at-a-time approach is not worst-case optimal • Number of triangles: O(|E| 3/2 ) • Number of wedges: O(|E| 2 ) • Vertex-at-a-time (multi-way-joins) are WCO • ( v 1 , v 2 ), ( v 1 , v 2 , v 3 ), ( v 1 , v 2 , v 3 , v 4 ), … • Will not materialize all wedges 10 10
Subgraph Isomorphism (TurboISO) SubTask 1 SubTask 2 Match spanning tree Match cross-edges from one starting vertex single starting vertex v v 10 10 10 10 multiple lightweight heavyweight matching 10 4 subgraphs * 220 edge lookups vertices 2 edge lookups 10*10 10*10 100 100 11 11
TurboISO: Flexible Join Order 12 12
Hard to Parallelize Running time (ms) 13 13
Subgraph Enumeration • Count all instances of an unlabeled pattern • E.g. triangles, squares, cliques • Important to rule out permutations 14 14
Reachability Queries • Given two vertices v and u • Find (and/or rank) paths connecting them • Simplest approach: parallel BFS from both vertices • Expensive 15 15
Dynamic Graphs • Temporal Analysis à Deal with multiple snapshots • Real-Time analytics à Work on live graph data • Storage implications ANALYTICAL TRANSACTIONAL SYSTEM SYSTEM LOAD UPDATES RESULTS DYNAMIC READ-ONLY DATA STRUCTURE DATA STRUCTURE + TRANSACTIONS NO TRANSACTIONS E.g.: B-Tree, LSMT E.g.: CSR 16 16
Graph Storage for RT Analytics • Sequential adjacency list scan is important • CSR: Sequential scan but read-only • TEL: LOG-based adjacency list µ s/vertex (seeks) cache miss/edge ns/edge (scan) TEL B+Tree TEL B+Tree TEL B+Tree 1000 10 LSMT Linked List LSMT Linked List LSMT Linked List 100 1 100 10 0.1 1 10 0.01 0.1 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 20 2 21 2 22 2 23 2 24 2 25 2 26 graph scale, V graph scale, V graph scale, V Cache misses Seek time Edge scan 17 17
Open Issues • Graph analytics algorithms are diverse • Still looking for good APIs • There is no “SQL for graphs” • Hard to leverage hardware characteristics • Scale out to distributed systems: Hard because of edge cut • SIMD: hard because of skew and random access • Caching: hard because of random access 18 18
Recommend
More recommend