The Chinese University of Hong Kong The Chinese University of Hong Kong Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao, Jeffrey Xu Yu, Philip S. Yu Zhao, Jeffrey Xu Yu, Philip S. Yu Peixian The Chinese University of Hong Kong, {pxzhao, pxzhao,yu}@se.cuhk.edu.hk yu}@se.cuhk.edu.hk The Chinese University of Hong Kong, { IBM Watson Research Center, psyu@us.ibm.com psyu@us.ibm.com IBM Watson Research Center, 1 1
An Overview An Overview • Graph containment query Graph containment query • • The framework and query cost model The framework and query cost model • • Some existing path/graph based solutions Some existing path/graph based solutions • • A new tree-based approach A new tree-based approach • • Experimental studies Experimental studies • • Conclusion Conclusion • 2 2
Graph Containment Query Graph Containment Query • Given a graph database G = { g 1 , g 2 , …, g N } and a query sup(q ) { g |q g ,g G } graph q , find the set = � � i i i √ √ (q) • Infeasible to check subgraph isomorphism for every g i in G , because subgraph-isomorphism is NP-Complete. 3 3
The Framework The Framework • Index construction generates a set of features, generates a set of features, F, F, from the from the • Index construction graph database G G . Each . Each feature feature , , f f , maintains a set of graph ids , maintains a set of graph ids graph database in G G containing, containing, f f , , sup sup ( ( f f ). ). in • Query processing is a is a filtering-verification filtering-verification process. process. • Query processing • Filtering phase uses the features in query graph, q, to compute the candidate set. • Verification phase checks subgraph isomorphism for every graph in C q . False positives are pruned. 4 4
Query Cost Model Query Cost Model • The cost of processing a graph containment query q q upon G is upon G is • The cost of processing a graph containment query modeled as modeled as • C f : the filtering cost, and • C v : the verification cost (NP-Complete) • Several Facts: • Several Facts: • To improve query performance is to minimize |C q |. • The feature set F selected has great impacts on C f and |C q |. • There is also an index construction cost, which is the cost of discovering the feature set F. 5 5
Existing Solutions: Paths vs vs Graphs Graphs Existing Solutions: Paths • Path-based Indexing Approach: GraphGrep ( PODS PODS’ ’02 02 ) ) • Path-based Indexing Approach: GraphGrep ( • All paths up to a certain length l p are enumerated as indexing features – An efficient index construction process – Index size is determined by l p – Limited pruning power, because the structural information is lost. • Graph-based Indexing Approach: gIndex ( SIGMOD SIGMOD’ ’04) 04) • Graph-based Indexing Approach: gIndex ( • Discriminative frequent subgraphs are mined from G as indexing features – A costly index construction process – Compact index structure – Great pruning power, because structural information is well- preserved 6 6
Tree Features? Tree Features? • Regarding paths and graphs as index features: Regarding paths and graphs as index features: • • The cost of generating path features is small but the candidate set can be large. • The cost of generating frequent graph features is high but the candidate set can be small. • The key observation The key observation : the majority of frequent : the majority of frequent • graph-features (more than 95%) are trees. graph-features (more than 95%) are trees. • How good can tree features do? How good can tree features do? • 7 7
A New Approach: Tree+ Δ Δ A New Approach: Tree+ • To explore To explore indexability indexability of path, tree and graph. of path, tree and graph. • • A new approach Tree+ A new approach Tree+ Δ Δ : : • • To select frequent tree features. • To select a small number of discriminative graph- features that can prune graphs effectively, on demand, without costly graph mining . 8 8
Indexability of Path, Tree and Graph Indexability of Path, Tree and Graph • We consider three main factors to answer indexability. • The frequent feature set size: | F | • The feature selection cost (mining): C FS • The candidate set size: |C q | 9 9
The Frequent Feature Set Size: | F F | • 95% of frequent graph features are trees. Why? 95% of frequent graph features are trees. Why? • • Consider non-tree frequent graph features g and g Consider non-tree frequent graph features g and g’ ’. . • • Based on Apriori principle, all g’s subtrees, t 1 , t 2 , …, t n are frequent. • Because of the structural diversity and vertex/edge label variety, there is a little chance that subrees of g coincide with those of g’. 10 10
Frequent Feature Distributions Frequent Feature Distributions The Real Dataset (AIDS antivirus screen dataset) N = 1,000, σ = 0.1 11 11
The Feature Selection Cost: C C FS The Feature Selection Cost: FS • Given a graph database, G , and a minimum support threshold, σ , to discover the frequent feature set F from G. • Graph : two prohibitive operations are unavoidable – Subgraph isomorphism – Graph isomorphism • Tree : one prohibitive operation is unavoidable – Tree-in-Graph testing • Path : polynomial time 12 12
The Candidate Set Size: |C |C q | The Candidate Set Size: q | • Let pruning power of a frequent feature, f, be • Let pruning power of a frequent feature set S = { f 1 , f 2 , …, f n } • Let a frequent subtree feature set of graph, g, be T ( g ) = { t 1 , t 2 , …, t n }. power( power( g g ) ) ≥ ≥ power( power( T T ( ( g g )) )) • Let a frequent subpath feature set of tree, t, be P ( t ) = { p 1 , p 2 , …, p n }. power( power( t t ) ) ≥ ≥ power( power( P P ( ( t t )) )) 13 13
The Pruning Power The Pruning Power The Real Dataset (AIDS antivirus screen dataset) N = 1,000, σ = 0.1 14 14
Indexability of Tree Indexability of Tree • The frequent tree-feature set dominates (95%). The frequent tree-feature set dominates (95%). • • Discovering frequent tree-features can be done Discovering frequent tree-features can be done • much more efficiently than mining frequent much more efficiently than mining frequent general graph-features. general graph-features. • Frequent tree features Frequent tree features can contribute similar can contribute similar • pruning power as frequent graph features do. pruning power as frequent graph features do. 15 15
Add Graph Features On Demand Add Graph Features On Demand • Consider a query graph q which contains a subgraph g • If power( T ( g )) ≈ power( g ), there is no need to index the graph-feature g . • If power( g ) >> power( T ( g )), it needs to select g as an index feature, because g is more discriminative than T ( g ), in terms of pruning. • Select discriminative graph-features on-demand, without mining the whole set of frequent graph-features from G. • The selected graph features are additional indexing features, denoted Δ , for later reuse . 16 16
Discriminative Ratio Discriminative Ratio • A discriminative ratio, ε (g) , is defined to measure the similarity of pruning power between a graph -feature g and its subtrees T(g). • A non-tree graph feature, g, is discriminative if ε (g) ≥ ε 0 . 17 17
Discriminative Graph Selection (1) (1) Discriminative Graph Selection • Consider two graphs g and g’ , where g g’ . • If the gap between power( g’ ) and power( g ) is large, reclaim g’ from G. Otherwise , do not reclaim g’ in the presence of g. • Approximate the discriminative between g’ and g , in the presence of frequent tree-features discovered. 18 18
Discriminative Graph Selection (2) (2) Discriminative Graph Selection • Let occurrence probability of g in the graph DB be • The conditional occurrence probability of g’ , w.r.t. g : • When Pr(g’|g) is small, g’ has higher probability to be discriminative w.r.t. g. 19 19
Discriminative Graph Selection (3) (3) Discriminative Graph Selection • The upper and lower bound of Pr (g’| g ) become because ε (g) ≥ ε 0 and ε (g’) ≥ ε 0 . recall: | sup( ) | / | x G | � = x 20 20
Discriminative Graph Selection (4) (4) Discriminative Graph Selection • Because 0 ≤ Pr (g’| g ) ≤ 1, the conditional occurrence probability of Pr ( g ’ |g ), is solely upper-bounded by T (g’). 21 21
Recommend
More recommend