Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat
Introduction • Frequent structure mining: given a set of complex structures (molecules, access logs, graphs, (free) trees, ...), find substructures that occur frequently • Frequent structure mining approaches: – Specialized: efficient algorithms for sequences, trees (Freqt, uFreqT) and graphs (gSpan, FSG) – General: ILP algorithms (Warmr), biased graph mining algorithms (B-AGM) September 25, 2003, Cavtat ECML/PKDD-2003
Introduction • [Yan, SIGKDD’2003] Comparison between gSpan and W ARMR on confirmed active Aids molecules: 6400s W ARMR 2s gSpan • Our goal: to build an efficient W ARMR - like algorithm September 25, 2003, Cavtat ECML/PKDD-2003
Overview • Problem description • Optimizations: – Use a bias for tight problem specifications – Perform a depth-first search – Use efficient data structures in a new complete enumeration strategy which combines pruning with candidate generation – Speed-up evaluation by storing intermediate evaluation results, construct low-cost queries • Experiments & conclusions September 25, 2003, Cavtat ECML/PKDD-2003
Problem description • The task of the algorithm is: 1 Given a database of Datalog facts Find a set of queries that occurs frequently 2 3 4 September 25, 2003, Cavtat ECML/PKDD-2003
Database of Facts g 1 g 2 n 2 n 4 n 6 a a b b a n 1 n 5 n 7 b c n 3 • { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a), e(g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,c), e(g 2 ,n 6 ,n 7 ,b) } September 25, 2003, Cavtat ECML/PKDD-2003
Queries N 4 b N 5 a N 1 a N 3 a N 2 • k(G) ← e(G,N 1 ,N 2 ,a),e(G,N 2 ,N 3 ,a), e(G,N 1 ,N 4 ,a),e(G,N 4 ,N 5 ,b) September 25, 2003, Cavtat ECML/PKDD-2003
Queries - Bias • For a fixed set of predicates many kinds of queries possible: – k(G) ← e(G,N 1 ,N 2 ,a),e(G,N 2 ,N 3 ,a), e(G,N 1 ,N 4 ,a),e(G,N 4 ,N 5 ,b) – k(G) ← e(G,N 1 ,N 2 ,L),e(G,N 2 ,N 3 ,L), e(G,N 1 ,N 4 ,L),e(G,N 4 ,N 5 ,L) • Our algorithm requires the user to specify a mode bias with types, primary keys, atom variable constraints, ... September 25, 2003, Cavtat ECML/PKDD-2003
Occurrence of Queries • Database D : θ ={ G/g 1 ,N 1 /n 2 ,N 2 /n 1 ,N 3 /n 2 ,N 4 /n 3 ,N 5 /n 1 } { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a), e(g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } • Query Q : k(G) ← e(G,N 1 ,N 2 ,a),e(G,N 2 ,N 3 ,a), e(G,N 1 ,N 4 ,a),e(G,N 4 ,N 5 ,b) • (W ARMR ) θ - subsumption: D Q iff there is a substitution θ , ( Q θ ) ⊆ D September 25, 2003, Cavtat ECML/PKDD-2003
Occurrence of Queries g 1 g 2 n 2 a n 4 n 6 a a a a b b b a a n 1 n 5 n 7 b b a n 3 N 1 N 4 b a a N 5 a Counterintuitive! N 3 a N 1 a b N 3 N 2 N 5 N 4 a N 2 September 25, 2003, Cavtat ECML/PKDD-2003
Occurrence of Queries Equivalent: a k(G) ← e(G,N 1 ,N 2 ,b),e(G,N 2 ,N 3 ,a), b a a e(G,N 3 ,N 2 ,a),e(G,N 3 ,N 4 ,a) a k(G) ← e(G,N 1 ,N 2 ,b),e(G,N 2 ,N 3 ,a), b a e(G,N 3 ,N 2 ,a) Counterintuitive! September 25, 2003, Cavtat ECML/PKDD-2003
Occurrence of Queries • (F ARMER here) OI - subsumption: D Q iff there is a substitution θ , ( Q θ ) ⊆ D and : – θ is injective – θ does not map to constants in Q • Advantages over OI-subsumption: – in many situations (eg. graphs) more intuitive – if queries are equivalent, they are alphabetic variants; mode refinement is easier (proper) • Disadvantages? September 25, 2003, Cavtat ECML/PKDD-2003
Frequency • Database D : { e(g e(g 1 ,n 1 ,n 2 ,a),e(g e(g 1 ,n 2 ,n 1 ,a),e(g e(g 1 ,n 2 ,n 3 ,a), 1 ,n 1 ,n 2 ,a) 1 ,n 2 ,n 1 ,a) 1 ,n 2 ,n 3 ,a) e(g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), e(g e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } 1 ,n 4 ,n 2 ,a) • Query Q : k(G) ← e(G,N e(G,N 1 ,N 2 ,a) 1 ,N 2 ,a) • Frequency freq(Q) : the number of different values for G for which the body is subsumed by the database. September 25, 2003, Cavtat ECML/PKDD-2003
Monotonicity • Frequently: frequency ≥ minsup , for predefined threshold value minsup • Monotonicity: if Q 2 OI-subsumes Q 1 , freq(Q 1 ) ≥ freq(Q 2 ) ⇒ if a query is infrequent, it should not be refined ⇒ if a query is subsumed by an infrequent query, it should not be considered September 25, 2003, Cavtat ECML/PKDD-2003
F ARMER F ARMER (Query Q ):: determine refinements of Q 1. compute frequency of refinements 2. sort refinements 3. for each frequent refinement Q’ do F ARMER (Q’) September 25, 2003, Cavtat ECML/PKDD-2003
Determine Refinements • Only one variant of each query should be counted and outputted • Main problem: query equivalency under OI has graph isomorphism complexity • Our approach: – use ordered tree-based heuristics – use efficient data structures to determine equivalency – perform also other pruning during exponential search September 25, 2003, Cavtat ECML/PKDD-2003
Determine Refinements • [IJCAI’01] e(G,N 1 ,N 2 ,a) e(G,N 1 ,N 2 ,b) e(G,N 3 ,N 4 ,b) e(G,N 1 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 3 ,N 4 ,a) September 25, 2003, Cavtat ECML/PKDD-2003
Determine Refinements e(G,N 1 ,N 2 ,a) e(G,N 1 ,N 2 ,b) 3 ,a) e(G,N 1 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,b) e(G,N 2 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 3 ,N 4 ,a) e(G,N 3 ,N 4 September 25, 2003, Cavtat ECML/PKDD-2003
Determine Refinements • (In the paper) we prove that – Refinement with this strategy is complete : of every frequent query defined by the bias, at least one variant is found – The order of siblings does not matter for completeness (but they must have some order) September 25, 2003, Cavtat ECML/PKDD-2003
Determine Refinements • Incrementally generate variants • Search for the variant (under construction) in the existing part of the query tree • To optimize this search, siblings are stored in a tree-like hash structure • If a query is found that is infrequent ⇒ query Q is pruned (monotonicity constraint!) September 25, 2003, Cavtat ECML/PKDD-2003
Frequency Computation • Main problem: the complexity of finding an OI substitution is the same as subgraph isomorphism, and is therefore NP complete • Our approach: try to avoid as much as possible that the same (exponential) computation is performed twice September 25, 2003, Cavtat ECML/PKDD-2003
Frequency Computation • D = { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e e (g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), (g 1 ,n 3 ,n 1 ,b) e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g e(g 2 ,n 6 ,n 7 ,b) } 2 ,n 6 ,n 7 ,b) • Q = k(G) ← e(G,N e(G,N 1 ,N 2 ,b) 1 ,N 2 ,b) • For each value of G for which the database subsumes the query, the `first’ substitution is stored September 25, 2003, Cavtat ECML/PKDD-2003
Frequency Computation • Once a query is refined, for each refinement the first subsuming substitution has to be determined • This computation is performed in one backtracking procedure for all refinements together (like query packs) • This search starts from the subsitution of the original query September 25, 2003, Cavtat ECML/PKDD-2003
Frequency Computation • D = { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e { e(g { e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e e(g 1 ,n 1 ,n 2 ,a),e(g 1 ,n 2 ,n 1 ,a),e(g 1 ,n 2 ,n 3 ,a),e e e e 1 ,n 1 ,n 2 ,a) (g 1 (g 1 ,n 3 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), (g 1 ,n 3 ,n 1 ,b),e(g (g 1 (g 1 ,n 3 ,n 3 ,n 3 ,n 1 ,n 1 ,n 1 ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), ,b),e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), e(g 1 ,n 3 ,n 4 ,b),e(g 1 ,n 3 ,n 5 ,a), (g (g (g 1 ,n 1 ,n 1 ,n 3 ,n 3 ,n 3 ,n 1 ,b) 1 ,b) 1 ,b) 1 ,n 3 ,n 4 ,b) e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } e(g 1 ,n 4 ,n 2 ,a),e(g e(g 1 ,n 4 ,n 5 ,b),e(g 2 ,n 6 ,n 7 ,b) } 1 ,n 4 ,n 5 ,b) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 ,N 3 ,a) e(G,N 2 e(G,N 2 ,N 3 ,b) ,N 3 ,b) e(G,N 2 ,N 3 ,b) e(G,N 2 ,N 3 ,b) • Q = k(G) ← e(G,N e(G,N 1 ,N 2 ,b) 1 ,N 2 ,b) e(G,N 1 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 1 ,N 3 ,b) e(G,N 3 ,N 4 ,b) e(G,N 3 ,N 4 ,b) e(G,N 3 ,N 4 ,b) September 25, 2003, Cavtat ECML/PKDD-2003
Recommend
More recommend