Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrücken, Germany 27 th June 2013, New York SIGMOD 2013
Why are sequences interesting? Various applications Mind the Gap: Large-Scale Frequent Sequence Mining 2
Why are sequences interesting? Mind the Gap: Large-Scale Frequent Sequence Mining 3
Sequences with gaps • Generalization of n-grams to sequences with gaps – sunny [...] New York – rainy [...] New York • Exposes more structure Central Park is the best place to be on a sunny day in New York . It was a sunny , beautiful New York City afternoon. Mind the Gap: Large-Scale Frequent Sequence Mining 4
More applications.... Text analysis (e.g., linguistics or sociology) Language modeling (e.g, query completion) Information extraction (e.g,relation extraction) Also: web usage mining, spam detection, ... Mind the Gap: Large-Scale Frequent Sequence Mining 5
Challenges Huge collections of sequences Computationally intensive problem O(n 2 ) n-grams for sequence S where |S| = n O(2 n ) subsequences for sequence S where |S| = n and gap > n Sequences with small support can be interesting Potentially many output patterns How can we perform frequent sequence mining at such large scales? Mind the Gap: Large-Scale Frequent Sequence Mining 6
Outline Motivation & challenges Problem statement The MG-FSM algorithm Experimental Evaluation Conclusion Mind the Gap: Large-Scale Frequent Sequence Mining 7
Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that Occur in at least σ sequences (support threshold) Have length at most λ (length threshold) Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York Mind the Gap: Large-Scale Frequent Sequence Mining 8
Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that Occur in at least σ sequences (support threshold) Have length at most λ (length threshold) Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York for σ = 3, γ = 0, λ = 2 Frequent n-gram New York Mind the Gap: Large-Scale Frequent Sequence Mining 9
Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that Occur in at least σ sequences (support threshold) Have length at most λ (length threshold) Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York for σ = 3, γ = 0, λ = 2 Frequent n-gram New York for σ = 3, γ ≥ 2, λ = 3 sunny New York Frequent subsequence Mind the Gap: Large-Scale Frequent Sequence Mining 10
Outline Motivation & challenges Problem statement The MG-FSM algorithm Experimental Evaluation Conclusion Mind the Gap: Large-Scale Frequent Sequence Mining 11
Parallel frequent sequence mining 1. Divide data into potentially Sequence overlapping partitions database D 2. Mine each partition Partitioning 3. Filter and combine results D 1 D 2 D k FSM FSM FSM ... mining mining mining F 1 F 2 F k Filter Filter Filter Frequent sequences F Mind the Gap: Large-Scale Frequent Sequence Mining 12
Using item-based partitioning 1. Order items by desc. Sequence frequency a > ... > k database D 2. Partition by item a,b,... Partitioning (called pivot item) item a item k item b 3. Mine each partition 4. Filter: no less-frequent D 1 D 2 D k item FSM FSM FSM ... mining mining mining F 1 F 2 F k Includes b but Disjoint not c,d,...,k Filter subsequence sets Filter Includes a but Filter computed in-parallel Includes k not b,c,...,k and independently Frequent sequences F Mind the Gap: Large-Scale Frequent Sequence Mining 13
Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, A A B A B A AD, BA, BC, A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B A B A B AB, AC, BA, B A C A A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, A B C C B C:4 BC, CA C A D D:2 B A C A D B C Support σ=2 Max. gap γ=1 A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 14
Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, with A but A A B A B A AD, BA, BC, not B,C,D A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B with B but A B A B AB, AC, BA, B A C A not C,D.... A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, with C A B C C B C:4 BC, CA but not D C A D D:2 B A C A D B C Support σ=2 Max. gap γ=1 with D A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 15
Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, with A but A A B A B A AD, BA, BC, not B,C,D A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B with B but A B A B AB, AC, BA, B A C A not C,D.... A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, with C A B C C B C:4 BC, CA but not D C A D D:2 High communication cost B A C A D B C Support σ=2 Max. gap γ=1 Redundant computation cost with D A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 16
Improving the partitioning Traditional approach Derive a partitioning rule (“projection”) Prove correctness of the partition rule MG-FSM approach Use any partitioning satisfying correctness Rewrite the input sequences ensuring each w-partition generates the set of pivot sequences for w Mind the Gap: Large-Scale Frequent Sequence Mining 17
Which is the optimal partition? C B D B C Max. gap γ=0 C B ,1 Max. length λ =2 B C ,1 C B B C pivot C C Β C C B B C Many short sequences? Few long sequences? Optimal partition not clear! Trade-off Aim for a “good” partition cost & gain using inexpensive rewrites Mind the Gap: Large-Scale Frequent Sequence Mining 18
Rewriting partitions Partition C frequent sequences with C but not D A C B A C B γ = 1 λ = 3 D A C B D D A C B D D A C B D D B C A D A C B D D B C A B C A D D B D B C A D D B D A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 19 Mind the Gap: Large-Scale Frequent Sequence Mining
Rewriting partitions Partition C frequent sequences with C but not D A C B A C B A C B γ = 1 λ = 3 D A C B D D A C B D D A C B D D A C B D D B C A D A C B D D B C A D A C B D D B C A B C A D D B D B C A D D B D B C A D D B D A D D C D A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 20 Mind the Gap: Large-Scale Frequent Sequence Mining
Rewriting partitions Partition C frequent sequences with C but not D A C B A C B A C B A C B A C B γ = 1 λ = 3 _ A C B _ D A C B D _ A C B _ D A C B D D A C B D _ A C B _ _ B C A _ A C B _ _ B C A D A C B D D B C A D A C B D D B C A D A C B D D B C A B C A _ _ B _ B C A _ _ B _ B C A D D B D B C A D D B D B C A D D B D A _ _ C _ A D D C D A _ _ C _ A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 21 Mind the Gap: Large-Scale Frequent Sequence Mining
Rewriting partitions Partition C frequent sequences with C but not D A C B γ = 1 λ = 3 _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A _ _ C _ Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) Mind the Gap: Large-Scale Frequent Sequence Mining 22
Rewriting partitions Partition C frequent sequences with C but not D A C B A C B γ = 1 λ = 3 _ A C B _ _ A C B _ _ A C B _ _ B C A _ A C B _ _ B C A B C A _ _ B _ B C A _ _ B _ A _ _ C _ Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) Mind the Gap: Large-Scale Frequent Sequence Mining 23
Recommend
More recommend