mind the gap large scale frequent sequence mining
play

Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki - PowerPoint PPT Presentation

Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrcken, Germany 27 th June 2013, New York SIGMOD 2013 Why are sequences


  1. Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrücken, Germany 27 th June 2013, New York SIGMOD 2013

  2. Why are sequences interesting? Various applications Mind the Gap: Large-Scale Frequent Sequence Mining 2

  3. Why are sequences interesting? Mind the Gap: Large-Scale Frequent Sequence Mining 3

  4. Sequences with gaps • Generalization of n-grams to sequences with gaps – sunny [...] New York – rainy [...] New York • Exposes more structure  Central Park is the best place to be on a sunny day in New York .  It was a sunny , beautiful New York City afternoon. Mind the Gap: Large-Scale Frequent Sequence Mining 4

  5. More applications....  Text analysis (e.g., linguistics or sociology)  Language modeling (e.g, query completion)  Information extraction (e.g,relation extraction)  Also: web usage mining, spam detection, ... Mind the Gap: Large-Scale Frequent Sequence Mining 5

  6. Challenges Huge collections of sequences Computationally intensive problem  O(n 2 ) n-grams for sequence S where |S| = n  O(2 n ) subsequences for sequence S where |S| = n and gap > n Sequences with small support can be interesting Potentially many output patterns How can we perform frequent sequence mining at such large scales? Mind the Gap: Large-Scale Frequent Sequence Mining 6

  7. Outline  Motivation & challenges  Problem statement  The MG-FSM algorithm  Experimental Evaluation  Conclusion Mind the Gap: Large-Scale Frequent Sequence Mining 7

  8. Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that  Occur in at least σ sequences (support threshold)  Have length at most λ (length threshold)  Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York Mind the Gap: Large-Scale Frequent Sequence Mining 8

  9. Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that  Occur in at least σ sequences (support threshold)  Have length at most λ (length threshold)  Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York for σ = 3, γ = 0, λ = 2 Frequent n-gram New York Mind the Gap: Large-Scale Frequent Sequence Mining 9

  10. Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that  Occur in at least σ sequences (support threshold)  Have length at most λ (length threshold)  Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York for σ = 3, γ = 0, λ = 2 Frequent n-gram New York for σ = 3, γ ≥ 2, λ = 3 sunny New York Frequent subsequence Mind the Gap: Large-Scale Frequent Sequence Mining 10

  11. Outline  Motivation & challenges  Problem statement  The MG-FSM algorithm  Experimental Evaluation  Conclusion Mind the Gap: Large-Scale Frequent Sequence Mining 11

  12. Parallel frequent sequence mining 1. Divide data into potentially Sequence overlapping partitions database D 2. Mine each partition Partitioning 3. Filter and combine results D 1 D 2 D k FSM FSM FSM ... mining mining mining F 1 F 2 F k Filter Filter Filter Frequent sequences F Mind the Gap: Large-Scale Frequent Sequence Mining 12

  13. Using item-based partitioning 1. Order items by desc. Sequence frequency a > ... > k database D 2. Partition by item a,b,... Partitioning (called pivot item) item a item k item b 3. Mine each partition 4. Filter: no less-frequent D 1 D 2 D k item FSM FSM FSM ... mining mining mining F 1 F 2 F k Includes b but Disjoint not c,d,...,k Filter subsequence sets Filter Includes a but Filter computed in-parallel Includes k not b,c,...,k and independently Frequent sequences F Mind the Gap: Large-Scale Frequent Sequence Mining 13

  14. Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, A A B A B A AD, BA, BC, A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B A B A B AB, AC, BA, B A C A A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, A B C C B C:4 BC, CA C A D D:2 B A C A D B C Support σ=2 Max. gap γ=1 A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 14

  15. Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, with A but A A B A B A AD, BA, BC, not B,C,D A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B with B but A B A B AB, AC, BA, B A C A not C,D.... A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, with C A B C C B C:4 BC, CA but not D C A D D:2 B A C A D B C Support σ=2 Max. gap γ=1 with D A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 15

  16. Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, with A but A A B A B A AD, BA, BC, not B,C,D A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B with B but A B A B AB, AC, BA, B A C A not C,D.... A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, with C A B C C B C:4 BC, CA but not D C A D D:2  High communication cost B A C A D B C Support σ=2 Max. gap γ=1  Redundant computation cost with D A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 16

  17. Improving the partitioning Traditional approach  Derive a partitioning rule (“projection”)  Prove correctness of the partition rule MG-FSM approach  Use any partitioning satisfying correctness  Rewrite the input sequences  ensuring each w-partition generates the set of pivot sequences for w Mind the Gap: Large-Scale Frequent Sequence Mining 17

  18. Which is the optimal partition? C B D B C Max. gap γ=0 C B ,1 Max. length λ =2 B C ,1 C B B C pivot C C Β C C B B C Many short sequences? Few long sequences? Optimal partition not clear! Trade-off Aim for a “good” partition cost & gain using inexpensive rewrites Mind the Gap: Large-Scale Frequent Sequence Mining 18

  19. Rewriting partitions Partition C frequent sequences with C but not D A C B A C B γ = 1 λ = 3 D A C B D D A C B D D A C B D D B C A D A C B D D B C A B C A D D B D B C A D D B D A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 19 Mind the Gap: Large-Scale Frequent Sequence Mining

  20. Rewriting partitions Partition C frequent sequences with C but not D A C B A C B A C B γ = 1 λ = 3 D A C B D D A C B D D A C B D D A C B D D B C A D A C B D D B C A D A C B D D B C A B C A D D B D B C A D D B D B C A D D B D A D D C D A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 20 Mind the Gap: Large-Scale Frequent Sequence Mining

  21. Rewriting partitions Partition C frequent sequences with C but not D A C B A C B A C B A C B A C B γ = 1 λ = 3 _ A C B _ D A C B D _ A C B _ D A C B D D A C B D _ A C B _ _ B C A _ A C B _ _ B C A D A C B D D B C A D A C B D D B C A D A C B D D B C A B C A _ _ B _ B C A _ _ B _ B C A D D B D B C A D D B D B C A D D B D A _ _ C _ A D D C D A _ _ C _ A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 21 Mind the Gap: Large-Scale Frequent Sequence Mining

  22. Rewriting partitions Partition C frequent sequences with C but not D A C B γ = 1 λ = 3 _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A _ _ C _  Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) Mind the Gap: Large-Scale Frequent Sequence Mining 22

  23. Rewriting partitions Partition C frequent sequences with C but not D A C B A C B γ = 1 λ = 3 _ A C B _ _ A C B _ _ A C B _ _ B C A _ A C B _ _ B C A B C A _ _ B _ B C A _ _ B _ A _ _ C _  Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) Mind the Gap: Large-Scale Frequent Sequence Mining 23

Recommend


More recommend