Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki - PowerPoint PPT Presentation

Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrücken, Germany 27 th June 2013, New York SIGMOD 2013

Why are sequences interesting? Various applications Mind the Gap: Large-Scale Frequent Sequence Mining 2

Why are sequences interesting? Mind the Gap: Large-Scale Frequent Sequence Mining 3

Sequences with gaps • Generalization of n-grams to sequences with gaps – sunny [...] New York – rainy [...] New York • Exposes more structure  Central Park is the best place to be on a sunny day in New York .  It was a sunny , beautiful New York City afternoon. Mind the Gap: Large-Scale Frequent Sequence Mining 4

More applications....  Text analysis (e.g., linguistics or sociology)  Language modeling (e.g, query completion)  Information extraction (e.g,relation extraction)  Also: web usage mining, spam detection, ... Mind the Gap: Large-Scale Frequent Sequence Mining 5

Challenges Huge collections of sequences Computationally intensive problem  O(n 2 ) n-grams for sequence S where |S| = n  O(2 n ) subsequences for sequence S where |S| = n and gap > n Sequences with small support can be interesting Potentially many output patterns How can we perform frequent sequence mining at such large scales? Mind the Gap: Large-Scale Frequent Sequence Mining 6

Outline  Motivation & challenges  Problem statement  The MG-FSM algorithm  Experimental Evaluation  Conclusion Mind the Gap: Large-Scale Frequent Sequence Mining 7

Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that  Occur in at least σ sequences (support threshold)  Have length at most λ (length threshold)  Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York Mind the Gap: Large-Scale Frequent Sequence Mining 8

Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that  Occur in at least σ sequences (support threshold)  Have length at most λ (length threshold)  Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York for σ = 3, γ = 0, λ = 2 Frequent n-gram New York Mind the Gap: Large-Scale Frequent Sequence Mining 9

Gap-constrained frequent sequence mining Input: Sequence database Output: Frequent subsequences that  Occur in at least σ sequences (support threshold)  Have length at most λ (length threshold)  Have gap at most γ between consecutive items (gap threshold) Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon. for σ = 2, γ = 0, λ = 5 Frequent n-gram sunny day in New York for σ = 3, γ = 0, λ = 2 Frequent n-gram New York for σ = 3, γ ≥ 2, λ = 3 sunny New York Frequent subsequence Mind the Gap: Large-Scale Frequent Sequence Mining 10

Outline  Motivation & challenges  Problem statement  The MG-FSM algorithm  Experimental Evaluation  Conclusion Mind the Gap: Large-Scale Frequent Sequence Mining 11

Parallel frequent sequence mining 1. Divide data into potentially Sequence overlapping partitions database D 2. Mine each partition Partitioning 3. Filter and combine results D 1 D 2 D k FSM FSM FSM ... mining mining mining F 1 F 2 F k Filter Filter Filter Frequent sequences F Mind the Gap: Large-Scale Frequent Sequence Mining 12

Using item-based partitioning 1. Order items by desc. Sequence frequency a > ... > k database D 2. Partition by item a,b,... Partitioning (called pivot item) item a item k item b 3. Mine each partition 4. Filter: no less-frequent D 1 D 2 D k item FSM FSM FSM ... mining mining mining F 1 F 2 F k Includes b but Disjoint not c,d,...,k Filter subsequence sets Filter Includes a but Filter computed in-parallel Includes k not b,c,...,k and independently Frequent sequences F Mind the Gap: Large-Scale Frequent Sequence Mining 13

Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, A A B A B A AD, BA, BC, A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B A B A B AB, AC, BA, B A C A A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, A B C C B C:4 BC, CA C A D D:2 B A C A D B C Support σ=2 Max. gap γ=1 A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 14

Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, with A but A A B A B A AD, BA, BC, not B,C,D A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B with B but A B A B AB, AC, BA, B A C A not C,D.... A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, with C A B C C B C:4 BC, CA but not D C A D D:2 B A C A D B C Support σ=2 Max. gap γ=1 with D A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 15

Example: Naive partitioning A B A B A, B, C, D, A B C C B AA, AB, AC, with A but A A B A B A AD, BA, BC, not B,C,D A D A B C C B CA C A D A B A C A A D A, B, C, AA, C A D B with B but A B A B AB, AC, BA, B A C A not C,D.... A B C C B BC B C B A C A B C C A:6 B:4 A, B, C, AC, with C A B C C B C:4 BC, CA but not D C A D D:2  High communication cost B A C A D B C Support σ=2 Max. gap γ=1  Redundant computation cost with D A, D, AD A D Max. length λ =3 C A D Mind the Gap: Large-Scale Frequent Sequence Mining 16

Improving the partitioning Traditional approach  Derive a partitioning rule (“projection”)  Prove correctness of the partition rule MG-FSM approach  Use any partitioning satisfying correctness  Rewrite the input sequences  ensuring each w-partition generates the set of pivot sequences for w Mind the Gap: Large-Scale Frequent Sequence Mining 17

Which is the optimal partition? C B D B C Max. gap γ=0 C B ,1 Max. length λ =2 B C ,1 C B B C pivot C C Β C C B B C Many short sequences? Few long sequences? Optimal partition not clear! Trade-off Aim for a “good” partition cost & gain using inexpensive rewrites Mind the Gap: Large-Scale Frequent Sequence Mining 18

Rewriting partitions Partition C frequent sequences with C but not D A C B A C B γ = 1 λ = 3 D A C B D D A C B D D A C B D D B C A D A C B D D B C A B C A D D B D B C A D D B D A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 19 Mind the Gap: Large-Scale Frequent Sequence Mining

Rewriting partitions Partition C frequent sequences with C but not D A C B A C B A C B γ = 1 λ = 3 D A C B D D A C B D D A C B D D A C B D D B C A D A C B D D B C A D A C B D D B C A B C A D D B D B C A D D B D B C A D D B D A D D C D A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 20 Mind the Gap: Large-Scale Frequent Sequence Mining

Rewriting partitions Partition C frequent sequences with C but not D A C B A C B A C B A C B A C B γ = 1 λ = 3 _ A C B _ D A C B D _ A C B _ D A C B D D A C B D _ A C B _ _ B C A _ A C B _ _ B C A D A C B D D B C A D A C B D D B C A D A C B D D B C A B C A _ _ B _ B C A _ _ B _ B C A D D B D B C A D D B D B C A D D B D A _ _ C _ A D D C D A _ _ C _ A D D C D A D D C D 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 21 Mind the Gap: Large-Scale Frequent Sequence Mining

Rewriting partitions Partition C frequent sequences with C but not D A C B γ = 1 λ = 3 _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A _ _ C _  Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) Mind the Gap: Large-Scale Frequent Sequence Mining 22

Rewriting partitions Partition C frequent sequences with C but not D A C B A C B γ = 1 λ = 3 _ A C B _ _ A C B _ _ A C B _ _ B C A _ A C B _ _ B C A B C A _ _ B _ B C A _ _ B _ A _ _ C _  Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence) Mind the Gap: Large-Scale Frequent Sequence Mining 23

Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki - PowerPoint PPT Presentation

Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrcken, Germany 27 th June 2013, New York SIGMOD 2013 Why are sequences

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

www.Every-Mind.org www.Every-Mind.org www.Every-Mind.org www.Every-Mind.org

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

Maximising Business Analytics to drive evidence based decisions in Cancer Care HISA Health Data

Early and Often, In Community: Bringing more parallelism into undergraduate CS curricula Libby

Babylon MediaPersia Greece Rome Arises from the fourth Arises from the second of kingdom,

Session 4: Civil Discovery LBSC 708X/INFM 718X Seminar on E-Discovery Jason R. Baron Adjunct

Innovative Low-Cost Plastic Optical Fiber Sensors for Gas Monitoring M. Ishtaiwi 1 , M. Parvis 1 ,

Building Safety After the Grenfell Tower Tragedy: What Happens Next? Claire Morrissey Senior

Answer Set Programming modulo Acyclicity Jori Bomanson 1 , Martin Gebser 1 , 2 , Tomi Janhunen 1

Using Interval Constraint Propagation for Pseudo- Boolean Constraint Solving