scalable frequent sequence mining with flexible
play

Scalable Frequent Sequence Mining With Flexible Subsequence - PowerPoint PPT Presentation

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2 1 Technische Universit at Berlin 2 Universit at Mannheim ICDE 2019, Macau, China April 11 th , 2019


  1. Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2 1 Technische Universit¨ at Berlin 2 Universit¨ at Mannheim ICDE 2019, Macau, China April 11 th , 2019

  2. Frequent Sequence Mining (FSM) Fundamental task in data mining ◮ Data modeled as sequences of items or events ◮ Often items are arranged in a hierarchy ◮ Goal is to discover frequent subsequences Example (market-basket data) ◮ Sequence = purchases of customer over time ◮ Item = product + product hierarchy ◮ Example subsequence = DSLR Camera → Tripod → Flash Applications Photography ◮ Natural language processing Tripod DSLR Camera ◮ Information extraction ◮ Web usage analysis Cannon5D Nikon5100 . . . ◮ . . . Example product hierarchy A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 2 / 15

  3. Challenge: Flexibility ◮ Unconstrained FSM outputs a multitude of frequent subsequences a bell (302392), had never used (23202), become president (234311), PER be professor (1582), graduated from (3962) , large enough to be (12083), why so many of us (234), who VERB also (22 223), lives in (4322) , of the (220125), going to (12897), great artist (2394), . . . ◮ Typically, only few of them are interesting to a specific application – E.g., only relational phrases between entities are of interest ◮ Flexible methods (that can be tailored to applications) are essential A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 3 / 15

  4. Goal: flexible and scalable FSM ◮ Common approach: flexible subsequence constraints ◮ Problem: existing FSM algorithms are flexible or scalable ◮ Our paper: flexible and scalable A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 4 / 15

  5. Outline 1. Frequent Sequence Mining 2. Flexibility 3. Scalability 4. Conclusion A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 5 / 15

  6. Flexible FSM with DESQ ◮ We adopt the unified FSM framework DESQ [ICDM ’16, TODS ’19] – Applications can describe flexible subsequences constraints in an intuitive, declarative way – Alleviates need for customized mining algorithms ◮ Provides pattern expression language to specify subsequence constraints – Syntax like regular expressions – Supports captures groups and hierarchies A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 6 / 15

  7. Example pattern expressions for applications Noun modified by adjective or noun ([ADJ | NOUN] NOUN) 1 big country (110), research scientist (473) ENTITY (VERB + NOUN + ? PREP?) ENTITY Relational phrase between entities 2 is being advised by (15), has coached (10) DigitalCamera[. { 0,3 } ( . ↑ )] { 1,4 } Products bought after a digital camera 3 Camera Lenses, Tripods & Monopods (11), Camera Batteries, SD & SDHC Cards (12) ([ S | T ]) . ∗ ( . ) . ∗ ([ R | T ]) Amino acid sequences that match [ S | T ] . [ R | T ] 4 S L R(103,093), T A K(102941) A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 7 / 15

  8. Example pattern expressions for traditional constraints 1 3-grams ( . . . ) 2 3 − , 4-, and 5-grams ( . ) { 3 , 5 } 3 skip 3-grams with gap 1 ( . ) . ( . ) . ( . ) 4 All subsequences [ . ∗ ( . )] + 5 length 3–5 subsequences [ . ∗ ( . )] { 3 , 5 } 6 bounded gap of 0–3 ( . )[ . { 0 , 3 } ( . )]+ 7 serial episodes of length 3, window 5 ( . )[ . ? . ?( . ) | . ?( . ) . ? | ( . ) . ? . ?]( . ) 8 generalized 5-grams ( . ↑ ) { 5 } 9 subsequences matching regex [ a | b ] c ∗ d ( a | b )[ . ∗ ( c )] ∗ . ∗ ( d ) 10 . . . A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 8 / 15

  9. Outline 1. Frequent Sequence Mining 2. Flexibility 3. Scalability 3.1 General framework 3.2 Communicate inputs 3.3 Communicate candidates 3.4 Experimental study 4. Conclusion A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 9 / 15

  10. A general framework for distributed FSM ◮ Bulk synchronous parallel with 1 round of communication (2) Communication (1) Local preprocessing (3) Local mining ( map ) ( reduce ) ( shuffle ) ◮ Item-based partitioning [SIGMOD ’00, PPoPP ’07, SIGMOD ’13] Input sequence Candidate subsequences acdcb , acdb , acb , acdcb , acdb , acb , relevant for partition c adcb , accb adcb , accb acdcb relevant for partition a adb , ab adb , ab (not relevant for partitions b , d ) ◮ Key challenges – How to distribute computation – What to communicate A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 10 / 15

  11. Communicate inputs ◮ Send each input sequence to all partitions to which it can contribute (2) Send rewritten (1) Determine partitions, (3) Run local rewrite input sequences FSM algorithm input sequences ◮ Often sufficient to send parts of the input sequence ◮ Example: if e ’s not relevant for mining task, don’t send them e e e a c d c b A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 11 / 15

  12. Communicate candidates ◮ Send each candidate subsequence to its corresponding partition (2) Send compressed (1) Generate and (3) Count compress candidates candidates candidates ◮ Important optimization: compress candidates { a } { c } { b } { a } { c } { c } { b } acdcb , acdb , acb , { a } { c } { d } { b } adcb , accb { a } { c } { d } { c } { b } { a } { d } { c } { b } { c } { d } { b } { c } { c } a c d c b { a } { c } { d } { b } { b } A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 12 / 15

  13. Experimental study: key results ◮ Up to 50x faster than na¨ ıve approaches, up to 100x less communication 1000 1000 Total time (in seconds) Total time (in seconds) Naïve SemiNaïve 100 D−SEQ 100 D−CAND n/a (OOM) n/a (OOM) 10 10 1 1 N 1 ( 10 ) N 2 ( 100 ) N 3 ( 10 ) N 4 ( 1k ) N 5 ( 1k ) A 1 ( 500 ) A 2 ( 100 ) A 3 ( 100 ) A 4 ( 100 ) Subsequence constraint Subsequence constraint (a) New York Times data (b) Amazon Review data ◮ Sending candidates is up to 5x faster for selective constraints ◮ 1-4x generalization overhead over specialized, less general approaches ◮ Both approaches scale nearly linearly with number of input sequences A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 13 / 15

  14. Outline 1. Frequent Sequence Mining 2. Flexibility 3. Scalability 4. Conclusion A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 14 / 15

  15. Conclusion ◮ Existing algorithms: flexible or scalable . Ours: both ◮ Adopt DESQ: a framework to tailor FSM to applications ◮ Distributed mining via item-based partitioning Communicate inputs 1 Communicate candidates 2 ◮ Available as open source Apache Spark library, link at https://github.com/rgemulla/desq/tree/distributed G. Buehrer et al. Toward terabyte pattern mining: An architecture-conscious solution. PPoPP ’07. K. Beedkar and R. Gemulla. DESQ: Frequent sequence mining with subsequence constraints. ICDM ’16. K. Beedkar, R. Gemulla, and W. Martens. A unified framework for frequent sequence mining with subsequence constraints. To appear in Transactions on Database Systems , 2019. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD ’00. I. Miliaraki et al. Mind the gap: Large-scale frequent sequence mining. SIGMOD ’13. A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 15 / 15

Recommend


More recommend