Distributed frequent sequence mining with declarative subsequence - PowerPoint PPT Presentation

Distributed frequent sequence mining with declarative subsequence constraints Alexander Renz-Wieland April 26, 2017

• Sequence: succession of items • Words in text • Products bought by a customer • Nucleotides in DNA molecules 1

• Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules 1

• Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences → lives in (2), in Washington (2), lives (2), in (2), Washington (2) 1

• Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives → lives in (2), in Washington (2), lives (2), in (2), Washington (2), PERSON lives in LOCATION (2), ... 1

• Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences Subsequences of input sequence 1: Obama, Obama lives, Obama in, Obama Washington, Obama lives in, Obama lives • Item hierarchy Washington, Obama in Washington, Obama lives in Washington, lives, lives in, lives Washington, lives in Washington, in, in • Subsequences Washington, Washington (15 subsequences, with hierarchy: 190) 1

• Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives • Subsequences • Subsequence constraints item constraint, gap constraint, length constraint, ... 1

• Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives • Subsequences • Subsequence constraints item constraint, gap constraint, length constraint, ... • Declarative constraints: “relational phrases between entities” (Beedkar and Gemulla, 2016) → lives in (2) 1

• Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives • Subsequences • Subsequence constraints item constraint, gap constraint, length constraint, ... • Declarative constraints: “relational phrases between entities” (Beedkar and Gemulla, 2016) → lives in (2) • Scalable algorithms 1

Outline Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation 2

Problem definition • Given • Input sequences • Item hierarchy • Constraint π • Minimum support threshold σ • Candidate sequences of input sequence T : • Subsequences of T that conform with constraint π • Find frequent sequences • Every sequence that is a candidate sequence of at least σ input sequences 4

Related work Sequential algorithms DESQ-COUNT and DESQ-DFS (Beedkar and Gemulla, 2016) Two distributed algorithms for Hadoop MapReduce: • MG-FSM (Miliaraki et al., 2013; Beedkar et al., 2015) • Maximum gap and maximum length constraints • No hierarchies • LASH (Beedkar and Gemulla, 2015) • Maximum gap and maximum length constraints • Hierarchies 5

Naïve approach • “Word count” • Generate candidate sequences → count → filter • Can improve by using single item frequencies 7

Naïve approach • “Word count” • Generate candidate sequences → count → filter • Can improve by using single item frequencies • Problem : a sequence of length n has O ( 2 n ) subsequences (without considering hierarchy) • Typically less due to constraints, but still a problem → Need a better approach 7

Overview • Two main stages • Partition candidate sequences • Similar approach used in MG-FSM and LASH 9

Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining 10

Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item 12

Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item 12

Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item P a : ab, abc, abd, abcd ab, abc, abcd, abd, T: abcd b, bc, bcd, bd P b : b, bc, bd, bcd 12

Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item P a : ab, abc, abd, abcd ab, abc, abcd, abd, T: abcd b, bc, bcd, bd P b : b, bc, bd, bcd • Least frequent item P b : ab, b ab, abc, abcd, abd, T: abcd P c : abc, bc b, bc, bcd, bd P d : abd, abcd, bd, bcd with f ( a ) > f ( b ) > f ( c ) > f ( d ) 12

Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item P a : ab, abc, abd, abcd ab, abc, abcd, abd, T: abcd b, bc, bcd, bd P b : b, bc, bd, bcd • Least frequent item P b : ab, b ab, abc, abcd, abd, T: abcd P c : abc, bc b, bc, bcd, bd P d : abd, abcd, bd, bcd with f ( a ) > f ( b ) > f ( c ) > f ( d ) → reduces variance in partition sizes 12

Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining One partition per pivot item. 13

Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining One partition per pivot item. 13 An input sequence is relevant for zero or more partitions.

Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining One partition per pivot item. 13 An input sequence is relevant for zero or more partitions. Next: what to shuffle?

Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence • Send candidate sequences 15

Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence + compact when many candidate sequences - need to compute candidate sequences twice • Send candidate sequences 15

Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence + compact when many candidate sequences - need to compute candidate sequences twice • Send candidate sequences + compact when candidate sequences are short and few per partition 15

Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence + compact when many candidate sequences - need to compute candidate sequences twice • Send candidate sequences + compact when candidate sequences are short and few per partition → Focus on sending candidate sequences → Try to represent them compactly 15

A compact representation for candidate sequences • Goal: compactly represent set of candidate sequences • Trick: exploit shared structure 16

A compact representation for candidate sequences • Goal: compactly represent set of candidate sequences • Trick: exploit shared structure { caabe , caaBe , caAbe , caABe , cAabe , cAaBe , cAAbe , cAABe , cbe , cBe } 16

Distributed frequent sequence mining with declarative subsequence - PowerPoint PPT Presentation

Distributed frequent sequence mining with declarative subsequence constraints Alexander Renz-Wieland April 26, 2017 Sequence: succession of items Words in text Products bought by a customer Nucleotides in DNA molecules 1

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining Kaustubh Beedkar 1 Rainer

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Topology, Capacity and Flow Assignment Computer Communication Networks: Analysis and Design

Council of School Councils Oct. 29, 2014 Agenda Welcome Assessment and reporting

Adaptation and Synchronization over a Network : Asymptotic Error Convergence and Pinning Travis

Reliable classification of classroom practices using lecture recordings George Kinnear

AST 1420 Galactic Structure and Dynamics Recap last week: Galaxies are collisionless

Real-Time BGP Data Access Mikhail Strizhov Colorado State University 1 Introduction

Update on Potential Legal Issues Related to the California Cap-and-Trade Program March 14, 2012

Cocks IBE Algorithm W.K. Chiu, C. Ding, C.L. Yu May 16, 2010 W.K. Chiu, C. Ding, C.L. Yu