Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent - PowerPoint PPT Presentation

Toon Calders Discovery Science, October 30 th 2012, Lyon

 Frequent Itemset Mining F I Mi i  Pattern Explosion Problem  Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets  Recent Approaches Towards Non ‐ Redundant pp Pattern Mining  Relations Between the Approaches R l i B h A h

Minsup = 60% Minsup 60% Minconf = 80% set support TID Item A 2 1 A,B,C,D B 4 2 B,C,D C 5 BD  C 100% 3 A,C,D D 4 C  D C 80% 80% 4 B,C,D BC 4 D  C 100% 5 B,C BD 3 C  B 80% CD CD 4 4 B  C 100% BCD 3

Warehouse Warehouse Data  mine mine gather use

 Association rules gaining popularity  Literally hundreds of algorithms: AIS, Apriori, AprioriTID, AprioriHybrid, b d FPGrowth, FPGrowth*, Eclat, dEclat, Pincer ‐ search, ABS, DCI, kDCI, LCM, AIM, PIE, h k ARMOR, AFOPT, COFI, Patricia, MAXMINER, MAFIA, …

Mushroom has 8124 transactions, and a transaction length of 23 and a transaction length of 23 Over 50 000 patterns Over 10 000 000 patterns

patterns Data

 Frequent itemset / Association rule mining = find all itemsets / ARs satisfying thresholds  Many are redundant smoker  lung cancer  l k smoker, bald  lung cancer pregnant  woman  pregnant, smoker  woman, lung cancer

 Frequent Itemset Mining F I Mi i  Pattern Explosion Problem  Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets  Recent Approaches Towards Non ‐ Redundant pp Pattern Mining  Relations Between the Approaches R l i B h A h

A1 A2 A3 3 B1 B2 B3 3 C1 C C C2 C3 C3 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1  Number of frequent itemsets = 21 Number of frequent itemsets 21  Need a compact representation

 Condensed Representation: “Compressed” version of the collection of all f frequent itemsets (usually a subset) that allows ll b h ll for lossless regeneration of the complete collection. ll  Closed Itemsets (Pasquier et al, ICDT 1999)  Free Itemsets (Boulicaut et al, PKDD 2000)  Disjunction ‐ Free itemsets (Bykowski and Rigotti, PODS 2001)

 How do supports interact?  What information about unknown supports h f b k can we derive from known supports?  Concise representation: only store relevant part of the supports

 Agrawal et al. (Monotonicity)  Supp(AX)  Supp(A) Supp(AX)  Supp(A)  Lakhal et al. Lakhal et al. (Closed sets) (Closed sets) Boulicaut et al. (Free sets)  If Supp(A) = Supp(AB) If Supp(A) = Supp(AB) Then Supp(AX) = Supp(AXB)

 Bayardo Ba ardo (MAXMINER) (MAXMINER)  Supp(ABX)  Supp(AX) – (Supp(X) ‐ Supp(BX)) drop (X, B)  Bykowski, Rigotti (Disjunction ‐ free sets) if Supp(ABC) = Supp(AB) + Supp(AC) – Supp(A) then S Supp(ABCX) = Supp(ABX) + Supp(ACX) – Supp(AX) (ABCX) S (ABX) S (ACX) S (AX)

 General problem:  Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC)  [ ?, ? ] (ABC) [ ? ? ]

 General problem:  Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC)  [ 0.2, 0.5 ] (ABC) [ ]

 The problem of finding tight bounds f f is hard to solve in general Theorem The following problem is NP ‐ complete: h f ll bl l Given itemsets I1, …, In, and supports s1, …, sn, Does there exist a database D such that: h i d b h h for j=1…n, supp(I j ) = s j

 Can be translated into a linear program  Introduce variable X J for every itemset J X J  fraction of transactions with items = J TID Items 1 A 2 C 3 3 C C 4 A,B 5 A,B,C 6 A,B,C

 Can be translated into a linear program  Introduce variable X J for every itemset J X J  fraction of transactions with items = J X { } = 0 TID Items X A = 1/6 / 1 A X B = 0 2 C X C = 2/6 / 3 3 C C C X AB = 1/6 4 A,B X AC = 0 5 A,B,C X X BC = = 0 0 6 A,B,C X ABC = 2/6

Give bounds on ABC b d Minimize/maximize X ABC s t s.t. For a database D X {} +X A +X B +X C +X AB +X AC +X BC +X ABC = 1 BC ABC X {} ,X A ,X B ,X C , …, X ABC  0 In which X X AB +X ABC = 0.7 +X 0 7 supp(AB) = 0.7 X BC +X ABC = 0.5 supp(BC) = 0.5

 Given: Supp(I) for all I  J Give tight [l,u] for J g , Can be computed efficiently  Without counting : Supp(J)  [l,u]  J is a derivable itemset (DI) iff l = u  We know Supp(J) exactly without counting!

 Considerably smaller than all frequent f itemsets  Many redundancies removed  There exist efficient algorithms for mining them  Yet, still way too many patterns generated  supp(A) = 90%, supp(B)=20% supp(AB)  [10%,20%] yet, supp(AB) = 18% not interesting

 Frequent Itemset Mining  Recent Approaches Towards Non ‐ Redundant h d d d Pattern Mining  Statistically based  Compression based  Relations Between the Approaches

 We have background knowledge  Supports of some itemsets  Column/row marginals  Influences our “expectation” of the database  Not every database equally likely  Surprisingness: p g  How does real support correspond to expectation?

Statistical model Statistical model Row marginal Update -One database Column marginal Supports -Distribution over Density of tiles f databases … Report statistic statistic prediction yes Surprising ? database Support/tile/…

 Types of background knowledge f  Supports, marginals, densities of regions  Mapping background knowledge to statistical model  Distribution over databases; one distributions representing a database  Way of computing surprisingness f

 Row and column marginals A A B B C C 0 0 0 0 2 Row 0 1 1 w marginal 0 1 1 2 2 1 1 0 1 1 0 0 s 3 1 1 1 3 3 3 Column marginals

 Row and column marginals A A B B C C 0 ? ? ? 2 Row ? ? ? w marginal ? ? ? 2 2 ? ? ? 1 ? ? ? s 3 ? ? ? 3 3 3 Column marginals

 Density of tiles f A A B B C C 0 0 0 0 1 1 0 1 1 1 1 0 1 0 0 1 1 1

 Density of tiles f A A B B C C ? ? ? ? ? ? Density 1 y ? ? ? ? ? ? Density 6/ 8 ? ? ? ? ? ?

 Consider all databases that satisfy the f constraints  Uniform distribution over these databases f d b h d b  Gionis et al: row and column marginals  Hanhijärvi et al: extension to supports A. Gionis, H. Mannila, T. Mielikäinen, P . Tsaparas: Assessing data mining results via swap randomization. TKDD 1(3): (2007) S. Hanhijärvi, M. Ojala, N. Vuokko, K. Puolamäki, N. Tatti, H. Mannila: Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining. ACM SIGKDD (2009)

1 1 1 1 1 1 3 3 1 1 1 3 supp(BC) = 60% 0 1 1 2 1 0 0 1 0 0 1 1 0 1 0 1 3 4 3  Is this support surprising given the marginals?

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 supp(BC) = 60% supp(BC) = 60% supp(BC) = 40% supp(BC) = 40% supp(BC) = 60% supp(BC) = 60% 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 supp(BC) = 60% supp(BC) = 40%

1 1 1 1 1 1 1 1 1 supp(BC) = 60% 0 1 1 1 0 0 1 0 0 0 1 0  Is this support surprising given the marginals? h h l No!  p ‐ value = P(supp(BC)  60% | marginals) = 60%  E[supp(BC)] = 60% x 60% + 40% x 40% = 52%

 Estimation of p ‐ value via simulation (MC) f  Uniform sampling from databases with same marginals is non ‐ trivial l l  MCMC 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0

No explicit model created t d Statistical model Statistical model Update -Uniform over all Row marginal Column marginal satisfying databases (Supports) Report prediction prediction statistic statistic yes P P ‐ value l Simulation; Surprising ? MCMC database Any statistic

 Database  probability distribution  p(t=X) = |{ t  D | t=X }|/|D|  Pick the one with maximal entropy k h h l  H(p) = ‐  X p(t=X) log(p(t=X)) A B prob A B Prob A B prob Example: 0 0 10% 0 0 0% 0 0 8% supp(A) = 90% 0 1 0% 0 1 10% 0 1 2% supp(B) = 20% 1 0 70% 1 0 80% 1 0 72% 1 1 1 1 20% 20% 1 1 1 1 10% 10% 1 1 1 1 18% 18% H = 1.157 H = 0.992 H = 1.19

 H(p) = ‐  X p(t=X) log(p(t=X))  ‐ log(p(t=X)) denotes space required to encode X, given an optimal Shannon encoding for the distribution p; characterizes the information content of X characterizes the information content of X  p(t=X) denotes the probability that event t=X occurs occurs  H(p) = expected number of bits needed to encode transactions transactions

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent - PowerPoint PPT Presentation

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i Pattern Explosion Problem Condensed Representations C d d R i Closed itemsets Non Derivable Itemsets Non Derivable Itemsets

Extending Relational Databases Toon Calders t.calders@tue.nl Last Lectures Relational query

Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals University of Antwerp,

Extraction of family relationships from historical documents Julia Efremova, Toon Calders

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters Bart Goethals Eindhoven

Metric Learning Applied for Automatic Large Image Classification Supervisors TOON CALDERS (PhD) /

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele

Reason, relativism and situated cognition Adam Toon University of Exeter a.toon@exeter.ac.uk 1

Program 13:00h welcome and introduction (Toon Goedem) 13:30h Status update hardware and

UPDATING THE EVIDENCE ON FCE De Baets, S. Calders, P.; Schalley, N.; Vermeulen, K.; Vertriest,

Long-term storage of sugar beet in North-West Europe Robert Olsson Toon Huijbregts, Guy Legrand,

of SSH Implementations Paul Fiterau, Toon Lenaerts, Erik Poll, Joeri de Ruiter, Frits Vaandrager,

Inverse Toon Shading: Interactive Normal Field Modeling with

Toon Hoi Theng 1001128790 Aditya Pratama Putra 1001129192 Lau Meow Yean 1001130096 Jacqueline

PROTECTION POKER - A GAME FOR RISK ESTIMATION Martin Gilje Jaatun ( Yaw-toon)

PRESENTATION OF CONSOLIDATED RESULTS For the quarter ended 28 June 2014 AGENDA 2 Strategic and

PRESENTATION OF CONSOLIDATED RESULTS For the 52 weeks ended 29 March 2014 AGENDA 2 Strategic

VIRTIO-NET: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD CUNMING LIANG, Intel Agenda

Proto-Ersuic, Qiangic, and PTB Dominic Yu UC Berkeley ICSTLL45 (Singapore - NTU) 2012 October

Welcome to #WCETWebcast October 19, 2017 The webcast will begin shortly. There is no audio

Polynomial Representations of the Lie superalgebra osp (1 | 2 n ) Asmus Kjr Bisbo Department of

Combining Kernels for Classification Doctoral Thesis Seminar Darrin P . Lewis

Convergence of Boundary Element Methods on Fractals Simon Chandler-Wilde

Word order and disambiguation in Pangasinan Joey Lim Michael Yoshitaka Erlewine

iLab Dynamic Routing Florian Wohlfart wohlfart@in.tum.de Chair of Network Architectures and