toon calders

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent - PowerPoint PPT Presentation

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i Pattern Explosion Problem Condensed Representations C d d R i Closed itemsets Non Derivable Itemsets Non Derivable Itemsets

  1. Toon Calders Discovery Science, October 30 th 2012, Lyon

  2.  Frequent Itemset Mining F I Mi i  Pattern Explosion Problem  Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets  Recent Approaches Towards Non ‐ Redundant pp Pattern Mining  Relations Between the Approaches R l i B h A h

  3. Minsup = 60% Minsup 60% Minconf = 80% set support TID Item A 2 1 A,B,C,D B 4 2 B,C,D C 5 BD  C 100% 3 A,C,D D 4 C  D C 80% 80% 4 B,C,D BC 4 D  C 100% 5 B,C BD 3 C  B 80% CD CD 4 4 B  C 100% BCD 3

  4. Warehouse Warehouse Data  mine mine gather use

  5.  Association rules gaining popularity  Literally hundreds of algorithms: AIS, Apriori, AprioriTID, AprioriHybrid, b d FPGrowth, FPGrowth*, Eclat, dEclat, Pincer ‐ search, ABS, DCI, kDCI, LCM, AIM, PIE, h k ARMOR, AFOPT, COFI, Patricia, MAXMINER, MAFIA, …

  6. Mushroom has 8124 transactions, and a transaction length of 23 and a transaction length of 23 Over 50 000 patterns Over 10 000 000 patterns

  7. patterns Data

  8.  Frequent itemset / Association rule mining = find all itemsets / ARs satisfying thresholds  Many are redundant smoker  lung cancer  l k smoker, bald  lung cancer pregnant  woman  pregnant, smoker  woman, lung cancer

  9.  Frequent Itemset Mining F I Mi i  Pattern Explosion Problem  Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets  Recent Approaches Towards Non ‐ Redundant pp Pattern Mining  Relations Between the Approaches R l i B h A h

  10. A1 A2 A3 3 B1 B2 B3 3 C1 C C C2 C3 C3 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1  Number of frequent itemsets = 21 Number of frequent itemsets 21  Need a compact representation

  11.  Condensed Representation: “Compressed” version of the collection of all f frequent itemsets (usually a subset) that allows ll b h ll for lossless regeneration of the complete collection. ll  Closed Itemsets (Pasquier et al, ICDT 1999)  Free Itemsets (Boulicaut et al, PKDD 2000)  Disjunction ‐ Free itemsets (Bykowski and Rigotti, PODS 2001)

  12.  How do supports interact?  What information about unknown supports h f b k can we derive from known supports?  Concise representation: only store relevant part of the supports

  13.  Agrawal et al. (Monotonicity)  Supp(AX)  Supp(A) Supp(AX)  Supp(A)  Lakhal et al. Lakhal et al. (Closed sets) (Closed sets) Boulicaut et al. (Free sets)  If Supp(A) = Supp(AB) If Supp(A) = Supp(AB) Then Supp(AX) = Supp(AXB)

  14.  Bayardo Ba ardo (MAXMINER) (MAXMINER)  Supp(ABX)  Supp(AX) – (Supp(X) ‐ Supp(BX)) drop (X, B)  Bykowski, Rigotti (Disjunction ‐ free sets) if Supp(ABC) = Supp(AB) + Supp(AC) – Supp(A) then S Supp(ABCX) = Supp(ABX) + Supp(ACX) – Supp(AX) (ABCX) S (ABX) S (ACX) S (AX)

  15.  General problem:  Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC)  [ ?, ? ] (ABC) [ ? ? ]

  16.  General problem:  Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC)  [ 0.2, 0.5 ] (ABC) [ ]

  17.  The problem of finding tight bounds f f is hard to solve in general Theorem The following problem is NP ‐ complete: h f ll bl l Given itemsets I1, …, In, and supports s1, …, sn, Does there exist a database D such that: h i d b h h for j=1…n, supp(I j ) = s j

  18.  Can be translated into a linear program  Introduce variable X J for every itemset J X J  fraction of transactions with items = J TID Items 1 A 2 C 3 3 C C 4 A,B 5 A,B,C 6 A,B,C

  19.  Can be translated into a linear program  Introduce variable X J for every itemset J X J  fraction of transactions with items = J X { } = 0 TID Items X A = 1/6 / 1 A X B = 0 2 C X C = 2/6 / 3 3 C C C X AB = 1/6 4 A,B X AC = 0 5 A,B,C X X BC = = 0 0 6 A,B,C X ABC = 2/6

  20. Give bounds on ABC b d Minimize/maximize X ABC s t s.t. For a database D X {} +X A +X B +X C +X AB +X AC +X BC +X ABC = 1 BC ABC X {} ,X A ,X B ,X C , …, X ABC  0 In which X X AB +X ABC = 0.7 +X 0 7 supp(AB) = 0.7 X BC +X ABC = 0.5 supp(BC) = 0.5

  21.  Given: Supp(I) for all I  J Give tight [l,u] for J g , Can be computed efficiently  Without counting : Supp(J)  [l,u]  J is a derivable itemset (DI) iff l = u  We know Supp(J) exactly without counting!

  22.  Considerably smaller than all frequent f itemsets  Many redundancies removed  There exist efficient algorithms for mining them  Yet, still way too many patterns generated  supp(A) = 90%, supp(B)=20% supp(AB)  [10%,20%] yet, supp(AB) = 18% not interesting

  23.  Frequent Itemset Mining  Recent Approaches Towards Non ‐ Redundant h d d d Pattern Mining  Statistically based  Compression based  Relations Between the Approaches

  24.  We have background knowledge  Supports of some itemsets  Column/row marginals  Influences our “expectation” of the database  Not every database equally likely  Surprisingness: p g  How does real support correspond to expectation?

  25. Statistical model Statistical model Row marginal Update -One database Column marginal Supports -Distribution over Density of tiles f databases … Report statistic statistic prediction yes Surprising ? database Support/tile/…

  26.  Types of background knowledge f  Supports, marginals, densities of regions  Mapping background knowledge to statistical model  Distribution over databases; one distributions representing a database  Way of computing surprisingness f

  27.  Row and column marginals A A B B C C 0 0 0 0 2 Row 0 1 1 w marginal 0 1 1 2 2 1 1 0 1 1 0 0 s 3 1 1 1 3 3 3 Column marginals

  28.  Row and column marginals A A B B C C 0 ? ? ? 2 Row ? ? ? w marginal ? ? ? 2 2 ? ? ? 1 ? ? ? s 3 ? ? ? 3 3 3 Column marginals

  29.  Density of tiles f A A B B C C 0 0 0 0 1 1 0 1 1 1 1 0 1 0 0 1 1 1

  30.  Density of tiles f A A B B C C ? ? ? ? ? ? Density 1 y ? ? ? ? ? ? Density 6/ 8 ? ? ? ? ? ?

  31.  Consider all databases that satisfy the f constraints  Uniform distribution over these databases f d b h d b  Gionis et al: row and column marginals  Hanhijärvi et al: extension to supports A. Gionis, H. Mannila, T. Mielikäinen, P . Tsaparas: Assessing data mining results via swap randomization. TKDD 1(3): (2007) S. Hanhijärvi, M. Ojala, N. Vuokko, K. Puolamäki, N. Tatti, H. Mannila: Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining. ACM SIGKDD (2009)

  32. 1 1 1 1 1 1 3 3 1 1 1 3 supp(BC) = 60% 0 1 1 2 1 0 0 1 0 0 1 1 0 1 0 1 3 4 3  Is this support surprising given the marginals?

  33. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 supp(BC) = 60% supp(BC) = 60% supp(BC) = 40% supp(BC) = 40% supp(BC) = 60% supp(BC) = 60% 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 supp(BC) = 60% supp(BC) = 40%

  34. 1 1 1 1 1 1 1 1 1 supp(BC) = 60% 0 1 1 1 0 0 1 0 0 0 1 0  Is this support surprising given the marginals? h h l No!  p ‐ value = P(supp(BC)  60% | marginals) = 60%  E[supp(BC)] = 60% x 60% + 40% x 40% = 52%

  35.  Estimation of p ‐ value via simulation (MC) f  Uniform sampling from databases with same marginals is non ‐ trivial l l  MCMC 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0

  36. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0

  37. No explicit model created t d Statistical model Statistical model Update -Uniform over all Row marginal Column marginal satisfying databases (Supports) Report prediction prediction statistic statistic yes P P ‐ value l Simulation; Surprising ? MCMC database Any statistic

  38.  Database  probability distribution  p(t=X) = |{ t  D | t=X }|/|D|  Pick the one with maximal entropy k h h l  H(p) = ‐  X p(t=X) log(p(t=X)) A B prob A B Prob A B prob Example: 0 0 10% 0 0 0% 0 0 8% supp(A) = 90% 0 1 0% 0 1 10% 0 1 2% supp(B) = 20% 1 0 70% 1 0 80% 1 0 72% 1 1 1 1 20% 20% 1 1 1 1 10% 10% 1 1 1 1 18% 18% H = 1.157 H = 0.992 H = 1.19

  39.  H(p) = ‐  X p(t=X) log(p(t=X))  ‐ log(p(t=X)) denotes space required to encode X, given an optimal Shannon encoding for the distribution p; characterizes the information content of X characterizes the information content of X  p(t=X) denotes the probability that event t=X occurs occurs  H(p) = expected number of bits needed to encode transactions transactions


More recommend