Toon Calders Discovery Science, October 30 th 2012, Lyon
Frequent Itemset Mining F I Mi i Pattern Explosion Problem Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets Recent Approaches Towards Non ‐ Redundant pp Pattern Mining Relations Between the Approaches R l i B h A h
Minsup = 60% Minsup 60% Minconf = 80% set support TID Item A 2 1 A,B,C,D B 4 2 B,C,D C 5 BD C 100% 3 A,C,D D 4 C D C 80% 80% 4 B,C,D BC 4 D C 100% 5 B,C BD 3 C B 80% CD CD 4 4 B C 100% BCD 3
Warehouse Warehouse Data mine mine gather use
Association rules gaining popularity Literally hundreds of algorithms: AIS, Apriori, AprioriTID, AprioriHybrid, b d FPGrowth, FPGrowth*, Eclat, dEclat, Pincer ‐ search, ABS, DCI, kDCI, LCM, AIM, PIE, h k ARMOR, AFOPT, COFI, Patricia, MAXMINER, MAFIA, …
Mushroom has 8124 transactions, and a transaction length of 23 and a transaction length of 23 Over 50 000 patterns Over 10 000 000 patterns
patterns Data
Frequent itemset / Association rule mining = find all itemsets / ARs satisfying thresholds Many are redundant smoker lung cancer l k smoker, bald lung cancer pregnant woman pregnant, smoker woman, lung cancer
Frequent Itemset Mining F I Mi i Pattern Explosion Problem Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets Recent Approaches Towards Non ‐ Redundant pp Pattern Mining Relations Between the Approaches R l i B h A h
A1 A2 A3 3 B1 B2 B3 3 C1 C C C2 C3 C3 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 Number of frequent itemsets = 21 Number of frequent itemsets 21 Need a compact representation
Condensed Representation: “Compressed” version of the collection of all f frequent itemsets (usually a subset) that allows ll b h ll for lossless regeneration of the complete collection. ll Closed Itemsets (Pasquier et al, ICDT 1999) Free Itemsets (Boulicaut et al, PKDD 2000) Disjunction ‐ Free itemsets (Bykowski and Rigotti, PODS 2001)
How do supports interact? What information about unknown supports h f b k can we derive from known supports? Concise representation: only store relevant part of the supports
Agrawal et al. (Monotonicity) Supp(AX) Supp(A) Supp(AX) Supp(A) Lakhal et al. Lakhal et al. (Closed sets) (Closed sets) Boulicaut et al. (Free sets) If Supp(A) = Supp(AB) If Supp(A) = Supp(AB) Then Supp(AX) = Supp(AXB)
Bayardo Ba ardo (MAXMINER) (MAXMINER) Supp(ABX) Supp(AX) – (Supp(X) ‐ Supp(BX)) drop (X, B) Bykowski, Rigotti (Disjunction ‐ free sets) if Supp(ABC) = Supp(AB) + Supp(AC) – Supp(A) then S Supp(ABCX) = Supp(ABX) + Supp(ACX) – Supp(AX) (ABCX) S (ABX) S (ACX) S (AX)
General problem: Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC) [ ?, ? ] (ABC) [ ? ? ]
General problem: Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC) [ 0.2, 0.5 ] (ABC) [ ]
The problem of finding tight bounds f f is hard to solve in general Theorem The following problem is NP ‐ complete: h f ll bl l Given itemsets I1, …, In, and supports s1, …, sn, Does there exist a database D such that: h i d b h h for j=1…n, supp(I j ) = s j
Can be translated into a linear program Introduce variable X J for every itemset J X J fraction of transactions with items = J TID Items 1 A 2 C 3 3 C C 4 A,B 5 A,B,C 6 A,B,C
Can be translated into a linear program Introduce variable X J for every itemset J X J fraction of transactions with items = J X { } = 0 TID Items X A = 1/6 / 1 A X B = 0 2 C X C = 2/6 / 3 3 C C C X AB = 1/6 4 A,B X AC = 0 5 A,B,C X X BC = = 0 0 6 A,B,C X ABC = 2/6
Give bounds on ABC b d Minimize/maximize X ABC s t s.t. For a database D X {} +X A +X B +X C +X AB +X AC +X BC +X ABC = 1 BC ABC X {} ,X A ,X B ,X C , …, X ABC 0 In which X X AB +X ABC = 0.7 +X 0 7 supp(AB) = 0.7 X BC +X ABC = 0.5 supp(BC) = 0.5
Given: Supp(I) for all I J Give tight [l,u] for J g , Can be computed efficiently Without counting : Supp(J) [l,u] J is a derivable itemset (DI) iff l = u We know Supp(J) exactly without counting!
Considerably smaller than all frequent f itemsets Many redundancies removed There exist efficient algorithms for mining them Yet, still way too many patterns generated supp(A) = 90%, supp(B)=20% supp(AB) [10%,20%] yet, supp(AB) = 18% not interesting
Frequent Itemset Mining Recent Approaches Towards Non ‐ Redundant h d d d Pattern Mining Statistically based Compression based Relations Between the Approaches
We have background knowledge Supports of some itemsets Column/row marginals Influences our “expectation” of the database Not every database equally likely Surprisingness: p g How does real support correspond to expectation?
Statistical model Statistical model Row marginal Update -One database Column marginal Supports -Distribution over Density of tiles f databases … Report statistic statistic prediction yes Surprising ? database Support/tile/…
Types of background knowledge f Supports, marginals, densities of regions Mapping background knowledge to statistical model Distribution over databases; one distributions representing a database Way of computing surprisingness f
Row and column marginals A A B B C C 0 0 0 0 2 Row 0 1 1 w marginal 0 1 1 2 2 1 1 0 1 1 0 0 s 3 1 1 1 3 3 3 Column marginals
Row and column marginals A A B B C C 0 ? ? ? 2 Row ? ? ? w marginal ? ? ? 2 2 ? ? ? 1 ? ? ? s 3 ? ? ? 3 3 3 Column marginals
Density of tiles f A A B B C C 0 0 0 0 1 1 0 1 1 1 1 0 1 0 0 1 1 1
Density of tiles f A A B B C C ? ? ? ? ? ? Density 1 y ? ? ? ? ? ? Density 6/ 8 ? ? ? ? ? ?
Consider all databases that satisfy the f constraints Uniform distribution over these databases f d b h d b Gionis et al: row and column marginals Hanhijärvi et al: extension to supports A. Gionis, H. Mannila, T. Mielikäinen, P . Tsaparas: Assessing data mining results via swap randomization. TKDD 1(3): (2007) S. Hanhijärvi, M. Ojala, N. Vuokko, K. Puolamäki, N. Tatti, H. Mannila: Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining. ACM SIGKDD (2009)
1 1 1 1 1 1 3 3 1 1 1 3 supp(BC) = 60% 0 1 1 2 1 0 0 1 0 0 1 1 0 1 0 1 3 4 3 Is this support surprising given the marginals?
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 supp(BC) = 60% supp(BC) = 60% supp(BC) = 40% supp(BC) = 40% supp(BC) = 60% supp(BC) = 60% 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 supp(BC) = 60% supp(BC) = 40%
1 1 1 1 1 1 1 1 1 supp(BC) = 60% 0 1 1 1 0 0 1 0 0 0 1 0 Is this support surprising given the marginals? h h l No! p ‐ value = P(supp(BC) 60% | marginals) = 60% E[supp(BC)] = 60% x 60% + 40% x 40% = 52%
Estimation of p ‐ value via simulation (MC) f Uniform sampling from databases with same marginals is non ‐ trivial l l MCMC 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0
No explicit model created t d Statistical model Statistical model Update -Uniform over all Row marginal Column marginal satisfying databases (Supports) Report prediction prediction statistic statistic yes P P ‐ value l Simulation; Surprising ? MCMC database Any statistic
Database probability distribution p(t=X) = |{ t D | t=X }|/|D| Pick the one with maximal entropy k h h l H(p) = ‐ X p(t=X) log(p(t=X)) A B prob A B Prob A B prob Example: 0 0 10% 0 0 0% 0 0 8% supp(A) = 90% 0 1 0% 0 1 10% 0 1 2% supp(B) = 20% 1 0 70% 1 0 80% 1 0 72% 1 1 1 1 20% 20% 1 1 1 1 10% 10% 1 1 1 1 18% 18% H = 1.157 H = 0.992 H = 1.19
H(p) = ‐ X p(t=X) log(p(t=X)) ‐ log(p(t=X)) denotes space required to encode X, given an optimal Shannon encoding for the distribution p; characterizes the information content of X characterizes the information content of X p(t=X) denotes the probability that event t=X occurs occurs H(p) = expected number of bits needed to encode transactions transactions
Recommend
More recommend