A New F ramew ork for Itemset Generation Charu C. Aggarw al Philip S. Y u IBM T J W atson Researc h Cen ter August 10, 1998
Asso ciation Rules (1) Iden tify the presence of one set of items implying the presence of another set of items in a transaction ) e.g. diap er b eer (2) Applications Marekt bask et analysis { A ttac hed mailing in direct mark eting { Departmen t store �o or/shelf planning { In ternet sufring patterns { 1
Generation of Asso ciation Rules X ) Y (1) The supp ort of a rule is the fraction of the X Y rules whic h con tain b oth the set of items and . (2) The con�dence of the rule X ) Y is the fraction of the rules con taining X whic h also con tain Y . (3) The traditio nal approac h on asso ciatio n rule mining �rst �nding all the large itemsets whic h ha v e su�- { cien t supp ort, using large itemset generation algo- rithms then using them to generate all the rules with su�- { cien t con�dence. (4) The Apriori metho d w orks b y ( k { generating all p oten tial large + 1) itemsets from large k -itemsets using joins on the large k -itemsets, and then v alidating them against the database. { 2
W eaknesses of the large itemset metho d (1) The large itemset mo del w orks v ery w ell when the data is sparse. (2) When the data loses its sparse prop ert y , the large item- set metho d breaks do wn. (3) The metho d do es not address the signi�cance of a rule (relativ e to the assumption of statisiti cal indep endence ) Generalizing Asso ciation Rules to Correlati ons { (SIGMOD 97), Brin, Mot w ani and Silv erstein 3
Example (1) Consider the follo wing example: A retailer of breakfast cereal surv eys 5000 studen ts on the activities that they engage in the morning. The data sho ws that 3000 studen ts pla y bask etball, { 3750 eat cereal, and { 2000 studen ts b oth pla y bask etball and eat cereal. { (2) Consider the follo wing rule at 40% supp ort and 60% con�dence: pl ay bask etbal l ) eat cer eal (3) This asso ciation rule is misleading, b ecause the o v erall p ercen tage of studen ts eating cereal is 75%, whic h is ev en larger than 60%. (4) The rule pl ay bask etbal l ) ( not ) eat cer eal has b oth lo w er con�dence and lo w er supp ort than the rule implying p ositiv e asso ciatio n. 4
Another example (1) Consider the follo wing example: 1 1 1 1 0 0 0 0 X 1 1 0 0 0 0 0 0 Y 0 1 1 1 1 1 1 1 Z T able 1: The base data Rule Supp ort Con�dence X Y 25% 50% ) X Z 37.5% 75% ) T able 2: Corresp onding supp ort and con�dence � X The co e�cien t of correlation b et w een the items and Y 0 : 577, is while the co e�cien t of correlati on b et w een X and Z is is � 0 : 378. 5
The basic problems � Spuriousness in itemset generation as illustrated b y the last few examples. � Need to deal with dense data sets: ho w to set supp ort lev el � Inabilit y of �nd negativ e asso ciatio n rules: T o o m uc h bias in fa v or of the absence of items as opp osed to the presence of items. W e need to treat the presence or absence of an item in a symmetric w a y . � Data in whic h the di�eren t attributes ha v e widely v arying densities. 6
In terest Measure � The use of in terest measure is an attempt to remo v e itemsets whic h do not ha v e statistical indep endence. � R An itemset is said to b e -in teresting, if its presence is R -times the exp ected presence based on the assump- tion of statistica l indep endance. 7
Use of in terest measures � The use of in terest measures (whic h w ere prop osed b y Srik an t et. al.) is useful in pruning a w a y those rules whic h are rendered unin teresting. � As the bask etball-cereal illustrates, so long as in terest is used as a p ostpro cessing op erator, either the user has to set the supp ort v alue lo w enough so as not to lose an y in teresting rules in the output, or risk losing useful rules. The former ma y not alw a ys b e computationally feasible. � The in terest measure do es not normalize uniformly with resp ect to dense or sparse data. � F or t w o items with p erfect p ositiv e correlatio n, and base densit y of 0.9 eac h the in terest lev el is 2 0 : 9 = (0 : 9) 1 : 11, = while for t w o items with p erfect p ositiv e stataistica l correlation and base densit y of 0.1 eac h, the in terest lev el is 10. 8
The notion of collectiv e strength � Let I b e an itemset. � An itemset I is said to b e violated if some items tak e on the v alue of 0, while others tak e on the v alue of 1 in a transaction. � Let v ( I ) b e the fraction of violatio ns. W e ha v e E [ v ( I )] = 1 � � p � � (1 � p ). i 2 I i i 2 I i � Let A ( I ) b e the fraction of agreemen ts. A ( I ) = 1 � v ( I ). Also w e ha v e E [ A ( I )] = 1 � E [ v ( I )]. 9
Collectiv e Strength � The collectiv e strength of an itemset is equal to the agreemen t ratio divided b y the violation ratio. 1 � v ( I ) E [ v ( I )] C ( I ) = � (1) � E [ v ( I v ( I 1 )] ) � Another w a y of lo oking at collectiv e strength: Go o d Ev en ts E[Bad Ev en ts] C ( I ) = � (2) E[Go o d Ev en ts] Bad Ev en ts � When there is p erfect negativ e correlatio n among the items, the collectiv e strength is 0, else the collectiv e strength is 1 . � A collectiv e strength of 1 is the break ev en p oin t. 10
Application to previous examples Bask etball-cereal example: 5000 p eople, 3000 pla y bask etball, 3750 � eat cereal, 2000 b oth pla y bask etball and eat cereal. Itemset Supp ort Collectiv e Strength Pla y bask etball, eat cereal 40% 0.67 Pla y bask etball, (not)eat cereal 20% 1 = 0 : 67 = 1 : 49 1 1 1 1 0 0 0 0 X Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 T able 3: The base data Itemset Supp ort Statistical Correlation Collectiv e St rength X, Y 25% 0 : 577 3 X, Z 37 : 5% � 0 : 378 0 : 6 Y, Z 12 : 5% � 0 : 655 0 : 31 11
Closure Prop ert y � Supp ose that the items f Milk ; Bread g are closely cor- f Diap ; g . related and similarly for the items er Beer � This will result in f Milk ; Bread ; Diap er ; Beer g to ha v e high collectiv e strength f Milk ; Bread g and f Diap er ; Beer g are indep enden t { Items in a set p erfectly correlated (supp ort 10%) { 4 4 2 2 1 � (0 : 1 +0 : 9 ) 0 : 1 +0 : 9 � Collectiv e strength: { 4 4 2 2 0 : 1 +0 : 9 1 � (0 : 1 +0 : 9 ) � The closure prop ert y forces all subsets to b e closely correlated. � An itemset I is is said to b e strongly collectiv e at lev el K , if it satis�es the follo wing prop o erties: The collectiv e strength C ( I ) of the itemset I is at { least K . The collectiv e strength of { Closure Prop ert y: ev ery subset J of I is at least K . 12
Generating the strongly collectiv e bask ets � Let k b e a n um b er whic h is larger than 1. Consider 0 B n � an itemset of size 2. Supp ose that all 2-subsets B k of ha v e collectiv e strength larger than . Then the 0 B itemset is highly lik ely to ha v e collectiv e strength k larger than . 0 � The follo wing results can b e pro v ed for the 2 to 3 case: Let I = f i ; i ; i g b e a 3-itemset. Supp ose that { 1 2 3 for ev ery 2-subset of I the violation ratio is at most � < 1. Then, it m ust also b e the case that the violation ratio of itemset I is at most � . A similar result can b e pro v ed for the agreemen t { ratio. � When the ab o v e t w o results are used in conjunction, then the results for collectiv e strength ma y b e inferred. 13
Algorithm for �nding itemsets with collectiv e strength � Find all t w o itemsets with the appropriate collectiv e P strength. Let us call this . 2 � P erform joins to �nd P from P . k +1 k � Remo v e all those ( k + 1)-itemsets from P suc h that k +1 some k -subset of it is not included in P . k � Con tin ue the pro cess for increasing k , un til P is k empt y . � P erform a pass o v er the transactio n database in order P k to remo v e an y false itemsets in for eac h . k � V alidating agaist the database is e�cien t b ecause of the prop ert y discussed earlier. 14
Recommend
More recommend