Association Rule Mining 1
What Is Association Rule Mining? Association rule mining is finding frequent patterns or associations among sets of items or objects, usually amongst transactional data Applications include Market Basket analysis, cross-marketing, catalog design, etc. 2
Association Mining Examples. Rule form: “ Body ead [support, confidence]”. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] buys(x, "bread") buys(x, "milk") [0.6%, 65%] major(x, "CS") /\ takes(x, "DB") grade(x, "A") [1%, 75%] age(X,30-45) /\ income(X, 50K-75K) buys(X, SUVcar) age=“30 - 45”, income=“50K - 75K” car=“SUV”
Market-basket Analysis & Finding Associations Do items occur together? Proposed by Agrawal et al in 1993. It is an important data mining model studied extensively by the database and data mining community. Assumes all data are categorical. Initially used for Market Basket Analysis to find how items purchased by customers are related. Bread Milk [sup = 5%, conf = 100%]
Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done Applications * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics * (What other products should the store stocks up?) Detecting “ping - pong”ing of patients, faulty “collisions” 5
Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper} {Beer}, 1 Bread, Milk {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality! 5 Bread, Milk, Diaper, Coke An itemset is simply a set of items 6
Examples from a Supermarket Can you think of association rules from a supermarket? Let’s say you identify association rules from a supermarket, how might you exploit them? That is, if you are the store manager, how might you make money? Assume you have a rule of the form X Y 7
Supermarket examples If you have a rule X Y, you could: Run a sale on X if you want to increase sales of Y Locate the two items near each other Locate the two items far from each other to make the shopper walk through the store Print out a coupon on checkout for Y if shopper bought X but not Y 8
Association “ rules ”– standard format Rule format: ( A set can consist of just a single item ) If {set of items} Then {set of items} Condition Results Then If {Diapers, {Beer, Chips} Baby Food} Customer Customer buys both Condition implies Results buys diaper Right side very often is a single item Rules do not imply causality Customer buys beer
What is an Interesting Association? Requires domain-knowledge validation Actionable, non-trivial, understandable Algorithms provide first-pass based on statistics on how “unexpected” an association is Some standard statistics used: C R support ≈ p(R&C) percent of “baskets” where rule holds confidence ≈ p(R|C) percent of times R holds when C holds
Support and Confidence Find all the rules X Y with Customer Customer buys both minimum confidence and support buys diaper Support = probability that a transaction contains {X,Y} i.e., ratio of transactions in which X, Y occur together to all transactions in DB. Confidence = conditional probability Customer that a transaction having X contains Y buys beer i.e., ratio of transactions in which X, Y occur together to those in which X occurs. Thel confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS: Confidence (LHS => RHS) = Support(LHS RHS) / Support(LHS)
Definition: Frequent Itemset Itemset A collection of one or more items TID Items Example: {Milk, Bread, Diaper} 1 Bread, Milk k-itemset: itemset with k items 2 Bread, Diaper, Beer, Eggs Support count ( ) 3 Milk, Diaper, Beer, Coke Frequency count of occurrence of itemset 4 Bread, Milk, Diaper, Beer E.g. ({Milk, Bread,Diaper}) = 2 5 Bread, Milk, Diaper, Coke Support Fraction of transactions containing the itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold 12
Support and Confidence Calculations Given Association Rule TID Items – {Milk, Diaper} {Beer} 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs Rule Evaluation Metrics 3 Milk, Diaper, Beer, Coke – Support (s) 4 Bread, Milk, Diaper, Beer Fraction of transactions that 5 Bread, Milk, Diaper, Coke contain both X and Y – Confidence (c) Measures how often items in Y appear in transactions that { Milk , Diaper } Beer contain X Now Compute these two metrics ( Milk , Diaper, Beer ) 2 0 . 4 s | T | 5 ( Milk, Diaper, Beer ) 2 0 . 67 c ( Milk , Diaper ) 3
Support and Confidence – 2 nd Example Itemset {A, C} has a support of 2/5 = 40% Transaction ID Items Bought Rule {A} ==> {C} has confidence of 1001 A, B, C 50% 1002 A, C Rule {C} ==> {A} has confidence of 1003 A, D 100% 1004 B, E, F Support for {A, C, E} ? 1005 A, D, F Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ? Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).
t1: Beef, Chicken, Milk Example t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk Transaction data t7: Chicken, Milk, Clothes Assume: minsup = 30% minconf = 80% An example frequent itemset : {Chicken, Clothes, Milk} [sup = 3/7] Rules from the itemset are partitions of the items Association rules from above itemset: Clothes Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken Milk, [sup = 3/7, conf = 3/3] 15
Mining Association Rules Example of Rules: TID Items {Milk,Diaper} {Beer} (s=0.4, c=0.67) 1 Bread, Milk {Milk,Beer} {Diaper} (s=0.4, c=1.0) 2 Bread, Diaper, Beer, Eggs {Diaper,Beer} {Milk} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Beer} {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Diaper} {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support (by definition) but may have different confidence values
Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
Mining Association Rules Two-step approach: Frequent Itemset Generation 1. Generate all itemsets whose support minsup – Rule Generation 2. – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive
Transaction data representation A simplistic view of “shopping baskets” Some important information not considered: the quantity of each item purchased the price paid 19
Many mining algorithms There are a large number of them They use different strategies and data structures. Their resulting sets of rules are all the same. Given a transaction data set T , and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined. Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different. We study only one: the Apriori Algorithm 20
The Apriori algorithm The best known algorithm Two steps : Find all itemsets that have minimum support ( frequent itemsets , also called large itemsets). Use frequent itemsets to generate rules. E.g., a frequent itemset {Chicken, Clothes, Milk} [sup = 3/7] and one rule from the frequent itemset Clothes Milk, Chicken [sup = 3/7, conf = 3/3] 21
Step 1: Mining all Frequent Itemsets A frequent itemset is an itemset whose support is ≥ minsup. Key idea: The Apriori property (downward closure property): any subsets of a frequent itemset are also frequent itemsets ABC ABD ACD BCD AB AC AD BC BD CD A B C D 22
Recommend
More recommend