CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Prof. F. Provost (Stern, NYU) • Prof. B. Liu, UIC 1
What Is Association Mining? • Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Applications: – Market Basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. 2
Association Mining? • Examples. – Rule form: “ Body ead [support, confidence]”. – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] – buys(x, "bread") buys(x, "milk") [0.6%, 65%] – major(x, "CS") /\ takes(x, "DB") grade(x, "A") [1%, 75%] – age(X,30-45) /\ income(X, 50K-75K) buys(X, SUVcar) – age=“30 - 45”, income=“50K - 75K” car=“SUV”
Market-basket analysis and finding associations • Do items occur together? (more than I might expect) • Proposed by Agrawal et al in 1993. • It is an important data mining model studied extensively by the database and data mining community. • Assume all data are categorical. • No good algorithm for numeric data. • Initially used for Market Basket Analysis to find how items purchased by customers are related. Bread Milk [sup = 5%, conf = 100%]
Association Rule: Basic Concepts • Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items – E.g., 98% of people who purchase tires and auto accessories also get automotive services done • Applications – * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) – Home Electronics * (What other products should the store stocks up?) – Detecting “ping - pong”ing of patients, faulty “collisions” 5
Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper} {Beer}, 1 Bread, Milk {Milk, Bread} {Eggs,Coke}, 2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk}, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality! 5 Bread, Milk, Diaper, Coke An itemset is simply a set of items 6
Association Rule Mining – We are interested in rules that are • non-trivial (and possibly unexpected) • actionable • easily explainable 7
Examples from a Supermarket • Can you think of association rules from a supermarket? • Let’s say you identify association rules from a supermarket, how might you exploit them? – That is, if you are the store manager, how might you make money? • Assume you have a rule of the form X Y 8
Supermarket examples • If you have a rule X Y, you could: – Run a sale on X if you want to increase sales of Y – Locate the two items near each other – Locate the two items far from each other to make the shopper walk through the store – Print out a coupon on checkout for Y if shopper bought X but not Y 9
Association “ rules ” – standard format Rule format: ( A set can consist of just a single item ) If {set of items} Then {set of items} Condition Results Then If {Diapers, {Beer, Chips} Baby Food} Customer Customer buys both buys diaper Condition implies Results Customer buys beer
What is an interesting association? • Requires domain-knowledge validation – actionable vs. trivial vs. inexplicable • Algorithms provide first-pass based on statistics on how “unexpected” an association is • Some standard statistics used: C R – support ≈ p(R&C) • percent of “baskets” where rule holds – confidence ≈ p(R|C) • percent of times R holds when C holds
Support and Confidence • Find all the rules X Y with Customer Customer buys both minimum confidence and support buys diaper – Support = probability that a transaction contains {X,Y} • i.e., ratio of transactions in which X, Y occur together to all transactions in database. Customer – Confidence = conditional probability that buys beer a transaction having X also contains Y • i.e., ratio of transactions in which X, Y occur together to those in which X occurs. In general confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS: Confidence (LHS => RHS) = Support(LHS RHS) / Support(LHS)
Definition: Frequent Itemset • Itemset – A collection of one or more items TID Items • Example: {Milk, Bread, Diaper} 1 Bread, Milk – k-itemset 2 Bread, Diaper, Beer, Eggs • An itemset that contains k items 3 Milk, Diaper, Beer, Coke Support count ( ) • 4 Bread, Milk, Diaper, Beer – Frequency of occurrence of itemset 5 Bread, Milk, Diaper, Coke – E.g. ({Milk, Bread,Diaper}) = 2 • Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold 13
Definition: Association Rule Association Rule TID Items – An implication expression of the form 1 Bread, Milk X Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs – Example: 3 Milk, Diaper, Beer, Coke {Milk, Diaper} {Beer} 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Rule Evaluation Metrics – Support (s) Example: Fraction of transactions that contain { Milk , Diaper } Beer both X and Y – Confidence (c) ( Milk , Diaper, Beer ) 2 Measures how often items in Y s 0 . 4 appear in transactions that | T | 5 contain X ( Milk, Diaper, Beer ) 2 c 0 . 67 ( Milk , Diaper ) 3
Support and Confidence - Example Itemset {A, C} has a support of 2/5 = Itemset {A, C} has a support of 2/5 = 40% 40% Transaction ID Items Bought Rule {A} ==> {C} has confidence of Rule {A} ==> {C} has confidence of 1001 A, B, C 50% 50% 1002 A, C Rule {C} ==> {A} has confidence of Rule {C} ==> {A} has confidence of 1003 A, D 100% 100% 1004 B, E, F Support for {A, C, E} ? Support for {A, C, E} ? 1005 A, D, F Support for {A, D, F} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ? Confidence for {A} ==> {D, F} ? Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).
t1: Beef, Chicken, Milk t2: Beef, Cheese Example t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk • Transaction data t7: Chicken, Milk, Clothes • Assume: minsup = 30% minconf = 80% • An example frequent itemset : {Chicken, Clothes, Milk} [sup = 3/7] • Association rules from the itemset: Clothes Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken Milk, [sup = 3/7, conf = 3/3] 16 CS583, Bing Liu, UIC
Mining Association Rules Example of Rules: TID Items {Milk,Diaper} {Beer} (s=0.4, c=0.67) 1 Bread, Milk {Milk,Beer} {Diaper} (s=0.4, c=1.0) 2 Bread, Diaper, Beer, Eggs {Diaper,Beer} {Milk} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Beer} {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Diaper} {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements
Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support minsup – 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive
Transaction data representation • A simplistic view of shopping baskets, • Some important information not considered. E.g, – the quantity of each item purchased and – the price paid. CS583, Bing Liu, UIC 20
Many mining algorithms • There are a large number of them!! • They use different strategies and data structures. • Their resulting sets of rules are all the same. – Given a transaction data set T , and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined. • Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different. • We study only one: the Apriori Algorithm CS583, Bing Liu, UIC 21
Recommend
More recommend