CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Prof. F. Provost (Stern, NYU) • Prof. B. Liu, UIC 1

What Is Association Mining? • Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Applications: – Market Basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. 2

Association Mining? • Examples. – Rule form: “ Body  ead [support, confidence]”. – buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%] – buys(x, "bread")  buys(x, "milk") [0.6%, 65%] – major(x, "CS") /\ takes(x, "DB")  grade(x, "A") [1%, 75%] – age(X,30-45) /\ income(X, 50K-75K)  buys(X, SUVcar) – age=“30 - 45”, income=“50K - 75K”  car=“SUV”

Market-basket analysis and finding associations • Do items occur together? (more than I might expect) • Proposed by Agrawal et al in 1993. • It is an important data mining model studied extensively by the database and data mining community. • Assume all data are categorical. • No good algorithm for numeric data. • Initially used for Market Basket Analysis to find how items purchased by customers are related. Bread  Milk [sup = 5%, conf = 100%]

Association Rule: Basic Concepts • Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items – E.g., 98% of people who purchase tires and auto accessories also get automotive services done • Applications – *  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) – Home Electronics  * (What other products should the store stocks up?) – Detecting “ping - pong”ing of patients, faulty “collisions” 5

Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper}  {Beer}, 1 Bread, Milk {Milk, Bread}  {Eggs,Coke}, 2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk}, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality! 5 Bread, Milk, Diaper, Coke An itemset is simply a set of items 6

Association Rule Mining – We are interested in rules that are • non-trivial (and possibly unexpected) • actionable • easily explainable 7

Examples from a Supermarket • Can you think of association rules from a supermarket? • Let’s say you identify association rules from a supermarket, how might you exploit them? – That is, if you are the store manager, how might you make money? • Assume you have a rule of the form X  Y 8

Supermarket examples • If you have a rule X  Y, you could: – Run a sale on X if you want to increase sales of Y – Locate the two items near each other – Locate the two items far from each other to make the shopper walk through the store – Print out a coupon on checkout for Y if shopper bought X but not Y 9

Association “ rules ” – standard format Rule format: ( A set can consist of just a single item ) If {set of items}  Then {set of items} Condition Results Then If {Diapers, {Beer, Chips} Baby Food} Customer Customer buys both buys diaper Condition implies Results Customer buys beer

What is an interesting association? • Requires domain-knowledge validation – actionable vs. trivial vs. inexplicable • Algorithms provide first-pass based on statistics on how “unexpected” an association is • Some standard statistics used: C  R – support ≈ p(R&C) • percent of “baskets” where rule holds – confidence ≈ p(R|C) • percent of times R holds when C holds

Support and Confidence • Find all the rules X  Y with Customer Customer buys both minimum confidence and support buys diaper – Support = probability that a transaction contains {X,Y} • i.e., ratio of transactions in which X, Y occur together to all transactions in database. Customer – Confidence = conditional probability that buys beer a transaction having X also contains Y • i.e., ratio of transactions in which X, Y occur together to those in which X occurs. In general confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS: Confidence (LHS => RHS) = Support(LHS  RHS) / Support(LHS)

Definition: Frequent Itemset • Itemset – A collection of one or more items TID Items • Example: {Milk, Bread, Diaper} 1 Bread, Milk – k-itemset 2 Bread, Diaper, Beer, Eggs • An itemset that contains k items 3 Milk, Diaper, Beer, Coke Support count (  ) • 4 Bread, Milk, Diaper, Beer – Frequency of occurrence of itemset 5 Bread, Milk, Diaper, Coke – E.g.  ({Milk, Bread,Diaper}) = 2 • Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold 13

Definition: Association Rule  Association Rule TID Items – An implication expression of the form 1 Bread, Milk X  Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs – Example: 3 Milk, Diaper, Beer, Coke {Milk, Diaper}  {Beer} 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke  Rule Evaluation Metrics – Support (s) Example:  Fraction of transactions that contain  { Milk , Diaper } Beer both X and Y – Confidence (c)   ( Milk , Diaper, Beer ) 2  Measures how often items in Y   s 0 . 4 appear in transactions that | T | 5 contain X  ( Milk, Diaper, Beer ) 2    c 0 . 67  ( Milk , Diaper ) 3

Support and Confidence - Example Itemset {A, C} has a support of 2/5 = Itemset {A, C} has a support of 2/5 = 40% 40% Transaction ID Items Bought Rule {A} ==> {C} has confidence of Rule {A} ==> {C} has confidence of 1001 A, B, C 50% 50% 1002 A, C Rule {C} ==> {A} has confidence of Rule {C} ==> {A} has confidence of 1003 A, D 100% 100% 1004 B, E, F Support for {A, C, E} ? Support for {A, C, E} ? 1005 A, D, F Support for {A, D, F} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ? Confidence for {A} ==> {D, F} ? Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).

t1: Beef, Chicken, Milk t2: Beef, Cheese Example t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk • Transaction data t7: Chicken, Milk, Clothes • Assume: minsup = 30% minconf = 80% • An example frequent itemset : {Chicken, Clothes, Milk} [sup = 3/7] • Association rules from the itemset: Clothes  Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3] 16 CS583, Bing Liu, UIC

Mining Association Rules Example of Rules: TID Items {Milk,Diaper}  {Beer} (s=0.4, c=0.67) 1 Bread, Milk {Milk,Beer}  {Diaper} (s=0.4, c=1.0) 2 Bread, Diaper, Beer, Eggs {Diaper,Beer}  {Milk} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Diaper}  {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375

Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support  minsup – 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive

Transaction data representation • A simplistic view of shopping baskets, • Some important information not considered. E.g, – the quantity of each item purchased and – the price paid. CS583, Bing Liu, UIC 20

Many mining algorithms • There are a large number of them!! • They use different strategies and data structures. • Their resulting sets of rules are all the same. – Given a transaction data set T , and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined. • Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different. • We study only one: the Apriori Algorithm CS583, Bing Liu, UIC 21

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. F. Provost (Stern, NYU) Prof. B. Liu, UIC 1 What Is Association Mining?

CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by

CISC 4631 Data Mining Lecture 11: Neural Networks Biological Motivation Can we simulate the

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are based on the slides by

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

CISC Semiconductor GmbH Dr. Markus PISTAUER CEO m.pistauer@cisc.at Company at a glance

Relationship Mining Association Rule Mining Association Rule Mining Try to automatically find

Week 5 Video 3 Relationship Mining Association Rule Mining Association Rule Mining Try to

CISC 323 Intro to Software Engineering Week 8: Software Architecture (Continued) CISC 323

CISC 323 Intro to Software Engineering Week 6: Design Patterns CISC 323 Intro to Software

CISC / RISC Complex / Reduced Instruction Set Computers CISC / RISC p. 1/12 Instruction

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Cryogenic Instrumentation & Slow Controls (CISC) Overview & Status Sowjanya Gollapinni

CISC 876: Kolmogorov Complexity Neil Conway March 27, 2007 Neil Conway CISC 876: Kolmogorov

CISC 101 CISC 101 Introduction Me: Sarah-Jane Whittaker PhD Candidate, Queens School

Data Mining Lecture 03: Introduction to classification Linear classifier Theses

DATA MINING LECTURE 3 Frequent Itemsets and Association Rules This is how it all started

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 26, 2012 Schedule:

4.1. Product Differentiation Up to now we have assumed that goods, produced by different firms,

Table-1 Data New form of data for CS101 -- "table" Re-use the code idioms, loops

154 calories Bike 20 minutes Run 2.25 miles Rum and coke = = 240 calories Bike 35 minute

The Calculus of Communicating Systems Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

Modals and conditionals Kai von Fintel (MIT) CSSL17 July 1014, 2017 1 This intermediate

Dynamic Games and Bargaining Johan Stennek 1 Dynamic Games Logic of cartels Idea:

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. F. Provost (Stern, NYU) Prof. B. Liu, UIC 1 What Is Association Mining?

CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by

CISC 4631 Data Mining Lecture 11: Neural Networks Biological Motivation Can we simulate the

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are based on the slides by

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

CISC Semiconductor GmbH Dr. Markus PISTAUER CEO m.pistauer@cisc.at Company at a glance

Relationship Mining Association Rule Mining Association Rule Mining Try to automatically find

Week 5 Video 3 Relationship Mining Association Rule Mining Association Rule Mining Try to

CISC 323 Intro to Software Engineering Week 8: Software Architecture (Continued) CISC 323

CISC 323 Intro to Software Engineering Week 6: Design Patterns CISC 323 Intro to Software

CISC / RISC Complex / Reduced Instruction Set Computers CISC / RISC p. 1/12 Instruction

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Cryogenic Instrumentation &amp; Slow Controls (CISC) Overview &amp; Status Sowjanya Gollapinni

CISC 876: Kolmogorov Complexity Neil Conway March 27, 2007 Neil Conway CISC 876: Kolmogorov

CISC 101 CISC 101 Introduction Me: Sarah-Jane Whittaker PhD Candidate, Queens School

Data Mining Lecture 03: Introduction to classification Linear classifier Theses

DATA MINING LECTURE 3 Frequent Itemsets and Association Rules This is how it all started

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 26, 2012 Schedule:

4.1. Product Differentiation Up to now we have assumed that goods, produced by different firms,

Table-1 Data New form of data for CS101 -- &quot;table&quot; Re-use the code idioms, loops

154 calories Bike 20 minutes Run 2.25 miles Rum and coke = = 240 calories Bike 35 minute

The Calculus of Communicating Systems Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

Modals and conditionals Kai von Fintel (MIT) CSSL17 July 1014, 2017 1 This intermediate

Dynamic Games and Bargaining Johan Stennek 1 Dynamic Games Logic of cartels Idea:

Cryogenic Instrumentation & Slow Controls (CISC) Overview & Status Sowjanya Gollapinni

Table-1 Data New form of data for CS101 -- "table" Re-use the code idioms, loops