A Probabilistic Approach to Association Rule Mining CSE Colloquium - PowerPoint PPT Presentation

A Probabilistic Approach to Association Rule Mining CSE Colloquium Department of Computer Science and Engineering Southern Methodist University Dr. Michael Hahsler Marketing Research and e-Business Adviser Hall Financial Group, Frisco, Texas, U.S.A. Dallas, October 10, 2008.

Outline 1. Motivation 2. Introduction to Association Rules • Support-confidence framework 3. Probabilistic Interpretation, Weaknesses and Enhancements • Probabilistic Interpretation of Support and Confidence • Weaknesses of the Support-confidence Framework • Lift and Chi-Square Test for Independence 4. Probabilistic Model • Independence Model • Applications - Comparison of Simulated and Real World Data - NB-Frequent Itemsets - Hyper-Confidence 5. Conclusion 2

Motivation 3

Motivation The amount of collected data is constantly growing. For example: • Transaction data: Retailers (point-of-sale systems, loyalty card programms) and e-commerce • Web navigation data: Web analytics, search engines, digital libraries, Wikis, etc. • Gen expression data: DNA microarrays Typical size of data sets: • Typical Retailer: 10–500 product groups and 500–10,000 products • Amazon: approx. 3 million books/CDs (1998) • Wikipedia: approx. 2.5 million articles (2008) • Google: approx. 8 billion pages (est. 70% of the web) in index (2005) • Human Genome Project: approx. 20,000–25,000 genes in human DNA with 3 billion chemical base pairs. • Typically 10,000–10 million transactions (shopping baskets, user sessions, observations, etc.) 4

Motivation The aim of association analysis is to find ‘interesting’ relationships between items (products, documents, etc.). Example: ‘purchase relationship’: milk, flour and eggs are frequently bought together. or If someone purchases milk and flour then the person often also purchases eggs. Applications of found relationships: • Retail: Product placement, promotion campaigns, product assortment decisions, etc. → exploratory market basket analysis (Russell et al. , 1997; Berry and Linoff, 1997; Schnedlitz et al. , 2001; Reutterer et al. , 2007). • E-commerce, dig. libraries, search engines: Personalization, mass customization → recommender systems, item-based collaborative filtering (Sarwar et al. , 2001; Linden et al. , 2003; Geyer-Schulz and Hahsler, 2003). 5

Motivation Problem: For k items (products) we have 2 k − k − 1 possible relationships between items. Example: Power set for k = 4 items (represented as lattice). {beer, eggs, flour, milk} {beer, eggs, flour} {beer, eggs, milk} {beer, flour, milk} {eggs, flour, milk} {beer, eggs} {beer, flour} {beer, milk} {eggs, flour} {eggs, milk} {flour, milk} {beer} {eggs} {flour} {milk} {} For k = 100 the number of possible relationships exceeds 10 30 ! → Data mining: Find frequent itemsets and association rules . 6

Introduction to Association Rules 7

Transaction Data Definition: Let I = { i 1 , i 2 , . . . , i k } be a set of items . Let D = { Tr 1 , Tr 2 , . . . , Tr n } be a set of transactions called database . Each transaction in D contains a subset of I and has an unique transaction identifier. Represented as a binary purchase incidence matrix: Transaction ID beer eggs flour milk 1 0 1 1 1 2 1 1 0 0 3 0 1 0 1 4 0 1 1 1 5 0 0 0 1 8

Association Rules A rule takes the form X → Y with X, Y ⊆ I and X ∩ Y = ∅ . X and Y are called itemsets . X is the rule’s antecedent (left-hand side) and Y is the rule’s consequent (right-hand side). To select ‘interesting’ association rules from the set of all possible rules, two measures are used (Agrawal et al. , 1993): 1. Support of an itemset Z is defined as supp( Z ) = n Z /n . → share of transactions in the database that contains Z . 2. Confidence of a rule X → Y is defined as conf( X → Y ) = supp( X ∪ Y ) / supp( X ) → share of transactions containing Y in all the transactions containing X . Each association rule X → Y has to satisfy the following restrictions: supp( X ∪ Y ) ≥ σ conf( X → Y ) ≥ γ → called the support-confidence framework. 9

Minimum Support Idea: Set a user-defined threshold for support since more frequent itemsets are typically more important. E.g., frequently purchased products generally generate more revenue. Apriori property (Agrawal and Srikant, 1994): The support of an itemset can not increase by adding an item. Example: σ = . 4 (support count ≥ 2 ) Transaction ID beer eggs flour milk {beer, eggs, flour, milk} support count = 0 1 0 1 1 1 2 1 1 1 0 3 0 1 0 1 {beer, eggs, flou} 1 {beer, eggs, milk} 0 4 0 1 1 1 {beer, flour, milk} 0 {eggs, flour, milk} 2 5 0 0 0 1 {beer, eggs} 1 {beer, milk} 0 {eggs, flour} 3 {eggs, milk} 2 {flour,milk} 2 {beer, flour} 1 {beer} 1 {eggs} 4 {flour} 3 {milk} 4 'Frequent Itemsets' → Basis for efficient algorithms (Apriori, Eclat). 10

Minimum Confidence From the set of frequent itemsets all rules which satisfy the threshold for confidence conf( X → Y ) = supp( X ∪ Y ) ≥ γ are generated. supp( X ) Confidence { eggs } → { flour } 3 / 4 = 0 . 75 { flour } → { eggs } 3 / 3 = 1 {eggs, flour, milk} 2 { eggs } → { milk } 2 / 4 = 0 . 5 { milk } → { eggs } 2 / 4 = 0 . 5 { flour } → { milk } 2 / 3 = 0 . 67 { milk } → { flour } 2 / 4 = 0 . 5 {eggs, flour} 3 {eggs, milk} 2 {flour, milk} 2 { eggs, flour } → { milk } 2 / 3 = 0 . 67 { eggs, milk } → { flour } 2 / 2 = 1 { flour, milk } → { eggs } 2 / 2 = 1 { eggs } → { flour, milk } 2 / 4 = 0 . 5 {eggs} 4 {flour} 3 {milk} 4 { flour } → { eggs, milk } 2 / 3 = 0 . 67 'Frequent itemsets' { milk } → { eggs, flour } 2 / 4 = 0 . 5 At γ = 0 . 7 the following set of rules is generated: Support Confidence { eggs } → { flour } 3 / 5 = 0 . 6 3 / 4 = 0 . 75 { flour } → { eggs } 3 / 5 = 0 . 6 3 / 3 = 1 { eggs, milk } → { flour } 2 / 5 = 0 . 4 2 / 2 = 1 { flour, milk } → { eggs } 2 / 5 = 0 . 4 2 / 2 = 1 11

Probabilistic Interpretation, Weaknesses and Enhancements 12

Probabilistic interpretation of Support and Confidence • Support supp( Z ) = n Z /n corresponds to an estimate for P ( E Z ) , the probability for the event that itemset Z is contained in a transaction. • Confidence can be interpreted as an estimate for the conditional probability P ( E Y | E X ) = P ( E X ∩ E Y ) . P ( E X ) This directly follows the definition of confidence: n X ∪ Y conf( X → Y ) = supp( X ∪ Y ) n = . n X supp( X ) n 13

Weaknesses of Support and Confidence • Support suffers from the ‘rare item problem’ (Liu et al. , 1999a): Infrequent items not meeting minimum support are ignored which is problematic if rare items are important. E.g. rarely sold products which account for a large part of revenue or profit. Typical support distribution (retail point-of-sale data with 169 items): 80 Number of items 60 40 20 0 0.00 0.05 0.10 0.15 0.20 0.25 Support • Support falls rapidly with itemset size. A threshold on support favors short itemsets (Seno and Karypis, 2005). 14

Weaknesses of Support and Confidence • Confidence ignores the frequency of Y (Aggarwal and Yu, 1998; Silverstein et al. , 1998).  X=0 X=1 Y=0 5 5 10 conf( X → Y ) = n X ∪ Y = 20 25 = . 8 = ˆ P ( E Y | E X ) Y=1 70 20 90 n X  75 25 100 Confidence of the rule is relatively high. But the unconditional probability ˆ P ( E Y ) = n Y /n = 90 / 100 = . 9 is higher! • The thresholds for support and confidence are user-defined. In practice, the values are chosen to produce a ‘manageable’ number of frequent itemsets or rules. → What is the risk and cost attached to using spurious rules in an application? 15

Lift The measure lift (interest, Brin et al. , 1997) is defined as lift( X → Y ) = conf( X → Y ) supp( X ∪ Y ) = supp( X ) · supp( Y ) supp( Y ) and can be interpreted as an estimate for P ( E X ∩ E Y ) / ( P ( E X ) · P ( E Y )) . → Measure for the deviation from stochastic independence: P ( E X ∩ E Y ) = P ( E X ) · P ( E Y ) In marketing values of lift are interpreted as: (Betancourt and Gautschi, 1990; Hruschka et al. , 1999) : • lift( X → Y ) = 1 . . . X and Y are independent • lift( X → Y ) > 1 . . . complementary effects between X and Y • lift( X → Y ) < 1 . . . substitution effects between X and Y Example  X=0 X=1 Y=0 5 5 10 . 2 lift( X → Y ) = . 25 · . 9 = . 89 Y=1 70 20 90  75 25 100 16

Chi-Square Test for Independence Tests for significant deviations from stochastic independence (Silverstein et al. , 1998; Liu et al. , 1999b) . Example: 2 × 2 contingency table ( l = 2 dimensions) for rule X → Y .  X=0 X=1 Y=0 5 5 10 Y=1 70 20 90  75 25 100 Null hypothesis: P ( E X ∩ E Y ) = P ( E X ) · P ( E Y ) The test statistic ( n ij − E ( n ij )) 2 X 2 = � � E ( n ij ) = n i · · n · j with E ( n ij ) i j asymptotically approaches a χ 2 distribution with 2 l − l − 1 degrees of freedom. The result of the test for the contingency table above: X 2 = 3 . 7037 , df = 1 , p - value = 0 . 05429 → The null hypothesis (independence) can not be be rejected at α = 0 . 05 . Can also be used to test for independence between all l items in an itemset – l -dimensional contingency table. Weakness: Bad approximation for E ( n ij ) < 5 ; multiple testing. 17

Probabilistic Model 18

A Probabilistic Approach to Association Rule Mining CSE Colloquium - PowerPoint PPT Presentation

A Probabilistic Approach to Association Rule Mining CSE Colloquium Department of Computer Science and Engineering Southern Methodist University Dr. Michael Hahsler Marketing Research and e-Business Adviser Hall Financial Group, Frisco, Texas,

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Relationship Mining Association Rule Mining Association Rule Mining Try to automatically find

Week 5 Video 3 Relationship Mining Association Rule Mining Association Rule Mining Try to

Association Rules from transactional databases ! Mining multilevel association rules from

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

Association rule mining Association rule induction: Originally designed for market basket analysis

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

CSE 258 Web Mining and Recommender Systems Advanced Recommender Systems This week

The apriori algorithm as an engine for computerized adaptive assessment N IELS S MITS Research

Features of High Capacity MTRs (from publication profiles) Frances Marshall

Particle methods with applications in finance Peng HU ICERM, Providence September 5, 2012 P. HU

The Art of Exploiting Logical Flaws in Web Apps Sumit sid Siddharth Richard deanx

A First-Order Formalization of Commitments and Goals for Planning Felipe Meneguzzi 1 , Pankaj

Software Security: Principles CS 161: Computer Security Prof. Vern Paxson TAs: Jethro Beekman,

The Yield to M Maturity (YTM) of Bonds and How to C Calculate It Quickly This Lesson: Very

Sambuz

Useful Links

Newsletter

Mail Us