Data Mining Chapter 5 Association Analysis : Basic Concepts - PDF document

Data Mining Chapter 5 Association Analysis : Basic Concepts Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 10/26/2020 Introduction to Data Mining, 2 nd Edition 1 1 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper}  {Beer}, 1 Bread, Milk {Milk, Bread}  {Eggs,Coke}, 2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk}, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality! 5 Bread, Milk, Diaper, Coke 10/26/2020 Introduction to Data Mining, 2 nd Edition 2 2

Definition: Frequent Itemset  Itemset – A collection of one or more items  Example: {Milk, Bread, Diaper} TID Items – k-itemset 1 Bread, Milk  An itemset that contains k items 2 Bread, Diaper, Beer, Eggs  Support count (  ) 3 Milk, Diaper, Beer, Coke – Frequency of occurrence of an itemset 4 Bread, Milk, Diaper, Beer – E.g.  ({Milk, Bread,Diaper}) = 2 5 Bread, Milk, Diaper, Coke  Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5  Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold 10/26/2020 Introduction to Data Mining, 2 nd Edition 3 3 Definition: Association Rule  Association Rule TID Items – An implication expression of the form 1 Bread, Milk X  Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs – Example: 3 Milk, Diaper, Beer, Coke {Milk, Diaper}  {Beer} 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke  Rule Evaluation Metrics – Support (s) Example:  Fraction of transactions that contain  both X and Y { Milk , Diaper } { Beer} – Confidence (c)   ( Milk , Diaper, Beer ) 2  Measures how often items in Y   s 0 . 4 appear in transactions that | T | 5 contain X  ( Milk, Diaper, Beer ) 2    c 0 . 67  ( Milk , Diaper ) 3 10/26/2020 Introduction to Data Mining, 2 nd Edition 4 4

Association Rule Mining Task  Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold  Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive! 10/26/2020 Introduction to Data Mining, 2 nd Edition 5 5 Computational Complexity  Given d unique items: – Total number of itemsets = 2 d – Total number of possible association rules:    d   d k           d 1 d k R     k j       k  1 j  1    3 2 1 d d  1 If d=6, R = 602 rules 10/26/2020 Introduction to Data Mining, 2 nd Edition 6 6

Mining Association Rules Example of Rules: TID Items 1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67) 2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Diaper}  {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements 10/26/2020 Introduction to Data Mining, 2 nd Edition 7 7 Mining Association Rules Two-step approach:  1. Frequent Itemset Generation Generate all itemsets whose support  minsup – 2. Rule Generation Generate high confidence rules from each frequent itemset, – where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still  computationally expensive 10/26/2020 Introduction to Data Mining, 2 nd Edition 8 8

Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Given d items, there are 2 d possible candidate itemsets ABCDE 10/26/2020 Introduction to Data Mining, 2 nd Edition 9 9 Frequent Itemset Generation  Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2 d !!! 10/26/2020 Introduction to Data Mining, 2 nd Edition 10 10

Frequent Itemset Generation Strategies  Reduce the number of candidates (M) – Complete search: M=2 d – Use pruning techniques to reduce M  Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms  Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction 10/26/2020 Introduction to Data Mining, 2 nd Edition 11 11 Reducing Number of Candidates  Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent  Apriori principle holds due to the following property of the support measure:     X , Y : ( X Y ) s ( X ) s ( Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support 10/26/2020 Introduction to Data Mining, 2 nd Edition 12 12

Illustrating Apriori Principle Found to be Infrequent Pruned supersets 10/26/2020 Introduction to Data Mining, 2 nd Edition 13 13 Illustrating Apriori Principle TID Items Items (1-itemsets) 1 Bread, Milk Item Count 2 Beer, Bread, Diaper, Eggs Bread 4 3 Beer, Coke, Diaper, Milk Coke 2 4 Beer, Bread, Diaper, Milk Milk 4 Beer 3 5 Bread, Coke, Diaper, Milk Diaper 4 Eggs 1 Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 14 14

Illustrating Apriori Principle TID Items Items (1-itemsets) 1 Bread, Milk Item Count 2 Beer, Bread, Diaper, Eggs Bread 4 3 Beer, Coke, Diaper, Milk Coke 2 4 Beer, Bread, Diaper, Milk Milk 4 Beer 3 5 Bread, Coke, Diaper, Milk Diaper 4 Eggs 1 Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 15 15 Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Beer 3 {Bread,Milk} Diaper 4 {Bread, Beer } (No need to generate Eggs 1 {Bread,Diaper} candidates involving Coke {Beer, Milk} or Eggs) {Diaper, Milk} {Beer,Diaper} Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 16 16

Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Count Beer 3 {Bread,Milk} 3 Diaper 4 {Beer, Bread} 2 (No need to generate Eggs 1 {Bread,Diaper} 3 candidates involving Coke {Beer,Milk} 2 or Eggs) {Diaper,Milk} 3 {Beer,Diaper} 3 Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 17 17 Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Count Beer 3 {Bread,Milk} 3 Diaper 4 (No need to generate {Bread,Beer} 2 Eggs 1 candidates involving Coke {Bread,Diaper} 3 {Milk,Beer} 2 or Eggs) {Milk,Diaper} 3 {Beer,Diaper} 3 Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, Itemset If every subset is considered, { Beer, Diaper, Milk} 6 C 1 + 6 C 2 + 6 C 3 6 C 1 + 6 C 2 + 6 C 3 { Beer,Bread,Diaper} 6 + 15 + 20 = 41 6 + 15 + 20 = 41 {Bread,Diaper,Milk} With support-based pruning, With support-based pruning, { Beer, Bread, Milk} 6 + 6 + 4 = 16 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 18 18

Data Mining Chapter 5 Association Analysis : Basic Concepts - PDF document

Data Mining Chapter 5 Association Analysis : Basic Concepts Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 10/26/2020 Introduction to Data Mining, 2 nd Edition 1 1 Association Rule Mining

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3

Association Rules from transactional databases ! Mining multilevel association rules from

AI Methodology Theoretical aspects Mathematical formalizations, properties, algorithms

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Session 5 Programming Paradigms Sbastien Combfis Fall 2019 This work is licensed under a

CSE 473 Artificial Intelligence (AI) Rajesh Rao (Instructor) Yi-Shu Wei (TA) Hunter Whalen (TA)

MEDP 150 / FILMP 150 MEDP 150 / FILMP 150 Whether you are thinking about a career in filmmaking,

Developement of the pre-supernova neutrinos Andrzej Odrzywo lek M. Smoluchowski Institute of

Christopher R. Cogle, M.D. Amar Kelkar, M.D. Chair, Florida Cancer Control and Research

Rescheduling Pediatric Endoscopy Procedures After COVID-19 Pandemic Thomas M Attard MD FAAP FACG