Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent ( credit : Tan et al., Leskovec et al.)
Frequent Itemsets & Association Rules (a.k.a. counting co-occurrences)
The Market-Basket Model Input: Output: TID Items Rules Discovered: 1 Bread, Coke, Milk {Milk} --> {Coke} 2 Beer, Bread {Diaper, Milk} --> {Beer} 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk • Baskets = sets of purchases, Items = products; • Brick and Mortar: Track purchasing habits • Chain stores have TBs of transaction data • Tie-in “tricks”, e.g., sale on diapers + raise price of beer • Need the rule to occur frequently, or no $$’s • Online: People who bought X also bought Y adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Examples: Plagiarism, Side-Effects • Baskets = sentences; Items = documents containing those sentences • Items that appear together too often could represent plagiarism • Notice items do not have to be “in” baskets • Baskets = patients; Items = drugs & side-effects • Has been used to detect combinations of drugs that result in particular side-effects • Requires extension: Absence of an item needs to be observed as well as presence adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Example: Voting records Association Rule Confidence { budget resolution = no, MX-missile=no, aid to El Salvador = yes } 91.0% → { Republican } − { budget resolution = yes, MX-missile=yes, aid to El Salvador = no } 97.5% → { Democrat } − { crime = yes, right-to-sue = yes, physician fee freeze = yes } 93.5% → { Republican } − { crime = no, right-to-sue = no, physician fee freeze = no } 100% → { Democrat } − • Baskets = politicians; Items = party & votes • Can extract set of votes most associated with each party (or or faction within a party) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Frequent Itemsets • Simplest question: Find sets of items that appear together “frequently” in baskets • Support σ ( Χ ) for itemset Χ : Number of baskets containing all items in Χ • (Often expressed as a fraction of the total number of baskets) • Given a support threshold σ min , then sets of items X that appear in at least σ ( Χ ) ≥ σ min baskets are called frequent itemsets adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Example: Frequent Itemsets • Items = {milk, coke, pepsi, beer, juice} • Baskets B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets ( σ ( X ) ≥ 3) : {m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3, {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3
Association Rules • If-then rules about the contents of baskets • { a 1 , a 2 ,…,a k } → b means: “if a basket contains all of a 1 ,…,a k then it is likely to contain b ” • In practice there are many rules, want to find significant/interesting ones! • Confidence of this association rule is the probability of B ={ b } given A = { a 1 ,…,a k } σ ( X ∪ Y ) Support, s ( X − → Y ) = ; ( N σ ( X ∪ Y ) Confidence, c ( X − → Y ) = . σ ( X ) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Interest of Association Rules • Not all high-confidence rules are interesting • The rule A → milk may have high confidence because milk is just purchased very often (independent of A ) • Interest Factor (or Lift) of a rule A → B : s ( A, B ) Lift = c ( A − → B ) I ( A, B ) = , s ( A ) × s ( B ) s ( B ) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Confidence and Interest B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Association rule: {m} → b • Confidence = 4/5 • Interest Factor = 1/6 4/5 = 4/30 • Item b appears in 6/8 of the baskets • Rule is not very interesting! adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Many measures of interest − → Measure (Symbol) Definition � � ��� � Goodman-Kruskal ( λ ) j max k f jk − max k f + k N − max k f + k f ij Nf ij f i + N log f i + � � ��� � � − � Mutual Information ( M ) N log i j i f i + f + j N f 11 f 1+ f +1 + f 10 Nf 11 Nf 10 J-Measure ( J ) N log N log f 1+ f +0 f 1+ ) 2 + ( f 10 f 1+ f 1+ ) 2 ] − ( f +1 N × ( f 11 N ) 2 Gini index ( G ) f 0+ ) 2 + ( f 00 + f 0+ f 0+ ) 2 ] − ( f +0 N × [( f 01 N ) 2 � ��� � Laplace ( L ) f 11 + 1 f 1+ + 2 B B � ��� � Conviction ( V ) f 1+ f +0 Nf 10 A f 11 f 10 f 1+ � f 11 f 1+ − f +1 1 − f +1 ��� � A f 01 f 00 f 0+ Certainty factor ( F ) N N f 1+ − f +1 f 11 f +1 f +0 N Added Value ( AV ) N adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Mining Association Rules • Problem: Find all association rules with support ≥ s and confidence ≥ c • Note: Support of an association rule is the support of the set of items on the left side • Hard part: Finding the frequent itemsets! • If { i 1 , i 2 ,…, i k } → j has high support and confidence, then both { i 1 , i 2 ,…, i k } and { i 1 , i 2 ,…,i k , j } will be “frequent” adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Finding Frequent Item Sets Given k products, how many possible item sets are there? null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Finding Frequent Item Sets Answer : 2 k - 1 -> Cannot enumerate all possible sets null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Observation: A-priori Principle null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde Frequent Itemset abcde Subsets of a frequent item set are also frequent
Corollary: Pruning of Candidates null Infrequent Itemset a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde Pruned Supersets abcde If we know that a subset is not frequent, then we can ignore all its supersets
A-priori Algorithm Algorithm 6.1 Frequent itemset generation of the Apriori algorithm. 1: k = 1. 2: F k = { i | i ∈ I ∧ σ ( { i } ) ≥ N × minsup } . { Find all frequent 1-itemsets } 3: repeat k = k + 1. 4: C k = apriori-gen( F k − 1 ). { Generate candidate itemsets } 5: for each transaction t ∈ T do 6: C t = subset( C k , t ). { Identify all candidates that belong to t } 7: for each candidate itemset c ∈ C t do 8: σ ( c ) = σ ( c ) + 1. { Increment support count } 9: end for 10: end for 11: F k = { c | c ∈ C k ∧ σ ( c ) ≥ N × minsup } . { Extract the frequent k -itemsets } 12: 13: until F k = ∅ 14: Result = � F k . adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Generating Candidates C k 1. Self-joining: Find pairs of sets in L k-1 that differ by one element 2. Pruning: Remove all candidates with infrequent subsets
Example: Generating Candidates C k B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets of size 2: {m,b}:4, {m,c}:3, {c,b}:5, {c,j}:3 • Self-joining: {m,b,c}, {b,c,j} • Pruning: {b,c,j} since {b,j} not frequent
Compacting the Output • To reduce the number of rules we can post-process them and only output: • Maximal frequent itemsets: No immediate superset is frequent • Gives more pruning • Closed itemsets: No immediate superset has same count (> 0) • Stores not only frequent information, but exact counts J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Example: Maximal vs Closed B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Frequent itemsets: Closed {m}:5 , {c}:6 , {b}:6 , {j}:4 , {m,c}:3, {m,b}:4 , {c,b}:5 , {c,j}:3 , Maximal {m,c,b}:3
Example: Maximal vs Closed Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets
Subset Matching Given a transaction t, what Transaction, t are the possible subsets of ( items are sorted ) size 3? 1 2 3 5 6 Level 1 2 3 5 6 3 5 6 5 6 1 2 3 Level 2 3 5 6 5 6 6 5 6 6 6 1 2 1 3 1 5 2 3 2 5 3 5 1 2 3 1 3 5 2 3 5 1 2 5 1 5 6 2 5 6 3 5 6 1 3 6 2 3 6 1 2 6 Subsets of 3 items Level 3 adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Recommend
More recommend