Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent ( credit : Tan et al., Leskovec et al.)

Frequent Itemsets &   Association Rules (a.k.a. counting co-occurrences)

The Market-Basket Model Input: Output: TID Items Rules Discovered: 1 Bread, Coke, Milk {Milk} --> {Coke} 2 Beer, Bread {Diaper, Milk} --> {Beer} 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk • Baskets = sets of purchases, Items = products; • Brick and Mortar: Track purchasing habits • Chain stores have TBs of transaction data • Tie-in “tricks”, e.g., sale on diapers + raise price of beer • Need the rule to occur frequently, or no $$’s • Online: People who bought X also bought Y adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Examples: Plagiarism, Side-Effects • Baskets = sentences;   Items = documents containing those sentences • Items that appear together too often   could represent plagiarism • Notice items do not have to be “in” baskets • Baskets = patients;   Items = drugs & side-effects • Has been used to detect combinations   of drugs that result in particular side-effects • Requires extension: Absence of an item   needs to be observed as well as presence adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Example: Voting records Association Rule Confidence { budget resolution = no, MX-missile=no, aid to El Salvador = yes } 91.0% → { Republican } − { budget resolution = yes, MX-missile=yes, aid to El Salvador = no } 97.5% → { Democrat } − { crime = yes, right-to-sue = yes, physician fee freeze = yes } 93.5% → { Republican } − { crime = no, right-to-sue = no, physician fee freeze = no } 100% → { Democrat } − • Baskets = politicians;   Items = party & votes • Can extract set of votes most associated   with each party (or or faction within a party) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Frequent Itemsets • Simplest question: Find sets of items that appear together “frequently” in baskets • Support σ ( Χ ) for itemset Χ :   Number of baskets containing all items in Χ • (Often expressed as a fraction   of the total number of baskets) • Given a support threshold σ min , then   sets of items X that appear in at least   σ ( Χ ) ≥ σ min baskets are called   frequent itemsets adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Example: Frequent Itemsets • Items = {milk, coke, pepsi, beer, juice} • Baskets B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets ( σ ( X ) ≥ 3) : {m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3,   {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3

Association Rules • If-then rules about the contents of baskets • { a 1 , a 2 ,…,a k } → b means: “if a basket contains all of a 1 ,…,a k then it is likely to contain b ” • In practice there are many rules, want to find significant/interesting ones! • Confidence of this association rule is the probability of B ={ b } given A = { a 1 ,…,a k } σ ( X ∪ Y ) Support, s ( X − → Y ) = ; ( N σ ( X ∪ Y ) Confidence, c ( X − → Y ) = . σ ( X ) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Interest of Association Rules • Not all high-confidence rules are interesting • The rule A → milk may have high confidence because milk is just purchased very often (independent of A )   • Interest Factor (or Lift) of a rule A → B : s ( A, B ) Lift = c ( A − → B ) I ( A, B ) = , s ( A ) × s ( B ) s ( B ) adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Confidence and Interest B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Association rule: {m} → b • Confidence = 4/5 • Interest Factor = 1/6 4/5 = 4/30 • Item b appears in 6/8 of the baskets • Rule is not very interesting! adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Many measures of interest − → Measure (Symbol) Definition � � �� Goodman-Kruskal ( λ ) j max k f jk − max k f + k N − max k f + k f ij Nf ij f i + N log f i + � � �� − � Mutual Information ( M ) N log i j i f i + f + j N f 11 f 1+ f +1 + f 10 Nf 11 Nf 10 J-Measure ( J ) N log N log f 1+ f +0 f 1+ ) 2 + ( f 10 f 1+ f 1+ ) 2 ] − ( f +1 N × ( f 11 N ) 2 Gini index ( G ) f 0+ ) 2 + ( f 00 + f 0+ f 0+ ) 2 ] − ( f +0 N × [( f 01 N ) 2 � �� Laplace ( L ) f 11 + 1 f 1+ + 2 B B � �� Conviction ( V ) f 1+ f +0 Nf 10 A f 11 f 10 f 1+ � f 11 f 1+ − f +1 1 − f +1 �� A f 01 f 00 f 0+ Certainty factor ( F ) N N f 1+ − f +1 f 11 f +1 f +0 N Added Value ( AV ) N adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Mining Association Rules • Problem: Find all association rules with support ≥ s and confidence ≥ c • Note: Support of an association rule is the support of the set of items on the left side • Hard part: Finding the frequent itemsets! • If { i 1 , i 2 ,…, i k } → j has high support and confidence, then both { i 1 , i 2 ,…, i k } and   { i 1 , i 2 ,…,i k , j } will be “frequent” adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Finding Frequent Item Sets Given k products, how many possible item sets are there? null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Finding Frequent Item Sets Answer : 2 k - 1 -> Cannot enumerate all possible sets null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Observation: A-priori Principle null a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde Frequent Itemset abcde Subsets of a frequent item set are also frequent

Corollary: Pruning of Candidates null Infrequent Itemset a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde Pruned Supersets abcde If we know that a subset is not frequent,   then we can ignore all its supersets

A-priori Algorithm Algorithm 6.1 Frequent itemset generation of the Apriori algorithm. 1: k = 1. 2: F k = { i | i ∈ I ∧ σ ( { i } ) ≥ N × minsup } . { Find all frequent 1-itemsets } 3: repeat k = k + 1. 4: C k = apriori-gen( F k − 1 ). { Generate candidate itemsets } 5: for each transaction t ∈ T do 6: C t = subset( C k , t ). { Identify all candidates that belong to t } 7: for each candidate itemset c ∈ C t do 8: σ ( c ) = σ ( c ) + 1. { Increment support count } 9: end for 10: end for 11: F k = { c | c ∈ C k ∧ σ ( c ) ≥ N × minsup } . { Extract the frequent k -itemsets } 12: 13: until F k = ∅ 14: Result = � F k . adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Generating Candidates C k 1. Self-joining: Find pairs of sets in L k-1   that differ by one element 2. Pruning: Remove all candidates   with infrequent subsets

Example: Generating Candidates C k B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets of size 2:   {m,b}:4, {m,c}:3, {c,b}:5, {c,j}:3 • Self-joining:   {m,b,c}, {b,c,j} • Pruning:   {b,c,j} since {b,j} not frequent

Compacting the Output • To reduce the number of rules we can   post-process them and only output: • Maximal frequent itemsets:   No immediate superset is frequent • Gives more pruning • Closed itemsets:   No immediate superset has same count (> 0) • Stores not only frequent information,   but exact counts J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Example: Maximal vs Closed B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Frequent itemsets: Closed {m}:5 , {c}:6 , {b}:6 , {j}:4 ,   {m,c}:3, {m,b}:4 , {c,b}:5 , {c,j}:3 ,   Maximal {m,c,b}:3

Example: Maximal vs Closed Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets

Subset Matching Given a transaction t, what Transaction, t are the possible subsets of ( items are sorted ) size 3? 1 2 3 5 6 Level 1 2 3 5 6 3 5 6 5 6 1 2 3 Level 2 3 5 6 5 6 6 5 6 6 6 1 2 1 3 1 5 2 3 2 5 3 5 1 2 3 1 3 5 2 3 5 1 2 5 1 5 6 2 5 6 3 5 6 1 3 6 2 3 6 1 2 6 Subsets of 3 items Level 3 adapted from : Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent ( credit : Tan et al., Leskovec et al.) Frequent Itemsets & Association Rules (a.k.a. counting co-occurrences) The Market-Basket Model Input:

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Algorithms for Natural Language Processing Lecture 7: Lexical Semantics Three Ways of Looking

Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Optimizing Black-box Metrics with Adaptive Surrogates Qijia Jiang 1 , Olaoluwa (Oliver) Adigun 2 ,

Session Title: Challenges in Learning Science Concepts Teaching Emergence: An Attempt at

Obfuscated Circuits with Capabilities and Performance Beyond the SAT Attacks Conference on

Computational Learning Theory: Positive and negative learnability results Machine Learning 1