Introduction to Data Mining Frequent Pattern Mining and Association - PowerPoint PPT Presentation

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber George Kollios 1

Mining Frequent Patterns, Association and Correlations  Basic concepts  Frequent itemset mining methods  Mining association rules  Association mining to correlation analysis  Constraint-based association mining 2

What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures,  etc.) that occurs frequently in a data set  Frequent sequential pattern  Frequent structured pattern Motivation: Finding inherent regularities in data   What products were often purchased together? — Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug? Applications   Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 3

Frequent Itemset Mining  Frequent itemset mining: frequent set of items in a transaction data set  Agrawal, Imielinski, and Swami, SIGMOD 1993  SIGMOD Test of Time Award 2003 “This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ” R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD ’93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. 4

Basic Concepts: Transaction dataset Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F 5

Record Data Data that consists of a collection of records, each of which consists of  a fixed set of attributes Points in a multi-dimensional space, where each dimension  represents a distinct attribute Represented by an m by n matrix, where there are m rows, one for  each object, and n columns, one for each attribute Tid Refund Marital Taxable Cheat Status Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

Transaction Data  A special type of record data, where  each record (transaction) involves a set of items.  For example, the set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Document Data  Each document becomes a `term' vector,  each term is a component (attribute) of the vector,  the value of each component is the number of times the corresponding term occurs in the document. timeout season coach score game team ball lost pla wi y n Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0

Basic Concepts: Transaction dataset Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F 9

Basic Concepts: Frequent Patterns and Association Rules Itemset: X = {x 1 , …, x k } (k-itemset) Transaction-id Items bought  10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F 10

Basic Concepts: Frequent Patterns and Association Rules Itemset: X = {x 1 , …, x k } (k- Transaction-id Items bought  itemset) 10 A, B, D Frequent itemset: X with minimum  20 A, C, D support count 30 A, D, E Support count (absolute support):  40 B, E, F count of transactions containing X 50 B, C, D, E, F 11

Basic Concepts: Frequent Patterns and Association Rules Itemset: X = {x 1 , …, x k } (k- Transaction-id Items bought  itemset) 10 A, B, D Frequent itemset: X with minimum  20 A, C, D support count 30 A, D, E Support count (absolute support):  40 B, E, F count of transactions containing X 50 B, C, D, E, F Association rule: A  B with  minimum support and confidence Customer Customer Support: probability that a buys both buys diaper  transaction contains A  B s = P(A  B) Confidence: conditional probability  that a transaction having A also contains B Customer c = P(B | A) buys beer 12

Illustration of Frequent Itemsets and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F  Frequent itemsets (minimum support count = 3) ?  Association rules (minimum support = 50%, minimum confidence = 50%) ? 13

Illustration of Frequent Itemsets and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F  Frequent itemsets (minimum support count = 3) ? {A:3, B:3, D:4, E:3, AD:3}  Association rules (minimum support = 50%, minimum confidence = 50%) ? A  D (60%, 100%) D  A (60%, 75%) 14

Mining Frequent Patterns, Association and Correlations  Basic concepts  Frequent itemset mining methods  Mining association rules  Association mining to correlation analysis  Constraint-based association mining 15

Scalable Methods for Mining Frequent Patterns  Frequent itemset mining methods  Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth — Han, Pei & Yin @SIGMOD’00)  Closed and maximal patterns and their mining methods  FIMI Workshop and implementation repository 16

Frequent itemset mining  Brute force approach Transaction- Items id bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Itemset: X = {x 1 , …, x k } (k-itemset) Frequent itemset: X with minimum support count

Frequent itemset mining  Brute force approach  Set enumeration tree for all possible itemsets  Tree search Transaction- Items id bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

Apriori – Apriori Property  The apriori property of frequent patterns  Any nonempty subset of a frequent itemset must be frequent  If {beer, diaper, nuts} is frequent, so is {beer, diaper}  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! 19

Apriori: Level-Wise Search Method  Level-wise search method (BFS):  Initially, scan DB once to get frequent 1-itemset  Generate length (k+1) candidate itemsets from length k frequent itemsets  Test the candidates against DB  Terminate when no frequent or candidate set can be generated 20

The Apriori Algorithm  Pseudo-code: C k : Candidate k-itemset L k : frequent k-itemset L 1 = frequent 1-itemsets; for (k = 2; L k-1 !=  ; k++) C k = generate candidate set from L k-1; for each transaction t in database find all candidates in C k that are subset of t; increment their count; L k = candidates in C k with min_support return  k L k ; 21

The Apriori Algorithm — An Example Sup min = 2 Itemset sup Itemset sup Transaction DB {A} 2 L 1 {A} 2 Tid Items C 1 {B} 3 {B} 3 10 A, C, D {C} 3 1 st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E Itemset sup C 2 C 2 Itemset {A, B} 1 2 nd scan Itemset sup L 2 {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset sup Itemset 3 rd scan L 3 C 3 {B, C, E} 2 {B, C, E} 22

Important Details of Apriori  How to generate candidate sets?  How to count supports for candidate sets? 23

Candidate Set Generation C k = generate candidate set from L k-1; Step 1: self-joining L k-1 : assuming items and itemsets are sorted in  order, joinable only if the first k-2 items are in common Step 2: pruning: prune if it has infrequent subset  Example : Generate C 4 from L 3 = { abc, abd, acd, ace, bcd } Step 1: Self-joining: L 3 *L 3   abcd from abc and abd; acde from acd and ace Step 2: Pruning:   acde is removed because ade is not in L 3 C 4 ={ abcd } 24

How to Count Supports of Candidates? for each transaction t in database find all candidates in C k that are subset of t; increment their count;  Why counting supports of candidates a problem?  The total number of candidates can be very huge  One transaction may contain many candidates  For each subset s in t, check if s is in C k 25

How to Count Supports of Candidates? for each transaction t in database find all candidates in C k that are subset of t; increment their count;  For each subset s in t, check if s is in C k  Linear search  Prefix tree  Hash-tree (prefix tree with hash function at interior node)  Hash-table 26

Example: Hash-tree hash function Transaction: 2 3 5 6 7 3,6,9 1,4,7 2,5,8 3 2 5 2 3 4 6 5 5 6 7 3 6 7 1 4 5 3 5 6 3 4 5 1 3 6 3 6 8 3 5 7 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 28

Introduction to Data Mining Frequent Pattern Mining and Association - PowerPoint PPT Presentation

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns, Association and Correlations Basic concepts Frequent

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Agribusiness Master Class Foundation Week | Cebu, Philippines 25-29 November 2019 Day 2:

P t rs r

Challenges in Applying Machine Learning Methods: Studying Political Interactions on Social

The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (1.1) C.

DATA MINING INTRO LECTURE Introduction Instructors Aris (Aris Anagnostopoulos) Yiannis (Ioannis

CS6220: DATA MINING TECHNIQUES 1: Introduction Instructor: Yizhou Sun yzsun@ccs.neu.edu

Student Responsibilities Week 12 Reading : This week: Textbook, Sections 3.5, 3.6 Next

Crowd Workers Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website: