Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?—Pattern Evaluation Data Mining: Mining Frequent Patterns Methods Jay Urbain, PhD Credits: Nazli Goharian, Jiawei Han, Micheline Kamber, and Jian Pei n Summary 1 1 2 What Is Frequent Pattern Analysis? Why Is Freq. Pattern Mining Important? Frequent pattern: a pattern (a set of items , subsequences , substructures , Frequent pattern: An intrinsic and important property of datasets n n etc.) that occur frequently in a data set Foundation for many essential data mining tasks: n First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context n n Association, correlation, and causality analysis of frequent itemsets and association rule mining n Sequential, structural (e.g., sub-graph) patterns Motivation: Finding inherent regularities in data n n Pattern analysis in spatiotemporal, multimedia, time-series, and n What products were often purchased together?— Beer and diapers? stream data n Classification: discriminative, frequent pattern analysis n What are the subsequent purchases after buying a PC? n Cluster analysis: frequent pattern-based clustering n What DNA sequences are sensitive to this new drug? n Data warehousing: iceberg cube and cube-gradient n Can we automatically classify web documents? n Semantic data compression: fascicles n Applications n Broad applications n Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 3 4 Basic Concepts: Frequent Patterns Basic Concepts: Association Rules itemset: A set of one or more items Tid Items bought Tid Items bought Find all the rules X à Y with minimum n n k-itemset X = {x 1 , …, x k } 10 Beer, Nuts, Diaper support and confidence 10 Beer, Nuts, Diaper n 20 Beer, Coffee, Diaper absolute support , or support count of support, s , probability that a 20 Beer, Coffee, Diaper n n X: Frequency or occurrence of an 30 Beer, Diaper, Eggs transaction contains X & Y, i.e., p(X,Y) 30 Beer, Diaper, Eggs itemset X 40 Nuts, Eggs, Milk confidence, c, conditional probability 40 Nuts, Eggs, Milk n relative support , s , is the fraction of 50 Nuts, Coffee, Diaper, Eggs, Milk n that a transaction having X also 50 Nuts, Coffee, Diaper, Eggs, Milk transactions that contains X (i.e., the contains Y, i.e., p(Y|X) Customer probability that a transaction contains Customer buys both Customer buys Let minsup = 50%, minconf = 50% X) Customer diaper buys both buys diaper An itemset X is frequent if X ’ s support Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, n is no less than a minsup threshold {Beer, Diaper}:3 Customer Association rules: (many more!) n buys beer Beer à Diaper (3/5=60%, 3/3=100%) n Customer Diaper à Beer (3/5=80%, ¾ =75%) n buys beer Problems? 5 6 1
Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns => a n n DB = {<a 1 , …, a 100 >, < a 1 , …, a 50 >} n E.g., {a 1 , …, a 100 } contains ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 ) = 2 100 – 1 = n n Min_sup = 1. 0 0 1.27*10 30 sub-patterns! n What is the set of closed itemset? n <a 1 , …, a 100 >: 1 Solution: Mine closed patterns and max-patterns instead: An itemset X is closed if X is frequent and there exists no super-pattern n < a 1 , …, a 50 >: 2 n Y כ X (Y is a superset of X), with the same support as X (proposed by n What is the set of max-pattern? Pasquier, et al. @ ICDT ’ 99) n <a 1 , …, a 100 >: 1 An itemset X is a max-pattern if X is frequent and there exists no n frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD ’ 98) n What is the set of all patterns? Closed pattern is a lossless compression of freq. patterns n n 2 100 – 1 n Reducing the # of patterns and rules 7 8 Computational Complexity of Frequent Itemset Mining Frequent Patterns, Association and Mining Correlations: Basic Concepts and Methods How many itemsets are potentially generated in the worst case? n n Basic Concepts n The number of frequent itemsets to be generated is sensitive to the minsup threshold n Frequent Itemset Mining Methods n When minsup is low, there exist potentially an exponential number of frequent itemsets n The worst case: M N where M: # distinct items, and N: max length of n Which Patterns Are Interesting?—Pattern Evaluation transactions The worst case complexty vs. the expected probability Methods n n Ex. Suppose Walmart has 10 4 kinds of products n The chance to pick up one product 10 -4 n Summary n The chance to pick up a particular set of 10 products: ~10 -40 n What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions? 9 10 The Downward Closure Property and Scalable Scalable Frequent Itemset Mining Methods Mining Methods The downward closure property of frequent patterns n n Apriori: A Candidate Generation-and-Test Approach n Any subset of a frequent itemset must be frequent n If {beer, diaper, nuts} is frequent, so is {beer, diaper, beer n Improving the Efficiency of Apriori nuts, diaper nuts} n FPGrowth: A Frequent Pattern-Growth Approach Scalable mining methods: Three major approaches n n Apriori (Agrawal & Srikant@VLDB ’ 94) n ECLAT: Frequent Pattern Mining with Vertical Data Format n Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD ’ 00) n Vertical data format approach (Charm—Zaki & Hsiao @SDM ’ 02) 11 12 2
The Apriori Algorithm—An Example Apriori: A Candidate Generation & Test Approach Sup min = 2 Itemset sup Itemset sup Database TDB Apriori pruning principle: If there is any itemset which is infrequent, its {A} 2 n L 1 {A} 2 C 1 Tid Items {B} 3 superset should not be generated/tested! (Agrawal & Srikant {B} 3 10 A, C, D {C} 3 @VLDB ’ 94, Mannila, et al. @ KDD ’ 94) {C} 3 1 st scan 20 B, C, E {D} 1 {E} 3 Method: n 30 A, B, C, E {E} 3 40 B, E n Initially, scan DB once to get frequent 1-itemset C 2 Itemset sup C 2 n Generate length (k+1) candidate itemsets from length k frequent Itemset {A, B} 1 L 2 2 nd scan Itemset sup itemsets {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 n Test the candidates against DB {B, C} 2 {A, E} {B, C} 2 {B, E} 3 n Terminate when no frequent or candidate set can be generated {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Note: expand L 3 Itemset sup C 3 Itemset 3 rd scan {B, C, E} {B, C, E} 2 13 14 Implementation of Apriori The Apriori Algorithm ( Pseudo-Code) C k : candidate itemset of size k How to generate candidates? n L k : frequent itemset of size k n Step 1: self-joining L k n Step 2: pruning L 1 = {frequent items}; Example of Candidate-generation for ( k = 1; L k !={}; k ++) do begin n n L 3 = { abc, abd, acd, ace, bcd } C k+1 = candidates generated from L k ; n Self-joining: L 3 *L 3 // do not generate C k+1 candidates with subsets not in C k n abcd from abc and abd for each transaction t in database do n acde from acd and ace increment the count of all candidates in C k+1 that are n Pruning: contained in t n acde is removed because ade is not in L 3 L k+1 = candidates in C k+1 with min_support n C 4 = { abcd } end Return U k L k ; 15 16 How to Count Supports of Candidates? Scalable Frequent Itemset Mining Methods Why is counting supports of candidates a problem? n Apriori: A Candidate Generation-and-Test Approach n n The total number of candidates can be huge n One transaction may contain many candidates n Improving the Efficiency of Apriori Method: n n Candidate itemsets are typically stored in a hash with count n FPGrowth: A Frequent Pattern-Growth Approach n Different approaches, e.g. hash map, hash tree, etc. n ECLAT: Frequent Pattern Mining with Vertical Data Format n Mining Close Frequent Patterns and Maxpatterns 17 18 3
Recommend
More recommend