Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 70
Outline Introduction 1 Frequent pattern mining model 2 Frequent itemset mining algorithms 3 Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format Summarizing itemsets 4 Mining maximal itemsets Mining closed itemsets Sequence mining 5 Graph mining 6 Pattern and rule assessment 7 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 70
Table of contents Introduction 1 Frequent pattern mining model 2 Frequent itemset mining algorithms 3 Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format Summarizing itemsets 4 Mining maximal itemsets Mining closed itemsets Sequence mining 5 Graph mining 6 Pattern and rule assessment 7 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 70
Introduction The classical problem of associative pattern mining is defined in the context of supermarket (items bought by customers). The items bought by customers are referred to as transactions. The goal is to determine association between groups of items bought by customers. The most popular model for associative pattern mining uses the frequencies of sets of items as the quantification of the level of association. The discovered set of items are referred to as large itemsets, frequent itemsets, or frequent patterns. Which items are frequently purchased together by customers? Shopping Baskets bread milk bread milk bread milk cereal sugar eggs butter Customer 1 Customer 2 Customer 3 sugar eggs Market Analyst Customer n Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 70
Applications of associative pattern mining The associative pattern mining has a wide variety of applications Supermarket data The supermarket application was the original motivating scenario in which the frequent pattern mining problem was proposed. The goal is to mine the sets of items that are frequently bought together at a supermarket by analyzing the customer shopping transactions. Text mining Text data often represented in the bag-of-words model, frequent pattern mining can help in identifying co-occurring terms and keywords. Such co-occurring terms have numerous text-mining applications. Web Mining Web site logs all incoming traffic to its site in the form of record the source and destination pages requested by some user, time, return code. We interested in finding if there are sets of web pages that many users tend to browse whenever they visit the website. Generalization to dependency-oriented data types The original frequent pattern mining model has been generalized to many dependency-oriented data types, such as time-series data, sequential data, spatial data, and graph data, with a few modifications. Such models are useful in applications such as Web log analysis, software bug detection, and spatiotemporal event detection. Other major data mining problems Frequent pattern mining can be used as a subroutine to provide effective solutions to many data mining problems such as clustering, classification, and outliers analysis. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 70
Association rules Frequent itemsets can be used to generate association rules of the form X = ⇒ Y X and Y are set of items. For example, if the supermarket owner discovers the following rule { Eggs , Milk } = ⇒ { Yogurt } As a conclusion, she/he can promote Yogurt to customers who often buy Eggs and Milk . The frequency-based model for associative pattern mining is very popular due to its simplicity. However, the raw frequency of a pattern is not same as the statistical significance of underlying correlations. Therefor several models based on statistical significance are proposed , which are referred to as interesting patterns. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 70
Table of contents Introduction 1 Frequent pattern mining model 2 Frequent itemset mining algorithms 3 Brute force Frequent itemset mining algorithm Apriori algorithm Frequent pattern growth (FP-growth) Mining frequent itemsets using vertical data format Summarizing itemsets 4 Mining maximal itemsets Mining closed itemsets Sequence mining 5 Graph mining 6 Pattern and rule assessment 7 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 70
Frequent pattern mining model Assume that the database T contains n transactions T 1 , T 2 , . . . , T n . Each transaction has a unique identifier, referred to as transaction identifier or tid. Each transaction T i is drawn on the universe of items U . tid Set of Items Binary Representation 1 { Bread, Butter, Milk } 110010 2 { Eggs, Milk, Y ogurt } 000111 3 { Bread, Cheese, Eggs, Milk } 101110 4 { Eggs, Milk, Y ogurt } 000111 5 { Cheese, Milk, Y ogurt } 001011 An itemset is a set of items. A k − itemset is an itemset that contains exactly k items. The fraction of transactions in T = { T 1 , T 2 , . . . , T n } in which an itemset occurs as a subset is known as support of itemset. Definition (Support) The support of an itemset I is defined as the fraction of the transactions in the database T = { T 1 , T 2 , . . . , T n } that contain I as a subset and denoted by sup ( I ). Items that are correlated will have high support. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 70
Frequent pattern mining model (cont.) The frequent pattern mining model is to determine itemsets with a minimum level of support denoted by minsup . Definition (Frequent itemset mining) Given a set of transactions T = { T 1 , T 2 , . . . , T n } , where each transaction T i is a subset of items from U , determine all itemsets I that occur as a subset of at least a predefined fraction minsup of the transactions in T . Consider the following database tid Set of Items 1 { Bread, Butter, Milk } 2 { Eggs, Milk, Y ogurt } 3 { Bread, Cheese, Eggs, Milk } 4 { Eggs, Milk, Y ogurt } 5 { Cheese, Milk, Y ogurt } The universe of items U = { Bread , Butter , Cheese , Eggs , Milk , Yogurt } . sup ( { Bread , Milk } ) = 2 5 = 0 . 4. sup ( { Cheese , Yogurt } ) = 1 5 = 0 . 2. The number of frequent itemsets is generally very sensitive to the value of minsup . Therefore, an appropriate choice of minsup is crucial for discovering a set of frequent patterns with meaningful size. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 70
Frequent pattern mining model (cont.) When an itemset I is contained in a transaction, all of its subsets will also contained in the transaction. Therefore, the support of any subset J of I will always be at least equal to that of I . This is referred to as support monotonicity property. Property (Support monotonicity property ) The support of every subset J of I is at least equal to that of the support of itemset I. sup ( J ) ≥ sup ( I ) ∀ J ⊆ I This implies that every subset of a frequent itemset is also frequent. This is referred to as downward closure property. Property (Downward closure property) Every subset of a frequent itemset is also frequent. The downward closure property of frequent patterns is algorithmically very convenient because it provides an important constraint on the inherent structure of frequent patterns. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 70
Frequent pattern mining model (cont.) The downward closure property can be used to create concise representations of frequent patterns, wherein only the maximal frequent subsets are retained. Definition (Maximal frequent itemsets ) A frequent itemset is maximal at a given minimum support level minsup , if it is frequent, and no superset of it is frequent. Consider the following database tid Set of Items 1 { Bread, Butter, Milk } 2 { Eggs, Milk, Y ogurt } 3 { Bread, Cheese, Eggs, Milk } 4 { Eggs, Milk, Y ogurt } 5 { Cheese, Milk, Y ogurt } The itemset { Eggs , Milk , Yogurt } is maximal frequent itemset at minsup = 0 . 3. The itemset { Eggs , Milk } is not maximal, because it has a superset that is also frequent. All frequent itemsets can be derived from the maximal patterns by enumerating the subsets of the maximal frequent patterns. The maximal patterns can be considered condensed representation of the frequent patterns. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 70
Frequent pattern mining model (cont.) The maximal patterns can be considered condensed representation of frequent patterns. This representation does not retain information about the support values of the subsets. Ex: Sup ( { Eggs , Milk , Yogurt } ) = 0 . 4 ⇏ Sup ( { Milk , Yogurt } = 0 . 6. A different representation called closed frequent itemset is able to retain support information of the subsets. (will be discussed later) An interesting property of itemsets is that they can be conceptually arranged in the form of a lattice of itemsets. This lattice contains one node for each subset and neighboring nodes differ by exactly one item. All frequent pattern mining algorithms, implicitly or explicitly, traverse this search space to determine frequent patterns. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 70
Recommend
More recommend