2 Preliminaries of the same size and covered by the same superset. - PDF document

δ -Tolerance Closed Frequent Itemsets James Cheng Yiping Ke Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology { csjames, keyiping, wilfred } @cse.ust.hk size n has (2 n − 1) non-empty subset FIs, mining MFIs ef- Abstract fectively addresses the problem of too many FIs. However, In this paper, we study an inherent problem of mining most applications are not only interested in the patterns rep- Frequent Itemsets (FIs) : the number of FIs mined is often resented by the FIs, but also require their occurrence fre- too large. The large number of FIs not only affects the min- quency in the database for further analysis. For example, ing performance, but also severely thwarts the application we need the frequency of the FIs to compute the support of FI mining. In the literature, Closed FIs (CFIs) and Max- and confidence of association rules. MFIs, however, lose imal FIs (MFIs) are proposed as concise representations of the frequency information of most FIs. FIs . However, the number of CFIs is still too large in many On the contrary, the set of CFIs is a lossless representa- cases, while MFIs lose information about the frequency of tion of FIs. CFIs are FIs that have no proper superset with the FIs. To address this problem, we relax the restrictive the same frequency. Thus, we can retrieve the frequency definition of CFIs and propose the δ -Tolerance CFIs ( δ - of the non-closed FIs from their closed supersets. However, TCFIs) . Mining δ -TCFIs recursively removes all subsets of the definition of the closure of CFIs is too restrictive, since a a δ -TCFI that fall within a frequency distance bounded by δ . CFI covers its subset only if the CFI appears in every trans- We propose two algorithms, CFI2TCFI and MineTCFI , to action that its subset appears in. This is unusual when the mine δ -TCFIs. CFI2TCFI achieves very high accuracy on database is large, especially for a sparse dataset. the estimated frequency of the recovered FIs but is less effi- In this paper, we investigate the relationship between the cient when the number of CFIs is large, since it is based on frequency of an itemset and its superset and propose a re- CFI mining. MineTCFI is significantly faster and consumes laxation on the rigid definition of CFIs. We motivate our less memory than the algorithms of the state-of-the-art con- approach by the following example. cise representations of FIs, while the accuracy of MineTCFI abcd:100� is only slightly lower than that of CFI2TCFI. 0.065� 0.029� 0.038� 0.029� bcd:107� abc:103� abd:103� acd:104� 0.027� 0.177� 0.036� 1 Introduction 0.028� 0.037� 0.028� cd:111� ab:106� ac:108� ad:107� bc:110� bd:130� 0.065� 1-108/111=0.027� 0.030� Frequent Itemset ( FI ) mining [1, 2] is fundamental to 0.035� a:111� b:139� d:134� c:115� many important data mining tasks such as associations [1], correlations [6], sequences [3], episodes [13], emerging pat- Figure 1. FIs and Their Frequency terns [8], indexing [17] and caching [18], etc. Over the last decade, a huge amount of research has been conducted on improving the efficiency of mining FIs and many fast algo- Example 1 Figure 1 shows 15 FIs (nodes) obtained from a retail dataset, where abcd is an abbreviation for the itemset rithms [9] have been proposed. However, the mining oper- { a , b , c , d } and the number following “:” is the frequency ation can easily return an explosive number of FIs, which of abcd . not only severely thwarts the application of FIs, but also di- Although we have only 1 MFI, i.e., abcd , the best esti- rectly affects the mining efficiency. mation for the frequency of the 14 proper subsets of abcd To address this problem, Maximal Frequent Itemsets ( MFIs ) [4] and Closed Frequent Itemsets ( CFIs ) [14] are is that they have frequency at least 100, which is the frequency of abcd . However, we are certainly interested in proposed as concise representations of FIs . MFIs are also the knowledge that the FIs b , d and bd have a frequency FIs but none of their proper supersets is an FI. Since an FI of

considerably lower than δ . Most importantly, MineTCFI significantly greater than that of other FIs. On the contrary, CFIs preserve the frequency information but all the 15 FIs is significantly faster than all other algorithms, while the are CFIs, even though the frequency of many FIs only differ memory consumption of MineTCFI is also small and in slightly from that of their supersets. most cases smaller than that of the other algorithms. Another important finding of mining δ -TCFIs is when We investigate the relationship between the frequency of δ increases, the error rate only increases at a much slower the FIs. In Figure 1, the number on each edge is computed as δ = (1 − frequency of Y rate. Thus, we can further reduce the number of δ -TCFIs by frequency of X ) , where Y is X ’s smallest su- using a larger δ , while still attaining high accuracy. perset that has the greatest frequency. For CFIs, if we want to remove X from the mining result, δ has to be equal to 0 , Organization. Section 2 gives the preliminaries. Then, which is a restrictive condition in most cases. However, if Section 3 defines the notion of δ -TCFIs and Section 4 we relax this equality condition to allow a small tolerance, presents the algorithms CFI2TCFI and MineTCFI. Section say δ ≤ 0 . 04 , we can immediately prune 11 FIs and retain 5 reports the experimental results. Section 6 discusses re- only abcd , bcd , bd and b (i.e., the bold nodes in Figure lated work and Section 7 concludes the paper. 1). The frequency of the pruned FIs can be accurately estimated as the average frequency of the pruned FIs that are 2 Preliminaries of the same size and covered by the same superset. For example, ab , ac and ad are of the same size and covered by Let I = { x 1 , x 2 , . . . , x N } be a set of items. An itemset the same superset abcd ; thus, their frequency is estimated (also called a pattern ) is a subset of I . A transaction is an as 106+108+107 = 107 . ✷ 3 itemset. We say that a transaction Y supports an itemset X We find that a majority of the FIs mined from most of if Y ⊇ X . For brevity, an itemset { x k 1 , x k 2 , . . . , x k m } is the well-known real datasets [9], as well as from the preva- written as x k 1 x k 2 . . . x k m in this paper. lently used synthetic datasets [12], exhibit the above char- Let D be a database of transactions. The frequency of an acteristic in their frequency. Therefore, we propose to allow itemset X , denoted as freq ( X ) , is the number of transac- tolerance, bounded by a threshold δ , in the condition for the tions in D that support X . X is called a Frequent Itemset closure of CFIs, and define a new concise representation of ( FI ) if freq ( X ) ≥ σ |D| , where σ ( 0 ≤ σ ≤ 1 ) is a user- FIs called the δ -Tolerance CFIs ( δ -TCFIs ). The notion of δ - specified minimum support threshold . X is called a Maxi- tolerance greatly alleviates the restrictive definition of CFIs, mal Frequent Itemset ( MFI ) if X is an FI and there exists as illustrated in the above example. no FI Y such that Y ⊃ X . X is called a Closed Frequent We propose two algorithms to mine δ -TCFIs. Our algo- Itemset ( CFI ) if X is an FI and there exists no FI Y such rithm, CFI2TCFI , is based on the fact that the set of CFIs that Y ⊃ X and freq ( Y ) = freq ( X ) . is a lossless representation of FIs. CFI2TCFI first obtains the CFIs and then generates the δ -TCFIs by checking the δ -Tolerance Closed Frequent Itemsets 3 condition of δ -tolerance on the CFIs. However, CFI2TCFI becomes inefficient when the number of CFIs is large. In this section, we first define the notion of δ -TCFIs. We study the closure of the δ -TCFIs and propose an- Then, we discuss how we estimate the frequency of the FIs other algorithm, MineTCFI , which makes use of the δ - that are recovered from the δ -TCFIs. Finally, we give an tolerance in the closure to perform greater pruning on the analysis on the error bound of the estimated frequency of mining space. Since the pruning condition is a relaxation the recovered FIs. on the pruning condition of mining CFIs, MineTCFI is al- ways more efficient than CFI2TCFI. The effectiveness of 3.1 The Notion of δ -TCFIs the pruning can also be inferred from Example 1 as the majority of the itemsets can be pruned when the closure defin- Definition 1 ( δ -Tolerance Closed Frequent Itemset) An ition of CFIs is relaxed. itemset X is a δ -tolerance closed frequent itemset ( δ -TCFI) We compare our algorithms with FPclose [10], NDI [7], if and only if X is an FI and there exists no FI Y such that MinEx [5] and RPlocal [16], which are the state-of-the-art Y ⊃ X , | Y | = | X | +1 , and freq ( Y ) ≥ ((1 − δ ) · freq ( X )) , algorithms for mining the four respective concise represen- where δ ( 0 ≤ δ ≤ 1 ) is a user-specified frequency toler- tations of FIs. Our experimental results on real datasets [9] ance factor . show that the number of δ -TCFIs is many times (up to or- We can define CFIs and MFIs by our δ -TCFIs as follows. ders of magnitude) smaller than the number of itemsets ob- Lemma 1 An itemset X is a CFI if and only if X is a 0 - tained by the other algorithms. We also measure the error TCFI. rate of the estimated frequency of the FIs that are recovered from the δ -TCFIs. In all cases, the error rate of CFI2TCFI Lemma 2 An itemset X is an MFI if and only if X is a is significantly lower than δ while that of MineTCFI is also 1 -TCFI.

2 Preliminaries of the same size and covered by the same superset. - PDF document

-Tolerance Closed Frequent Itemsets James Cheng Yiping Ke Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology { csjames, keyiping, wilfred } @cse.ust.hk size n has (2 n 1)

Outline 2 Introduction Introduction Preliminaries Preliminaries Problem formulation Problem

Preliminaries Programming Coprogramming Advanced Coprogramming Preliminaries Higher-Order

CBMC: Bounded Model Checking for ANSI-C Version 1.0, 2010 Outline Preliminaries BMC Basics

Computational Models of Discourse: Preliminaries, Overview Caroline Sporleder Universit at

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Modeling user behavior Semantic matching Learning to

EGAP Learning Days: Power Analysis Gareth Nellis Preliminaries: Average Treatment Effect

Structural attacks on block ciphers Sondre Rnjom NSM/UiB September 2, 2017 1 Preliminaries 2

Outline Morning program Preliminaries Modeling user behavior Semantic matching Learning to

Talk Outline Core Computation for Data Exchange 1. Preliminaries Vadim Savenkov 2. Computing

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Semantic matching Learning to rank Entities Afternoon

Outline Preliminaries A brief introduction Random processes LIESSE Fourier representation of

Difference-in-Differences Brady Neal causalcourse.com Motivation and Preliminaries

Large Independent Sets in LoS Networks Joint work with Pavan Sangha , and Prudence Wong Michele

Preliminaries Q1 Is p & G(p -> XX p) a solution to the p on even states but saying

SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent

Using Unsupervised Paradigm Acquisition for Prefixes Daniel Zeman FAL MFF, Univerzita Karlova,

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew H. Austern, Aart J.

ChemoPort Access Guidelines & Demonstration What is Implantable Chemoport ?

Securing the Tor Network Mike Perry Black Hat USA 2007 Defcon 2007 What is Tor? Volunteer

financial misstatement prediction A comparison of deep learning and text mining approach for

Jena Hwang Na-Rae Han Vivek Srikumar Archna Bhatia Tim OGorman Nathan Schneider August 4,

Introduction to Qualitative Comparative Analysis (QCA) Morning Session: The Basics of QCA as an

2 Preliminaries of the same size and covered by the same superset. - PDF document

-Tolerance Closed Frequent Itemsets James Cheng Yiping Ke Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology { csjames, keyiping, wilfred } @cse.ust.hk size n has (2 n 1)

Outline 2 Introduction Introduction Preliminaries Preliminaries Problem formulation Problem

Preliminaries Programming Coprogramming Advanced Coprogramming Preliminaries Higher-Order

CBMC: Bounded Model Checking for ANSI-C Version 1.0, 2010 Outline Preliminaries BMC Basics

Computational Models of Discourse: Preliminaries, Overview Caroline Sporleder Universit at

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Modeling user behavior Semantic matching Learning to

EGAP Learning Days: Power Analysis Gareth Nellis Preliminaries: Average Treatment Effect

Structural attacks on block ciphers Sondre Rnjom NSM/UiB September 2, 2017 1 Preliminaries 2

Outline Morning program Preliminaries Modeling user behavior Semantic matching Learning to

Talk Outline Core Computation for Data Exchange 1. Preliminaries Vadim Savenkov 2. Computing

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Semantic matching Learning to rank Entities Afternoon

Outline Preliminaries A brief introduction Random processes LIESSE Fourier representation of

Difference-in-Differences Brady Neal causalcourse.com Motivation and Preliminaries

Large Independent Sets in LoS Networks Joint work with Pavan Sangha , and Prudence Wong Michele

Preliminaries Q1 Is p &amp; G(p -&gt; XX p) a solution to the p on even states but saying

SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent

Using Unsupervised Paradigm Acquisition for Prefixes Daniel Zeman FAL MFF, Univerzita Karlova,

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew H. Austern, Aart J.

ChemoPort Access Guidelines &amp; Demonstration What is Implantable Chemoport ?

Securing the Tor Network Mike Perry Black Hat USA 2007 Defcon 2007 What is Tor? Volunteer

financial misstatement prediction A comparison of deep learning and text mining approach for

Jena Hwang Na-Rae Han Vivek Srikumar Archna Bhatia Tim OGorman Nathan Schneider August 4,

Introduction to Qualitative Comparative Analysis (QCA) Morning Session: The Basics of QCA as an

Preliminaries Q1 Is p & G(p -> XX p) a solution to the p on even states but saying

ChemoPort Access Guidelines & Demonstration What is Implantable Chemoport ?