CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets � Gene expression – Consists of large number of genes Zhiyu Wang Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department of Computing Science University of Alberta 1 2 Biological Datasets Overview…… � Motivation � Lung Cancer dataset (gene expression) � Problem statement – 181 samples � Preliminaries – Each sample is described by 12533 genes � CARPENTER algorithm – Transpose table How can we find frequent patterns in such – Row enumeration tree dataset? – Prune methods CARPENTER � Performance � Comments and Conclusion 3 4
Motivation Motivation � Running time of most existing algorithms � Challenge to find the closed patterns from increases exponentially with increasing biological datasets that contains large average row length number of columns with small number of 2 i – For example, in a dataset rows i i 2 potential frequent itemsets, where is the – For example, maximum row size. 10,000 – 100,000 columns with 100 – 1,000 rows – What if i=12533? × 12533 3772 2 6 . 44 10 = (Hugh Search Space) 5 6 Problem Statement Preliminaries f � Discover all the frequent closed patterns with � Features i respect to user specified support threshold in – Items in the dataset such biological datasets efficiently. � Feature support set ′ R ( F ) F ′ – Maximal set of rows contain a set of features i r_i 1 a, b, c Features: {a, b, c, d} 2 b, c, d Feature support set 3 b, c, d ′ R ( F ) F’={b,c}, then ={1,2,3} 4 d 7 8
Preliminaries CARPENTER algorithm ′ � Proposed by A. K. H. Tung et.al, in ACM � Row support set F ( R ) SIGKDD 2003. – Maximal set of features common to a set of rows � Frequent closed pattern � Main idea is to find frequent closed pattern R ′ – There is no superset with the same support value in depth-first row-wise enumeration. i r_i Row support set 1 a, b, c ′ F ( R ) R’={1,2}, then ={b,c} � Assumption: Assume dataset satisfies the 2 b, c, d condition: R << F Frequent Closed patterns: 3 b, c, d 4 d {b,c}, {d}, {b,c,d}…….. 9 10 CARPENTER Transpose table � There are two phases: 1. Transpose the dataset transpose 2. Row enumeration tree original table Recursively search in conditional – transposed table Projection {2, 3} 23-Conditional transposed table transposed table 11 12
4 5 3 7 Row enumeration tree 2 6 9 10 8 � Bottom-up row enumeration tree is 1 based on conditional table. � Each node is a conditional table. – 23-conditional table Not a real tree represents node 23. structure 13 14 CARPENTER Example � Recursively generation of conditional � Without pruning strategies, minsup=3 transposed table, performing a depth-first traversal of row-enumeration tree in order to find the frequent closed patterns. 15 16
Example Prune methods � Frequent closed � It is obvious that complete traversal of row enumerations tree is not efficient. patterns Minsup=3 � CARPENTER proposes 3 prune methods. a 1,2,3,4 l 1,2,5 aeh 2,3,4 17 18 Prune method 1 � Prune out the branch which can never generate closed pattern over minsup threshold If minsup=4, then these branches will prune out 19 20
Prune method 2 Prune method 3 � If rows appear in all tuples of the conditional � In each node, if corresponding support features transposed table, then such branch needs to prune is found, prune out the branch. and reconstruct 21 22 Performance Performance � CARPENTER is comparing with CHARM and � Length ratio =60%, varying minsup CLOSET 100000 – Both CHARM and CLOSET use column 10000 Runtim e (sec.) enumeration approach 1000 � Use lung cancer dataset 100 10 – 181 samples with 12533 features 1 � Two parameters: minsup and length ratio 0.1 4 5 6 7 8 9 10 minsup – Length ratio is the percentage of column from original dataset carpenter (sec) charm (sec) closet (sec) 23 24
Performance Comments � Minsup=4% varying length ratio � Bottom-up approach of CARPENTER is not efficient. 100000 10000 Runtime (sec.) 1000 minsup=3 100 10 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Length Ratio carpenter (sec) charm (sec) closet (sec) 25 26 Comments Conclusion � TD-Close uses top-down approach. � CARPENTER is used to find the frequent closed pattern in biological dataset. � CARPENTER uses row enumeration instead of minsup=3 column enumeration to overcome the high dimensionality of biological datasets. � Not very efficient somehow 27 28
References Thank you! � A. K. H. Tung J. Yang F. Pan, G. Cong and M. J. Questions? Zaki. CARPENTER: Finding closed patterns in long biological datasets. In In Proc. 2003 ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining , 2003. 29 30
Recommend
More recommend