carpenter biological datasets
play

CARPENTER Biological Datasets Find Closed Patterns in Long - PowerPoint PPT Presentation

CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets Gene expression Consists of large number of genes Zhiyu Wang Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department of Computing Science


  1. CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets � Gene expression – Consists of large number of genes Zhiyu Wang Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department of Computing Science University of Alberta 1 2 Biological Datasets Overview…… � Motivation � Lung Cancer dataset (gene expression) � Problem statement – 181 samples � Preliminaries – Each sample is described by 12533 genes � CARPENTER algorithm – Transpose table How can we find frequent patterns in such – Row enumeration tree dataset? – Prune methods CARPENTER � Performance � Comments and Conclusion 3 4

  2. Motivation Motivation � Running time of most existing algorithms � Challenge to find the closed patterns from increases exponentially with increasing biological datasets that contains large average row length number of columns with small number of 2 i – For example, in a dataset rows i i 2 potential frequent itemsets, where is the – For example, maximum row size. 10,000 – 100,000 columns with 100 – 1,000 rows – What if i=12533? × 12533 3772 2 6 . 44 10 = (Hugh Search Space) 5 6 Problem Statement Preliminaries f � Discover all the frequent closed patterns with � Features i respect to user specified support threshold in – Items in the dataset such biological datasets efficiently. � Feature support set ′ R ( F ) F ′ – Maximal set of rows contain a set of features i r_i 1 a, b, c Features: {a, b, c, d} 2 b, c, d Feature support set 3 b, c, d ′ R ( F ) F’={b,c}, then ={1,2,3} 4 d 7 8

  3. Preliminaries CARPENTER algorithm ′ � Proposed by A. K. H. Tung et.al, in ACM � Row support set F ( R ) SIGKDD 2003. – Maximal set of features common to a set of rows � Frequent closed pattern � Main idea is to find frequent closed pattern R ′ – There is no superset with the same support value in depth-first row-wise enumeration. i r_i Row support set 1 a, b, c ′ F ( R ) R’={1,2}, then ={b,c} � Assumption: Assume dataset satisfies the 2 b, c, d condition: R << F Frequent Closed patterns: 3 b, c, d 4 d {b,c}, {d}, {b,c,d}…….. 9 10 CARPENTER Transpose table � There are two phases: 1. Transpose the dataset transpose 2. Row enumeration tree original table Recursively search in conditional – transposed table Projection {2, 3} 23-Conditional transposed table transposed table 11 12

  4. 4 5 3 7 Row enumeration tree 2 6 9 10 8 � Bottom-up row enumeration tree is 1 based on conditional table. � Each node is a conditional table. – 23-conditional table Not a real tree represents node 23. structure 13 14 CARPENTER Example � Recursively generation of conditional � Without pruning strategies, minsup=3 transposed table, performing a depth-first traversal of row-enumeration tree in order to find the frequent closed patterns. 15 16

  5. Example Prune methods � Frequent closed � It is obvious that complete traversal of row enumerations tree is not efficient. patterns Minsup=3 � CARPENTER proposes 3 prune methods. a 1,2,3,4 l 1,2,5 aeh 2,3,4 17 18 Prune method 1 � Prune out the branch which can never generate closed pattern over minsup threshold If minsup=4, then these branches will prune out 19 20

  6. Prune method 2 Prune method 3 � If rows appear in all tuples of the conditional � In each node, if corresponding support features transposed table, then such branch needs to prune is found, prune out the branch. and reconstruct 21 22 Performance Performance � CARPENTER is comparing with CHARM and � Length ratio =60%, varying minsup CLOSET 100000 – Both CHARM and CLOSET use column 10000 Runtim e (sec.) enumeration approach 1000 � Use lung cancer dataset 100 10 – 181 samples with 12533 features 1 � Two parameters: minsup and length ratio 0.1 4 5 6 7 8 9 10 minsup – Length ratio is the percentage of column from original dataset carpenter (sec) charm (sec) closet (sec) 23 24

  7. Performance Comments � Minsup=4% varying length ratio � Bottom-up approach of CARPENTER is not efficient. 100000 10000 Runtime (sec.) 1000 minsup=3 100 10 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Length Ratio carpenter (sec) charm (sec) closet (sec) 25 26 Comments Conclusion � TD-Close uses top-down approach. � CARPENTER is used to find the frequent closed pattern in biological dataset. � CARPENTER uses row enumeration instead of minsup=3 column enumeration to overcome the high dimensionality of biological datasets. � Not very efficient somehow 27 28

  8. References Thank you! � A. K. H. Tung J. Yang F. Pan, G. Cong and M. J. Questions? Zaki. CARPENTER: Finding closed patterns in long biological datasets. In In Proc. 2003 ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining , 2003. 29 30

Recommend


More recommend