2007-11-26 Outline • Introduction High Confidence Rule Mining h f d l • Row Enumeration for Microarray Analysis • Confidence ‐ based Prune Strategy • MAXCONF Algorithm Kang Deng • Evaluation U i University of Alberta it f Alb t • References What is Microarrays? “High Confidence Rule Mining for Microarray Analysis”, by Tara McIntosh, • A DNA microarray is a collection of Sanjay Chawla, 2006 microscopic DNA spots, commonly representing microscopic DNA spots commonly representing • 27 pages single genes, arrayed on a solid surface by • 14 definitions covalent attachment to a chemical matrix. Confidence-based Strategy genes • 2 lemmas items • 4 tables samples l • Figures, formulas, etc transactions 1
2007-11-26 Row Enumeration Our Task Explosive increase of candidates • Traditional Dataset • Microarray Dataset • One main objective of molecular biology is to items items 1-length d develop a deeper understanding of how l d d t di f h 2-length transactions genes are functionally related. 3-length The length of transactions the patterns minimum support minimum confidence is much less than the average Minimum Support = 30%, Minimum Confidence = 80% number of items in One items in One transaction We do not mine association rules, but Width: 12 Length: 10000 Width: <500 Length: >>6000 Confidence Rules. How can we make the right rectangle like EXPLOSION!!! the left one? Row Enumeration Transposed table & Tree Outline Items Transactions A 1,2,5,6 B 1,4,8 • Introduction C 1,2,3,4,5,8 D 1,2,3,4,6,7,8 E 1,2,3,4,5 • Row Enumeration F 3 G 1,2,3,4,8 • Confidence ‐ based Prune Strategy H 3 I 3,5,6,7 • MAXCONF Algorithm J 7 • Evaluation Row Enumeration • References 2
2007-11-26 Row Enumeration Confidence ‐ based Strategy Tree If the current parent node n, is completely contained within a l l i d i hi RER II, “Mining frequent closed patterns in microarray data.” by G. Cong, K.-L. sibling node, a child node is not Tan, A. Tung, and F. Pan, 2004 Support-based pruning strategy constructed. Minimum Support = 30%, Minimum Confidence = 80% For Example, node 2. In Biology, we care confidence rules, but not support Confidence ‐ based Strategy Prune #1 Outline • Introduction • Row Enumeration • Confidence ‐ based Prune Strategy • MAXCONF Algorithm • Evaluation • References σ = + = m a x ( 5 ) 1 2 3 3
2007-11-26 Confidence ‐ based Strategy Confidence ‐ based Strategy Prune #1 Prune #1 → This rule has the highest confidence ( ) I ( ACEG ) → √ What about this one? ( AI ) ( CEG ) In the itemset {A,B,C}, Support(A)<=Support(B), Support(A)<=Support(C) { } pp ( ) pp ( ) pp ( ) pp ( ) So, A is the minimum feature in {A,B,C} Itemset becomes larger, the support of it will not change or even become smaller σ ≤ σ ( AI ) ( ) I σ ( itemset ) (A)->(B, C) is an I-spanning rule; (B, C)->(A) is not = confidence σ ( ( ) ) antecedent Confidence ‐ based Strategy Confidence ‐ based Strategy Prune #1 Prune #2 → → → Itemset: {CDEG} Itemset: {CDEG} C C DEG E DEG E , CDG G CDG G , CDE CDE The maximum feature of CDEG is CEG Prune Strategy #2: If maximum feature set M of an itemset at node If maximum feature set M of an itemset at node σ = + = m a x ( 5 ) 1 2 3 Maximum Support of 5 n is not empty, we can prune all child nodes of σ = ( I ) 4 Minimum Feature in this itemset is I n whose itemsets are subsets of M. σ (5) Maximum Confidence of 5: = = conf (5) max 3 / 4 max σ ( ) I If minimum confidence is 4/5, the child of node #5 will be pruned σ σ ( antecedent ) ( ) I 4
2007-11-26 Confidence ‐ based Strategy MAXCONF Algorithm Prune #2 Pruning #2 Pruning #1 σ Itemset: (1234){CDEG} (5) = = conf (5) max 3 / 4 max σ → → → → ( ) ( ) I → → C C DEG E DEG E , CDG G CDG G , CDE CDE → → → Itemset: (1234){CDEG} C DEG E , CDG G , CDE Maximum Feature {CEG} Itemset of child node {CEG} The maximum feature of CDEG is CEG Sub- → → → Node (12345)generates: C EG E , CG G , CE rules Outline Outline • Introduction • Introduction • Row Enumeration • Row Enumeration • Confidence ‐ based Prune Strategy • Confidence ‐ based Prune Strategy • MAXCONF Algorithm • MAXCONF Algorithm • Evaluation • Evaluation • Conclusion • References 5
2007-11-26 Evaluation Evaluation Scalability MAXCONF vs RER II The performance of RER II is not Two Aspect: affected by minimum confidence 1 Rule Generation 1.Rule Generation In most cases, MAXCONF is better I t MAXCONF i b tt than RER II. RER II only outperforms 2.Scalability MAXCONF when the minimum support is higher than 40%. Evaluation Rule Generation References • MAXCONF, “High Confidence Rule Mi i Mining for Microarray Analysis”, by f Mi A l i ” b Tara McIntosh, Sanjay Chawla, 2006 • RER II, “Mining frequent closed patterns in microarray data.” by G. Cong, K.-L. Tan, A. Tung, and F. When minimum support is 0, Pan, 2004 RER II will run out of memory MAXCONF generates more rules than RER II! 6
2007-11-26 Any Question? Thanks for your Attentions 7
Recommend
More recommend