novel computational and integrative tools for the
play

Novel Computational and Integrative Tools for the Analysis of Gene - PowerPoint PPT Presentation

Novel Computational and Integrative Tools for the Analysis of Gene Co-Expression Data Michael A. Langston Department of Computer Science University of Tennessee currently on leave to Computer Science and Mathematics Division Oak Ridge


  1. Novel Computational and Integrative Tools for the Analysis of Gene Co-Expression Data Michael A. Langston Department of Computer Science University of Tennessee currently on leave to Computer Science and Mathematics Division Oak Ridge National Laboratory USA UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 1

  2. Technology Mapping Biological Knowledge Analysis Tools . . . . Protein Structure Ontology . . . . Gene Regulatory Networks Cis -Regulatory Elements . . . . Sequence Homology Quantitative Trait Loci . . . . Protein function Combinatorial Algorithms . . . . Cell Physiology Bayesian Networks . . . . UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 2

  3. Many Network Actions • Cis and trans (direct and indirect) regulation • Post-transcriptional regulation (e.g., alternate splicing) • µRNA (e.g., functional RNA, RNAi and gene silencing) • All are forms of co-regulation. • Not to be confused with mere differential expression. • Thus the central problem is clique. • But it’s NP -complete to decide clique. • In fact it’s NP -complete even to approximate clique! • Nevertheless, with new mathematical tools (FPT) we can solve clique optimally using vertex cover. • Confines “combinatorial explosion” to the parameter. UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 3

  4. A Little Complexity Theory • The Classic View: “fuggetaboutit” “easy” P NP Σ 2 P … … PSPACE “hard” UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 4

  5. Parameter Sensitivity: Instance( n,k ) • Suppose our problem is, say, NP -complete. • Consider an algorithm with a time bound such as O(2 k+n ). • And now one with a time bound more like O(2 k +n). • Both are exponential in parameter value(s). • But what happens when k is fixed? • FPT confines superpolynomial behavior to the parameter. UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 5

  6. A Little Complexity Theory The Parameterized View: “fuggettaboutit” “solvable” (even if NP-hard!) … … W[1] W[2] XP FPT “heuristics only” UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 6

  7. On Solving Clique • Clique is a central problem all right, but it’s not FPT (unless the W hierarchy collapses). • Fortunately, Vertex Cover is FPT. • And Vertex Cover is a complementary dual to Clique: _ G G UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 7

  8. Solving Vertex Cover COMPLEXITY THEORY PARALLELISM AND GRIDS Problem Classification Speedup Algorithm Selection Collaboration Intellectual Available Clique Property Technologies GRAPH ALGORITHMS RECONFIGURATION Modeling Hardware Acceleration Optimization Fast Prototyping UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 8

  9. The Vertex Cover Project • use preprocessing via degree structures • then kernelize to reduce to a computational core • employ branching to explore the core • finally, interleave all three UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 9

  10. Sample Grid Architecture NetSolve Key: NetSolve’s Servers program description file facility NetSolve Distributed Agent Storage Middleware (NetSolve) NetSolve Client Compute Resources (Grid Service Clusters) Foundational Fabric (Switches and Depots) UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 10

  11. Hardware Acceleration � Reconfigurable devices � Very different algorithms � VHDL versus C FPGA chip � I/O is often the most critical resource � With current implementations, we are able to solve sub-instances: • of size 512 or less, • and with speedups north of about 125 UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 11

  12. The Clique Compute Engine Parametric Tuning, Pre-Processed Cliques for Decomposition, and Refinement Graph Post-Processing Highly Parallel Computation PE PE PE PE Recalcitrant Sub-problem Reconfigurable Technology PE PE PE FPGA FPGA FPGA UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 12

  13. Vertex Cover Driver Job List A simple mechanism. (Sometimes too simple.) Splitter Open Socket Job Scheduler Handle Machine Branching Processor 1 Handle Machine Branching Processor 2 . . . . . . Handle Machine Branching Initialize Branching Processor N ssh UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 13

  14. Distributed Subtree Splitting 1 2 3 4 Pruning is … … … … needed at processor 4. Processor 2 Processor 3 Processor 1 is still active. is still active. is still active. Send a subtree to the job queue. UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 14

  15. Sample Results on Protein Sequence Data Graph Graph Cover Instance Sequential Sequential Parallel Dynamic Name Size Size Type Branching Branching Decomposition Kernelization SH2-5 839 399 Yes 34 seconds 7 seconds Not needed Not needed SH2-5 839 398 No 34 seconds 141 82 minutes 20 minutes minutes SH3-10 2466 2044 Yes 203 minutes ~ 5 days ~ 5 days 140 minutes SH3-10 2466 2043 No 203 minutes 6+ days 6+ days 620 minutes 32 PEs @ 500MHz. Load balancing is critical. The hardest computations. So clique size is 422. “No” is harder than “yes.” UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 15

  16. A Toolchain for Microarray Analysis cDNA or mRNA Microarrays Raw Data Normalization Gene Expression Profiles Compute Spearman’s Rank Coefficients Edge-Weighted Graph Filter With Threshold Value Unweighted Pre-Processing Tools Graph e.g., Graph Separators and Partitioning Clique Extraction Clique-Centric Toolkit e.g., Maximum Clique * , All Maximal Cliques Genes of *NP-complete Interest Post-Processing Tools e.g., Neighborhood Search, Subgraph Expansion Validation * *Putative and Experimental UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 16

  17. A Sample Study • Data acquisition depends on: – organism and tissue type – independent variable (e.g., time course, life stages) – chip technologies/vendors; cDNA vs mRNA – normalization methods, coefficient computations • In this particular study: – 32 Mus musculus RI strains – brain tissue – Affymetrix U74Av2 mRNA Arrays – MAS5.0 package, Spearman rank order UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 17

  18. Computational Experience • 12,422 probe set IDs (genes, vertices) • Over 100M edges • Employed a variety of thresholds • Many days of highly parallel CPU time • With the threshold set at 0.5: –the maximum clique size is 369 –density made this a difficult computation • But we could do it via FPT: –contrast with brute force UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 18

  19. Zeroing in on Biological Relevance • Clique versus clustering • Too low a threshold produces large cliques, which can be hard to evaluate • Too high a threshold produces small cliques, which can exaggerate noise • Iterating, we settle on a threshold of 0.85: - maximum clique size is 17 - there are 5227 maximal cliques UT-ORNL 22 June 2005 DIMACS Graph Algorithms Research Laboratory – University of Tennessee Graph Algorithms Research Laboratory --- University of Tennessee 19

Recommend


More recommend