knowledge discovery
play

knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 - PDF document

20.09.2008 Challenge of bioinformatics: knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 20.09.2008 Bioinformatics Management analysis and Management, analysis and interpretation of biological data Goal: gain new


  1. 20.09.2008 Challenge of bioinformatics: knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1

  2. 20.09.2008 Bioinformatics ■ Management analysis and ■ Management, analysis and interpretation of biological data ■ Goal: gain new insights into the biology 4 Sep 2008 2

  3. 20.09.2008 EMBL nucleotide DB Total nucleotides Number of entries (current 216,455,190,745) (current 136,401,022) http://www.ebi.ac.uk/embl/Services/DBStats/ 3

  4. 20.09.2008 Many data types Sequence (DNA, RNA,Protein…) Structure S Variation Gene Expression 6340 experiments, 192031 assays available Protein expression Metabolism Interactions Regulation 1 assay = 10-100MB “image”, converted into values for all genes on assay… … Imaging Imaging … 4

  5. 20.09.2008 Computer science and bioinformatics Communications of the ACM Volume 48 , Issue 3 (March 2005) The disappearing computer Pages: 72 - 78 Year of Publication: 2005 Jacques Cohen Computer science and bioinformatics ( Jacques Cohen, CACM 2005) ■ In barely half a century computer science has grown from infancy to science has grown from infancy to maturity. ■ Computer scientists should be encouraged to learn biology and biologists computer science to prepare themselves for an intellectually stimulating and financially rewarding future in bioinformatics. 5

  6. 20.09.2008 Computer Literacy Interview With Donald Knuth By Dan Doernberg December 7th, 1993 CLB: If you were a soon-to-graduate college senior or Ph.D. and you didn't have any "baggage", what kind of research would you want to do? Or would you even choose research again? Knuth: I think the most exciting computer research now is partly in robotics, and partly in applications to biochemistry partly in applications to biochemistry. Robotics, for example, that's terrific. Making devices that actually move around and communicate with each other. Stanford has a big robotics lab now, and our plan is for a new building that will have a hundred robots walking the corridors, to stimulate the students. It'll be two or three years until we move in to the building. Just seeing robots there, you'll think of neat projects. These projects also suggest a lot of good mathematical and theoretical questions. And high level graphical tools, there's a tremendous amount of great stuff in that area too. Yeah, I'd love to do that... only one life, you know, but... CLB: Why do you mention biochemistry? Knuth: There's millions and millions of unsolved problems. Biology is so digital, and incredibly complicated, but incredibly useful. The trouble with biology is that if you have to work as a biologist it's boring Your experiments take you three trouble with biology is that, if you have to work as a biologist, it s boring. Your experiments take you three years and then, one night, the electricity goes off and all the things die! You start over. In computers we can create our own worlds. Biologists deserve a lot of credit for being able to slug it through. It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at peoples' fingertips, that it won't be pretty much working on refinements of well-explored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can't predict an unending growth. I can't be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on, it's at that level. ATCGCTGAATTCCAATGTG Level 0 Level 1 A eukaryotic genome can be Level 2 thought of as six Levels of six Levels of DNA structure. Level 3 The loops at Level 4 range from 0.5kb to Level 4 100kb in length. Level 5 If these loops were stabilized then the genes inside the loop Level 6 would not be expressed. 6

  7. 20.09.2008 A simple gene DNA, gene, RNA, protein, gene regulation, … A: B: Upstream/ Downstream promoter ATCGAAAT DNA: +Modifications TAGCTTTA 7

  8. 20.09.2008 From parts list to a system Network Undirected: 5+4+3+2+1 15 Undirected: 5+4+3+2+1=15 L Directed graph: 5 2 = 25 H K Connection/not: 2 15 = 32768 3 15 14348907 A ti Activate/repress: 3 15= 14348907 t / J S 20: 1M or 3400M 8

  9. 20.09.2008 Models and parameters Logical switches on nodes L Boolean or continuous H K Firing thresholds, growth functions? J S Problem: ■ We have parts (?) ■ We may have partial info on wirings ■ We may have partial info on wirings – Scientific literature ■ We have some observations under some conditios at some timepoints ■ What’s the content of the “black box”? 9

  10. 20.09.2008 Reality ■ ~25,000 genes ■ ■ ~1,000,000 proteins 1 000 000 proteins ■ ~300 + 10,000 cell types ■ Infinite nr of conditions ■ Other levels of control: – Micro-RNA Micro RNA – Chromosome level effects – Cell-cell signaling – … Mid-term Review Mid-term Review Embryonic Stem Cell (ES) key regulators Embryonic Stem Cell (ES) key regulators collaboration between James Adjaye, Jaak Vilo and Ioannis Xenarios collaboration between James Adjaye, Jaak Vilo and Ioannis Xenarios OCT4 SOX2 NANOG 10

  11. 20.09.2008 siRNA knockdown of SOX2 siRNA knockdown of SOX2 Identify positively and negatively affected gene lists Identify positively and negatively affected gene lists OCT4 SOX2 NANOG SOX2 -> SOX2 -| Network reconstruction using Network reconstruction using gene expression gene expression OCT4 SOX2 NANOG SOX2 -> SOX2 -| OCT4 -> OCT4 -| NANOG -> NANOG -| 11

  12. 20.09.2008 Network reconstruction using Network reconstruction using gene expression gene expression http://www.biology.emory.edu/research/Lucchesi/html/research.html 12

  13. 20.09.2008 Expression Profiler (2002): Pattern + Sequence + Expression data combined view Gene Expression g ≈ g P ≈ α P m B t X + β P m C t Y + γ P m D t Y Chapter 3 13

  14. 20.09.2008 Can we model gene expression? G ≈ Σ α M T G ij ≈ Σ α lk M il T kj Chapter 3 Linear regression G ≈ MAT G ≈ MAT Chapter 3 14

  15. 20.09.2008 cc ~ expression+motifs M/G1 G1 KOexpr motifs KOexpr motifs Mbp1+ Mbp1+ Swi4+ Swi4+ Ace2+ Ace2+ Mbp1+ Mbp1+ Swi4+ Fkh2- Swi5+ Mcm1+ S Mcm1+ Swi6- KOexpr motifs Fkh- Swi4+ Swi4+ G2/M Swi6+ Swi6+ Mbp1+ KOexpr motifs S/G2 Ace2+ Fkh1 KOexpr motifs Mcm1+ Swi6 Swi4+ Swi4+ Swi4 Swi4+ Mcm1 knockout Fkh1+ Swi5+ data Fkh2 Ace2 Ace2+ Fkh2+ Predict new knowledge ■ Provided some knowledge of elements on pathways on pathways Reactome ■ Predict missing elements and links using all available knowledge using all available knowledge – Collaborations with biologists: verification 15

  16. 20.09.2008 Ongoing EU projects • Systems biology – Embryonic stem cell regulation Embryonic stem cell regulation – Pathway reconstruction (LKB1, TGFB, …) – Dry-lab and wet-lab connection • Cancer diagnostics – Patient data entry and management – Biomarker identification Biomarker identification • Stem cell based toxicology profiling – Data management – Analysis http://biit.cs.ut.ee/software • Published (i (in 1 year) 1 ) • Ongoing… 16

  17. 20.09.2008 Research Focus • Algorithms (Data Mining & Bioinformatics) • Tools (web based) • Databases & information systems • Gene regulation & Systems Biology • Cancer; Stem Cells; • Microarray & other high-throughput data Fast Approximate Hierarchical Clustering using Similarity Heuristics Hierarchical clustering is applied in gene expression data analysis, i d t l i number of genes can be 20000+ Hierarchical clustering: Each subtree is a cluster. Hierarchy is built Hierarchy is built by iteratively joining two most similar clusters into a larger one. 17

  18. 20.09.2008 Fast Hierarchical Clustering Avoid calculating all O(n 2 ) distances: – Estimate distances – Use pivots – Find close objects Fi d l bj t – Cluster with partial information Meelis Kull 18

  19. 20.09.2008 MEM MEM 19

  20. 20.09.2008 Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007 GraphWeb : mining biological networks for submodules with functional significance • Genes as nodes • -omics define edges – expression correlation – protein-protein interactions – literature co-occurrence – regulation – binding site discovery 20

  21. 20.09.2008 Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007 Gene modules • Integrate data sources as graph l layers • Find well- connected subgraphs • Combine Combine evidence to infer GO: cell cycle, regulation, growth. knowledge about KEGG: Alzheimer’s disease regulation and function Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007 Data as graphs .. everything is interconnected Public datasets for H.sapiens IntAct: Protein interactions (PPI), 18773 interactions IntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactions Transfac: gene regulation data, 5183 interactions 21

Recommend


More recommend