comparative genomics comparative genomics sequence
play

Comparative Genomics: Comparative Genomics: Sequence, Structure, - PowerPoint PPT Presentation

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and Networks and Networks Bonnie Berger MIT Comparative Genomics Comparative Genomics Look at the same kind of data across species with the hope that


  1. Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and Networks and Networks Bonnie Berger MIT

  2. Comparative Genomics Comparative Genomics Look at the same kind of data across species with the hope that areas of high correlation correspond to functional parts or modules of the genome.

  3. Biology in One Slide Biology in One Slide Protein Function

  4. Comparative Genomics of DNA Comparative Genomics of DNA Function Protein

  5. Look at multiple species simultaneously Multiple Species Comparison

  6. Application to Regulatory Motif Discovery Application to Regulatory Motif Discovery S. cer S. par S. mik S. bay Gal4 Controls Evaluate conservation within: (1) All intergenic regions 22% 5% (2) Intergenic : coding 4:1 1:3 (3) Upstream : downstream 12:1 1:1 A signature for regulatory motifs

  7. Result Highlights Result Highlights [Kellis, Patterson, Birren, Berger*, Lander* (2004). RECOMB , 157-166; J. Comp Biol special issue, 11 :2-3, 319-355; Kellis et al. Nature (2003)] • Identify gene correspondence across species for > 90% of genes in Yeast. – 99.9% sensitivity and 99% specificity on 4000 known genes. – Refine boundaries of hundreds of genes (5700 genes total). • Identify most previously known and 41 novel regulatory motifs. – Genome-wide, unbiased search. – No previous knowledge necessary.

  8. Comparative Genomics of RNA Comparative Genomics of RNA Function Protein

  9. RNA Secondary Structure Detection RNA Secondary Structure Detection Problem: Identify biologically significant RNA secondary structure. Hairpin Loops Challenge: Interior loops Any given single sequence will have a Stems plausible secondary A-U Multi-branched loop structure. G-C Bulge loop G-U

  10. Compensatory Mutations Compensatory Mutations Given K orthologous aligned RNA sequences: If i th and j th positions are base-paired in many organisms, then their nucleotides must covary.

  11. Compensatory Mutations Compensatory Mutations Given K orthologous aligned RNA sequences: If i th and j th positions are base-paired in many organisms, then their nucleotides must covary.

  12. Approaches to Secondary Structure Approaches to Secondary Structure Detection Detection • Statistical – Stochastic context free grammars for 2- species comparison (QRNA) – Machine learning (RNAGenie) – Our approach: statistical significance across multiple species (MSARi) • Homology – Train on a particular RNA secondary structure and try to predict that structure

  13. Result Highlights Result Highlights [Coventry, Kleitman, Berger (2004), PNAS ] • Identifies RNA secondary structure with 90% sensitivity at 98% specificity. – no previous knowledge necessary Used to identify functionally significant RNA secondary • structure in mRNA. Can be used to scan multiple genomes for RNA secondary • structure. • Benchmarks: QRNA 19.8% sensitivity at 98% specificity ddbRNA 68% sensitivity at 97.7% specificity

  14. Comparative Genomics of Proteins Comparative Genomics of Proteins Function Protein

  15. Protein Structure: The Protein Folding Problem Given an amino acid sequence, e.g., MDPNCSCAAAGDSCTCANSCTCLACKCTSCK, how will it fold in 3D? Proteins must fold to function Some diseases are caused by misfolding e.g., mad cow disease

  16. Protein Folding by Comparative Modeling Protein Folding by Comparative Modeling • Similar protein sequences � similar structures • Use known structures to predict a new one • About 40,000 protein structures have been solved using experimental techniques and stored in the Protein Data Bank (PDB) ; ~1000 are unique structural folds Same structural folds Different structural folds

  17. Protein Threading Protein Threading Query Sequence: DRVYIHPF A DRVYIHPF A The Best Match Threading = Match between a string and a 3D object

  18. Result Highlights Result Highlights • RAPTOR: threading as Linear-Programming (Jinbo Xu) Minimize ∑ ∑ = + E a x b y i , l i , l ( i , l )( j , k ) ( i , l )( j , k ) Structural Template s . t . 9 ∑ 5 = ∀ ∈ x y , l D [ i ] 1 i , l ( i , l )( j , k ) 4 6 ∈ 8 10 k R [ i , j , l ] ∑ = ∀ ∈ x y , k D [ j ] 2 j , k ( i , l )( j , k ) 7 3 ∈ l R [ j , k , i ] ∑ = x 1 i , l ∈ … … T N L A K Y E T L l D [ i ] ∈ , { 0 , 1 } x y Input Sequence i , l ( i , l )( j , k ) RAPTOR was the best performing algorithm at CAFASP, a worldwide competition

  19. Threading Protein Complexes Threading Protein Complexes A RGPPQLIK… RGPPQLIK… DBLRAP B EGAATQY… EGAATQY… – DBLRAP is our extension of RAPTOR for joint homology modeling of two structures ( PS B 06) • Extend LP formulation to score interfaces between two structures as well • DBLRAP was able to predict interactions for 8% of proteins in the yeast genome (c.f. 5% previously)

  20. Structure Alignment Structure Alignment

  21. Protein Structure Alignment Protein Structure Alignment Problem: find the optimal alignment between two protein structures

  22. Contact Map Alignment Contact Map Alignment Goal: find maximum common subgraph contact distance ≤ Du (5Å − 7.5Å)

  23. State of Art: Contact Map Alignment State of Art: Contact Map Alignment • History: more than 20 years, many programs based on heuristic algorithms • NP-hard and hard to approximate if being measured by Maximum Common Subgraph (Goldman, Papadimitriou & Istrail, FOCS 99) • Lagrangian relaxation (Caprara & Lancia, Recomb 2002) • Integer programming (Caprara et al, JCB 2004)

  24. Tree- -Decomposition for Protein Decomposition for Protein Tree Structure Alignment Structure Alignment Method: tree-decomposition of one protein structure into small pieces to exploit the geometric characteristics of a structure Results: there is a poly-time approximation algorithm (PTAS) to find an alignment at least (1-1/k) of the best. Its time complexity is: Δ ε Δ = + ε D 2 6 3 3 tw O ( k poly ( N ) /( D ) ) ( 1 ) ( ) c c D l = 3 k 2 D tw O (( ) ) D l The parameters D, Dc and D l are small constants, so is D/D l. Therefore, this problem admits a PTAS, the best that we can achieve since this problem is NP-hard.

  25. Biological Applications of Tree Biological Applications of Tree Decomposition Decomposition • Sidechain packing (Xu & Berger, Recomb ’05, JACM ’06) • Protein threading (Xu, Jiao, & Berger, CS B ’05) • Network motif search (Dost et al, Recomb ’07) • RNA secondary structure alignment (Song et al, B ’05) CS • De novo sequencing (Liu et al, PS B ’06) • Protein structure alignment (Xu, Jiao, & Berger, Recomb ’06, JCB ’06; Xu, CDC ’07)

  26. Comparative Genomics of Networks Comparative Genomics of Networks Protein Function

  27. Why understanding function- -level level Why understanding function differences is important differences is important • Increased complexity (function) is not explained simply by variations in gene (or protein) count 6600 21000 14000 24500 23000 Estimated Number of Genes Estimated Number of Genes Estimated Number of Genes 6600 27000 19000 32000 49000 Estimated Number of Proteins Estimated Number of Proteins Estimated Number of Proteins Numbers from http://www.ensembl.org

  28. Protein- -Protein Interactions ( Protein Interactions (PPIs PPIs) ) Protein • Often, proteins interact with other proteins to perform their functions • Many cellular activities are a result of protein interactions Image from: MAPK Signaling Cascade MAPK Signaling Cascade MAPK Signaling Cascade http://focosi.altervista.org /mapkmap2.html

  29. Modeling PPIs PPIs Modeling • Traditional perspective: low-throughput, structural • New perspective: high-throughput, network-based G-protein complex GDP GDP G α G α GDP G γ G α G β G β G β G γ G γ New systems- -level level New systems New systems-level Traditional perspective Traditional perspective Traditional perspective perspective perspective perspective Image from www.rcsb.org

  30. Protein- -Protein Interaction (PPI) Protein Interaction (PPI) Protein Network Network = ? Y + X Cusick et al. Hum Med Gen, 05 Yeast 2-Hybrid method Yeast PPI Network http://internal.binf.ku.dk

  31. Motivation behind Network Comparison Motivation behind Network Comparison • Compare PPI networks at the species level • Transfer annotation from one species to another – More feasible, cheaper and easier than in humans – Error detection • Compute functional orthologs – Functional orthologs: proteins which perform the same function across species

  32. The Problem The Problem Given two protein-protein interaction networks, find for a piece of one network, something that has a comparative structure in the other network Our approach: match neighborhood topologies

  33. Algorithm: IsoRank IsoRank Algorithm: a1 a2 a4 a5 b7 1e-2 a7 a5 a5 b1 2e-8 a3 a5 b3 a6 1e-7 a8 a5 b9 1e-4 b2 b3 a3 b1 5e-4 b5 a3 b6 b7 b4 3e-9 b9 … b1 b6 Functional b8 similarity Sequence similarity for each a5 b7 2.1 possible a5 b9 1.5 node pairing a3 b2 3.4

Recommend


More recommend