Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and Networks and Networks Bonnie Berger MIT
Comparative Genomics Comparative Genomics Look at the same kind of data across species with the hope that areas of high correlation correspond to functional parts or modules of the genome.
Biology in One Slide Biology in One Slide Protein Function
Comparative Genomics of DNA Comparative Genomics of DNA Function Protein
Look at multiple species simultaneously Multiple Species Comparison
Application to Regulatory Motif Discovery Application to Regulatory Motif Discovery S. cer S. par S. mik S. bay Gal4 Controls Evaluate conservation within: (1) All intergenic regions 22% 5% (2) Intergenic : coding 4:1 1:3 (3) Upstream : downstream 12:1 1:1 A signature for regulatory motifs
Result Highlights Result Highlights [Kellis, Patterson, Birren, Berger*, Lander* (2004). RECOMB , 157-166; J. Comp Biol special issue, 11 :2-3, 319-355; Kellis et al. Nature (2003)] • Identify gene correspondence across species for > 90% of genes in Yeast. – 99.9% sensitivity and 99% specificity on 4000 known genes. – Refine boundaries of hundreds of genes (5700 genes total). • Identify most previously known and 41 novel regulatory motifs. – Genome-wide, unbiased search. – No previous knowledge necessary.
Comparative Genomics of RNA Comparative Genomics of RNA Function Protein
RNA Secondary Structure Detection RNA Secondary Structure Detection Problem: Identify biologically significant RNA secondary structure. Hairpin Loops Challenge: Interior loops Any given single sequence will have a Stems plausible secondary A-U Multi-branched loop structure. G-C Bulge loop G-U
Compensatory Mutations Compensatory Mutations Given K orthologous aligned RNA sequences: If i th and j th positions are base-paired in many organisms, then their nucleotides must covary.
Compensatory Mutations Compensatory Mutations Given K orthologous aligned RNA sequences: If i th and j th positions are base-paired in many organisms, then their nucleotides must covary.
Approaches to Secondary Structure Approaches to Secondary Structure Detection Detection • Statistical – Stochastic context free grammars for 2- species comparison (QRNA) – Machine learning (RNAGenie) – Our approach: statistical significance across multiple species (MSARi) • Homology – Train on a particular RNA secondary structure and try to predict that structure
Result Highlights Result Highlights [Coventry, Kleitman, Berger (2004), PNAS ] • Identifies RNA secondary structure with 90% sensitivity at 98% specificity. – no previous knowledge necessary Used to identify functionally significant RNA secondary • structure in mRNA. Can be used to scan multiple genomes for RNA secondary • structure. • Benchmarks: QRNA 19.8% sensitivity at 98% specificity ddbRNA 68% sensitivity at 97.7% specificity
Comparative Genomics of Proteins Comparative Genomics of Proteins Function Protein
Protein Structure: The Protein Folding Problem Given an amino acid sequence, e.g., MDPNCSCAAAGDSCTCANSCTCLACKCTSCK, how will it fold in 3D? Proteins must fold to function Some diseases are caused by misfolding e.g., mad cow disease
Protein Folding by Comparative Modeling Protein Folding by Comparative Modeling • Similar protein sequences � similar structures • Use known structures to predict a new one • About 40,000 protein structures have been solved using experimental techniques and stored in the Protein Data Bank (PDB) ; ~1000 are unique structural folds Same structural folds Different structural folds
Protein Threading Protein Threading Query Sequence: DRVYIHPF A DRVYIHPF A The Best Match Threading = Match between a string and a 3D object
Result Highlights Result Highlights • RAPTOR: threading as Linear-Programming (Jinbo Xu) Minimize ∑ ∑ = + E a x b y i , l i , l ( i , l )( j , k ) ( i , l )( j , k ) Structural Template s . t . 9 ∑ 5 = ∀ ∈ x y , l D [ i ] 1 i , l ( i , l )( j , k ) 4 6 ∈ 8 10 k R [ i , j , l ] ∑ = ∀ ∈ x y , k D [ j ] 2 j , k ( i , l )( j , k ) 7 3 ∈ l R [ j , k , i ] ∑ = x 1 i , l ∈ … … T N L A K Y E T L l D [ i ] ∈ , { 0 , 1 } x y Input Sequence i , l ( i , l )( j , k ) RAPTOR was the best performing algorithm at CAFASP, a worldwide competition
Threading Protein Complexes Threading Protein Complexes A RGPPQLIK… RGPPQLIK… DBLRAP B EGAATQY… EGAATQY… – DBLRAP is our extension of RAPTOR for joint homology modeling of two structures ( PS B 06) • Extend LP formulation to score interfaces between two structures as well • DBLRAP was able to predict interactions for 8% of proteins in the yeast genome (c.f. 5% previously)
Structure Alignment Structure Alignment
Protein Structure Alignment Protein Structure Alignment Problem: find the optimal alignment between two protein structures
Contact Map Alignment Contact Map Alignment Goal: find maximum common subgraph contact distance ≤ Du (5Å − 7.5Å)
State of Art: Contact Map Alignment State of Art: Contact Map Alignment • History: more than 20 years, many programs based on heuristic algorithms • NP-hard and hard to approximate if being measured by Maximum Common Subgraph (Goldman, Papadimitriou & Istrail, FOCS 99) • Lagrangian relaxation (Caprara & Lancia, Recomb 2002) • Integer programming (Caprara et al, JCB 2004)
Tree- -Decomposition for Protein Decomposition for Protein Tree Structure Alignment Structure Alignment Method: tree-decomposition of one protein structure into small pieces to exploit the geometric characteristics of a structure Results: there is a poly-time approximation algorithm (PTAS) to find an alignment at least (1-1/k) of the best. Its time complexity is: Δ ε Δ = + ε D 2 6 3 3 tw O ( k poly ( N ) /( D ) ) ( 1 ) ( ) c c D l = 3 k 2 D tw O (( ) ) D l The parameters D, Dc and D l are small constants, so is D/D l. Therefore, this problem admits a PTAS, the best that we can achieve since this problem is NP-hard.
Biological Applications of Tree Biological Applications of Tree Decomposition Decomposition • Sidechain packing (Xu & Berger, Recomb ’05, JACM ’06) • Protein threading (Xu, Jiao, & Berger, CS B ’05) • Network motif search (Dost et al, Recomb ’07) • RNA secondary structure alignment (Song et al, B ’05) CS • De novo sequencing (Liu et al, PS B ’06) • Protein structure alignment (Xu, Jiao, & Berger, Recomb ’06, JCB ’06; Xu, CDC ’07)
Comparative Genomics of Networks Comparative Genomics of Networks Protein Function
Why understanding function- -level level Why understanding function differences is important differences is important • Increased complexity (function) is not explained simply by variations in gene (or protein) count 6600 21000 14000 24500 23000 Estimated Number of Genes Estimated Number of Genes Estimated Number of Genes 6600 27000 19000 32000 49000 Estimated Number of Proteins Estimated Number of Proteins Estimated Number of Proteins Numbers from http://www.ensembl.org
Protein- -Protein Interactions ( Protein Interactions (PPIs PPIs) ) Protein • Often, proteins interact with other proteins to perform their functions • Many cellular activities are a result of protein interactions Image from: MAPK Signaling Cascade MAPK Signaling Cascade MAPK Signaling Cascade http://focosi.altervista.org /mapkmap2.html
Modeling PPIs PPIs Modeling • Traditional perspective: low-throughput, structural • New perspective: high-throughput, network-based G-protein complex GDP GDP G α G α GDP G γ G α G β G β G β G γ G γ New systems- -level level New systems New systems-level Traditional perspective Traditional perspective Traditional perspective perspective perspective perspective Image from www.rcsb.org
Protein- -Protein Interaction (PPI) Protein Interaction (PPI) Protein Network Network = ? Y + X Cusick et al. Hum Med Gen, 05 Yeast 2-Hybrid method Yeast PPI Network http://internal.binf.ku.dk
Motivation behind Network Comparison Motivation behind Network Comparison • Compare PPI networks at the species level • Transfer annotation from one species to another – More feasible, cheaper and easier than in humans – Error detection • Compute functional orthologs – Functional orthologs: proteins which perform the same function across species
The Problem The Problem Given two protein-protein interaction networks, find for a piece of one network, something that has a comparative structure in the other network Our approach: match neighborhood topologies
Algorithm: IsoRank IsoRank Algorithm: a1 a2 a4 a5 b7 1e-2 a7 a5 a5 b1 2e-8 a3 a5 b3 a6 1e-7 a8 a5 b9 1e-4 b2 b3 a3 b1 5e-4 b5 a3 b6 b7 b4 3e-9 b9 … b1 b6 Functional b8 similarity Sequence similarity for each a5 b7 2.1 possible a5 b9 1.5 node pairing a3 b2 3.4
Recommend
More recommend