The relation between indel length and functional divergence A formal study Alexander Schönhuth Raheleh Salari, Fereydoun Hormozdiari, Artem Cherkasov, Cenk Sahinalp Computational Biology Lab School of Computing Science Simon Fraser University, Canada Division of Infectious Diseases Faculty of Medicine University of British Columbia September 17, 2008
Guideline
Goals and Motivation • Develop computational solutions to identify truly evolutionary insertions and deletions (indels) in alignments.
Goals and Motivation • Develop computational solutions to identify truly evolutionary insertions and deletions (indels) in alignments. • Examine whether significant indels are correlated to more severe functional divergence between homologous proteins. • Improved assessment of functional similarity of homologous proteins.
Indel Facts Evolutionary processes behind insertions and deletions are not well understood. Truely evolutionary distributions of indel length: • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, 2005] • Zipfian distribution [Chang and Benner, 2004]
Indel Facts Evolutionary processes behind insertions and deletions are not well understood. Truely evolutionary distributions of indel length: • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, 2005] • Zipfian distribution [Chang and Benner, 2004] Classical alignment procedures with affine gap penalties: • Geometric distribution.
Indel Facts Evolutionary processes behind Moreover, from small-scale insertions and deletions are not studies: well understood. • Indels often occur in the proteins’ Truely evolutionary distributions of loop regions and cause significant indel length: structural changes [Fechteler et • Mixtures of exponentials [Qian al., 1995] and Goldstein, 2003], [Pang et. al, • Indels occur in disease-causing 2005] mutational hot spots [Kondrashov • Zipfian distribution [Chang and et al., 2004] Benner, 2004] • Thanks to the structural changes: Classical alignment procedures novel approaches to antibacterial drug design [Cherkasov et al. with affine gap penalties: 2005, 2006], [Nandan et al. 2007] • Geometric distribution.
Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Compute all paralogous protein pairs in an organism. 2. Collect “indel” and “non-indel” pairs. 3. Compare functional similarity of “indel” and “non-indel” protein pairs.
Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Interesting organism and sound, but efficient alignment 1. Compute all paralogous protein method needed. pairs in an organism. 2. Collect “indel” and “non-indel” pairs. 3. Compare functional similarity of “indel” and “non-indel” protein pairs.
Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Interesting organism and sound, but efficient alignment 1. Compute all paralogous protein method needed. pairs in an organism. 2. How to identify true indels? 2. Collect “indel” and “non-indel” Idea: Identify alignments that pairs. contain statistically significant 3. Compare functional similarity of indels, neglect “alignment noise”. “indel” and “non-indel” protein Sound indel statistics needed. pairs.
Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Interesting organism and sound, but efficient alignment 1. Compute all paralogous protein method needed. pairs in an organism. 2. How to identify true indels? 2. Collect “indel” and “non-indel” Idea: Identify alignments that pairs. contain statistically significant 3. Compare functional similarity of indels, neglect “alignment noise”. “indel” and “non-indel” protein Sound indel statistics needed. pairs. 3. Definition of functional similarity needed.
Data and Methods
Organism: E.coli K12 • E.coli K12 is a well known organism. • In prokaryotes: horizontal gene transfer, insertion of genetic material from other prokaryotic organisms. • In bacteria: genomic islands, regions that accumulate inserted genetic material. • Novel classes of antibacterial drug targets needed.
Paralogs • Paralogous proteins (paralogs) can assume different functions in the organism [Gerlt and Babbitt, 2000]. • Global alignments: Needleman-Wunsch algorithm with affine gap penalties. Tool: GGSEARCH [Pearson and Lipman, 1988] • Paralogous pairs: E-value below 10 − 6 , at least 50% sequence and 20% sequence identity.
Paralogs • Paralogous proteins (paralogs) can assume different functions in the organism [Gerlt and Babbitt, 2000]. • Global alignments: Needleman-Wunsch algorithm with affine gap penalties. Tool: GGSEARCH [Pearson and Lipman, 1988] • Paralogous pairs: E-value below 10 − 6 , at least 50% sequence and 20% sequence identity. • Above 40% sequence identity: aligned proteins are structurally similar [Rost, 1999]. • Between 20% and 40% sequence identity: Twilight zone, structural similarity cannot straightforwardly be inferred.
GO based Computation of Functional Similarity • Gene Ontology (GO): Structured description of functional annotation. • Three subcategories: • Molecular Function • Biological Process • Cellular Component • Organized as directed acyclic graph (DAG).
GO based Computation of Functional Similarity • Gene Ontology (GO): Structured description of functional annotation. • Proteins are identified with subsets of • Three subcategories: nodes in a DAG. • Molecular Function • Protein similarity can be measured based • Biological Process on reasonable comparison of “subDAGs”. • Cellular Component • Organized as directed acyclic graph (DAG).
GO based Computation of Functional Similarity • Gene Ontology (GO): Structured description of functional annotation. • Proteins are identified with subsets of • Three subcategories: nodes in a DAG. • Molecular Function • Protein similarity can be measured based • Biological Process on reasonable comparison of “subDAGs”. • Cellular Component • Organized as directed acyclic graph (DAG). Method of choice: Extension of the semantic similarity measure by [Resnik, 1999] to protein similarity, as described by [Schlicker, 2006], suggested by Francisco Couto and Daniel Faria.
Indel Statistics: Problem Definition Definition • If A is an alignment algorithm, let L A ( x , y ) resp. I A ( x , y ) be the length of the alignment resp. the length of the largest indel in the alignment of x = x 1 ... x m , y = y 1 ... y n , as computed by A .
Indel Statistics: Problem Definition Definition • If A is an alignment algorithm, let L A ( x , y ) resp. I A ( x , y ) be the length of the alignment resp. the length of the largest indel in the alignment of x = x 1 ... x m , y = y 1 ... y n , as computed by A .
Indel Statistics: Problem Definition Definition • If A is an alignment algorithm, let L A ( x , y ) resp. I A ( x , y ) be the length of the alignment resp. the length of the largest indel in the alignment of x = x 1 ... x m , y = y 1 ... y n , as computed by A . • If k is an integer, let P n , T ( I A ( x , y ) ≥ k ) := P ( I A ( x , y ) ≥ k | L A ( x , y ) = n , ( x , y ) ∈ T ) be the probability that the largest indel in the alignment of x and y ((x,y) drawn from a pool of pairs T such that L A ( x , y ) = n ) is greater than k .
Problem Definition Indel Length Probability (ILP) Problem Computation of the probabilities P n , T ( I A ( x , y ) ≥ k ) . Input : A pair of sequences ( x , y ) , a pool T with ( x , y ) ∈ T , an alignment algorithm A and an integer k . Output: P n , T ( I A ( x , y ) ≥ k ) . Remark • Replacing I A ( x , y ) by S A ( x , y ) , the score of an alignment of x and y is the classical problem of score statistics. • Exact solution for A the Smith-Waterman algorithm for ungapped local alignments given by the Altschul-Dembo-Karlin statistics [Karlin and Altschul, 1990], [Dembo and Karlin, 1991].
Solution Strategy: Pair HMMs q X p q x i 1-2p 1-q M End Begin Px y i j p Y q 1-q q y j P n , T ( I A ( x , y ) ≥ k ) correspond to probabilities that Viterbi paths contain a consecutive run of either ’X’ or ’Y’ states of length at least k . Hard problem! 2p Match Indel q 1-2p 1-q
Solution Strategy: Pair HMMs • We write I for ’Indel’ and M for ’Match’. q X p q x i • Let 1-2p 1-q C n , k M End Begin Px y be the set of sequences over the alphabet i j p M , I of length n that contain a consecutive I stretch of length at least k . Y q 1-q q y j P n , T ( I A ( x , y ) ≥ k ) correspond to probabilities that Viterbi paths contain a consecutive run of either ’X’ or ’Y’ states of length at least k . Hard problem! 2p Match Indel q 1-2p 1-q
Recommend
More recommend