The relation between indel length and functional divergence A - PowerPoint PPT Presentation

The relation between indel length and functional divergence A formal study Alexander Schönhuth Raheleh Salari, Fereydoun Hormozdiari, Artem Cherkasov, Cenk Sahinalp Computational Biology Lab School of Computing Science Simon Fraser University, Canada Division of Infectious Diseases Faculty of Medicine University of British Columbia September 17, 2008

Guideline

Goals and Motivation • Develop computational solutions to identify truly evolutionary insertions and deletions (indels) in alignments.

Goals and Motivation • Develop computational solutions to identify truly evolutionary insertions and deletions (indels) in alignments. • Examine whether significant indels are correlated to more severe functional divergence between homologous proteins. • Improved assessment of functional similarity of homologous proteins.

Indel Facts Evolutionary processes behind insertions and deletions are not well understood. Truely evolutionary distributions of indel length: • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, 2005] • Zipfian distribution [Chang and Benner, 2004]

Indel Facts Evolutionary processes behind insertions and deletions are not well understood. Truely evolutionary distributions of indel length: • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, 2005] • Zipfian distribution [Chang and Benner, 2004] Classical alignment procedures with affine gap penalties: • Geometric distribution.

Indel Facts Evolutionary processes behind Moreover, from small-scale insertions and deletions are not studies: well understood. • Indels often occur in the proteins’ Truely evolutionary distributions of loop regions and cause significant indel length: structural changes [Fechteler et • Mixtures of exponentials [Qian al., 1995] and Goldstein, 2003], [Pang et. al, • Indels occur in disease-causing 2005] mutational hot spots [Kondrashov • Zipfian distribution [Chang and et al., 2004] Benner, 2004] • Thanks to the structural changes: Classical alignment procedures novel approaches to antibacterial drug design [Cherkasov et al. with affine gap penalties: 2005, 2006], [Nandan et al. 2007] • Geometric distribution.

Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Compute all paralogous protein pairs in an organism. 2. Collect “indel” and “non-indel” pairs. 3. Compare functional similarity of “indel” and “non-indel” protein pairs.

Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Interesting organism and sound, but efficient alignment 1. Compute all paralogous protein method needed. pairs in an organism. 2. Collect “indel” and “non-indel” pairs. 3. Compare functional similarity of “indel” and “non-indel” protein pairs.

Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Interesting organism and sound, but efficient alignment 1. Compute all paralogous protein method needed. pairs in an organism. 2. How to identify true indels? 2. Collect “indel” and “non-indel” Idea: Identify alignments that pairs. contain statistically significant 3. Compare functional similarity of indels, neglect “alignment noise”. “indel” and “non-indel” protein Sound indel statistics needed. pairs.

Basic Idea and Workflow Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins. 1. Interesting organism and sound, but efficient alignment 1. Compute all paralogous protein method needed. pairs in an organism. 2. How to identify true indels? 2. Collect “indel” and “non-indel” Idea: Identify alignments that pairs. contain statistically significant 3. Compare functional similarity of indels, neglect “alignment noise”. “indel” and “non-indel” protein Sound indel statistics needed. pairs. 3. Definition of functional similarity needed.

Data and Methods

Organism: E.coli K12 • E.coli K12 is a well known organism. • In prokaryotes: horizontal gene transfer, insertion of genetic material from other prokaryotic organisms. • In bacteria: genomic islands, regions that accumulate inserted genetic material. • Novel classes of antibacterial drug targets needed.

Paralogs • Paralogous proteins (paralogs) can assume different functions in the organism [Gerlt and Babbitt, 2000]. • Global alignments: Needleman-Wunsch algorithm with affine gap penalties. Tool: GGSEARCH [Pearson and Lipman, 1988] • Paralogous pairs: E-value below 10 − 6 , at least 50% sequence and 20% sequence identity.

Paralogs • Paralogous proteins (paralogs) can assume different functions in the organism [Gerlt and Babbitt, 2000]. • Global alignments: Needleman-Wunsch algorithm with affine gap penalties. Tool: GGSEARCH [Pearson and Lipman, 1988] • Paralogous pairs: E-value below 10 − 6 , at least 50% sequence and 20% sequence identity. • Above 40% sequence identity: aligned proteins are structurally similar [Rost, 1999]. • Between 20% and 40% sequence identity: Twilight zone, structural similarity cannot straightforwardly be inferred.

GO based Computation of Functional Similarity • Gene Ontology (GO): Structured description of functional annotation. • Three subcategories: • Molecular Function • Biological Process • Cellular Component • Organized as directed acyclic graph (DAG).

GO based Computation of Functional Similarity • Gene Ontology (GO): Structured description of functional annotation. • Proteins are identified with subsets of • Three subcategories: nodes in a DAG. • Molecular Function • Protein similarity can be measured based • Biological Process on reasonable comparison of “subDAGs”. • Cellular Component • Organized as directed acyclic graph (DAG).

GO based Computation of Functional Similarity • Gene Ontology (GO): Structured description of functional annotation. • Proteins are identified with subsets of • Three subcategories: nodes in a DAG. • Molecular Function • Protein similarity can be measured based • Biological Process on reasonable comparison of “subDAGs”. • Cellular Component • Organized as directed acyclic graph (DAG). Method of choice: Extension of the semantic similarity measure by [Resnik, 1999] to protein similarity, as described by [Schlicker, 2006], suggested by Francisco Couto and Daniel Faria.

Indel Statistics: Problem Definition Definition • If A is an alignment algorithm, let L A ( x , y ) resp. I A ( x , y ) be the length of the alignment resp. the length of the largest indel in the alignment of x = x 1 ... x m , y = y 1 ... y n , as computed by A .

Indel Statistics: Problem Definition Definition • If A is an alignment algorithm, let L A ( x , y ) resp. I A ( x , y ) be the length of the alignment resp. the length of the largest indel in the alignment of x = x 1 ... x m , y = y 1 ... y n , as computed by A . • If k is an integer, let P n , T ( I A ( x , y ) ≥ k ) := P ( I A ( x , y ) ≥ k | L A ( x , y ) = n , ( x , y ) ∈ T ) be the probability that the largest indel in the alignment of x and y ((x,y) drawn from a pool of pairs T such that L A ( x , y ) = n ) is greater than k .

Problem Definition Indel Length Probability (ILP) Problem Computation of the probabilities P n , T ( I A ( x , y ) ≥ k ) . Input : A pair of sequences ( x , y ) , a pool T with ( x , y ) ∈ T , an alignment algorithm A and an integer k . Output: P n , T ( I A ( x , y ) ≥ k ) . Remark • Replacing I A ( x , y ) by S A ( x , y ) , the score of an alignment of x and y is the classical problem of score statistics. • Exact solution for A the Smith-Waterman algorithm for ungapped local alignments given by the Altschul-Dembo-Karlin statistics [Karlin and Altschul, 1990], [Dembo and Karlin, 1991].

Solution Strategy: Pair HMMs q X p q x i 1-2p 1-q M End Begin Px y i j p Y q 1-q q y j P n , T ( I A ( x , y ) ≥ k ) correspond to probabilities that Viterbi paths contain a consecutive run of either ’X’ or ’Y’ states of length at least k . Hard problem! 2p Match Indel q 1-2p 1-q

Solution Strategy: Pair HMMs • We write I for ’Indel’ and M for ’Match’. q X p q x i • Let 1-2p 1-q C n , k M End Begin Px y be the set of sequences over the alphabet i j p M , I of length n that contain a consecutive I stretch of length at least k . Y q 1-q q y j P n , T ( I A ( x , y ) ≥ k ) correspond to probabilities that Viterbi paths contain a consecutive run of either ’X’ or ’Y’ states of length at least k . Hard problem! 2p Match Indel q 1-2p 1-q

The relation between indel length and functional divergence A - PowerPoint PPT Presentation

The relation between indel length and functional divergence A formal study Alexander Schnhuth Raheleh Salari, Fereydoun Hormozdiari, Artem Cherkasov, Cenk Sahinalp Computational Biology Lab School of Computing Science Simon Fraser

New Drawer 30 PRODUCT INFORMATION No. 2018-02 update October 201 8 Marc Van Aperen Indel Webasto

Relation between things vs. a relation between people Lenin: Where the bourgeois economists

Govt. of Gujarat Gujarat Coastline Zone Accretion Erosion length Stable length Total length

Relation between sets 2/16 A relation R between sets A and B is a predicate on A B . R ( x, y )

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Reframing the situation Implementing penalties for frameshifting mutations in a sequence

For Friday Read Chapter 10, sections 1 and 2 Prolog Handout 4 Length of a List

Verification of Security Protocols with Lists: from Length One to Unbounded Length Miriam Paiola

Part I: Soil Mechanics Volume-Volume relation Mass-Mass relation Mass-Volume relation

Relation Schema Given domains D 1 , D 2 , . D n a relation r is a subset of D 1 x D 2 x

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

Multicentre study to evaluate the relation study to evaluate the relation Multicentre between

Functional Dependencies 1 Functional Dependencies X Y is an assertion about a relation

Mohamed Thahir Traditional and Open Relation Extraction Read the Web Relation Extraction

Interspecies gene function prediction using semantic similarity Guoxian Yu*, Wei Luo, Guangyuan

INAF-Astronomical Observatory of Padova III. Evolution of the ejecta Luca Zampieri - Supernovae,

Fundamentals of Evolution Session 6 - 2018 Bayesian phylogenetics & big trees 1 Recap of

Shortest Non-trivial Cycles in Directed and Undirected Surface Graphs Kyle Fox University of

1 Gregor Mendel: Traits endure, they do not blend Augsutinian monk interested in plant

Sophisticated models in Bio++ Julien Dutheil, Bastien Boussau Birc, Aarhus; LBBE, Lyon Friday,

A Way Forward CONVERSATIONS ABOUT A WAY FORWARD FOR THE UNITED METHODIST CHURCH The Mission of

Pre-Stonewall 1895 Oscar Wilde trial; prosecuted for gross indecency 1903 First raid

Sambuz

Useful Links

Newsletter

Mail Us

The relation between indel length and functional divergence A - PowerPoint PPT Presentation

The relation between indel length and functional divergence A formal study Alexander Schnhuth Raheleh Salari, Fereydoun Hormozdiari, Artem Cherkasov, Cenk Sahinalp Computational Biology Lab School of Computing Science Simon Fraser

New Drawer 30 PRODUCT INFORMATION No. 2018-02 update October 201 8 Marc Van Aperen Indel Webasto

Relation between things vs. a relation between people Lenin: Where the bourgeois economists

Govt. of Gujarat Gujarat Coastline Zone Accretion Erosion length Stable length Total length

Relation between sets 2/16 A relation R between sets A and B is a predicate on A B . R ( x, y )

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Reframing the situation Implementing penalties for frameshifting mutations in a sequence

For Friday Read Chapter 10, sections 1 and 2 Prolog Handout 4 Length of a List

Verification of Security Protocols with Lists: from Length One to Unbounded Length Miriam Paiola

Part I: Soil Mechanics Volume-Volume relation Mass-Mass relation Mass-Volume relation

Relation Schema Given domains D 1 , D 2 , . D n a relation r is a subset of D 1 x D 2 x

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

Multicentre study to evaluate the relation study to evaluate the relation Multicentre between

Functional Dependencies 1 Functional Dependencies X Y is an assertion about a relation

Mohamed Thahir Traditional and Open Relation Extraction Read the Web Relation Extraction

Interspecies gene function prediction using semantic similarity Guoxian Yu*, Wei Luo, Guangyuan

INAF-Astronomical Observatory of Padova III. Evolution of the ejecta Luca Zampieri - Supernovae,

Fundamentals of Evolution Session 6 - 2018 Bayesian phylogenetics &amp; big trees 1 Recap of

Shortest Non-trivial Cycles in Directed and Undirected Surface Graphs Kyle Fox University of

1 Gregor Mendel: Traits endure, they do not blend Augsutinian monk interested in plant

Sophisticated models in Bio++ Julien Dutheil, Bastien Boussau Birc, Aarhus; LBBE, Lyon Friday,

A Way Forward CONVERSATIONS ABOUT A WAY FORWARD FOR THE UNITED METHODIST CHURCH The Mission of

Pre-Stonewall 1895 Oscar Wilde trial; prosecuted for gross indecency 1903 First raid

Sambuz

Useful Links

Newsletter

Mail Us

Fundamentals of Evolution Session 6 - 2018 Bayesian phylogenetics & big trees 1 Recap of