S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES E VOLUTIONARY D ISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES O UTLINE 1 S TRINGS : DISTANCES AND EVOLUTION 2 E VOLUTIONARY M ODELS Examples 3 I NFERRING P HYLOGENIES
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES O UTLINE 1 S TRINGS : DISTANCES AND EVOLUTION 2 E VOLUTIONARY M ODELS Examples 3 I NFERRING P HYLOGENIES
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES D IGITAL MOLECULES DNA DNA can be considered as a very long string over an alphabet of 4 bases ( A , C , G , T ). This encyclopedia stores genetic information in volumes (chromosomes), with interesting chapters (genes), reading instructions (regulatory elements) and less interesting material (junk DNA).
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES G ENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... H OW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES ? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES G ENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... H OW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES ? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES H OW EVOLUTION ACTS ON DNA? E VOLUTIONARY EVENTS Evolution can modify DNA in several ways: Local pointwise mutations can substitute, delete or insert a base somewhere. Entire DNA fragments can be deleted or duplicated, possibly reversed in their order. Bigger pieces of DNA can be swapped or inverted. Entire genomes can be duplicated. MUTATIONS HAPPEN RANDOMLY!!! O UR F OCUS For simplicity, we focus simply on pointwise substitution events.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES O UR SCENARIO The scenario is the following: consider two species (human and chimp) evolved from a common ancestor (some old primate). As the ancestor evolved to human or chimp, his DNA mutated pointwise in some positions, chosen randomly. evolutionary distance = number of mutations
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES D OES H AMMING DISTANCE COUNT THE NUMBER OF MUTATIONS ? Consider the following situation: A → C ; A → G → C ; A → C → G → C The same observation A , C can corresponds to different evolutionary histories. Hamming distance ignores multiple substitutions in a site. Moreover: A → G → A ; A → C → G → A Hamming distance ignores back-mutation! It underestimates the number of mutations. C ORRECTING DISTANCES The strategy is to develop a stochastic model of DNA evolution, and use it to correct the observed distance to account for multiple substitutions in a site.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES O UTLINE 1 S TRINGS : DISTANCES AND EVOLUTION 2 E VOLUTIONARY M ODELS Examples 3 I NFERRING P HYLOGENIES
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES A SIMPLE MODEL OF NUCLEOTIDE EVOLUTION H YPOTHESIS Time evolves continuously; Each site can be substituted independently; the rate of substitutions (expected frequency per unit of time) does not change in time (homogeneity); The rate of change from base i to base j does not depend on the mutation history of the site (memoryless property). C ONSEQUENCES Happening time of a single mutation event is modeled by an exponential distribution. Number of mutations is modeled by a Poisson process.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES M ARKOV P ROCESSES If we consider all possible mutations (from A to C , G , T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. F URTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A , π C , π G , π T (stationary chain). The process is time reversible: π i P ij ( t ) = π j P ji ( t ) . R ATE M ATRIX Under the previous hypothesis, the Q -matrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Q π = 0) For nucleotide substitution models, we have 6+3 parameters to set.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES M ARKOV P ROCESSES If we consider all possible mutations (from A to C , G , T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. F URTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A , π C , π G , π T (stationary chain). The process is time reversible: π i P ij ( t ) = π j P ji ( t ) . R ATE M ATRIX Under the previous hypothesis, the Q -matrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Q π = 0) For nucleotide substitution models, we have 6+3 parameters to set.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES E XPECTATIONS T OTAL RATE OF CHANGE � µ = − q ii π i i E XPECTED NUMBER OF CHANGES AFTER TIME t d = µ t P ROBABILITY OF OBSERVING A SUBSTITUTION AFTER TIME t � p = 1 − π i P ii ( t ) i p is also the expected number of observed substitutions per site.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES C ORRECTING H AMMING D ISTANCE p = Hamming distance Estimate p as ˆ 1 total length From d = µ t and p = 1 − � i π i P ii ( t ) deduce 2 π i P ii ( d � p = 1 − µ ) . i Solve the previous formula for d and use the estimate ˆ p of 3 p to compute the estimate ˆ d .
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES D IFFERENT EVOLUTIONARY MODELS There are 6 parameters to fix the rate matrix R and 3 to fix the equilibrium frequencies π .
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES T HE J UKES -C ANTOR MODEL The Jukes-Cantor model has − 3 1 1 1 4 4 4 4 been published in 1969. It is 1 − 3 1 1 4 4 4 4 the simplest model of evolution, Q = 1 1 − 3 1 assuming R ij = 1 and π i = 1 4 4 4 4 4 . 1 1 1 − 3 4 4 4 4 S OLUTION FOR P P ( t ) = 1 4 − Qe − t . C ORRECTION FOR THE DISTANCE d = − 3 � 1 − 4 � 3 ˆ 4 ln p
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES O UTLINE 1 S TRINGS : DISTANCES AND EVOLUTION 2 E VOLUTIONARY M ODELS Examples 3 I NFERRING P HYLOGENIES
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES R ECONSTRUCTING HISTORY OF LIFE W HAT MEANS “ PHYLOGENETIC INFERENCE ”? All species on Earth come from a common ancestor. If we have data from a pool of species, we wish to reconstruct the history of speciation events that lead to their emergence: We want to find the phylogenetic tree giving this information! This is an hard task, because data is often incomplete (we lack information about most of the ancestor species) and noisy.
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES M ETHODS TO INFER PHYLOGENY A PPROACHES TO PHYLOGENY Distance-based methods Parsimony methods Likelihood methods Bayesian inference methods D ISTANCE - BASED METHODS Given a matrix of pairwise distances, find the tree that explains it better. Several algorithms: UPGMA (clustering methods) Neighbor Joining Fitch-Margolias (sum of squares methods)
S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES A N EXAMPLE : PRIMATES DNA FROM PRIMATES Tarsius AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCT... Lemur AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCAT... Homo Sapiens AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCT... Chimp AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCT... Gorilla AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCAT... Pongo AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCT... Hylobates AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTT... Macaco Fuscata AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTT... D ISTANCE M ATRIX 0 . 00 0 . 29 0 . 40 0 . 39 0 . 38 0 . 34 0 . 38 0 . 37 Tarsius 0 . 29 0 . 00 0 . 37 0 . 38 0 . 35 0 . 33 0 . 36 0 . 34 Lemur 0 . 40 0 . 37 0 . 00 0 . 10 0 . 11 0 . 15 0 . 21 0 . 24 Homo Sapiens 0 . 39 0 . 38 0 . 10 0 . 00 0 . 12 0 . 17 0 . 21 0 . 24 Chimp 0 . 38 0 . 35 0 . 11 0 . 12 0 . 00 0 . 16 0 . 21 0 . 26 Gorilla 0 . 34 0 . 33 0 . 15 0 . 17 0 . 16 0 . 00 0 . 22 0 . 24 Pongo 0 . 38 0 . 36 0 . 21 0 . 21 0 . 21 0 . 22 0 . 00 0 . 26 Hylobates 0 . 37 0 . 34 0 . 24 0 . 24 0 . 26 0 . 24 0 . 26 0 . 00 Macaco Fuscata
Recommend
More recommend