Similarity vs homology Sequence similarity is not sequence homology l − If the two sequences g B and g C have accumulated enough mutations, the similarity between them is likely to be low #mutations #mutations 0 agt gt ccgt t aagt gcgt t c 64 acagt ccgt t cgggct at t g 1 agt gt ccgt t at agt gcgt t c 128 cagagcact accgc 2 agt gt ccgct t at agt gcgt t c 256 cacgagt aagat at agct 4 agt gt ccgct t aagggcgt t c 512 t aat cgt gat a 8 agt gt ccgct t caaggggcgt 1024 accct t at ct act t cct ggagt t 16 gggccgt t cat gggggt 2048 agcgacct gcccaa 32 gcagggcgt cact gagggct 4096 caaac Homology is more difficult to detect over greater evolutionary distances. Introduction to bioinformatics, Autumn 2006 26
Similarity vs homology (2) Sequence similarity can occur by chance l − Similarity does not imply homology Similarity is an expected consequence of homology l Introduction to bioinformatics, Autumn 2006 27
Orthologs and paralogs We distinguish between two types of homology l − Orthologs: homologs from two different species − Paralogs: homologs within a species Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 28
Orthologs and paralogs (2) Orthologs typically retain the original function l In paralogs, one copy is free to mutate and acquire l new function (no selective pressure) Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 29
Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not aligned 1 not aligned 1 not aligned Introduction to bioinformatics, Autumn 2006 30
Mutations: Insertions, deletions and substitutions acgtctag Indel: insertion or Mismatch: substitution ||||| deletion of a base (point mutation) of with respect to the a single base -actctag ancestor sequence Insertions and/or deletions are called indels l − We can’t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, Autumn 2006 31
Problems What sorts of alignments should be considered? l How to score alignments? l How to find optimal or good scoring alignments? l How to evaluate the statistical significance of scores? l In this course, we discuss the first three problems. Course Biological sequence analysis tackles all four in- depth. Introduction to bioinformatics, Autumn 2006 32
Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 33
Global alignment Problem: find optimal scoring alignment between two l sequences (Needleman & Wunsch 1970) We give score for each position in alignment l WHAT − Identity (match) +1 − Substitution (mismatch) -µ || �� − Indel WH-Y S(WHAT/WH-Y) = 1 + 1 – � – µ Introduction to bioinformatics, Autumn 2006 34
Representing alignments and scores WHAT - W H A T || - WH-Y W X H X X Y X Introduction to bioinformatics, Autumn 2006 35
Representing alignments and scores WHAT - W H A T || - 0 WH-Y W 1 H 2 2- � Global alignment Y 2- � -µ score S 3,4 = 2- � -µ Introduction to bioinformatics, Autumn 2006 36
Dynamic programming How to find the optimal alignment? l We use previous solutions for optimal alignments of l smaller subsequences This general approach is known as dynamic l programming Introduction to bioinformatics, Autumn 2006 37
Filling the alignment matrix - W H A T Consider the alignment process at shaded square. - Case 1. Align H against H (match or substitution). W Case 1 Case 2. Align H in WHY against Case 2 – (indel) in WHAT. H Case 3. Align H in WHAT Case 3 against – (indel) in WHY. Y Introduction to bioinformatics, Autumn 2006 38
Filling the alignment matrix (2) - W H A T Scoring the alternatives. Case 1. S 2,2 = S 1,1 + s(2, 2) - Case 2. S 2,2 = S 1,2 � � W Case 3. S 2,2 = S 2,1 � � Case 1 Case 2 s(i, j) = 1 for matching positions, H s(i, j) = - µ for substitutions. Case 3 Y Choose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, Autumn 2006 39
Global alignment: formal development A = a 1 a 2 a 3 …a n , 0 1 2 3 4 B = b 1 b 2 b 3 …b m - b 1 b 2 b 3 b 4 b 1 b 2 b 3 b 4 - 0 - - a 1 - a 2 a 3 l Any alignment can be written 1 a 1 as a unique path through the matrix 2 a 2 l Score for aligning A and B up to positions i and j: 3 a 3 S i,j = S(a 1 a 2 a 3 …a i , b 1 b 2 b 3 …b j ) Introduction to bioinformatics, Autumn 2006 40
Scoring partial alignments Alignment of A = a 1 a 2 a 3 …a n with B = b 1 b 2 b 3 …b m can end in l three ways − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) - − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 41
Scoring alignments Scores for each case: l +1 if a i = b j s(a i , b j ) = { -µ otherwise − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) – s(a i , -) = s(-, b j ) = - � − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 42
Scoring alignments (2) • First row and first column 0 1 2 3 4 correspond to initial alignment against indels: - b 1 b 2 b 3 b 4 S(i, 0) = -i � S(0, j) = -j � �� -2 � -3 � -4 � 0 0 - • Optimal global alignment �� 1 a 1 score S(A, B) = S n,m -2 � 2 a 2 -3 � 3 a 3 Introduction to bioinformatics, Autumn 2006 43
Algorithm for global alignment I nput sequences A, B, n = | A|, m = |B| Set S i,0 := - � i f or all i Set S 0,j := - � j f or all j f or i := 1 t o n f or j := 1 t o m S i,j := max{S i-1,j – � , S i-1,j -1 + s(a i ,b j ), S i,j -1 – � } end end Algorithm takes O(nm) time and space. Introduction to bioinformatics, Autumn 2006 44
Global alignment: example - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 T -4 C -6 G -8 T -10 ? Introduction to bioinformatics, Autumn 2006 45
Global alignment: example (2) - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 ATCGT- G -8 -5 -2 -1 -3 -4 | || T -10 -7 -4 -3 0 -2 -TGGTG Introduction to bioinformatics, Autumn 2006 46
Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 47
Local alignment: rationale • Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. Introduction to bioinformatics, Autumn 2006 48
Local alignment: rationale A B Regions of similarity • Global alignment would be inadequate • Problem: find the highest scoring local alignment between two sequences • Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) Introduction to bioinformatics, Autumn 2006 49
From global to local alignment Modifications to the global alignment algorithm l − Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix) − Allow preceding and trailing indels without penalty Introduction to bioinformatics, Autumn 2006 50
Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: , Best local alignment score: where S(I, J) is the score for substrings I and J. Introduction to bioinformatics, Autumn 2006 51
Allowing preceding and trailing indels • First row and column 0 1 2 3 4 initialised to zero: - b 1 b 2 b 3 b 4 M i,0 = M 0,j = 0 0 0 0 0 0 0 - 0 1 a 1 b1 b2 b3 0 2 a 2 - - a1 0 3 a 3 Introduction to bioinformatics, Autumn 2006 52
Recursion for local alignment • M i,j = max { - T G G T G M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j � � , A 0 0 0 0 0 0 M i,j-1 � � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2006 53
Finding best local alignment • Optimal score is the highest - T G G T G value in the matrix - 0 0 0 0 0 0 A 0 0 0 0 0 0 = max i,j M i,j T 0 1 0 0 1 0 • Best local alignment can be C 0 0 0 0 0 0 found by backtracking from the highest value in M G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2006 54
Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 - G G C T C A A T C A 0 - 0 0 0 0 0 0 0 0 0 0 0 1 A 0 2 C 0 3 C 0 4 T 0 5 A 0 6 A 0 7 G 0 8 G 0 Introduction to bioinformatics, Autumn 2006 55
Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 Scoring - G G C T C A A T C A Match: +2 0 - 0 0 0 0 0 0 0 0 0 0 0 1 A 0 0 0 0 0 0 2 2 0 0 2 Mismatch: -1 2 C 0 0 0 2 0 2 0 1 1 2 0 Indel: -2 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 5 A 0 0 0 0 2 3 4 3 1 1 3 6 A 0 0 0 0 0 1 5 6 4 2 3 C T – A A 7 G 0 2 2 0 0 0 3 4 5 3 1 C T C A A 8 G 0 2 4 2 0 0 1 2 3 4 2 Introduction to bioinformatics, Autumn 2006 56
Non-uniform mismatch penalties We used uniform penalty for mismatches: l s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ Transition mutations (A->G, G->A, C->T, T->C) are l approximately twice as frequent than transversions (A- >T, T->A, A->C, G->T) A C G T − use non-uniform mismatch A 1 -1 -0.5 -1 penalties C -1 1 -1 -0.5 G -0.5 -1 1 -1 T -1 -0.5 -1 1 Introduction to bioinformatics, Autumn 2006 57
Gaps in alignment Gap is a succession of indels in alignment l C T – - - A A C T C G C A A Previous model scored a length k gap as w(k) = -k � l Replication processes may produce longer stretches l of insertions or deletions − In coding regions, insertions or deletions of codons may preserve functionality Introduction to bioinformatics, Autumn 2006 58
Gap open and extension penalties (2) We can design a score that allows the penalty opening l gap to be larger than extending the gap: w(k) = - � � � (k – 1) Gap open cost � , Gap extension cost � l Our previous algorithm can be extended to use w(k) l (not discussed on this course) Introduction to bioinformatics, Autumn 2006 59
Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 60
Multiple alignment • Consider a set of n aggcgagct gcgagt gct a sequences on the right cgt t agat t gacgct gac – Orthologous sequences from t t ccggct gcgac different organisms gacacggcgaacgga – Paralogs from multiple duplications agt gt gcccgacgagcgaggac gcgggct gt gagcgct a • How can we study relationships between these aagcggcct gt gt gccct a sequences? at gct gct gccagt gt a agt cgagccccgagt gc agt ccgagt cc act cggt gc Introduction to bioinformatics, Autumn 2006 61
Optimal alignment of three sequences Alignment of A = a 1 a 2 …a i and B = b 1 b 2 …b j can end l either in (-, b j ), (a i , b j ) or (a i , -) 2 2 – 1 = 3 alternatives l Alignment of A, B and C = c 1 c 2 …c k can end in 2 3 – 1 l ways: (a i , -, -), (-, b j , -), (-, -, c k ), (-, b j , c k ), (a i , -, c k ), (a i , b j , -) or (a i , b j , c k ) Solve the recursion using three-dimensional dynamic l programming matrix: O(n 3 ) time and space Generalizes to n sequences but impractical with l moderate number of sequences Introduction to bioinformatics, Autumn 2006 62
Multiple alignment in practice In practice, real-world multiple alignment problems are l usually solved with heuristics Progressive multiple alignment l − Choose two sequences and align them − Choose third sequence w.r.t. two previous sequences and align the third against them − Repeat until all sequences have been aligned − Different options how to choose sequences and score alignments Introduction to bioinformatics, Autumn 2006 63
Multiple alignment in practice Profile-based progressive multiple alignment: l CLUSTALW − Construct a distance matrix of all pairs of sequences using dynamic programming − Progressively align pairs in order of decreasing similarity − CLUSTALW uses various heuristics to contribute to accuracy Introduction to bioinformatics, Autumn 2006 64
Additional material R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological l sequence analysis Course Biological sequence analysis in Spring 2007 l Introduction to bioinformatics, Autumn 2006 65
Inferring the Past: Phylogenetic Trees (chapter 12) The biological problem l Parsimony and distance methods l Models for mutations and estimation of distances l Maximum likelihood methods l Introduction to bioinformatics, Autumn 2006 66
Phylogeny • We want to study ancestor- descendant relationships, or phylogeny , among groups of organisms • Groups are called taxa (singular: taxon ) • Organisms are usually called operational taxonomic units or OTUs in the context of phylogeny Introduction to bioinformatics, Autumn 2006 67
Phylogenetic trees • Leaves (external nodes) ~ species, observed (OTUs) 3 • Internal nodes ~ ancestral species/divergence events, 2 4 7 not observed 8 6 • Unrooted tree does not 1 5 specify ancestor- descendant relationships Unrooted tree with 5 leaves beyond the observation and 3 internal nodes. ”leaves are not ancestors” Is node 7 ancestor of node 6? Introduction to bioinformatics, Autumn 2006 68
Phylogenetic trees 3 R 1 R 2 • Rooting a tree specifies 2 4 all ancestor-descendant 7 8 relationships in the tree 6 • Root is the ancestor to 1 5 root(R 2 ) ) R 1 the other species ( t o o r R 1 • There are n-1 ways to R 2 root a tree with n nodes 8 7 7 6 8 6 1 2 3 4 5 1 2 3 5 4 Introduction to bioinformatics, Autumn 2006 69
Questions Can we enumerate all possible phylogenetic trees for l n species (or sequences?) How to score a phylogenetic tree with respect to data? l How to find the best phylogenetic tree given data? l Introduction to bioinformatics, Autumn 2006 70
Finding the best phylogenetic tree: naive method How can we find the phylogenetic tree that best l represents the data? Naive method: enumerate all possible trees l How many different trees are there of n species? l Denote this number by b n l Introduction to bioinformatics, Autumn 2006 71
Enumerating unordered trees • Start with the only 1 2 1 2 1 2 unordered tree with 3 leaves ( b 3 = 1) 4 4 4 3 3 3 1 2 • Fourth node can be added to 3 different branches (edges), creating 1 new internal branch 3 • Total number of branches is n external and n – 3 internal branches • Consider all ways to add a leaf node to this tree • Unrooted tree with n leaves has 2n – 3 branches Introduction to bioinformatics, Autumn 2006 72
Enumerating unordered trees • Thus, we get the number of unrooted trees b n = (2(n – 1) – 3)b n-1 = (2n – 5)b n-1 = (2n – 5) * (2n – 7) * …* 3 * 1 = (2n – 5)! / ((n-3)!2 n-3 ), n > 2 • Number of rooted trees b’ n is b’ n = (2n – 3)b n = (2n – 3)! / ((n-2)!2 n-2 ), n > 2 that is, the number of unrooted trees times the number of branches in the trees Introduction to bioinformatics, Autumn 2006 73
Number of possible rooted and unrooted trees n B n b’ n 3 1 3 4 3 15 5 15 105 6 105 945 7 954 10395 8 10395 135135 9 135135 2027025 10 2027025 34459425 20 2.22E+020 8.20E+021 30 8.69E+036 4.95E+038 Introduction to bioinformatics, Autumn 2006 74
Too many trees? We can’t construct and evaluate every phylogenetic l tree even for a smallish number of species Better alternative is to l − Devise a way to evaluate an individual tree against the data − Guide the search using the evaluation criteria to reduce the search space Introduction to bioinformatics, Autumn 2006 75
Inferring the Past: Phylogenetic Trees (chapter 12) The biological problem l Parsimony and distance methods l Models for mutations and estimation of distances l Maximum likelihood methods l Introduction to bioinformatics, Autumn 2006 76
Parsimony method The parsimony method finds the tree that explains the l observed sequences with a minimal number of substitutions Method has two steps l − Compute smallest number of substitutions for a given tree with a parsimony algorithm − Search for the tree with the minimal number of substitutions Introduction to bioinformatics, Autumn 2006 77
Parsimony: an example Consider the following short sequences l 1 ACTTT 2 ACATT 3 AACGT 4 AATGT 5 AATTT There are 105 possible rooted trees for 5 sequences l Example: which of the following trees explains the l sequences with least number of substitutions? Introduction to bioinformatics, Autumn 2006 78
9 A A TTT A-> C 7 AAT T T T-> G 6 AA T GT T-> A 8 AC T TT T-> C 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT This tree explains the sequences with 4 substitutions Introduction to bioinformatics, Autumn 2006 79
First tree is 9 A A TTT 4 substitutions… more A-> C 7 AAT T T parsimonious! T-> G 6 AA T GT T-> A 8 AC T TT T-> C 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT T-> G 9 AAT T T 6 substitutions… 8 AA T GT T-> C A-> C 7 A A C G T G-> T C-> T 6 AC C TT C-> A 1 2 3 4 5 AC T TT AC A TT AA C GT AA T GT AA T TT Introduction to bioinformatics, Autumn 2006 80
Computing parsimony Parsimony treats each site (position in a sequence) l independently Total parsimony cost is the sum of parsimony costs of l each site We can compute the minimal parsimony cost for a l given tree by − First finding out possible assignments at each node, starting from leaves and proceeding towards the root − Then, starting from the root, assign a letter at each node, proceeding towards leaves Introduction to bioinformatics, Autumn 2006 81
Labelling tree nodes An unrooted tree with n leaves contains 2n-2 nodes l altogether Assign the following labels to nodes in a rooted tree l − leaf nodes: 1, 2, …, n − internal nodes: n+1, n+2, …, 2n-1 9 − root node: 2n-1 The label of a child node is always l 8 smaller than the label of the 6 7 parent node 1 2 3 4 5 Introduction to bioinformatics, Autumn 2006 82
Parsimony algorithm: first phase Find out possible assignments at every node for each site l independently. Denote site u in sequence i by s i,u For i := 1, … , n do F i := {s i,u } % possible assignment s at node i L i := 0 % number of subst it ut ions up t o node i For i := n+1, … , 2n-1 do Let j and k be t he children of node i I f F j � F k = � t hen L i := L j + L k + 1, F i := F j � F k else L i := L j + L k , F i := F j � F k Introduction to bioinformatics, Autumn 2006 83
Parsimony algorithm: first phase Choose u = 3 (for example) F 1 := {T} L 1 := 0 9 F 2 := {A} L 2 := 0 7 F 3 := {C}, L 3 := 0 6 8 F 4 := {T}, L 4 := 0 3 4 5 2 1 F 5 := {T}, L 5 := 0 AA C GT AA T GT AA T TT AC A TT AC T TT F 8 := F 1 � F 2 = {A, T} L 8 := L 1 + L 2 + 1 = 1 Introduction to bioinformatics, Autumn 2006 84
Parsimony algorithm: first phase F 6 := F 3 � F 4 = {C, T} L 6 := L 3 + L 4 + 1 = 1 9 T F 7 := F 5 � F 6 = {T} 7 T L 7 := L 5 + L 6 = 1 6 {C,T} 8 T F 8 := F 1 � F 2 = {A, T} L 8 := L 1 + L 2 + 1 = 1 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT F 9 := F 7 � F 8 = {T} L 9 := L 7 + L 8 = 2 � Parsimony cost for site 3 is 2 Introduction to bioinformatics, Autumn 2006 85
Parsimony algorithm: second phase Backtrack from the root and assign x � F i at each node l If we assigned y at parent of node i and y � F i , then l assign y Else assign x � F i by random l Introduction to bioinformatics, Autumn 2006 86
Parsimony algorithm: second phase At node 6, the algorithm assigns T because T 9 T was assigned to parent node 7 and T � F 6 7 T 6 {C, T } 8 T 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT The other nodes have only one possible letter to assign Introduction to bioinformatics, Autumn 2006 87
Parsimony algorithm First and second phase are repeated for each site in the sequences, 9 T summing the parsimony costs at each site 7 T 6 T 8 T 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT Introduction to bioinformatics, Autumn 2006 88
Properties of parsimony algorithm Parsimony algorithm requires that the sequences are l of same length − First align the sequences against each other and remove indels − Then compute parsimony for the resulting sequences Is the most parsimonious tree the correct tree? l − Not necessarily but it explains the sequences with least number of substitutions − We can assume that the probability of having fewer mutations is higher than having many mutations Introduction to bioinformatics, Autumn 2006 89
Finding the most parsimonious tree Parsimony algorithm calculates the parsimony cost for l a given tree… …but we still have the problem of finding the tree with l the lowest cost Exhaustive search (enumerating all trees) is in general l impossible More efficient methods exist, for example l − Probabilistic search − Branch and bound Introduction to bioinformatics, Autumn 2006 90
Branch and bound in parsimony We can exploit the fact that adding edges to a tree can l only increase the parsimony cost {C, T} {T} {T} 1 2 3 1 2 AA C GT AA T GT AA T TT AA T GT AA T TT cost 0 cost 1 Introduction to bioinformatics, Autumn 2006 91
Branch and bound in parsimony In parsimony… Branch and bound is a general search strategy Start from a tree with 1 l where sequence Each solution is potentially Add a sequence to the tree l l generated and calculate parsimony cost Track is kept of the best l solution found If the tree is complete, check l if found the best tree so far If a partial solution cannot l achieve better score, we If tree is not complete and l abandon the current search cost exceeds best tree cost, path do not continue adding edges to this tree Introduction to bioinformatics, Autumn 2006 92
Branch and bound graphically … … 4 3 1 2 Partial tree, no best complete tree constructed yet Complete tree: calculate parsimony cost and store Partial tree, cost exceeds the cost of the best tree this far Introduction to bioinformatics, Autumn 2006 93
Distance methods The parsimony method works on sequence (character l string) data We can also build phylogenetic trees in a more l general setting Distance methods work on a set of pairwise distances l d ij for the data Distances can be obtained from phenotypes as well as l from genotypes (sequences) Introduction to bioinformatics, Autumn 2006 94
Distances in a phylogenetic tree Distance matrix D = (d ij ) l gives pairwise distances for leaves of the phylogenetic 7 tree 6 8 In addition, the phylogenetic l tree will now specify 1 2 3 4 5 distances between leaves Distance d ij states how and internal nodes far apart species i and j − Denote these with d ij as well are evolutionary (e.g., number of mismatches in aligned sequences) Introduction to bioinformatics, Autumn 2006 95
Distances in evolutionary context Distances d ij in evolutionary context satisfy the l following conditions − Symmetry: d ij = d ji for each i, j − Distinguishability: d ij � 0 if and only if i � j − Triangle inequality: d ij � d ik + d kj for each i, j, k Distances satisfying these conditions are called metric l In addition, evolutionary mechanisms may impose l additional constraints on the distances � additive and ultrametric distances Introduction to bioinformatics, Autumn 2006 96
Additive trees A tree is called additive , if the distance between any l pair of leaves (i, j) is the sum of the distances between the leaves and the first node k that they share in the tree d ij = d ik + d jk ”Follow the path from the leaf i to the leaf j to find the l exact distance d ij between the leaves.” Introduction to bioinformatics, Autumn 2006 97
Additive trees: example A B C D A C A 0 2 4 4 1 1 2 B 2 0 4 4 1 1 C 4 4 0 2 B D D 4 4 2 0 Introduction to bioinformatics, Autumn 2006 98
Ultrametric trees A rooted additive tree is called a ultrametric tree , if the l distances between any two leaves i and j, and their common ancestor k are equal d ik = d jk Edge length d ij corresponds to the time elapsed since l divergence of i and j from the common parent In other words, edge lengths are measured by a l molecular clock with a constant rate Introduction to bioinformatics, Autumn 2006 99
Identifying ultrametric data We can identify distances to be ultrametric by the l three-point condition: D corresponds to an ultrametric tree if and only if for any three sequences i, j and k, the distances satisfy d ij � max(d ik , d kj ) If we find out that the data is ultrametric, we can utilise l a simple algorithm to find the corresponding tree Introduction to bioinformatics, Autumn 2006 100
Recommend
More recommend