582606 introduction to bioinformatics
play

582606 Introduction to bioinformatics Autumn 2006 Esa Pitknen - PowerPoint PPT Presentation

582606 Introduction to bioinformatics Autumn 2006 Esa Pitknen Master's Degree Programme in Bioinformatics (MBI) Department of Computer Science, University of Helsinki http://www.cs.helsinki.fi/mbi/courses/06-07/itb/ Introduction to


  1. Similarity vs homology Sequence similarity is not sequence homology l − If the two sequences g B and g C have accumulated enough mutations, the similarity between them is likely to be low #mutations #mutations 0 agt gt ccgt t aagt gcgt t c 64 acagt ccgt t cgggct at t g 1 agt gt ccgt t at agt gcgt t c 128 cagagcact accgc 2 agt gt ccgct t at agt gcgt t c 256 cacgagt aagat at agct 4 agt gt ccgct t aagggcgt t c 512 t aat cgt gat a 8 agt gt ccgct t caaggggcgt 1024 accct t at ct act t cct ggagt t 16 gggccgt t cat gggggt 2048 agcgacct gcccaa 32 gcagggcgt cact gagggct 4096 caaac Homology is more difficult to detect over greater evolutionary distances. Introduction to bioinformatics, Autumn 2006 26

  2. Similarity vs homology (2) Sequence similarity can occur by chance l − Similarity does not imply homology Similarity is an expected consequence of homology l Introduction to bioinformatics, Autumn 2006 27

  3. Orthologs and paralogs We distinguish between two types of homology l − Orthologs: homologs from two different species − Paralogs: homologs within a species Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 28

  4. Orthologs and paralogs (2) Orthologs typically retain the original function l In paralogs, one copy is free to mutate and acquire l new function (no selective pressure) Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 29

  5. Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not aligned 1 not aligned 1 not aligned Introduction to bioinformatics, Autumn 2006 30

  6. Mutations: Insertions, deletions and substitutions acgtctag Indel: insertion or Mismatch: substitution ||||| deletion of a base (point mutation) of with respect to the a single base -actctag ancestor sequence Insertions and/or deletions are called indels l − We can’t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, Autumn 2006 31

  7. Problems What sorts of alignments should be considered? l How to score alignments? l How to find optimal or good scoring alignments? l How to evaluate the statistical significance of scores? l In this course, we discuss the first three problems. Course Biological sequence analysis tackles all four in- depth. Introduction to bioinformatics, Autumn 2006 32

  8. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 33

  9. Global alignment Problem: find optimal scoring alignment between two l sequences (Needleman & Wunsch 1970) We give score for each position in alignment l WHAT − Identity (match) +1 − Substitution (mismatch) -µ || �� − Indel WH-Y S(WHAT/WH-Y) = 1 + 1 – � – µ Introduction to bioinformatics, Autumn 2006 34

  10. Representing alignments and scores WHAT - W H A T || - WH-Y W X H X X Y X Introduction to bioinformatics, Autumn 2006 35

  11. Representing alignments and scores WHAT - W H A T || - 0 WH-Y W 1 H 2 2- � Global alignment Y 2- � -µ score S 3,4 = 2- � -µ Introduction to bioinformatics, Autumn 2006 36

  12. Dynamic programming How to find the optimal alignment? l We use previous solutions for optimal alignments of l smaller subsequences This general approach is known as dynamic l programming Introduction to bioinformatics, Autumn 2006 37

  13. Filling the alignment matrix - W H A T Consider the alignment process at shaded square. - Case 1. Align H against H (match or substitution). W Case 1 Case 2. Align H in WHY against Case 2 – (indel) in WHAT. H Case 3. Align H in WHAT Case 3 against – (indel) in WHY. Y Introduction to bioinformatics, Autumn 2006 38

  14. Filling the alignment matrix (2) - W H A T Scoring the alternatives. Case 1. S 2,2 = S 1,1 + s(2, 2) - Case 2. S 2,2 = S 1,2 � � W Case 3. S 2,2 = S 2,1 � � Case 1 Case 2 s(i, j) = 1 for matching positions, H s(i, j) = - µ for substitutions. Case 3 Y Choose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, Autumn 2006 39

  15. Global alignment: formal development A = a 1 a 2 a 3 …a n , 0 1 2 3 4 B = b 1 b 2 b 3 …b m - b 1 b 2 b 3 b 4 b 1 b 2 b 3 b 4 - 0 - - a 1 - a 2 a 3 l Any alignment can be written 1 a 1 as a unique path through the matrix 2 a 2 l Score for aligning A and B up to positions i and j: 3 a 3 S i,j = S(a 1 a 2 a 3 …a i , b 1 b 2 b 3 …b j ) Introduction to bioinformatics, Autumn 2006 40

  16. Scoring partial alignments Alignment of A = a 1 a 2 a 3 …a n with B = b 1 b 2 b 3 …b m can end in l three ways − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) - − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 41

  17. Scoring alignments Scores for each case: l +1 if a i = b j s(a i , b j ) = { -µ otherwise − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) – s(a i , -) = s(-, b j ) = - � − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 42

  18. Scoring alignments (2) • First row and first column 0 1 2 3 4 correspond to initial alignment against indels: - b 1 b 2 b 3 b 4 S(i, 0) = -i � S(0, j) = -j � �� -2 � -3 � -4 � 0 0 - • Optimal global alignment �� 1 a 1 score S(A, B) = S n,m -2 � 2 a 2 -3 � 3 a 3 Introduction to bioinformatics, Autumn 2006 43

  19. Algorithm for global alignment I nput sequences A, B, n = | A|, m = |B| Set S i,0 := - � i f or all i Set S 0,j := - � j f or all j f or i := 1 t o n f or j := 1 t o m S i,j := max{S i-1,j – � , S i-1,j -1 + s(a i ,b j ), S i,j -1 – � } end end Algorithm takes O(nm) time and space. Introduction to bioinformatics, Autumn 2006 44

  20. Global alignment: example - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 T -4 C -6 G -8 T -10 ? Introduction to bioinformatics, Autumn 2006 45

  21. Global alignment: example (2) - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 ATCGT- G -8 -5 -2 -1 -3 -4 | || T -10 -7 -4 -3 0 -2 -TGGTG Introduction to bioinformatics, Autumn 2006 46

  22. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 47

  23. Local alignment: rationale • Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. Introduction to bioinformatics, Autumn 2006 48

  24. Local alignment: rationale A B Regions of similarity • Global alignment would be inadequate • Problem: find the highest scoring local alignment between two sequences • Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) Introduction to bioinformatics, Autumn 2006 49

  25. From global to local alignment Modifications to the global alignment algorithm l − Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix) − Allow preceding and trailing indels without penalty Introduction to bioinformatics, Autumn 2006 50

  26. Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: , Best local alignment score: where S(I, J) is the score for substrings I and J. Introduction to bioinformatics, Autumn 2006 51

  27. Allowing preceding and trailing indels • First row and column 0 1 2 3 4 initialised to zero: - b 1 b 2 b 3 b 4 M i,0 = M 0,j = 0 0 0 0 0 0 0 - 0 1 a 1 b1 b2 b3 0 2 a 2 - - a1 0 3 a 3 Introduction to bioinformatics, Autumn 2006 52

  28. Recursion for local alignment • M i,j = max { - T G G T G M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j � � , A 0 0 0 0 0 0 M i,j-1 � � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2006 53

  29. Finding best local alignment • Optimal score is the highest - T G G T G value in the matrix - 0 0 0 0 0 0 A 0 0 0 0 0 0 = max i,j M i,j T 0 1 0 0 1 0 • Best local alignment can be C 0 0 0 0 0 0 found by backtracking from the highest value in M G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2006 54

  30. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 - G G C T C A A T C A 0 - 0 0 0 0 0 0 0 0 0 0 0 1 A 0 2 C 0 3 C 0 4 T 0 5 A 0 6 A 0 7 G 0 8 G 0 Introduction to bioinformatics, Autumn 2006 55

  31. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 Scoring - G G C T C A A T C A Match: +2 0 - 0 0 0 0 0 0 0 0 0 0 0 1 A 0 0 0 0 0 0 2 2 0 0 2 Mismatch: -1 2 C 0 0 0 2 0 2 0 1 1 2 0 Indel: -2 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 5 A 0 0 0 0 2 3 4 3 1 1 3 6 A 0 0 0 0 0 1 5 6 4 2 3 C T – A A 7 G 0 2 2 0 0 0 3 4 5 3 1 C T C A A 8 G 0 2 4 2 0 0 1 2 3 4 2 Introduction to bioinformatics, Autumn 2006 56

  32. Non-uniform mismatch penalties We used uniform penalty for mismatches: l s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ Transition mutations (A->G, G->A, C->T, T->C) are l approximately twice as frequent than transversions (A- >T, T->A, A->C, G->T) A C G T − use non-uniform mismatch A 1 -1 -0.5 -1 penalties C -1 1 -1 -0.5 G -0.5 -1 1 -1 T -1 -0.5 -1 1 Introduction to bioinformatics, Autumn 2006 57

  33. Gaps in alignment Gap is a succession of indels in alignment l C T – - - A A C T C G C A A Previous model scored a length k gap as w(k) = -k � l Replication processes may produce longer stretches l of insertions or deletions − In coding regions, insertions or deletions of codons may preserve functionality Introduction to bioinformatics, Autumn 2006 58

  34. Gap open and extension penalties (2) We can design a score that allows the penalty opening l gap to be larger than extending the gap: w(k) = - � � � (k – 1) Gap open cost � , Gap extension cost � l Our previous algorithm can be extended to use w(k) l (not discussed on this course) Introduction to bioinformatics, Autumn 2006 59

  35. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 60

  36. Multiple alignment • Consider a set of n aggcgagct gcgagt gct a sequences on the right cgt t agat t gacgct gac – Orthologous sequences from t t ccggct gcgac different organisms gacacggcgaacgga – Paralogs from multiple duplications agt gt gcccgacgagcgaggac gcgggct gt gagcgct a • How can we study relationships between these aagcggcct gt gt gccct a sequences? at gct gct gccagt gt a agt cgagccccgagt gc agt ccgagt cc act cggt gc Introduction to bioinformatics, Autumn 2006 61

  37. Optimal alignment of three sequences Alignment of A = a 1 a 2 …a i and B = b 1 b 2 …b j can end l either in (-, b j ), (a i , b j ) or (a i , -) 2 2 – 1 = 3 alternatives l Alignment of A, B and C = c 1 c 2 …c k can end in 2 3 – 1 l ways: (a i , -, -), (-, b j , -), (-, -, c k ), (-, b j , c k ), (a i , -, c k ), (a i , b j , -) or (a i , b j , c k ) Solve the recursion using three-dimensional dynamic l programming matrix: O(n 3 ) time and space Generalizes to n sequences but impractical with l moderate number of sequences Introduction to bioinformatics, Autumn 2006 62

  38. Multiple alignment in practice In practice, real-world multiple alignment problems are l usually solved with heuristics Progressive multiple alignment l − Choose two sequences and align them − Choose third sequence w.r.t. two previous sequences and align the third against them − Repeat until all sequences have been aligned − Different options how to choose sequences and score alignments Introduction to bioinformatics, Autumn 2006 63

  39. Multiple alignment in practice Profile-based progressive multiple alignment: l CLUSTALW − Construct a distance matrix of all pairs of sequences using dynamic programming − Progressively align pairs in order of decreasing similarity − CLUSTALW uses various heuristics to contribute to accuracy Introduction to bioinformatics, Autumn 2006 64

  40. Additional material R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological l sequence analysis Course Biological sequence analysis in Spring 2007 l Introduction to bioinformatics, Autumn 2006 65

  41. Inferring the Past: Phylogenetic Trees (chapter 12) The biological problem l Parsimony and distance methods l Models for mutations and estimation of distances l Maximum likelihood methods l Introduction to bioinformatics, Autumn 2006 66

  42. Phylogeny • We want to study ancestor- descendant relationships, or phylogeny , among groups of organisms • Groups are called taxa (singular: taxon ) • Organisms are usually called operational taxonomic units or OTUs in the context of phylogeny Introduction to bioinformatics, Autumn 2006 67

  43. Phylogenetic trees • Leaves (external nodes) ~ species, observed (OTUs) 3 • Internal nodes ~ ancestral species/divergence events, 2 4 7 not observed 8 6 • Unrooted tree does not 1 5 specify ancestor- descendant relationships Unrooted tree with 5 leaves beyond the observation and 3 internal nodes. ”leaves are not ancestors” Is node 7 ancestor of node 6? Introduction to bioinformatics, Autumn 2006 68

  44. Phylogenetic trees 3 R 1 R 2 • Rooting a tree specifies 2 4 all ancestor-descendant 7 8 relationships in the tree 6 • Root is the ancestor to 1 5 root(R 2 ) ) R 1 the other species ( t o o r R 1 • There are n-1 ways to R 2 root a tree with n nodes 8 7 7 6 8 6 1 2 3 4 5 1 2 3 5 4 Introduction to bioinformatics, Autumn 2006 69

  45. Questions Can we enumerate all possible phylogenetic trees for l n species (or sequences?) How to score a phylogenetic tree with respect to data? l How to find the best phylogenetic tree given data? l Introduction to bioinformatics, Autumn 2006 70

  46. Finding the best phylogenetic tree: naive method How can we find the phylogenetic tree that best l represents the data? Naive method: enumerate all possible trees l How many different trees are there of n species? l Denote this number by b n l Introduction to bioinformatics, Autumn 2006 71

  47. Enumerating unordered trees • Start with the only 1 2 1 2 1 2 unordered tree with 3 leaves ( b 3 = 1) 4 4 4 3 3 3 1 2 • Fourth node can be added to 3 different branches (edges), creating 1 new internal branch 3 • Total number of branches is n external and n – 3 internal branches • Consider all ways to add a leaf node to this tree • Unrooted tree with n leaves has 2n – 3 branches Introduction to bioinformatics, Autumn 2006 72

  48. Enumerating unordered trees • Thus, we get the number of unrooted trees b n = (2(n – 1) – 3)b n-1 = (2n – 5)b n-1 = (2n – 5) * (2n – 7) * …* 3 * 1 = (2n – 5)! / ((n-3)!2 n-3 ), n > 2 • Number of rooted trees b’ n is b’ n = (2n – 3)b n = (2n – 3)! / ((n-2)!2 n-2 ), n > 2 that is, the number of unrooted trees times the number of branches in the trees Introduction to bioinformatics, Autumn 2006 73

  49. Number of possible rooted and unrooted trees n B n b’ n 3 1 3 4 3 15 5 15 105 6 105 945 7 954 10395 8 10395 135135 9 135135 2027025 10 2027025 34459425 20 2.22E+020 8.20E+021 30 8.69E+036 4.95E+038 Introduction to bioinformatics, Autumn 2006 74

  50. Too many trees? We can’t construct and evaluate every phylogenetic l tree even for a smallish number of species Better alternative is to l − Devise a way to evaluate an individual tree against the data − Guide the search using the evaluation criteria to reduce the search space Introduction to bioinformatics, Autumn 2006 75

  51. Inferring the Past: Phylogenetic Trees (chapter 12) The biological problem l Parsimony and distance methods l Models for mutations and estimation of distances l Maximum likelihood methods l Introduction to bioinformatics, Autumn 2006 76

  52. Parsimony method The parsimony method finds the tree that explains the l observed sequences with a minimal number of substitutions Method has two steps l − Compute smallest number of substitutions for a given tree with a parsimony algorithm − Search for the tree with the minimal number of substitutions Introduction to bioinformatics, Autumn 2006 77

  53. Parsimony: an example Consider the following short sequences l 1 ACTTT 2 ACATT 3 AACGT 4 AATGT 5 AATTT There are 105 possible rooted trees for 5 sequences l Example: which of the following trees explains the l sequences with least number of substitutions? Introduction to bioinformatics, Autumn 2006 78

  54. 9 A A TTT A-> C 7 AAT T T T-> G 6 AA T GT T-> A 8 AC T TT T-> C 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT This tree explains the sequences with 4 substitutions Introduction to bioinformatics, Autumn 2006 79

  55. First tree is 9 A A TTT 4 substitutions… more A-> C 7 AAT T T parsimonious! T-> G 6 AA T GT T-> A 8 AC T TT T-> C 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT T-> G 9 AAT T T 6 substitutions… 8 AA T GT T-> C A-> C 7 A A C G T G-> T C-> T 6 AC C TT C-> A 1 2 3 4 5 AC T TT AC A TT AA C GT AA T GT AA T TT Introduction to bioinformatics, Autumn 2006 80

  56. Computing parsimony Parsimony treats each site (position in a sequence) l independently Total parsimony cost is the sum of parsimony costs of l each site We can compute the minimal parsimony cost for a l given tree by − First finding out possible assignments at each node, starting from leaves and proceeding towards the root − Then, starting from the root, assign a letter at each node, proceeding towards leaves Introduction to bioinformatics, Autumn 2006 81

  57. Labelling tree nodes An unrooted tree with n leaves contains 2n-2 nodes l altogether Assign the following labels to nodes in a rooted tree l − leaf nodes: 1, 2, …, n − internal nodes: n+1, n+2, …, 2n-1 9 − root node: 2n-1 The label of a child node is always l 8 smaller than the label of the 6 7 parent node 1 2 3 4 5 Introduction to bioinformatics, Autumn 2006 82

  58. Parsimony algorithm: first phase Find out possible assignments at every node for each site l independently. Denote site u in sequence i by s i,u For i := 1, … , n do F i := {s i,u } % possible assignment s at node i L i := 0 % number of subst it ut ions up t o node i For i := n+1, … , 2n-1 do Let j and k be t he children of node i I f F j � F k = � t hen L i := L j + L k + 1, F i := F j � F k else L i := L j + L k , F i := F j � F k Introduction to bioinformatics, Autumn 2006 83

  59. Parsimony algorithm: first phase Choose u = 3 (for example) F 1 := {T} L 1 := 0 9 F 2 := {A} L 2 := 0 7 F 3 := {C}, L 3 := 0 6 8 F 4 := {T}, L 4 := 0 3 4 5 2 1 F 5 := {T}, L 5 := 0 AA C GT AA T GT AA T TT AC A TT AC T TT F 8 := F 1 � F 2 = {A, T} L 8 := L 1 + L 2 + 1 = 1 Introduction to bioinformatics, Autumn 2006 84

  60. Parsimony algorithm: first phase F 6 := F 3 � F 4 = {C, T} L 6 := L 3 + L 4 + 1 = 1 9 T F 7 := F 5 � F 6 = {T} 7 T L 7 := L 5 + L 6 = 1 6 {C,T} 8 T F 8 := F 1 � F 2 = {A, T} L 8 := L 1 + L 2 + 1 = 1 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT F 9 := F 7 � F 8 = {T} L 9 := L 7 + L 8 = 2 � Parsimony cost for site 3 is 2 Introduction to bioinformatics, Autumn 2006 85

  61. Parsimony algorithm: second phase Backtrack from the root and assign x � F i at each node l If we assigned y at parent of node i and y � F i , then l assign y Else assign x � F i by random l Introduction to bioinformatics, Autumn 2006 86

  62. Parsimony algorithm: second phase At node 6, the algorithm assigns T because T 9 T was assigned to parent node 7 and T � F 6 7 T 6 {C, T } 8 T 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT The other nodes have only one possible letter to assign Introduction to bioinformatics, Autumn 2006 87

  63. Parsimony algorithm First and second phase are repeated for each site in the sequences, 9 T summing the parsimony costs at each site 7 T 6 T 8 T 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT Introduction to bioinformatics, Autumn 2006 88

  64. Properties of parsimony algorithm Parsimony algorithm requires that the sequences are l of same length − First align the sequences against each other and remove indels − Then compute parsimony for the resulting sequences Is the most parsimonious tree the correct tree? l − Not necessarily but it explains the sequences with least number of substitutions − We can assume that the probability of having fewer mutations is higher than having many mutations Introduction to bioinformatics, Autumn 2006 89

  65. Finding the most parsimonious tree Parsimony algorithm calculates the parsimony cost for l a given tree… …but we still have the problem of finding the tree with l the lowest cost Exhaustive search (enumerating all trees) is in general l impossible More efficient methods exist, for example l − Probabilistic search − Branch and bound Introduction to bioinformatics, Autumn 2006 90

  66. Branch and bound in parsimony We can exploit the fact that adding edges to a tree can l only increase the parsimony cost {C, T} {T} {T} 1 2 3 1 2 AA C GT AA T GT AA T TT AA T GT AA T TT cost 0 cost 1 Introduction to bioinformatics, Autumn 2006 91

  67. Branch and bound in parsimony In parsimony… Branch and bound is a general search strategy Start from a tree with 1 l where sequence Each solution is potentially Add a sequence to the tree l l generated and calculate parsimony cost Track is kept of the best l solution found If the tree is complete, check l if found the best tree so far If a partial solution cannot l achieve better score, we If tree is not complete and l abandon the current search cost exceeds best tree cost, path do not continue adding edges to this tree Introduction to bioinformatics, Autumn 2006 92

  68. Branch and bound graphically … … 4 3 1 2 Partial tree, no best complete tree constructed yet Complete tree: calculate parsimony cost and store Partial tree, cost exceeds the cost of the best tree this far Introduction to bioinformatics, Autumn 2006 93

  69. Distance methods The parsimony method works on sequence (character l string) data We can also build phylogenetic trees in a more l general setting Distance methods work on a set of pairwise distances l d ij for the data Distances can be obtained from phenotypes as well as l from genotypes (sequences) Introduction to bioinformatics, Autumn 2006 94

  70. Distances in a phylogenetic tree Distance matrix D = (d ij ) l gives pairwise distances for leaves of the phylogenetic 7 tree 6 8 In addition, the phylogenetic l tree will now specify 1 2 3 4 5 distances between leaves Distance d ij states how and internal nodes far apart species i and j − Denote these with d ij as well are evolutionary (e.g., number of mismatches in aligned sequences) Introduction to bioinformatics, Autumn 2006 95

  71. Distances in evolutionary context Distances d ij in evolutionary context satisfy the l following conditions − Symmetry: d ij = d ji for each i, j − Distinguishability: d ij � 0 if and only if i � j − Triangle inequality: d ij � d ik + d kj for each i, j, k Distances satisfying these conditions are called metric l In addition, evolutionary mechanisms may impose l additional constraints on the distances � additive and ultrametric distances Introduction to bioinformatics, Autumn 2006 96

  72. Additive trees A tree is called additive , if the distance between any l pair of leaves (i, j) is the sum of the distances between the leaves and the first node k that they share in the tree d ij = d ik + d jk ”Follow the path from the leaf i to the leaf j to find the l exact distance d ij between the leaves.” Introduction to bioinformatics, Autumn 2006 97

  73. Additive trees: example A B C D A C A 0 2 4 4 1 1 2 B 2 0 4 4 1 1 C 4 4 0 2 B D D 4 4 2 0 Introduction to bioinformatics, Autumn 2006 98

  74. Ultrametric trees A rooted additive tree is called a ultrametric tree , if the l distances between any two leaves i and j, and their common ancestor k are equal d ik = d jk Edge length d ij corresponds to the time elapsed since l divergence of i and j from the common parent In other words, edge lengths are measured by a l molecular clock with a constant rate Introduction to bioinformatics, Autumn 2006 99

  75. Identifying ultrametric data We can identify distances to be ultrametric by the l three-point condition: D corresponds to an ultrametric tree if and only if for any three sequences i, j and k, the distances satisfy d ij � max(d ik , d kj ) If we find out that the data is ultrametric, we can utilise l a simple algorithm to find the corresponding tree Introduction to bioinformatics, Autumn 2006 100

Recommend


More recommend