582606 Introduction to bioinformatics Autumn 2006 Esa Pitknen - PowerPoint PPT Presentation

Similarity vs homology Sequence similarity is not sequence homology l − If the two sequences g B and g C have accumulated enough mutations, the similarity between them is likely to be low #mutations #mutations 0 agt gt ccgt t aagt gcgt t c 64 acagt ccgt t cgggct at t g 1 agt gt ccgt t at agt gcgt t c 128 cagagcact accgc 2 agt gt ccgct t at agt gcgt t c 256 cacgagt aagat at agct 4 agt gt ccgct t aagggcgt t c 512 t aat cgt gat a 8 agt gt ccgct t caaggggcgt 1024 accct t at ct act t cct ggagt t 16 gggccgt t cat gggggt 2048 agcgacct gcccaa 32 gcagggcgt cact gagggct 4096 caaac Homology is more difficult to detect over greater evolutionary distances. Introduction to bioinformatics, Autumn 2006 26

Similarity vs homology (2) Sequence similarity can occur by chance l − Similarity does not imply homology Similarity is an expected consequence of homology l Introduction to bioinformatics, Autumn 2006 27

Orthologs and paralogs We distinguish between two types of homology l − Orthologs: homologs from two different species − Paralogs: homologs within a species Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 28

Orthologs and paralogs (2) Orthologs typically retain the original function l In paralogs, one copy is free to mutate and acquire l new function (no selective pressure) Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 29

Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not aligned 1 not aligned 1 not aligned Introduction to bioinformatics, Autumn 2006 30

Mutations: Insertions, deletions and substitutions acgtctag Indel: insertion or Mismatch: substitution ||||| deletion of a base (point mutation) of with respect to the a single base -actctag ancestor sequence Insertions and/or deletions are called indels l − We can’t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, Autumn 2006 31

Problems What sorts of alignments should be considered? l How to score alignments? l How to find optimal or good scoring alignments? l How to evaluate the statistical significance of scores? l In this course, we discuss the first three problems. Course Biological sequence analysis tackles all four in- depth. Introduction to bioinformatics, Autumn 2006 32

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 33

Global alignment Problem: find optimal scoring alignment between two l sequences (Needleman & Wunsch 1970) We give score for each position in alignment l WHAT − Identity (match) +1 − Substitution (mismatch) -µ || �� − Indel WH-Y S(WHAT/WH-Y) = 1 + 1 – � – µ Introduction to bioinformatics, Autumn 2006 34

Representing alignments and scores WHAT - W H A T || - WH-Y W X H X X Y X Introduction to bioinformatics, Autumn 2006 35

Representing alignments and scores WHAT - W H A T || - 0 WH-Y W 1 H 2 2- � Global alignment Y 2- � -µ score S 3,4 = 2- � -µ Introduction to bioinformatics, Autumn 2006 36

Dynamic programming How to find the optimal alignment? l We use previous solutions for optimal alignments of l smaller subsequences This general approach is known as dynamic l programming Introduction to bioinformatics, Autumn 2006 37

Filling the alignment matrix - W H A T Consider the alignment process at shaded square. - Case 1. Align H against H (match or substitution). W Case 1 Case 2. Align H in WHY against Case 2 – (indel) in WHAT. H Case 3. Align H in WHAT Case 3 against – (indel) in WHY. Y Introduction to bioinformatics, Autumn 2006 38

Filling the alignment matrix (2) - W H A T Scoring the alternatives. Case 1. S 2,2 = S 1,1 + s(2, 2) - Case 2. S 2,2 = S 1,2 � � W Case 3. S 2,2 = S 2,1 � � Case 1 Case 2 s(i, j) = 1 for matching positions, H s(i, j) = - µ for substitutions. Case 3 Y Choose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, Autumn 2006 39

Global alignment: formal development A = a 1 a 2 a 3 …a n , 0 1 2 3 4 B = b 1 b 2 b 3 …b m - b 1 b 2 b 3 b 4 b 1 b 2 b 3 b 4 - 0 - - a 1 - a 2 a 3 l Any alignment can be written 1 a 1 as a unique path through the matrix 2 a 2 l Score for aligning A and B up to positions i and j: 3 a 3 S i,j = S(a 1 a 2 a 3 …a i , b 1 b 2 b 3 …b j ) Introduction to bioinformatics, Autumn 2006 40

Scoring partial alignments Alignment of A = a 1 a 2 a 3 …a n with B = b 1 b 2 b 3 …b m can end in l three ways − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) - − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 41

Scoring alignments Scores for each case: l +1 if a i = b j s(a i , b j ) = { -µ otherwise − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) – s(a i , -) = s(-, b j ) = - � − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 42

Scoring alignments (2) • First row and first column 0 1 2 3 4 correspond to initial alignment against indels: - b 1 b 2 b 3 b 4 S(i, 0) = -i � S(0, j) = -j � �� -2 � -3 � -4 � 0 0 - • Optimal global alignment �� 1 a 1 score S(A, B) = S n,m -2 � 2 a 2 -3 � 3 a 3 Introduction to bioinformatics, Autumn 2006 43

Algorithm for global alignment I nput sequences A, B, n = | A|, m = |B| Set S i,0 := - � i f or all i Set S 0,j := - � j f or all j f or i := 1 t o n f or j := 1 t o m S i,j := max{S i-1,j – � , S i-1,j -1 + s(a i ,b j ), S i,j -1 – � } end end Algorithm takes O(nm) time and space. Introduction to bioinformatics, Autumn 2006 44

Global alignment: example - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 T -4 C -6 G -8 T -10 ? Introduction to bioinformatics, Autumn 2006 45

Global alignment: example (2) - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 ATCGT- G -8 -5 -2 -1 -3 -4 | || T -10 -7 -4 -3 0 -2 -TGGTG Introduction to bioinformatics, Autumn 2006 46

Local alignment: rationale • Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. Introduction to bioinformatics, Autumn 2006 48

Local alignment: rationale A B Regions of similarity • Global alignment would be inadequate • Problem: find the highest scoring local alignment between two sequences • Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) Introduction to bioinformatics, Autumn 2006 49

From global to local alignment Modifications to the global alignment algorithm l − Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix) − Allow preceding and trailing indels without penalty Introduction to bioinformatics, Autumn 2006 50

Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: , Best local alignment score: where S(I, J) is the score for substrings I and J. Introduction to bioinformatics, Autumn 2006 51

Allowing preceding and trailing indels • First row and column 0 1 2 3 4 initialised to zero: - b 1 b 2 b 3 b 4 M i,0 = M 0,j = 0 0 0 0 0 0 0 - 0 1 a 1 b1 b2 b3 0 2 a 2 - - a1 0 3 a 3 Introduction to bioinformatics, Autumn 2006 52

Recursion for local alignment • M i,j = max { - T G G T G M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j � � , A 0 0 0 0 0 0 M i,j-1 � � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2006 53

Finding best local alignment • Optimal score is the highest - T G G T G value in the matrix - 0 0 0 0 0 0 A 0 0 0 0 0 0 = max i,j M i,j T 0 1 0 0 1 0 • Best local alignment can be C 0 0 0 0 0 0 found by backtracking from the highest value in M G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2006 54

Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 - G G C T C A A T C A 0 - 0 0 0 0 0 0 0 0 0 0 0 1 A 0 2 C 0 3 C 0 4 T 0 5 A 0 6 A 0 7 G 0 8 G 0 Introduction to bioinformatics, Autumn 2006 55

Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 Scoring - G G C T C A A T C A Match: +2 0 - 0 0 0 0 0 0 0 0 0 0 0 1 A 0 0 0 0 0 0 2 2 0 0 2 Mismatch: -1 2 C 0 0 0 2 0 2 0 1 1 2 0 Indel: -2 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 5 A 0 0 0 0 2 3 4 3 1 1 3 6 A 0 0 0 0 0 1 5 6 4 2 3 C T – A A 7 G 0 2 2 0 0 0 3 4 5 3 1 C T C A A 8 G 0 2 4 2 0 0 1 2 3 4 2 Introduction to bioinformatics, Autumn 2006 56

Non-uniform mismatch penalties We used uniform penalty for mismatches: l s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ Transition mutations (A->G, G->A, C->T, T->C) are l approximately twice as frequent than transversions (A- >T, T->A, A->C, G->T) A C G T − use non-uniform mismatch A 1 -1 -0.5 -1 penalties C -1 1 -1 -0.5 G -0.5 -1 1 -1 T -1 -0.5 -1 1 Introduction to bioinformatics, Autumn 2006 57

Gaps in alignment Gap is a succession of indels in alignment l C T – - - A A C T C G C A A Previous model scored a length k gap as w(k) = -k � l Replication processes may produce longer stretches l of insertions or deletions − In coding regions, insertions or deletions of codons may preserve functionality Introduction to bioinformatics, Autumn 2006 58

Gap open and extension penalties (2) We can design a score that allows the penalty opening l gap to be larger than extending the gap: w(k) = - � � � (k – 1) Gap open cost � , Gap extension cost � l Our previous algorithm can be extended to use w(k) l (not discussed on this course) Introduction to bioinformatics, Autumn 2006 59

Multiple alignment • Consider a set of n aggcgagct gcgagt gct a sequences on the right cgt t agat t gacgct gac – Orthologous sequences from t t ccggct gcgac different organisms gacacggcgaacgga – Paralogs from multiple duplications agt gt gcccgacgagcgaggac gcgggct gt gagcgct a • How can we study relationships between these aagcggcct gt gt gccct a sequences? at gct gct gccagt gt a agt cgagccccgagt gc agt ccgagt cc act cggt gc Introduction to bioinformatics, Autumn 2006 61

Optimal alignment of three sequences Alignment of A = a 1 a 2 …a i and B = b 1 b 2 …b j can end l either in (-, b j ), (a i , b j ) or (a i , -) 2 2 – 1 = 3 alternatives l Alignment of A, B and C = c 1 c 2 …c k can end in 2 3 – 1 l ways: (a i , -, -), (-, b j , -), (-, -, c k ), (-, b j , c k ), (a i , -, c k ), (a i , b j , -) or (a i , b j , c k ) Solve the recursion using three-dimensional dynamic l programming matrix: O(n 3 ) time and space Generalizes to n sequences but impractical with l moderate number of sequences Introduction to bioinformatics, Autumn 2006 62

Multiple alignment in practice In practice, real-world multiple alignment problems are l usually solved with heuristics Progressive multiple alignment l − Choose two sequences and align them − Choose third sequence w.r.t. two previous sequences and align the third against them − Repeat until all sequences have been aligned − Different options how to choose sequences and score alignments Introduction to bioinformatics, Autumn 2006 63

Multiple alignment in practice Profile-based progressive multiple alignment: l CLUSTALW − Construct a distance matrix of all pairs of sequences using dynamic programming − Progressively align pairs in order of decreasing similarity − CLUSTALW uses various heuristics to contribute to accuracy Introduction to bioinformatics, Autumn 2006 64

Additional material R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological l sequence analysis Course Biological sequence analysis in Spring 2007 l Introduction to bioinformatics, Autumn 2006 65

Inferring the Past: Phylogenetic Trees (chapter 12) The biological problem l Parsimony and distance methods l Models for mutations and estimation of distances l Maximum likelihood methods l Introduction to bioinformatics, Autumn 2006 66

Phylogeny • We want to study ancestor- descendant relationships, or phylogeny , among groups of organisms • Groups are called taxa (singular: taxon ) • Organisms are usually called operational taxonomic units or OTUs in the context of phylogeny Introduction to bioinformatics, Autumn 2006 67

Phylogenetic trees • Leaves (external nodes) ~ species, observed (OTUs) 3 • Internal nodes ~ ancestral species/divergence events, 2 4 7 not observed 8 6 • Unrooted tree does not 1 5 specify ancestor- descendant relationships Unrooted tree with 5 leaves beyond the observation and 3 internal nodes. ”leaves are not ancestors” Is node 7 ancestor of node 6? Introduction to bioinformatics, Autumn 2006 68

Phylogenetic trees 3 R 1 R 2 • Rooting a tree specifies 2 4 all ancestor-descendant 7 8 relationships in the tree 6 • Root is the ancestor to 1 5 root(R 2 ) ) R 1 the other species ( t o o r R 1 • There are n-1 ways to R 2 root a tree with n nodes 8 7 7 6 8 6 1 2 3 4 5 1 2 3 5 4 Introduction to bioinformatics, Autumn 2006 69

Questions Can we enumerate all possible phylogenetic trees for l n species (or sequences?) How to score a phylogenetic tree with respect to data? l How to find the best phylogenetic tree given data? l Introduction to bioinformatics, Autumn 2006 70

Finding the best phylogenetic tree: naive method How can we find the phylogenetic tree that best l represents the data? Naive method: enumerate all possible trees l How many different trees are there of n species? l Denote this number by b n l Introduction to bioinformatics, Autumn 2006 71

Enumerating unordered trees • Start with the only 1 2 1 2 1 2 unordered tree with 3 leaves ( b 3 = 1) 4 4 4 3 3 3 1 2 • Fourth node can be added to 3 different branches (edges), creating 1 new internal branch 3 • Total number of branches is n external and n – 3 internal branches • Consider all ways to add a leaf node to this tree • Unrooted tree with n leaves has 2n – 3 branches Introduction to bioinformatics, Autumn 2006 72

Enumerating unordered trees • Thus, we get the number of unrooted trees b n = (2(n – 1) – 3)b n-1 = (2n – 5)b n-1 = (2n – 5) * (2n – 7) * …* 3 * 1 = (2n – 5)! / ((n-3)!2 n-3 ), n > 2 • Number of rooted trees b’ n is b’ n = (2n – 3)b n = (2n – 3)! / ((n-2)!2 n-2 ), n > 2 that is, the number of unrooted trees times the number of branches in the trees Introduction to bioinformatics, Autumn 2006 73

Number of possible rooted and unrooted trees n B n b’ n 3 1 3 4 3 15 5 15 105 6 105 945 7 954 10395 8 10395 135135 9 135135 2027025 10 2027025 34459425 20 2.22E+020 8.20E+021 30 8.69E+036 4.95E+038 Introduction to bioinformatics, Autumn 2006 74

Too many trees? We can’t construct and evaluate every phylogenetic l tree even for a smallish number of species Better alternative is to l − Devise a way to evaluate an individual tree against the data − Guide the search using the evaluation criteria to reduce the search space Introduction to bioinformatics, Autumn 2006 75

Inferring the Past: Phylogenetic Trees (chapter 12) The biological problem l Parsimony and distance methods l Models for mutations and estimation of distances l Maximum likelihood methods l Introduction to bioinformatics, Autumn 2006 76

Parsimony method The parsimony method finds the tree that explains the l observed sequences with a minimal number of substitutions Method has two steps l − Compute smallest number of substitutions for a given tree with a parsimony algorithm − Search for the tree with the minimal number of substitutions Introduction to bioinformatics, Autumn 2006 77

Parsimony: an example Consider the following short sequences l 1 ACTTT 2 ACATT 3 AACGT 4 AATGT 5 AATTT There are 105 possible rooted trees for 5 sequences l Example: which of the following trees explains the l sequences with least number of substitutions? Introduction to bioinformatics, Autumn 2006 78

9 A A TTT A-> C 7 AAT T T T-> G 6 AA T GT T-> A 8 AC T TT T-> C 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT This tree explains the sequences with 4 substitutions Introduction to bioinformatics, Autumn 2006 79

First tree is 9 A A TTT 4 substitutions… more A-> C 7 AAT T T parsimonious! T-> G 6 AA T GT T-> A 8 AC T TT T-> C 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT T-> G 9 AAT T T 6 substitutions… 8 AA T GT T-> C A-> C 7 A A C G T G-> T C-> T 6 AC C TT C-> A 1 2 3 4 5 AC T TT AC A TT AA C GT AA T GT AA T TT Introduction to bioinformatics, Autumn 2006 80

Computing parsimony Parsimony treats each site (position in a sequence) l independently Total parsimony cost is the sum of parsimony costs of l each site We can compute the minimal parsimony cost for a l given tree by − First finding out possible assignments at each node, starting from leaves and proceeding towards the root − Then, starting from the root, assign a letter at each node, proceeding towards leaves Introduction to bioinformatics, Autumn 2006 81

Labelling tree nodes An unrooted tree with n leaves contains 2n-2 nodes l altogether Assign the following labels to nodes in a rooted tree l − leaf nodes: 1, 2, …, n − internal nodes: n+1, n+2, …, 2n-1 9 − root node: 2n-1 The label of a child node is always l 8 smaller than the label of the 6 7 parent node 1 2 3 4 5 Introduction to bioinformatics, Autumn 2006 82

Parsimony algorithm: first phase Find out possible assignments at every node for each site l independently. Denote site u in sequence i by s i,u For i := 1, … , n do F i := {s i,u } % possible assignment s at node i L i := 0 % number of subst it ut ions up t o node i For i := n+1, … , 2n-1 do Let j and k be t he children of node i I f F j � F k = � t hen L i := L j + L k + 1, F i := F j � F k else L i := L j + L k , F i := F j � F k Introduction to bioinformatics, Autumn 2006 83

Parsimony algorithm: first phase Choose u = 3 (for example) F 1 := {T} L 1 := 0 9 F 2 := {A} L 2 := 0 7 F 3 := {C}, L 3 := 0 6 8 F 4 := {T}, L 4 := 0 3 4 5 2 1 F 5 := {T}, L 5 := 0 AA C GT AA T GT AA T TT AC A TT AC T TT F 8 := F 1 � F 2 = {A, T} L 8 := L 1 + L 2 + 1 = 1 Introduction to bioinformatics, Autumn 2006 84

Parsimony algorithm: first phase F 6 := F 3 � F 4 = {C, T} L 6 := L 3 + L 4 + 1 = 1 9 T F 7 := F 5 � F 6 = {T} 7 T L 7 := L 5 + L 6 = 1 6 {C,T} 8 T F 8 := F 1 � F 2 = {A, T} L 8 := L 1 + L 2 + 1 = 1 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT F 9 := F 7 � F 8 = {T} L 9 := L 7 + L 8 = 2 � Parsimony cost for site 3 is 2 Introduction to bioinformatics, Autumn 2006 85

Parsimony algorithm: second phase Backtrack from the root and assign x � F i at each node l If we assigned y at parent of node i and y � F i , then l assign y Else assign x � F i by random l Introduction to bioinformatics, Autumn 2006 86

Parsimony algorithm: second phase At node 6, the algorithm assigns T because T 9 T was assigned to parent node 7 and T � F 6 7 T 6 {C, T } 8 T 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT The other nodes have only one possible letter to assign Introduction to bioinformatics, Autumn 2006 87

Parsimony algorithm First and second phase are repeated for each site in the sequences, 9 T summing the parsimony costs at each site 7 T 6 T 8 T 3 4 5 2 1 AA C GT AA T GT AA T TT AC A TT AC T TT Introduction to bioinformatics, Autumn 2006 88

Properties of parsimony algorithm Parsimony algorithm requires that the sequences are l of same length − First align the sequences against each other and remove indels − Then compute parsimony for the resulting sequences Is the most parsimonious tree the correct tree? l − Not necessarily but it explains the sequences with least number of substitutions − We can assume that the probability of having fewer mutations is higher than having many mutations Introduction to bioinformatics, Autumn 2006 89

Finding the most parsimonious tree Parsimony algorithm calculates the parsimony cost for l a given tree… …but we still have the problem of finding the tree with l the lowest cost Exhaustive search (enumerating all trees) is in general l impossible More efficient methods exist, for example l − Probabilistic search − Branch and bound Introduction to bioinformatics, Autumn 2006 90

Branch and bound in parsimony We can exploit the fact that adding edges to a tree can l only increase the parsimony cost {C, T} {T} {T} 1 2 3 1 2 AA C GT AA T GT AA T TT AA T GT AA T TT cost 0 cost 1 Introduction to bioinformatics, Autumn 2006 91

Branch and bound in parsimony In parsimony… Branch and bound is a general search strategy Start from a tree with 1 l where sequence Each solution is potentially Add a sequence to the tree l l generated and calculate parsimony cost Track is kept of the best l solution found If the tree is complete, check l if found the best tree so far If a partial solution cannot l achieve better score, we If tree is not complete and l abandon the current search cost exceeds best tree cost, path do not continue adding edges to this tree Introduction to bioinformatics, Autumn 2006 92

Branch and bound graphically … … 4 3 1 2 Partial tree, no best complete tree constructed yet Complete tree: calculate parsimony cost and store Partial tree, cost exceeds the cost of the best tree this far Introduction to bioinformatics, Autumn 2006 93

Distance methods The parsimony method works on sequence (character l string) data We can also build phylogenetic trees in a more l general setting Distance methods work on a set of pairwise distances l d ij for the data Distances can be obtained from phenotypes as well as l from genotypes (sequences) Introduction to bioinformatics, Autumn 2006 94

Distances in a phylogenetic tree Distance matrix D = (d ij ) l gives pairwise distances for leaves of the phylogenetic 7 tree 6 8 In addition, the phylogenetic l tree will now specify 1 2 3 4 5 distances between leaves Distance d ij states how and internal nodes far apart species i and j − Denote these with d ij as well are evolutionary (e.g., number of mismatches in aligned sequences) Introduction to bioinformatics, Autumn 2006 95

Distances in evolutionary context Distances d ij in evolutionary context satisfy the l following conditions − Symmetry: d ij = d ji for each i, j − Distinguishability: d ij � 0 if and only if i � j − Triangle inequality: d ij � d ik + d kj for each i, j, k Distances satisfying these conditions are called metric l In addition, evolutionary mechanisms may impose l additional constraints on the distances � additive and ultrametric distances Introduction to bioinformatics, Autumn 2006 96

Additive trees A tree is called additive , if the distance between any l pair of leaves (i, j) is the sum of the distances between the leaves and the first node k that they share in the tree d ij = d ik + d jk ”Follow the path from the leaf i to the leaf j to find the l exact distance d ij between the leaves.” Introduction to bioinformatics, Autumn 2006 97

Additive trees: example A B C D A C A 0 2 4 4 1 1 2 B 2 0 4 4 1 1 C 4 4 0 2 B D D 4 4 2 0 Introduction to bioinformatics, Autumn 2006 98

Ultrametric trees A rooted additive tree is called a ultrametric tree , if the l distances between any two leaves i and j, and their common ancestor k are equal d ik = d jk Edge length d ij corresponds to the time elapsed since l divergence of i and j from the common parent In other words, edge lengths are measured by a l molecular clock with a constant rate Introduction to bioinformatics, Autumn 2006 99

Identifying ultrametric data We can identify distances to be ultrametric by the l three-point condition: D corresponds to an ultrametric tree if and only if for any three sequences i, j and k, the distances satisfy d ij � max(d ik , d kj ) If we find out that the data is ultrametric, we can utilise l a simple algorithm to find the corresponding tree Introduction to bioinformatics, Autumn 2006 100

582606 Introduction to bioinformatics Autumn 2006 Esa Pitknen - PowerPoint PPT Presentation

582606 Introduction to bioinformatics Autumn 2006 Esa Pitknen Master's Degree Programme in Bioinformatics (MBI) Department of Computer Science, University of Helsinki http://www.cs.helsinki.fi/mbi/courses/06-07/itb/ Introduction to

582606 Introduction to bioinformatics Autumn 2007 Esa Pitknen Master's Degree Programme in

Introduction to Bioinformatics Esa Pitknen esa.pitkanen@cs.helsinki.fi Autumn 2008, I period

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Molecular biology recap Autumn 2007 Esa Pitknen Master's Degree Programme in Bioinformatics

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Practical Bioinformatics Mark Voorhies 5/11/2015 Mark Voorhies Practical Bioinformatics

Administrative issues Master level course l Obligatory course in the Masters Degree Programme

Isa 42:9 Behold, the former things have come to pass, Now I declare new things; Before

Mixed models in R using the lme4 package Part 4: Longitudinal data, modeling interactions Douglas

WELCOME!! White Oak Elementary PTO 10.10.18 Community Talk- Future State of PTO 5 Main

2020 Secondary 3 Express Subject Combination 60 Barker Road, Singapore 309919 | Tel: (65) 6256

Geometric Registration for Deformable Shapes 2.1 ICP + Tangent Space optimization for Rigid

Business CorrespondenceThe envelope! Mon Sep 12 13:37:20 CST 2016 Business

Some extentions of the AdS/CFT correspondence Andrei Parnachev Leiden University May 5, 2011

A correspondence between logical translations and ( x y ) y semantic transformations x

582606 Introduction to bioinformatics Autumn 2006 Esa Pitknen - PowerPoint PPT Presentation

582606 Introduction to bioinformatics Autumn 2006 Esa Pitknen Master's Degree Programme in Bioinformatics (MBI) Department of Computer Science, University of Helsinki http://www.cs.helsinki.fi/mbi/courses/06-07/itb/ Introduction to

582606 Introduction to bioinformatics Autumn 2007 Esa Pitknen Master's Degree Programme in

Introduction to Bioinformatics Esa Pitknen esa.pitkanen@cs.helsinki.fi Autumn 2008, I period

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Molecular biology recap Autumn 2007 Esa Pitknen Master's Degree Programme in Bioinformatics

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Practical Bioinformatics Mark Voorhies 5/11/2015 Mark Voorhies Practical Bioinformatics

Administrative issues Master level course l Obligatory course in the Masters Degree Programme

Isa 42:9 Behold, the former things have come to pass, Now I declare new things; Before

Mixed models in R using the lme4 package Part 4: Longitudinal data, modeling interactions Douglas

WELCOME!! White Oak Elementary PTO 10.10.18 Community Talk- Future State of PTO 5 Main

2020 Secondary 3 Express Subject Combination 60 Barker Road, Singapore 309919 | Tel: (65) 6256

Geometric Registration for Deformable Shapes 2.1 ICP + Tangent Space optimization for Rigid

Business CorrespondenceThe envelope! Mon Sep 12 13:37:20 CST 2016 Business

Some extentions of the AdS/CFT correspondence Andrei Parnachev Leiden University May 5, 2011

A correspondence between logical translations and ( x y ) y semantic transformations x

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt