sequence alignment
play

Sequence alignment Alignment specifies which positions in two - PowerPoint PPT Presentation

Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not


  1. Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not aligned 1 not aligned 1 not aligned Introduction to bioinformatics, Autumn 2007 41

  2. Mutations: Insertions, deletions and substitutions acgtctag Indel: insertion or Mismatch: substitution ||||| deletion of a base (point mutation) of with respect to the a single base -actctag ancestor sequence Insertions and/or deletions are called indels l − We can’t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, Autumn 2007 42

  3. Problems What sorts of alignments should be considered? l How to score alignments? l How to find optimal or good scoring alignments? l How to evaluate the statistical significance of scores? l In this course, we discuss each of these problems briefly. Course Biological sequence analysis tackles all four in- depth. Introduction to bioinformatics, Autumn 2007 43

  4. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2007 44

  5. Global alignment Problem: find optimal scoring alignment between two l sequences (Needleman & Wunsch 1970) Every position in both sequences is included in the alignment l We give score for each position in alignment l WHAT − Identity (match) +1 − Substitution (mismatch) -µ || �� − Indel WH-Y Total score: sum of position scores l S(WHAT/WH-Y) = 1 + 1 – � – µ Introduction to bioinformatics, Autumn 2007 45

  6. Dynamic programming How to find the optimal alignment? l We use previous solutions for optimal alignments of l smaller subsequences This general approach is known as dynamic l programming Introduction to bioinformatics, Autumn 2007 46

  7. Introduction to dynamic programming: the money change problem Suppose you buy a pen for 4.23€ and pay for with a l 5€ note You get 77 cents in change – what coins is the cashier l going to give you if he or she tries to minimise the number of coins? The usual algorithm: start with largest coin l (denominator), proceed to smaller coins until no change is left: − 50, 20, 5 and 2 cents This greedy algorithm is incorrect , in the sense that it l does not always give you the correct answer Introduction to bioinformatics, Autumn 2007 47

  8. The money change problem • How else to compute the 77 change? 50 20 5 • We could consider all possible 27 57 72 ways to reduce the amount of change • Suppose we have 77 cents 7 22 7 37 52 22 52 67 change, and the following coins: 50, 20, 5 cents … • We can compute the change • Many values are computed with recursion more than once! • Figure shows the recursion • This leads to a correct but tree for the example very inefficient algorithm Introduction to bioinformatics, Autumn 2007 48

  9. The money change problem We can speed the computation up by solving the l change problem for all i � n − Example: solve the problem for 9 cents with available coins being 1, 2 and 5 cents Solve the problem in steps, first for 1 cent, then 2 l cents, and so on In each step, utilise the solutions from the previous l steps Introduction to bioinformatics, Autumn 2007 49

  10. The money change problem Amount of 0 1 2 3 4 5 6 7 8 9 change left Algorithm runs in time proportional to Md, where M is l the amount of change and d is the number of coin types The same technique of storing solutions of l subproblems can be utilised in aligning sequences Introduction to bioinformatics, Autumn 2007 50

  11. Representing alignments and scores Alignments can be represented in the - W H A T following tabular form. Each alignment - corresponds to a path through the table. W X WHAT H X X || Y X WH-Y Introduction to bioinformatics, Autumn 2007 51

  12. Representing alignments and scores WHAT - W H A T || - 0 WH-Y W 1 H 2 2- � Global alignment Y 2- � -µ score S 3,4 = 2- � -µ Introduction to bioinformatics, Autumn 2007 52

  13. Filling the alignment matrix - W H A T Consider the alignment process at shaded square. - Case 1. Align H against H (match or substitution). W Case 1 Case 2. Align H in WHY against Case 2 – (indel) in WHAT. H Case 3 Case 3. Align H in WHAT against – (indel) in WHY. Y Introduction to bioinformatics, Autumn 2007 53

  14. Filling the alignment matrix (2) - W H A T Scoring the alternatives. Case 1. S 2,2 = S 1,1 + s(2, 2) - Case 2. S 2,2 = S 1,2 � � W Case 3. S 2,2 = S 2,1 � � Case 1 Case 2 s(i, j) = 1 for matching positions, H s(i, j) = - µ for substitutions. Case 3 Y Choose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, Autumn 2007 54

  15. Global alignment: formal development A = a 1 a 2 a 3 …a n , 0 1 2 3 4 B = b 1 b 2 b 3 …b m - b 1 b 2 b 3 b 4 b 1 b 2 b 3 b 4 - 0 - - a 1 - a 2 a 3 l Any alignment can be written 1 a 1 as a unique path through the matrix 2 a 2 l Score for aligning A and B up to positions i and j: 3 a 3 S i,j = S(a 1 a 2 a 3 …a i , b 1 b 2 b 3 …b j ) Introduction to bioinformatics, Autumn 2007 55

  16. Scoring partial alignments Alignment of A = a 1 a 2 a 3 …a n with B = b 1 b 2 b 3 …b m can end in l three ways − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) - − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2007 56

  17. Scoring alignments Scores for each case: l +1 if a i = b j s(a i , b j ) = { -µ otherwise − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) – s(a i , -) = s(-, b j ) = - � − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2007 57

  18. Scoring alignments (2) • First row and first column 0 1 2 3 4 correspond to initial alignment against indels: - b 1 b 2 b 3 b 4 S(i, 0) = -i � S(0, j) = -j � �� -2 � -3 � -4 � 0 0 - • Optimal global alignment �� 1 a 1 score S(A, B) = S n,m -2 � 2 a 2 -3 � 3 a 3 Introduction to bioinformatics, Autumn 2007 58

  19. Algorithm for global alignment I nput sequences A, B, n = | A|, m = |B| Set S i,0 := - � i f or all i Set S 0,j := - � j f or all j f or i := 1 t o n f or j := 1 t o m S i,j := max{S i-1,j – � , S i-1,j -1 + s(a i ,b j ), S i,j -1 – � } end end Algorithm takes O(nm) time and space. Introduction to bioinformatics, Autumn 2007 59

  20. Global alignment: example - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 T -4 C -6 G -8 T -10 ? Introduction to bioinformatics, Autumn 2007 60

  21. Global alignment: example (2) - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 ATCGT- G -8 -5 -2 -1 -3 -4 | || T -10 -7 -4 -3 0 -2 -TGGTG Introduction to bioinformatics, Autumn 2007 61

  22. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2007 62

  23. Local alignment: rationale • Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. Introduction to bioinformatics, Autumn 2007 63

  24. Local alignment: rationale A B Regions of similarity • Global alignment would be inadequate • Problem: find the highest scoring local alignment between two sequences • Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) Introduction to bioinformatics, Autumn 2007 64

  25. From global to local alignment Modifications to the global alignment algorithm l − Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix), or in other words: − Allow preceding and trailing indels without penalty Introduction to bioinformatics, Autumn 2007 65

  26. Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: , Best local alignment score: where S(I, J) is the score for substrings I and J. Introduction to bioinformatics, Autumn 2007 66

  27. Allowing preceding and trailing indels • First row and column 0 1 2 3 4 initialised to zero: - b 1 b 2 b 3 b 4 M i,0 = M 0,j = 0 0 0 0 0 0 0 - 0 1 a 1 b1 b2 b3 2 a 2 0 - - a1 0 3 a 3 Introduction to bioinformatics, Autumn 2007 67

  28. Recursion for local alignment • M i,j = max { - T G G T G M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j � � , A 0 0 0 0 0 0 M i,j-1 � � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 T 0 1 0 0 2 0 Introduction to bioinformatics, Autumn 2007 68

Recommend


More recommend