sequence alignment chapter 6
play

Sequence Alignment (chapter 6) p The biological problem p Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 200 Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a


  1. Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 200

  2. Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. 201

  3. Local alignment: rationale A B Regions of similarity p Global alignment would be inadequate p Problem: find the highest scoring local alignment between two sequences p Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) 202

  4. From global to local alignment p Modifications to the global alignment algorithm n Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix), or in other words: n Allow preceding and trailing indels without penalty 203

  5. Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: Best local alignment score: where S(I, J) is the alignment score for substrings I and J. 204

  6. Allowing preceding and trailing indels p First row and column 0 1 2 3 4 initialised to zero: b 1 b 2 b 3 b 4 - M i,0 = M 0,j = 0 0 0 0 0 0 0 - a 1 0 1 b1 b2 b3 0 2 a 2 - - a1 a 3 0 3 205

  7. Recursion for local alignment - T G G T G p M i,j = max { M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j – � , A 0 0 0 0 0 0 M i,j-1 – � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 Allow alignment to start anywhere in sequences T 0 1 0 0 2 0 206

  8. Finding best local alignment Optimal score is the highest - T G G T G p value in the matrix - 0 0 0 0 0 0 A 0 0 0 0 0 0 = max i,j M i,j T 0 1 0 0 1 0 Best local alignment can be p found by backtracking from the C 0 0 0 0 0 0 highest value in M G 0 0 1 1 0 1 What is the best local p alignment in this example? T 0 1 0 0 2 0 207

  9. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 M i,j = max { - G G C T C A A T C A M i-1,j-1 + s(a i , b i ), 0 - 0 0 0 0 0 0 0 0 0 0 0 M i-1,j � � , 1 A 0 0 M i,j-1 � � , 0 2 C 0 } 3 C 0 4 T 0 5 A 0 6 A 0 Scoring (for example) Match: + 2 7 G 0 Mismatch: -1 8 G 0 Indel: -2 208

  10. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 M i,j = max { - G G C T C A A T C A M i-1,j-1 + s(a i , b i ), 0 - 0 0 0 0 0 0 0 0 0 0 0 M i-1,j � � , 1 A 0 0 0 0 0 0 2 M i,j-1 � � , 0 2 C 0 } 3 C 0 4 T 0 5 A 0 6 A 0 Scoring (for example) Match: + 2 7 G 0 Mismatch: -1 8 G 0 Indel: -2 209

  11. Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 Optim al local alignm ent: - G G C T C A A T C A C T – A A 0 - 0 0 0 0 0 0 0 0 0 0 0 C T C A A 1 A 0 0 0 0 0 0 2 2 0 0 2 2 C 0 0 0 2 0 2 0 1 1 2 0 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 5 A 0 0 0 0 2 3 4 3 1 1 3 6 A 0 0 0 0 0 1 5 6 4 2 3 Scoring (for example) 7 G 0 2 2 0 0 0 3 4 5 3 1 Match: + 2 Mismatch: -1 8 G 0 2 4 2 0 0 1 2 3 4 2 Indel: -2 210

  12. Multiple optimal alignments Non-optimal, good-scoring alignments 10 0 1 2 3 4 5 6 7 8 9 How can you find - G G C T C A A T C A 0 - 0 0 0 0 0 0 0 0 0 0 0 1. Optimal 1 A 0 0 0 0 0 0 2 2 0 0 2 alignments if more than one 2 C 0 0 0 2 0 2 0 1 1 2 0 exist? 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 2. Non-optimal, good-scoring 5 A 0 0 0 0 2 3 4 3 1 1 3 alignments? 6 A 0 0 0 0 0 1 5 6 4 2 3 7 G 0 2 2 0 0 0 3 4 5 3 1 8 G 0 2 4 2 0 0 1 2 3 4 2 211

  13. Overlap alignment p Overlap matrix used by Overlap-Layout- Consensus algorithm can be computed with dynamic program ming p Initialization: O i,0 = O 0,j = 0 for all i, j p Recursion: O i,j = max { O i-1,j-1 + s(a i , b i ), O i-1,j – � , O i,j-1 – � , } Best overlap: maximum value from rightmost column and bottom row 212

  14. Non-uniform mismatch penalties We used uniform penalty for m ismatches: p s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ Transition mutations (A-> G, G-> A, C-> T, T-> C) are p approximately twice as frequent than transversions (A-> T, T-> A, A-> C, G-> T) use non-uniform mismatch n penalties collected into a substitution matrix A C G T A 1 -1 -0.5 -1 C -1 1 -1 -0.5 G -0.5 -1 1 -1 T -1 -0.5 -1 1 213

  15. Gaps in alignment p Gap is a succession of indels in alignment C T – - - A A C T C G C A A p Previous model scored a length k gap as w(k) = -k � p Replication processes may produce longer stretches of insertions or deletions n In coding regions, insertions or deletions of codons may preserve functionality 214

  16. Gap open and extension penalties (2) p We can design a score that allows the penalty opening gap to be larger than extending the gap: w(k) = - � – � (k – 1) p Gap open cost � , Gap extension cost � p Alignment algorithms can be extended to use w(k) (not discussed on this course) 215

  17. Amino acid sequences p We have discussed mainly DNA sequences p Amino acid sequences can be aligned as well p However, the design of the substitution matrix is more involved because of the larger alphabet p More on the topic in the course Biological sequence analysis 216

  18. Demonstration of the EBI web site p European Bioinformatics Institute (EBI) offers many biological databases and bioinformatics tools at http: / / www.ebi.ac.uk/ n Sequence alignment: Tools -> Sequence Analysis -> Align 217

  19. Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 218

  20. Multiple alignment Consider a set of n sequences p aggcgagct gcgagt gct a on the right cgt t agat t gacgct gac Orthologous sequences from n t t ccggct gcgac different organisms gacacggcgaacgga Paralogs from multiple n duplications agt gt gcccgacgagcgaggac How can we study gcgggct gt gagcgct a p relationships between these aagcggcct gt gt gccct a sequences? at gct gct gccagt gt a agt cgagccccgagt gc agt ccgagt cc act cggt gc 219

  21. Optimal alignment of three sequences p Alignment of A = a 1 a 2 … a i and B = b 1 b 2 … b j can end either in (-, b j ), (a i , b j ) or (a i , -) p 2 2 – 1 = 3 alternatives c k can end in 2 3 – p Alignment of A, B and C = c 1 c 2 … 1 ways: (a i , -, -), (-, b j , -), (-, -, c k ), (-, b j , c k ), (a i , -, c k ), (a i , b j , -) or (a i , b j , c k ) p Solve the recursion using three-dimensional dynamic programming matrix: O(n 3 ) time and space p Generalizes to n sequences but impractical with even a moderate number of sequences 220

  22. Multiple alignment in practice p In practice, real-world multiple alignment problems are usually solved with heuristics p Progressive multiple alignment n Choose two sequences and align them n Choose third sequence w.r.t. two previous sequences and align the third against them n Repeat until all sequences have been aligned n Different options how to choose sequences and score alignments n Note the similarity to Overlap-Layout-Consensus 221

  23. Multiple alignment in practice p Profile-based progressive multiple alignment: CLUSTALW n Construct a distance matrix of all pairs of sequences using dynamic programm ing n Progressively align pairs in order of decreasing similarity n CLUSTALW uses various heuristics to contribute to accuracy 222

  24. Additional material p R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis p N. C. Jones, P. A. Pevzner: An introduction to bioinformatics algorithms p Course Biological sequence analysis in period II, 2008 223

  25. Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 224

  26. The biological problem p Global and local alignment algoritms are slow in practice p Consider the scenario of aligning a query sequence against a large database of sequences n New sequence with unknown function n NCBI GenBank size in January 2007 was 65 369 091 950 bases (61 132 599 sequences) n Feb 2008: 85 759 586 764 bases (82 853 685 sequences) 225

  27. Problem with large amount of sequences p Exponential growth in both number and total length of sequences p Possible solution: Compare against model organisms only p With large amount of sequences, chances are that matches occur by random n Need for statistical analysis 226

  28. Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 227

  29. FASTA p FASTA is a multistep algorithm for sequence alignment (Wilbur and Lipman, 1983) p The sequence file format used by the FASTA software is widely used by other sequence analysis software p Main idea: n Choose regions of the two sequences (query and database) that look promising (have some degree of similarity) n Compute local alignment using dynamic programming in these regions 228

Recommend


More recommend