Sequence Alignment (chapter 6) p The biological problem p Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 200

Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. 201

Local alignment: rationale A B Regions of similarity p Global alignment would be inadequate p Problem: find the highest scoring local alignment between two sequences p Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) 202

From global to local alignment p Modifications to the global alignment algorithm n Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix), or in other words: n Allow preceding and trailing indels without penalty 203

Scoring local alignments A = a 1 a 2 a 3 …a n , B = b 1 b 2 b 3 …b m Let I and J be intervals (substrings) of A and B, respectively: Best local alignment score: where S(I, J) is the alignment score for substrings I and J. 204

Allowing preceding and trailing indels p First row and column 0 1 2 3 4 initialised to zero: b 1 b 2 b 3 b 4 - M i,0 = M 0,j = 0 0 0 0 0 0 0 - a 1 0 1 b1 b2 b3 0 2 a 2 - - a1 a 3 0 3 205

Recursion for local alignment - T G G T G p M i,j = max { M i-1,j-1 + s(a i , b i ), - 0 0 0 0 0 0 M i-1,j – � , A 0 0 0 0 0 0 M i,j-1 – � , 0 T 0 1 0 0 1 0 } C 0 0 0 0 0 0 G 0 0 1 1 0 1 Allow alignment to start anywhere in sequences T 0 1 0 0 2 0 206

Finding best local alignment Optimal score is the highest - T G G T G p value in the matrix - 0 0 0 0 0 0 A 0 0 0 0 0 0 = max i,j M i,j T 0 1 0 0 1 0 Best local alignment can be p found by backtracking from the C 0 0 0 0 0 0 highest value in M G 0 0 1 1 0 1 What is the best local p alignment in this example? T 0 1 0 0 2 0 207

Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 M i,j = max { - G G C T C A A T C A M i-1,j-1 + s(a i , b i ), 0 - 0 0 0 0 0 0 0 0 0 0 0 M i-1,j � � , 1 A 0 0 M i,j-1 � � , 0 2 C 0 } 3 C 0 4 T 0 5 A 0 6 A 0 Scoring (for example) Match: + 2 7 G 0 Mismatch: -1 8 G 0 Indel: -2 208

Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 M i,j = max { - G G C T C A A T C A M i-1,j-1 + s(a i , b i ), 0 - 0 0 0 0 0 0 0 0 0 0 0 M i-1,j � � , 1 A 0 0 0 0 0 0 2 M i,j-1 � � , 0 2 C 0 } 3 C 0 4 T 0 5 A 0 6 A 0 Scoring (for example) Match: + 2 7 G 0 Mismatch: -1 8 G 0 Indel: -2 209

Local alignment: example 10 0 1 2 3 4 5 6 7 8 9 Optim al local alignm ent: - G G C T C A A T C A C T – A A 0 - 0 0 0 0 0 0 0 0 0 0 0 C T C A A 1 A 0 0 0 0 0 0 2 2 0 0 2 2 C 0 0 0 2 0 2 0 1 1 2 0 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 5 A 0 0 0 0 2 3 4 3 1 1 3 6 A 0 0 0 0 0 1 5 6 4 2 3 Scoring (for example) 7 G 0 2 2 0 0 0 3 4 5 3 1 Match: + 2 Mismatch: -1 8 G 0 2 4 2 0 0 1 2 3 4 2 Indel: -2 210

Multiple optimal alignments Non-optimal, good-scoring alignments 10 0 1 2 3 4 5 6 7 8 9 How can you find - G G C T C A A T C A 0 - 0 0 0 0 0 0 0 0 0 0 0 1. Optimal 1 A 0 0 0 0 0 0 2 2 0 0 2 alignments if more than one 2 C 0 0 0 2 0 2 0 1 1 2 0 exist? 3 C 0 0 0 2 1 2 1 0 0 3 1 4 T 0 0 0 0 4 2 1 0 2 1 2 2. Non-optimal, good-scoring 5 A 0 0 0 0 2 3 4 3 1 1 3 alignments? 6 A 0 0 0 0 0 1 5 6 4 2 3 7 G 0 2 2 0 0 0 3 4 5 3 1 8 G 0 2 4 2 0 0 1 2 3 4 2 211

Overlap alignment p Overlap matrix used by Overlap-Layout- Consensus algorithm can be computed with dynamic program ming p Initialization: O i,0 = O 0,j = 0 for all i, j p Recursion: O i,j = max { O i-1,j-1 + s(a i , b i ), O i-1,j – � , O i,j-1 – � , } Best overlap: maximum value from rightmost column and bottom row 212

Non-uniform mismatch penalties We used uniform penalty for m ismatches: p s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ Transition mutations (A-> G, G-> A, C-> T, T-> C) are p approximately twice as frequent than transversions (A-> T, T-> A, A-> C, G-> T) use non-uniform mismatch n penalties collected into a substitution matrix A C G T A 1 -1 -0.5 -1 C -1 1 -1 -0.5 G -0.5 -1 1 -1 T -1 -0.5 -1 1 213

Gaps in alignment p Gap is a succession of indels in alignment C T – - - A A C T C G C A A p Previous model scored a length k gap as w(k) = -k � p Replication processes may produce longer stretches of insertions or deletions n In coding regions, insertions or deletions of codons may preserve functionality 214

Gap open and extension penalties (2) p We can design a score that allows the penalty opening gap to be larger than extending the gap: w(k) = - � – � (k – 1) p Gap open cost � , Gap extension cost � p Alignment algorithms can be extended to use w(k) (not discussed on this course) 215

Amino acid sequences p We have discussed mainly DNA sequences p Amino acid sequences can be aligned as well p However, the design of the substitution matrix is more involved because of the larger alphabet p More on the topic in the course Biological sequence analysis 216

Demonstration of the EBI web site p European Bioinformatics Institute (EBI) offers many biological databases and bioinformatics tools at http: / / www.ebi.ac.uk/ n Sequence alignment: Tools -> Sequence Analysis -> Align 217

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 218

Multiple alignment Consider a set of n sequences p aggcgagct gcgagt gct a on the right cgt t agat t gacgct gac Orthologous sequences from n t t ccggct gcgac different organisms gacacggcgaacgga Paralogs from multiple n duplications agt gt gcccgacgagcgaggac How can we study gcgggct gt gagcgct a p relationships between these aagcggcct gt gt gccct a sequences? at gct gct gccagt gt a agt cgagccccgagt gc agt ccgagt cc act cggt gc 219

Optimal alignment of three sequences p Alignment of A = a 1 a 2 … a i and B = b 1 b 2 … b j can end either in (-, b j ), (a i , b j ) or (a i , -) p 2 2 – 1 = 3 alternatives c k can end in 2 3 – p Alignment of A, B and C = c 1 c 2 … 1 ways: (a i , -, -), (-, b j , -), (-, -, c k ), (-, b j , c k ), (a i , -, c k ), (a i , b j , -) or (a i , b j , c k ) p Solve the recursion using three-dimensional dynamic programming matrix: O(n 3 ) time and space p Generalizes to n sequences but impractical with even a moderate number of sequences 220

Multiple alignment in practice p In practice, real-world multiple alignment problems are usually solved with heuristics p Progressive multiple alignment n Choose two sequences and align them n Choose third sequence w.r.t. two previous sequences and align the third against them n Repeat until all sequences have been aligned n Different options how to choose sequences and score alignments n Note the similarity to Overlap-Layout-Consensus 221

Multiple alignment in practice p Profile-based progressive multiple alignment: CLUSTALW n Construct a distance matrix of all pairs of sequences using dynamic programm ing n Progressively align pairs in order of decreasing similarity n CLUSTALW uses various heuristics to contribute to accuracy 222

Additional material p R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis p N. C. Jones, P. A. Pevzner: An introduction to bioinformatics algorithms p Course Biological sequence analysis in period II, 2008 223

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 224

The biological problem p Global and local alignment algoritms are slow in practice p Consider the scenario of aligning a query sequence against a large database of sequences n New sequence with unknown function n NCBI GenBank size in January 2007 was 65 369 091 950 bases (61 132 599 sequences) n Feb 2008: 85 759 586 764 bases (82 853 685 sequences) 225

Problem with large amount of sequences p Exponential growth in both number and total length of sequences p Possible solution: Compare against model organisms only p With large amount of sequences, chances are that matches occur by random n Need for statistical analysis 226

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 227

FASTA p FASTA is a multistep algorithm for sequence alignment (Wilbur and Lipman, 1983) p The sequence file format used by the FASTA software is widely used by other sequence analysis software p Main idea: n Choose regions of the two sequences (query and database) that look promising (have some degree of similarity) n Compute local alignment using dynamic programming in these regions 228

Sequence Alignment (chapter 6) p The biological problem p Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 200 Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Q1) What does LCA mean to buildings? A basis for relative improvements Climate change impact of

A first look at detector requirements at FCC-ee Mogens

GPRS (GPRS System Overview) 1. (a)(b)(c)(d) 2.

2020: An Algo Odyssey Presentation for Stanford University November 2020 Electronic FX Markets

Existence of frames with prescribed norms and frame operator Marcin Bownik University of Oregon

Timed automata with diagonal constraints B. Srivathsan Chennai Mathematical Institute, India In

ChaCha, a variant of Salsa20 D. J. Bernstein University of Illinois at Chicago NSF

AM 205: lecture 21 Today: eigenvalue sensitivity Eigenvalue Decomposition In some cases, the

Sequence Alignment (chapter 6) p The biological problem p Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment 200 Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Q1) What does LCA mean to buildings? A basis for relative improvements Climate change impact of

A first look at detector requirements at FCC-ee Mogens

GPRS (GPRS System Overview) 1. (a)(b)(c)(d) 2.

2020: An Algo Odyssey Presentation for Stanford University November 2020 Electronic FX Markets

Existence of frames with prescribed norms and frame operator Marcin Bownik University of Oregon

Timed automata with diagonal constraints B. Srivathsan Chennai Mathematical Institute, India In

ChaCha, a variant of Salsa20 D. J. Bernstein University of Illinois at Chicago NSF

AM 205: lecture 21 Today: eigenvalue sensitivity Eigenvalue Decomposition In some cases, the

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or