Sequence comparison: Dynamic programming Genome 559: Introduction - PowerPoint PPT Presentation

Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas http://faculty.washington.edu/jht/GS559_2013/

Sequence comparison overview • Problem: Find the “best” alignment between a query sequence and a target sequence. • To solve this problem, we need – a method for scoring alignments, and – an algorithm for finding the alignment with the best score. • The alignment score is calculated using – a substitution matrix – gap penalties. • The algorithm for finding the best alignment is dynamic programming (DP).

A simple alignment problem • Problem: find the best pairwise alignment of GAATC and CATAC . • Use a linear gap penalty of -4. • Use the following substitution matrix: A C G T A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10

How many possibilities? GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C • How many different possible alignments of two sequences of length n exist?

How many possibilities? GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C • How many different alignments of two sequences of length n exist? 5 2.5x10 2 2 n 2 n ! 10 1.8x10 5 FYI for two sequences of length m and n, possible 20 1.4x10 11 2 alignments number: n n ! 30 1.2x10 17 40 1.1x10 23 mn ( mn )! 2n choose n - the 2 min( , ) m n (min( , )!) m n binomial coefficient

A C G T GA DP matrix A 10 -5 0 -5 CA C -5 10 -5 0 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 i G A A T C 0 C 1 A 5 2 T 3 A 4 Value at ( i,j ) will be the score of the best alignment of the first i characters 5 C of one sequence to the first j characters of the other sequence. init. row and column

A C G T DP matrix GAA A 10 -5 0 -5 C -5 10 -5 0 CA- G 0 -5 10 -5 T -5 0 -5 10 G A A T C C A 5 1 T Moving horizontally in the A matrix introduces a gap in the sequence along the left edge. C

A C G T GA- DP matrix A 10 -5 0 -5 CAT C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C Moving vertically in the matrix introduces a gap in C the sequence along the top edge. A 5 T 1 A C

A C G T GAA DP matrix A 10 -5 0 -5 CAT C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C Moving diagonally in the C matrix aligns two residues A 5 T 0 A C

A C G T Initialization A 10 -5 0 -5 Start at top C -5 10 -5 0 left and move G 0 -5 10 -5 progressively T -5 0 -5 10 G A A T C 0 C A T A C

A C G T G Introducing a gap A 10 -5 0 -5 - C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 C A T A C

A C G T - Introducing a gap A 10 -5 0 -5 C C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 C -4 A T A C

Complete first row A C G T ----- A 10 -5 0 -5 and column CATAC C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 A -8 T -12 A -16 C -20

A C G T Three ways to get G- A 10 -5 0 -5 to i=1 , j=1 C -5 10 -5 0 -C G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 i G A A T C 0 -4 0 C -8 1 A 2 T 3 A 4 5 C

Three ways to get A C G T -G A 10 -5 0 -5 to i=1 , j=1 C- C -5 10 -5 0 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 i G A A T C 0 0 C -4 -8 1 A 2 T 3 A 4 5 C

Three ways to get A C G T G A 10 -5 0 -5 to i=1 , j=1 C C -5 10 -5 0 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 i G A A T C 0 0 C -5 1 A 2 T 3 A 4 5 C

A C G T Accept the highest scoring A 10 -5 0 -5 C -5 10 -5 0 of the three G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 T -12 A -16 C -20 Then simply repeat the same rule progressively across the matrix

A C G T DP matrix A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 ? T -12 A -16 C -20

A C G T -G G- --G DP matrix A 10 -5 0 -5 CA CA CA- C -5 10 -5 0 -4 G 0 -5 10 -5 -9 -12 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 ? T -12 A -16 C -20

A C G T -G G- --G DP matrix A 10 -5 0 -5 CA CA CA- C -5 10 -5 0 -4 G 0 -5 10 -5 -9 -12 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 -4 T -12 A -16 C -20

A C G T DP matrix A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 ? A -16 ? C -20 ?

A C G T DP matrix A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 -8 A -16 -12 C -20 -16

A C G T DP matrix A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 ? C -4 -5 A -8 -4 ? T -12 -8 ? A -16 -12 ? C -20 -16 ?

A C G T Traceback A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 -9 C -4 -5 A -8 -4 5 T -12 -8 1 What is the alignment associated with this entry? A -16 -12 2 C -20 -16 -2 (just follow the arrows back - this is called the traceback)

A C G T DP matrix A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 -9 C -4 -5 A -8 -4 5 -G-A T -12 -8 1 CATA A -16 -12 2 C -20 -16 -2

A C G T DP matrix A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 A -8 -4 5 Continue and find the optimal global T -12 -8 1 alignment, and its score. A -16 -12 2 C -20 -16 -2 ?

A C G T Best alignment starts at bottom right A 10 -5 0 -5 and follows traceback arrows to top left C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

A C G T GA-ATC One best traceback A 10 -5 0 -5 C -5 10 -5 0 CATA-C G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

A C G T GAAT-C A 10 -5 0 -5 -CATAC Another best traceback C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

A C G T GAAT-C GA-ATC A 10 -5 0 -5 -CATAC CATA-C C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

Multiple solutions GA-ATC • When a program returns a single CATA-C sequence alignment, it may not be the only best alignment but it GAAT-C is guaranteed to be one of them. CA-TAC GAAT-C • In our example, all of the C-ATAC alignments at the left have equal scores. GAAT-C -CATAC

DP in equation form • Align sequence x and y . • F is the DP matrix; s is the substitution matrix; d is the linear gap penalty. F 0 , 0 0 F i 1 , j 1 s x , y i j F i , j max F i 1 , j d F i , j 1 d

DP equation graphically F i , j 1 F i 1 j , 1 d s x i y , j F i 1 , j d F , i j take the max of these three

Dynamic programming • Yes, it’s a weird name. • DP is closely related to recursion and to mathematical induction. • We can prove that the resulting score is optimal.

What you should know • Scoring a pairwise alignment requires a substitution matrix and gap penalties. • Dynamic programming (DP) is an efficient algorithm for finding an optimal alignment. • Entry ( i,j ) in the DP matrix stores the score of the best-scoring alignment up to that position. • DP iteratively fills in the matrix using a simple mathematical rule. • How to use DP to find an alignment.

Practice problem: find a best pairwise alignment of GAATC and AATTC A C G T G A A T C A 10 -5 0 -5 C -5 10 -5 0 0 G 0 -5 10 -5 T -5 0 -5 10 A d = -4 A T T C

Sequence comparison: Dynamic programming Genome 559: Introduction - PowerPoint PPT Presentation

Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas http://faculty.washington.edu/jht/GS559_2013/ Sequence comparison overview Problem: Find the best

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Assignment 3: Sequence Comparison Part 1: Running BLAST Step 1: Obtain Gene Sequence Obtain

Sequence comparison: Sequence comparison: Significance of alignment scores

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

COVID-19 NEVADA UPDATE GOVERNOR STEVE SISOLAK April 14, 2020 1 I. Situation Report II.

Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Compact Modelling Techniques in Thin Film SOI MOSFETs Benjamin Iiguez Denis Flandre*

Debug info in optimized code how far can we go? Improving LLVM debug info with function entry

RCRA Updates G. Alan Farmer, Director Resource Conservation and Restoration Division EPA, Region

2016-17 BOT Approved Operating Budget 2016-17 Percent Area Budget of Total Educational &

Carbon-Based Electronics: Will there be a carbon age to follow the silicon age? Jeffrey Bokor

Sambuz

Useful Links

Newsletter

Mail Us

Sequence comparison: Dynamic programming Genome 559: Introduction - PowerPoint PPT Presentation

Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas http://faculty.washington.edu/jht/GS559_2013/ Sequence comparison overview Problem: Find the best

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Assignment 3: Sequence Comparison Part 1: Running BLAST Step 1: Obtain Gene Sequence Obtain

Sequence comparison: Sequence comparison: Significance of alignment scores

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

COVID-19 NEVADA UPDATE GOVERNOR STEVE SISOLAK April 14, 2020 1 I. Situation Report II.

Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Compact Modelling Techniques in Thin Film SOI MOSFETs Benjamin Iiguez Denis Flandre*

Debug info in optimized code how far can we go? Improving LLVM debug info with function entry

RCRA Updates G. Alan Farmer, Director Resource Conservation and Restoration Division EPA, Region

2016-17 BOT Approved Operating Budget 2016-17 Percent Area Budget of Total Educational &amp;

Carbon-Based Electronics: Will there be a carbon age to follow the silicon age? Jeffrey Bokor

Sambuz

Useful Links

Newsletter

Mail Us

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

2016-17 BOT Approved Operating Budget 2016-17 Percent Area Budget of Total Educational &