Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor Istrail
Algorithmic Functions of Computational Biology Professor Istrail Sequence Comparison Biomolecular sequences DNA sequences (string over 4 letter alphabet {A, C, □ G, T}) RNA sequences (string over 4 letter alphabet □ {ACGU}) Protein sequences (string over 20 letter alphabet □ {Amino Acids}) Sequence similarity helps in the discovery of genes, and the prediction of structure and function of proteins.
Algorithmic Functions of Computational Biology – Professor Istrail The Basic Similarity Analysis Algorithm Global Similarity • Scoring Schemes • Edit Graphs • Alignment = Path in the Edit Graph • The Principle of Optimality • The Dynamic Programming Algorithm • The Traceback
Algorithmic Functions of Computational Biology – Professor Istrail The Sequence Alignment Problem Input. : two sequences over the same alphabet and a scoring scheme Output: an alignment of the two sequences of maximum score Example: GCGCATTTGAGCGA □ TGCGTTAGGGTGACCA □ match A possible alignment: mismatch - GCGCATTTGAGCGA - - TGCG - - TTAGGGTGACC indel
Mismatch, Deletion, Insertion TCAGGGGGCTATT mismatch AGTCCTCCGATAA TCAGG G GGCTATT deletion AGTCC - CCGATAA (in template) TCAGGGGG - CTATT insertion AGTCCCCC C GATAA (in template) CSCI2820 - Class 4 5
Algorithmic Functions of Computational Biology – Professor Istrail Consider two sequences X x x x = 1 2 ... n x , i y belong to Σ j Y y y y = 1 2 ... m Over the alphabet A , C , G , T } Σ = {
Algorithmic Functions of Computational Biology – Professor Istrail Scoring Schemes Unit-score A C T δ G - A 1 0 0 0 0 C 1 0 0 0 0 0 G 1 0 0 0 0 0 0 T 1 0 - 0 0 0 0 0
Algorithmic Functions of Computational Biology – Professor Istrail Alignment A is aligned with A ACG C is aligned with G | | | AGG A C G | | | G is aligned with G A G G Unit-cost δ δ δ Score = (A,A) (C,G) (G,G) + + = 1 + 0 + 1 = 2
Algorithmic Functions of Computational Biology – Professor Istrail Gaps “-” is the gap symbol ACATGGAAT ACAT GG - AAT ACAGGAAAT ACA - GG AAAT SCORE 7 8 OPTIMAL ALIGNMENTS AAAGGG - - - AAAGGG GGGAAA GGGAAA - - - SCORE 0 3
Algorithmic Functions of Computational Biology – Professor Istrail δ (x,y) = the score for aligning x with y δ (x,-) = the score for aligning x with - δ (-,y) = the score for aligning - with y
Algorithmic Functions of Computational Biology – Professor Istrail Alignment A-CG - G ATCGTG Score δ δ δ δ δ δ (A,A) + (-,T) + (C,C) + (G,G) + (-,T ) + (G,G) THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED SYMBOLS
ARTEMIS Summer 2008 Professor Istrail Margaret Dayhoff & PAM Similarity Matrices
ARTEMIS Summer 2008 Dr. Margaret Oakley Dayhoff Professor Istrail The Mother & Father of Bioinformatics
Algorithmic Functions of Computational Biology – Professor Istrail Scoring Scheme Dayhoff PAM scoring matrices - A R N D C Q E G H I L K M F P S T W Y V δ -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 3 -3 0 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 A ... R 6 N 4 D PTIPLSRLFDNAMLRAHRLHQ SAIENQRLFNIAVSRVQHLHL Partial alignment for Monkey and Trout somatotropin proteins
Algorithmic Functions of Computational Biology – Professor Istrail Scoring Functions Mutations= Substitutions, Insertions, Deletions Scoring function = a sum of a terms each for a pair of aligned residues, and for each gap The meaning = log of the relative likelihood that the sequences are related, compared to being unrelated Identities and conservative substitutions are Positive terms Non-conservative substitutions are Negative terms
Global alignment problem • Input: Sequences X and Y of length m and n respectively and a similarity matrix • Output: An optimal global alignment of X and Y – Global alignments require all bases in both sequences are aligned CSCI2820 - Class 4 16
Local alignment problem • Input: Sequences X and Y of length m and n respectively and a similarity matrix • Output: An optimal local alignment of X and Y – Local alignments do not require using all bases in either sequence in the alignment • Applicable when looking for subsequences of similarity CSCI2820 - Class 4 17
Algorithmic Functions of Computational Biology – Professor Istrail The Edit Graph AGT with AT Suppose that we want to align We are going to construct a graph where alignments between the two sequences correspond to paths between the begin and and end nodes of the graph. This is the Edit Graph
Algorithmic Functions of Computational Biology – Professor Istrail The sequence AGT The sequence AT 3 0 2 1 0 AGT has length 3 AT has length 2 1 2 The Edit graph has (3+1)*(2+1) nodes
Algorithmic Functions of Computational Biology – Professor Istrail T A G Begin 0 1 2 3 0 A 1 T 2 End AGT indexes the columns, and AT indexes the rows of this “table”
G A 0 1 0 Algorithmic Functions of Computational Biology – A Professor Istrail T G A 1 Begin 0 1 2 3 0 T A 2 1 T 2 End The Graph is directed. The nodes (i,j) will hold values.
Algorithmic Functions of Computational Biology Professor Istrail G A T Begin 0 1 2 3 0 A 1 T 2 End
Algorithmic Functions of Computational Biology – Professor Istrail Directed edges get as labels pairs of aligned letters. A G T Begin 0 1 2 3 0 T G A - - - A - - - - A G T A A A A - A A A A 1 G T A - - - - - - - A G T T T T T T T T T 2 G T A - - - End
Algorithmic Functions of Computational Biology – Professor Istrail Alignment = Path in the Edit Graph A G T Begin 0 1 2 3 0 T G A - - - A AGT - - - - A G T A A A A-T A - A A A A G T A 1 - - - - - - - A G T T T T T T T T T G T 2 A End - - - Every path from Begin to End corresponds to an alignment Every alignment corresponds to a path between Begin and End
Algorithmic Functions of Computational Biology – Professor Istrail The Principle of Optimality The optimal answer to a problem is expressed in terms of optimal answer for its sub-problems
Algorithmic Functions of Computational Biology – Professor Istrail Dynamic Programming Given: Two sequences X and Y Find: An optimal alignment of X with Y Part 1: Compute first the optimal alignment score Part 2: Construct optimal alignment We are looking for the optimal alignment = maximal score path in the Edit Graph from the Begin vertex to the End vertex
Algorithmic Functions of Computational Biology – Professor Istrail The DP Matrix S(i,j) T G A S(1,0) 0 1 2 3 0 A 1 T S(2,1) 2
Algorithmic Functions of Computational Biology – Professor Istrail The DP Matrix Matrix S =[S(i,j)] ) j , i ( S(i,j) = The score of the maximal cost path o t from the Begin Vertex and the vertex (i,j) h t a P (i-1,j-1) The optimal path to (i,j) l (i,j-1) a m must pass through one of i t the vertices p O (i-1,j) (i-1,j) (i,j) (i,j-1) (i-1,j-1)
Algorithmic Functions of Computational Biology – Professor Istrail Opt path (i-1,j-1) (i,j-1) - xi S(i-1,j) + δ (- , yj) (i-1,j) (i,j) yj - δ Optimal path to (i-1,j) + (- , yj)
Algorithmic Functions of Computational Biology – Professor Istrail Optimal path (i-1,j-1) (i-1,j) δ S(i-1,j-1) + (xi , yj) (i,j-1) (i,j) Optimal path to (i-1,j-1) + (xi,yj) δ
Algorithmic Functions of Computational Biology – Professor Istrail Optimal path (i-1,j-1) (i,j-1) δ S(i,j-1) + (xi, -) (I-1,j) (i,j) Optimal path to (i,j-1) δ + (xi,-)
Algorithmic Functions of Computational Biology – Professor Istrail The Basic ALGORITHM δ S(i-1, j-1) + (xi, yj) MAX S(i,j) = δ S(i-1, j) + (xi, -) δ S(i, j-1) + (-, yj)
Algorithmic Functions of Computational Biology – Optimal Alignment and Tracback Professor Istrail A T G 0 1 2 3 0 T G 0 0 0 0 A - - - A - - - - A G T A A A A - A A A A 1 1 1 0 1 G T A - - - - - - - A G T T T T T T T T T 0 2 1 1 2 G T A - - - AGT Optimal Alignment A - T
Algorithmic Functions of Computational Biology – Professor Istrail The Basic ALGORITHM: Local Similarity We add this 0, δ S(i-1, j-1) + (xi, yj), δ MAX S(i-1, j) + (xi, -), S(i,j) = δ S(i, j-1) + (-, yj)
Protein global alignment X = hlsek Y = nlsak • X and Y represent a protein subsequence from the BRCA2 (early onset) protein in human and chimpanzee • Global alignments are used when the two sequences being compared represent a similar biological sequence CSCI2820 - Class 4 35
Margaret Dayhoff’s PAM 100 similarity matrix (partial) A N E H L K S * A 4 -1 0 -3 -3 -3 1 -9 N -1 5 1 2 -4 1 1 -9 E 0 1 5 -1 -5 -1 -1 -9 H -3 2 -1 7 -3 -2 -2 -9 L -3 -4 -5 -3 6 -4 -4 -9 K -3 1 -1 -2 -4 5 -1 -9 S 1 1 -1 -2 -4 -1 4 -9 * -9 -9 -9 -9 -9 -9 -9 1 CSCI2820 - Class 4 36
Recommend
More recommend