CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna www. FA05 CSE182
Searching Sequence databases http://www.ncbi.nlm.nih.gov/BLAST/ FA05 CSE182
Query: >gi|26339572|dbj|BAC33457.1| unnamed protein product [Mus musculus] MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGYIIVFVVA LIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSLCKVIPYLQTV SVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVMECSSMLPGLANKT TLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQIPGTSSVVQRKWKQQQPV SQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAICYLPISILNVLKRVFGMFTHTEDRE TVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAFSCCLGVHHRQGDRLARGRTSTESRKSLTT QISNFDNVSKLSEHVVLTSISTLPAANGAGPLQNWYLQQGVPSSLLSTWLEV • What is the function of this sequence? • Is there a human homolog? • Which organelle does it work in? (Secreted/membrane bound) • Idea: Search a database of known proteins to see if you can find similar sequences which have a known function FA05 CSE182
Querying with Blast FA05 CSE182
Blast Output • The output (Blastp query) is a series of protein sequences, ranked according to similarity with the query • Each database hit is aligned to a subsequence of the query FA05 CSE182
Blast Output Schematic 26 422 query db 19 405 FA05 CSE182
Blast Output S Id Q beg S beg Q end S end FA05 CSE182
Interpreting Blast results • How do we interpret these results? – Similar sequence in the 3 species implies that the common ancestor of the 3 had that sequence. – The sequence accumulates mutations over time. These mutations may be indels, or substitutions. – Hum and mus diverged more recently and so the sequences are more likely to be similar. hum hummus? mus dros FA05 CSE182
How do we measure similarity between sequences? • Percent identity? – Hard to compute without indels. A T C A A T C G A T C A A T C G T C A A T G G T T C A A T G G T Number of sequence edit operations? Number of sequence edit operations? Implies a notion of alignment. Implies a notion of alignment. FA05 CSE182
Computing alignments • What is an alignment? 2Xm table. • Each sequence is a row, with interspersed gaps • Columns describe the edit operations • A A - T C G G A A C T C G - A What is the score of an alignment? • What is the score of an alignment? • Score every column, and sum up the scores. Let C be the score • Score every column, and sum up the scores. Let C be the score • function for the column function for the column How do we compute the alignment with the best score? • How do we compute the alignment with the best score? • FA05 CSE182
Optimum scoring alignments, and score of optimum alignment • Instead of computing an optimum scoring alignment, we attempt to compute the score of an optimal alignment. • Later, we will show that the two are equivalent FA05 CSE182
Computing Optimal Alignment score 1 1 2 k t s • Observations: The optimum alignment has nice recursive properties: – The alignment score is the sum of the scores of columns. – If we break off at cell k, the left part and right part must be optimal sub-alignments. – The left part contains prefixes s[1..i], and t[1..j] FA05 CSE182
Optimum prefix alignments 1 k s t • Consider an optimum alignment of the prefix s[1..i], and t[1..j] • Look at the last cell. It can only have 3 possibilities. FA05 CSE182
3 possibilities for rightmost cell Optimum alignment of s[1..i-1], and t[1..j-1] s[i] 1. s[i] is aligned to t[j] t[j] Optimum alignment of s[1..i-1], and t[1..j] 2. s[i] is aligned to ‘-’ s[i] 3. t[j] is aligned to ‘-’ t[j] Optimum alignment of s[1..i], and t[1..j-1] FA05 CSE182
Optimal score of an alignment Optimum alignment of s[1..i-1], and t[1..j-1] s[i] S[i,j] = C(s i ,t j )+S(i-1,j-1) t[j] Optimum alignment of s[1..i-1], and t[1..j] s[i] S[i,j] = C(s i ,-)+S(i-1,j) - Optimum alignment of s[1..i], and t[1..j-1] - S[i,j] = C(-,t j )+S(i,j-1) t[j] • Let S[i,j] be the score of an optimal alignment of the prefix s[1..i], and t[1..j]. It must be one of 3 possibilities. FA05 CSE182
Optimal alignment score S [ i − 1, j − 1] + C ( s i , t j ) S [ i , j ] = max S [ i − 1, j ] + C ( s i , − ) S [ i , j − 1] + C ( − , t j ) • Which prefix pairs (i,j) should we use? For now, simply use all. • If the strings are of length m, and n, respectively, what is the score of the optimal alignment? FA05 CSE182
An O(nm) algorithm for score computation For i = 1 to n For j = 1 to m S [ i − 1, j − 1] + C ( s i , t j ) S [ i , j ] = max S [ i − 1, j ] + C ( s i , − ) S [ i , j − 1] + C ( − , t j ) • The iteration ensures that all values on the right are computed in earlier steps. FA05 CSE182
Initialization S [0,0] = 0 S [ i ,0] = C ( s i , − ) + S [ i − 1,0] ∀ i S [0, j ] = C ( − , s j ) + S [0, j − 1] ∀ j FA05 CSE182
A tableaux approach t 1 j n 1 s M S[i-1,j] S[i-1,j-1] S [ i − 1, j − 1] + C ( s i , t j ) S [ i , j ] = max S [ i − 1, j ] + C ( s i , − ) S [ i , j − 1] + C ( − , t j ) i S[i,j-1] S[i,j] n Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells FA05 CSE182
An Example T C A T - T C A T T G C A A T G C A A A1 A2 • Align s=TCAT with t=TGCAA • Match Score = 1 • Mismatch score = -1, Indel Score = -1 • Score A1?, Score A2? FA05 CSE182
Alignment Table T G C A A 0 -1 -2 -3 -4 -5 -1 1 0 -1 -2 -3 T -2 0 0 1 0 -1 C -3 -1 -1 0 2 1 A -4 -2 -2 -1 1 1 T FA05 CSE182
Alignment Table • S[4,5] = 1 is the score of an optimum T G C A A alignment 0 -1 -2 -3 -4 -5 • Therefore, A2 is an -1 1 0 -1 -2 -3 optimum alignment T • We know how to -2 0 0 1 0 -1 C obtain the optimum -3 -1 -1 0 2 1 A Score. How do we get -4 -2 -2 -1 1 1 the best alignment? T FA05 CSE182
Computing Optimum Alignment • At each cell, we have 3 choices • We maintain additional information to record the choice at each step. For i = 1 to n For j = 1 to m S [ i − 1, j − 1] + C ( s i , t j ) S [ i , j ] = max S [ i − 1, j ] + C ( s i , − ) S [ i , j − 1] + C ( − , t j ) j-1 j If (S[i,j]= S[i-1,j-1] + C(s i ,t j )) M[i,j] = i-1 If (S[i,j]= S[i-1,j] + C(s i ,-)) M[i,j] = If (S[i,j]= S[i,j-1] + C(-,t j ) ) M[i,j] = i FA05 CSE182
Computing Optimal Alignments T G C A A 0 -1 -2 -3 -4 -5 -1 1 0 -1 -2 -3 T -2 0 0 1 0 -1 C -3 -1 -1 0 2 1 A -4 -2 -2 -1 1 1 T FA05 CSE182
Retrieving Opt.Alignment • M[4,5]= 1 2 3 4 5 Implies that T G C A A S[4,5]=S[3,4]+C( A,T ) 0 -1 -2 -3 -4 -5 or A -1 1 0 -1 -2 -3 1 T T -2 0 0 1 0 -1 2 C M[3,4]= -3 -1 -1 0 2 1 3 A Implies that S[3,4]=S[2,3] +C( A,A ) -4 -2 -2 -1 1 1 4 T or A A A T FA05 CSE182
Retrieving Opt.Alignment • M[2,3]= Implies that T G C A A S[2,3]=S[1,2]+C( C,C ) 0 -1 -2 -3 -4 -5 or C A A -1 1 0 -1 -2 -3 T C A T -2 0 0 1 0 -1 C M[1,2]= -3 -1 -1 0 2 1 A Implies that S[1,2]=S[1,1] +C (-,G ) -4 -2 -2 -1 1 1 T or T - C A A T G C A T FA05 CSE182
Algorithm to retrieve optimal alignment RetrieveAl(i,j) if (M[i,j] == `\’) s i t j return (RetrieveAl (i-1,j-1) . ) else if (M[i,j] == `|’) s i - return (RetrieveAl (i-1,j) . ) else if (M[i,j] == `--’ ’) ) else if (M[i,j] == `-- - return (RetrieveAl RetrieveAl (i,j-1) . (i,j-1) . ) ) return ( t j FA05 CSE182
Summary • An optimal alignment of strings of length n and m can be computed in O(nm) time • There is a tight connection between computation of optimal score, and computation of opt. Alignment – True for all DP based solutions FA05 CSE182
Global versus Local Alignment Consider s = ACCACCCCTT t = ATCCCCACAT ACCACCCCTT ACCACCCCT T ACCACCCCTT ACCACCCCT T ATCCCCACAT A TCCCCACAT ATCCCCACAT A TCCCCACAT FA05 CSE182
Blast Outputs Local Alignment Schematic 26 422 query db 19 405 FA05 CSE182
Local Alignment • Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’) a • How can we compute this efficiently? b a’ b’ FA05 CSE182
Local Alignment Trick 0 S [ i − 1, j − 1] + C ( s i , t j ) S [ i , j ] = max S [ i − 1, j ] + C ( s i , − ) S [ i , j − 1] + C ( − , t j ) How can we compute the local alignment itself? FA05 CSE182
Generalizing Gap Cost • It is more likely for gaps to be contiguous • The penalty for a gap of length l should be go + ge ∗ l FA05 CSE182
Recommend
More recommend