Substitution Matrices Michael Schroeder Biotechnology Center TU Dresden
Contents § Why to compare and align sequences? § How to judge an alignment? § Z-score, E-value, P-value, structure and function § How to compare and align sequences? § Levensthein distance, scoring schemes, longest common subsequence, global and local alignment , substitution matrix , § How to compute an alignment? § Dynamic programming § How to compute an alignment fast? § Blast § How to align many sequences § Multiple sequence alignment, phylogenetic trees § Alignments and structure § How to predict protein structure from protein sequence 2
Scoring Scheme § Consider the alignments: A T G C TTATACGTAA A 2 -2 0 -2 TTA-CC-AAA T -2 2 -2 0 vs. G 0 -2 2 -2 TTATACGTAA C -2 0 -2 2 TTAC-CA-AA § with s m = 1, s r = s g = -1. § Are these alignments equally good? 3
Properties of Amino Acids Hydrophobic Polar A S T G Acidic P D N Q E I L Y V C K W H M R Basic F Aromatic 4
BLOSUM: Blocks Substitution Matrix Blosum62, Henikoff, 1992 5
How to choose data? § BLOSUM: Blocks Substitution matrix, Henikoff 1992 § BLOSUM62: Maximal sequence identity is 62% § 3000 Blocks of 800 families § > 2300 occurrences of each substitution § General problem: How to choose representative data? 6
Log Odds Ratio § How to define a score for mutation x-y : § Many substitutions x-y : high score x-y § Many occurrences of x : low score for x-y § Many occurrences of y : low score for x-y § Log odds ratio: P ( x, y ) log 2 P ( x ) P ( y ) § P ( x , y ) is probability for a substitution x ➞ y § P ( x , y ) = P ( y , x ) § P ( x ) is the probability for finding x § Where else could log odds ratios be useful? 7
How to Count Substitutions? ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI GCGCTGFKI GCGCTLFKI How many A ➞ G? 8
Substitutions with and without Evolutionary Model § How many A ➞ G? ACGCTAFKI ACGCTAFKL ASGCTAFKL § BLOSUM: ACACTAFKL § 4*3+1*6+5*1 = 23 GCGCTAFKI GCGCTGFKI § PAM: 3 GCGCTLFKI 9
PAM: Evolutionary Model ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI ACGCTAFKI GCGCTGFKI GCGCTLFKI A ➞ G I ➞ L GCGCTAFKI ACGCTAFKL A ➞ G A ➞ L C ➞ S G ➞ A GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL 10
How to Measure Quality of Substitution Matrices? 11
Styczynski 2008: Error in BLOSUM62! § But: Revised BLOSUM does worse! Red +1 of Blosum, green -1 12
Do Substitution Matrices Reflect Physicochemical Properties? 13
Needleman-Wunsch Algorithm Global alignment of string a to b Let a = a 1 . . . a m and b = b 1 . . . b n be strings. Then needle a,b = needle a,b ( m, n ) is the global alignment score of a and b , where 8 if j = 0 , is g > > > if i = 0 , js g > > > < 8 needle a,b ( i � 1 , j ) + s g needle a,b ( i, j ) = > < > needle a,b ( i, j � 1) + s g max otherwise, > > > > > > needle a,b ( i � 1 , j � 1) + s m ( ai = bj ) : : for 0 i m , 0 j n , gap penalty s g < 0 , mismatch penalty s r < 0 , match reward s m > 0 , and ( s m if ( a i = b j ) , s m ( ai = bj ) := if ( a i 6 = b j ) s r 14
Needleman-Wunsch Algorithm with Substitution Matrix Global alignment of string a to b Let a = a 1 . . . a m and b = b 1 . . . b n be strings. Then needle a,b = needle a,b ( m, n ) is the global alignment score of a and b with substitution matrix, where if j = 0 , is g if i = 0 , js g needle a,b ( i, j ) = needle a,b ( i − 1 , j ) + s g max needle a,b ( i, j − 1) + s g otherwise, needle a,b ( i − 1 , j − 1) + d s ( a i , b j ) for 0 ≤ i ≤ m and 0 ≤ j ≤ n , substitution matrix d s ( a i , b j ) , and gap penalty s g < 0 . 15
Global Alignment with Dynamic Programming needle(a,b): let d be a matrix of size m+1 × n+1 for 0 ≤ i ≤ m: d[i,0] = i * s g for 1 ≤ j ≤ n: d[0,j] = j * s g for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: if a i == b j : s = s m else: s = s r d[i,j] = max(d[i-1,j ] + s g , d[i ,j-1] + s g , d[i-1,j-1] + s) return d[m,n] 16
Global Alignment with Substitution Matrix and Dynamic Programming needle(a,b,d s ): let d be a matrix of size m+1 × n+1 for 0 ≤ i ≤ m: d[i,0] = i * s g for 1 ≤ j ≤ n: d[0,j] = j * s g for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: d[i,j] = max(d[i-1,j ] + s g , d[i ,j-1] + s g , d[i-1,j-1] + d s [a i ,b j ]) return d[m,n] 17
Global Alignment with SM and DP: Mark the Path needle(a,b,d s ): let d be the distance matrix of size m+1 × n+1 let d b be the backtracking matrix of size m+1 × n+1 d[0,0], d b [0,0] = 0, ∅ for 1 ≤ i ≤ m: d[i,0] = i * s g d b [i,0] = {'N'} for 1 ≤ j ≤ n: d[0,j] = j * s g d b [0,j] = {'W'} for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: d[i,j], d b [i,j] = max_dir(d[i-1,j ] + s g , d[i ,j-1] + s g , d[i-1,j-1] + d s [a i ,b j ]) return d[m,n], d b Legend: red=new code 18
Global Alignment with DP: Determine Directions max_dir(d N ,d W ,d NW ): dir = {} if d N == max(d N ,d W ,d NW ): dir.add('N') if d W == max(d N ,d W ,d NW ): dir.add('W') if d NW == max(d N ,d W ,d NW ): dir.add('NW') return max(d N ,d W ,d NW ), dir 19
Global Alignment with DP: Printing the Alignment print-needle(a,b,d b ): return print-needle(a,b,m,n,d b ) print-needle(a,b,i,j,d b ): if 'NW' in d b [i,j]: print-needle(a,i-1,j-1,d b ) print a i , b j else if 'W' in d b [i,j]: print-needle(a,i,j-1,d b ) print '-', b j else if 'N' in d b [i,j]: print-needle(a,i-1,j,d b ) print a i , '-' 20
Global Alignment with Dynamic Programming i \ j p e t e r p e d r o 21
Build a Substitution Matrix 22
Summary § BLOSUM (and PAM) § Log odds ratio § Role of evolutionary model § How to assess substitution matrices 23
Recommend
More recommend