Sequence comparison: Score matrices http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
FYI - informal inductive proof of best alignment path Consider the last step in the best alignment path to node α below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those nodes. We can reach node α by three possible paths: an A-B match, a gap in sequence A or a gap in sequence B: seq A The best-scoring path to Y X α is the maximum of: match gap seq B X + match Y + gap Z Z + gap α gap BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.
Local alignment - review A C G T A 2 -7 -5 -7 C -7 2 -7 -5 A A G G -5 -7 2 -7 T -7 -5 -7 2 0 0 0 0 d = -5 2 A 0 2 0 0 G 0 0 0 4 ( ) ( ) , − − − F i j 1 F i 1 , j 1 ( ) C 0 0 0 0 d s x i y , j ( ) ( ) − (no arrow means no preceding alignment) F i 1 , j d F , i j
Local alignment - review • Two differences from global alignment: – If a score is negative, replace with 0. – Traceback from the highest score in the matrix and continue until you reach 0. • Global alignment algorithm: Needleman- Wunsch . • Local alignment algorithm: Smith- Waterman .
dot plot of two DNA overlay of the global sequences DP alignment path
Protein score matrices • Quantitatively represent the degree of conservation of typical amino acid residues over evolutionary time. • All possible amino acid changes are represented (matrix of size at least 20 x 20). • Most commonly used are several different BLOSUM matrices derived for different degrees of evolutionary divergence. • DNA score matrices are simpler (and conceptually similar).
BLOSUM62 Score Matrix regular 20 amino acids # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 ambiguity codes and stop
Amino acid structures Hydrophobic Polar Charged phenylalanine F
BLOSUM62 Score Matrix Good scores – chemically similar Bad scores – chemically dissimilar
Amino acid structures Hydrophobic Polar Charged . C . . glycine G CH H . N N C SH + N . CH C . cysteine C C N histidine H CH . . . alanine A CH CH 3 C OH NH 3+ N N serine S CH . C C CH 3 N . . lysine K CH valine V C OH CH . N . CH threonine T N CH 3 . C NH N NH 2+ C CH 3 . arginine R CH C CH C C H 2 N leucine L N . tyrosine Y CH C OH O N CH 3 . . . N O - C CH 3 O C C . C NH 2 aspartate D isoleucine I CH CH . . CH 3 CH O - N N asparagine N . . NH 2 . N . C C glutamate E C CH C C S CH 3 methionine M CH O glutamine Q CH O N N . N C CH proline P N . H N . C tryptophan W CH N
Deriving BLOSUM scores • Find sets of sequences whose alignment is thought to be correct (this is partly bootstrapped by alignment). • Measure how often various amino acid pairs occur in the alignments. • Normalize this to the expected frequency of such pairs randomly in the same set of alignments. • Derive a log-odds score for aligned vs. random.
Example of alignment block (the BLO part of BLOSUM) 31 positions (columns) 61 sequences (rows) • Thousands of such blocks go into computing a single BLOSUM matrix. • Represent full diversity of sequences. • Results are summed over all columns of all blocks.
Pair frequency vs. expectation Sample column from an Actual aligned pair frequency: alignment block: = ∑ 1 this is called the sum D q c of pairs (the SUM E ij ij part of BLOSUM) T D etc. N where c is the count of pairs ij ij D and is the total pair count. T D 6 D-D pairs 4 D-E pairs Randomly expected pair frequency: 4 D-N pairs 1 E-N pair = e p p aa a a = + = 2 (a multiple alignment of N e p p p p p p ab a b b a a b sequences is the equivalent of all the where p and p are the overall probabilities pairwise alignments, a b which number (N)(N-1)/2 .) (frequencies) of specific residues and . a b
Log-odds score calculation (so adding scores == multiplying probabilities) q counted pair frequency = ij log s 2 ij e expected random pair frequency ij For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: q ij = (rounded) 2log matrixScore 2 e ij (computers can add integers faster than floats)
BLOSUM62 matrix (half-bit scores) ( 9 half-bits = 4.5 bits ) Frequency of C residue over all proteins: 0.0162 (you have to look this up) Reverse calculation of aligned C-C pair frequency in BLOSUM data set: = ∗ = q ( ) = = e 0 . 0162 0 . 0162 0 . 000262 4 . 5 cc 2 22 . 63 C-C cc e cc = ∗ = thus q 22 . 63 0 . 000262 0 . 00594 cc (in words, C-C pairs are 22.6 times as frequent as you would expect)
Constructing Blocks • Blocks are ungapped alignments of multiple sequences, usually 20 to 100 amino acids long. • Cluster the members of each block according to their percent identity. • Make pair counts and score matrix from a large collection of similarly clustered blocks. • Each BLOSUM matrix is named for the percent identity cutoff in step 2 (e.g. BLOSUM70 for 70% identity).
Randomly Distributed Gaps = (probability of a gap at each position in the sequence) if p k g = = = then 2 n P ( g ) k , P ( g ) k ,..., P ( g ) k 1 2 n [note - the slope of the line in this plot will vary according to the frequency of gaps, but it will always be linear]
Distribution of real alignment gap lengths in a large set of X-ray structure-aligned proteins log-linear plot Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)
What you should know • How a score matrix is derived • What the scores mean probablistically • Why gap penalties should be affine
Recommend
More recommend