Sequence comparison: Score matrices - PowerPoint PPT Presentation

Sequence comparison: Score matrices http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

FYI - informal inductive proof of best alignment path Consider the last step in the best alignment path to node α below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those nodes. We can reach node α by three possible paths: an A-B match, a gap in sequence A or a gap in sequence B: seq A The best-scoring path to Y X α is the maximum of: match gap seq B X + match Y + gap Z Z + gap α gap BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.

Local alignment - review A C G T A 2 -7 -5 -7 C -7 2 -7 -5 A A G G -5 -7 2 -7 T -7 -5 -7 2 0 0 0 0 d = -5 2 A 0 2 0 0 G 0 0 0 4 ( ) ( ) , − − − F i j 1 F i 1 , j 1 ( ) C 0 0 0 0 d s x i y , j ( ) ( ) − (no arrow means no preceding alignment) F i 1 , j d F , i j

Local alignment - review • Two differences from global alignment: – If a score is negative, replace with 0. – Traceback from the highest score in the matrix and continue until you reach 0. • Global alignment algorithm: Needleman- Wunsch . • Local alignment algorithm: Smith- Waterman .

dot plot of two DNA overlay of the global sequences DP alignment path

Protein score matrices • Quantitatively represent the degree of conservation of typical amino acid residues over evolutionary time. • All possible amino acid changes are represented (matrix of size at least 20 x 20). • Most commonly used are several different BLOSUM matrices derived for different degrees of evolutionary divergence. • DNA score matrices are simpler (and conceptually similar).

BLOSUM62 Score Matrix regular 20 amino acids # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 ambiguity codes and stop

Amino acid structures Hydrophobic Polar Charged phenylalanine F

BLOSUM62 Score Matrix Good scores – chemically similar Bad scores – chemically dissimilar

Amino acid structures Hydrophobic Polar Charged . C . . glycine G CH H . N N C SH + N . CH C . cysteine C C N histidine H CH . . . alanine A CH CH 3 C OH NH 3+ N N serine S CH . C C CH 3 N . . lysine K CH valine V C OH CH . N . CH threonine T N CH 3 . C NH N NH 2+ C CH 3 . arginine R CH C CH C C H 2 N leucine L N . tyrosine Y CH C OH O N CH 3 . . . N O - C CH 3 O C C . C NH 2 aspartate D isoleucine I CH CH . . CH 3 CH O - N N asparagine N . . NH 2 . N . C C glutamate E C CH C C S CH 3 methionine M CH O glutamine Q CH O N N . N C CH proline P N . H N . C tryptophan W CH N

Deriving BLOSUM scores • Find sets of sequences whose alignment is thought to be correct (this is partly bootstrapped by alignment). • Measure how often various amino acid pairs occur in the alignments. • Normalize this to the expected frequency of such pairs randomly in the same set of alignments. • Derive a log-odds score for aligned vs. random.

Example of alignment block (the BLO part of BLOSUM) 31 positions (columns) 61 sequences (rows) • Thousands of such blocks go into computing a single BLOSUM matrix. • Represent full diversity of sequences. • Results are summed over all columns of all blocks.

Pair frequency vs. expectation Sample column from an Actual aligned pair frequency: alignment block: = ∑ 1 this is called the sum D q c of pairs (the SUM E ij ij part of BLOSUM) T D etc. N where c is the count of pairs ij ij D and is the total pair count. T D 6 D-D pairs 4 D-E pairs Randomly expected pair frequency: 4 D-N pairs 1 E-N pair = e p p aa a a = + = 2 (a multiple alignment of N e p p p p p p ab a b b a a b sequences is the equivalent of all the where p and p are the overall probabilities pairwise alignments, a b which number (N)(N-1)/2 .) (frequencies) of specific residues and . a b

Log-odds score calculation (so adding scores == multiplying probabilities) q counted pair frequency = ij log s 2 ij e expected random pair frequency ij For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: q ij = (rounded) 2log matrixScore 2 e ij (computers can add integers faster than floats)

BLOSUM62 matrix (half-bit scores) ( 9 half-bits = 4.5 bits ) Frequency of C residue over all proteins: 0.0162 (you have to look this up) Reverse calculation of aligned C-C pair frequency in BLOSUM data set: = ∗ = q ( ) = = e 0 . 0162 0 . 0162 0 . 000262 4 . 5 cc 2 22 . 63 C-C cc e cc = ∗ = thus q 22 . 63 0 . 000262 0 . 00594 cc (in words, C-C pairs are 22.6 times as frequent as you would expect)

Constructing Blocks • Blocks are ungapped alignments of multiple sequences, usually 20 to 100 amino acids long. • Cluster the members of each block according to their percent identity. • Make pair counts and score matrix from a large collection of similarly clustered blocks. • Each BLOSUM matrix is named for the percent identity cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

Randomly Distributed Gaps = (probability of a gap at each position in the sequence) if p k g = = = then 2 n P ( g ) k , P ( g ) k ,..., P ( g ) k 1 2 n [note - the slope of the line in this plot will vary according to the frequency of gaps, but it will always be linear]

Distribution of real alignment gap lengths in a large set of X-ray structure-aligned proteins log-linear plot Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)

What you should know • How a score matrix is derived • What the scores mean probablistically • Why gap penalties should be affine

Sequence comparison: Score matrices - PowerPoint PPT Presentation

Sequence comparison: Score matrices http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas FYI - informal inductive proof of best alignment path Consider the last

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

One Step Mutation (OSM) matrices joint work with Sequence Evolution 1 Sequence Evolution

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Joint spectral radius Constrained matrix products Victor Kozyakin kozyakin@iitp.ru Institute

AMath 483/583 Lecture 22 Outline: MPI MasterWorker paradigm Linear algebra

( ) = 2 I i + 1 ( ) B X " ; I i # I f | " i > 2 L + 1 & ) ) = L [(2 L + 1)!!] 2

Eigencircles of 2 2 matrices Graham Farr Faculty of IT Monash University

Optimization and Dynamical Systems: Variational, Hamiltonian, and Symplectic Perspectives

Improving Word Alignment With Bridge Languages Shankar Kumar Joint Work with Franz Och and

Congruences for Fishburn numbers modulo prime powers Partitions, q -series, and modular forms AMS

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm

Sambuz

Useful Links

Newsletter

Mail Us

Sequence comparison: Score matrices - PowerPoint PPT Presentation

Sequence comparison: Score matrices http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas FYI - informal inductive proof of best alignment path Consider the last

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

One Step Mutation (OSM) matrices joint work with Sequence Evolution 1 Sequence Evolution

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Joint spectral radius Constrained matrix products Victor Kozyakin kozyakin@iitp.ru Institute

AMath 483/583 Lecture 22 Outline: MPI MasterWorker paradigm Linear algebra

( ) = 2 I i + 1 ( ) B X &quot; ; I i # I f | &quot; i &gt; 2 L + 1 &amp; ) ) = L [(2 L + 1)!!] 2

Eigencircles of 2 2 matrices Graham Farr Faculty of IT Monash University

Optimization and Dynamical Systems: Variational, Hamiltonian, and Symplectic Perspectives

Improving Word Alignment With Bridge Languages Shankar Kumar Joint Work with Franz Och and

Congruences for Fishburn numbers modulo prime powers Partitions, q -series, and modular forms AMS

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm

Sambuz

Useful Links

Newsletter

Mail Us

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

( ) = 2 I i + 1 ( ) B X " ; I i # I f | " i > 2 L + 1 & ) ) = L [(2 L + 1)!!] 2