CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182

Expectation? • Some quantities can be reasonably guessed by taking a statistical sample, others not – Average weight of a group of 100 people – Average height of a group of 100 people – Average grade on a test • Give an example of a quantity that cannot. • When the distribution, and the expectation is known, it is easy to see when you see something significant. • If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn October 09 CSE 182

Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA – Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. – “One size does not fit all” – Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant – Different proteins might evolve at different rates and we need to normalize for that 3 October 09 CSE 182

Frequency based scoring A B • Our goal is to score each column in the alignment • Comparing against expectation: – Think about alignments of pairs of random sequences, and compute the probability that A and B appear together just by chance P R (A,B) – Compute the probability of A and B appearing together in the alignment of related sequences (orthologs) P O (A,B) • A good score function?   log P O ( A , B )   P R ( A , B )   October 09 CSE 182

Log-odds scoring • Log-odds score makes sense. • It is also sensitive to evolution • However, to compute a log-odds score function you need good alignments • To get good alignments of sequences, you need a (log-odds) score function. October 09 CSE 182

PAM 1 distance • Define: Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] 6 October 09 CSE 182

PAM1 matrix • Align many proteins that are very similar – Is this a problem? • 1 PAM evolutionary distance represents the time in which 1% of the residues have changed • Estimate the frequency P b|a of residue a being substituted by residue b. • PAM1(a,b) = P a|b = Pr(b will mutate to an a after 1 PAM evolutionary distance) • Scoring matrix – S(a,b) = log 10 (P ab /P a P b ) = log 10 (P b|a /P b ) 7 October 09 CSE 182

PAM 1 • Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b) 8 October 09 CSE 182

• For closely related sequences (1PAM) apart, we can make a set of alignments, and use that to compute an appropriate evolutionary distance. • What do we do for higher PAM sequences? October 09 CSE 182

PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 1 PAM 2 PAM 1 PAM 10 October 09 CSE 182

Generating Higher PAMs • PAM 2 (a,b) = ∑ c PAM 1 (a,c). PAM 1 (c,b) • PAM 2 = PAM 1 * PAM 1 (Matrix multiplication) • PAM 250 – = PAM 1 *PAM 249 – = PAM 1 250 b b c a a = c PAM 1 PAM 2 PAM 1 11 October 09 CSE 182

Note: This is not the score matrix: What happens as you keep increasing the power? 12 October 09 CSE 182

Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log 10 (P ab /P a P b )= log 10 (P a|b /P a ) = log 10 (PAM 250 (a,b)/P a ) • How does it help? hum – S 250 (A,V) >> S 1 (A,V) – Scoring of hum vs. Dros should be mus using a higher PAM matrix than scoring hum vs. mus. – An alignment with a smaller % identity dros could still have a higher score and be more significant 13 October 09 CSE 182

PAM250 based scoring matrix • S 250 (a,b) = log 10 (P ab /P a P b ) = log 10 (PAM250(a,b)/P a ) 14 October 09 CSE 182

BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. – In practice BLOSUM62 seems to work very well. 15 October 09 CSE 182

PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 16 October 09 CSE 182

P-value computation • BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. • The results are presented in order of decreasing scores • The score is just a number. • How significant is the top scoring hits if it has a score S? • Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better • How can we compute E-value? October 09 CSE 182

What is a distribution function • Given a collection of numbers (scores) – 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. • Plot its distribution as follows: – X-axis =each number – Y-axis (count/frequency/probability) of seeing that number – More generally, the x-axis can be a range to accommodate real numbers October 09 CSE 182

• End of L5 October 09 CSE 182

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 - PowerPoint PPT Presentation

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some quantities can be reasonably guessed by taking a statistical sample, others not Average weight of a group of 100 people Average height of a group

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre

Adjacency Matrices Representations memory? 1. Adjacency matrices. 2. Adjacency lists. 3.

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

Hybrid Sparse Dictionary Construction Using K-SVD and DCT for History Matching by ES-MDA May 30,

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City

Welcome to Scoring the ACIRI a Job Aid. 1 This job aid provides a brief review of the scoring

MedLinker Medical Entity Linking with Neural Representations and Dictionary Matching *Daniel

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

CMSC 206 Dictionaries and Hashing The Dictionary ADT n a dictionary (table) is an abstract

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

Reshape Prose into Poetry Luke Allen source@stanford.edu Rhyme scoring CMU pronunciation

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

The Dictionary ADT The dictionary ADT models a searchable collection findElement(k): if the

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Dictionary lookup Suppose youre looking up a word in the dictionary (paper one, not

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

The Bipartite Matching Problem II Math 482, Lecture 22 Misha Lavrov March 27, 2020 Bipartite

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes