Sequence comparison: Score matrices Genome 559: Introduction to - PowerPoint PPT Presentation

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

FYI - informal inductive proof of best alignment path Consider the last step in the best alignment path to node a below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those nodes. We can reach node a by three possible paths: an A-B match, a gap in sequence A or a gap in sequence B: seq A The best-scoring path to Y X a is the maximum of: match gap seq B X + match Y + gap Z Z + gap a gap BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.

Local alignment - review A C G T A 2 -7 -5 -7 C -7 2 -7 -5 A A G G -5 -7 2 -7 T -7 -5 -7 2 0 0 0 0 d = -5 2 A 0 2 0 0 G 0 0 0 4     ,  1   F i j F i 1 , j 1   C 0 0 0 0 d s x i y , j      (no arrow means no preceding alignment) F i 1 , j d F , i j

Local alignment • Two differences from global alignment: – If a score is negative, replace with 0. – Traceback from the highest score in the matrix and continue until you reach 0. • Global alignment algorithm: Needleman- Wunsch . • Local alignment algorithm: Smith- Waterman .

dot plot of two DNA overlay of the global sequences DP alignment path

Protein score matrices • Quantitatively represent the degree of conservation of typical amino acid residues over evolutionary time. • All possible amino acid changes are represented (matrix of size at least 20 x 20). • Most commonly used are several different BLOSUM matrices derived for different degrees of evolutionary divergence. • DNA score matrices are much simpler (and conceptually similar).

BLOSUM62 Score Matrix ambiguity codes and stop # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 regular 20 amino acids

Amino acid structures Hydrophobic Polar Charged phenylalanine F

BLOSUM62 Score Matrix Good scores – chemically similar Bad scores – chemically dissimilar

Amino acid structures Hydrophobic Polar Charged . C . . glycine G CH H . N N C SH + . N CH C . cysteine C C N histidine H CH . . . alanine A CH CH 3 C OH NH 3+ N N serine S CH . C C CH 3 N . . lysine K CH valine V C OH CH . N . CH threonine T N CH 3 . C NH NH 2+ N C CH 3 arginine R CH . C CH C C H 2 N leucine L N . tyrosine Y CH C OH O N CH 3 . . . N O - C CH 3 O C C . C NH 2 aspartate D isoleucine I CH CH . . CH 3 CH O - N asparagine N . N . . NH 2 N . C C glutamate E C CH C C S CH 3 methionine M CH O glutamine Q CH O N N . N C CH proline P N . H N . C tryptophan W CH N

Deriving BLOSUM scores • Find sets of sequences whose alignment is thought to be correct (this is partly bootstrapped by alignment). • Measure how often various amino acid pairs occur in the alignments. • Normalize this to the expected frequency of such pairs randomly in the same set of alignments. • Derive a log-odds score for aligned vs. random.

Example of alignment block (the BLO part of BLOSUM) 31 positions (columns) 61 sequences (rows) • Thousands of such blocks go into computing a single BLOSUM matrix. • Represent full diversity of sequences. • Results are summed over all columns of all blocks.

Pair frequency vs. expectation Sample column from an Actual aligned pair frequency: alignment block:   1 this is called the sum D q c of pairs (the SUM E ij ij part of BLOSUM) T D etc. N where c is the count of pairs ij D ij and is the total pair count. T D 6 D-D pairs 4 D-E pairs Randomly expected pair frequency: 4 D-N pairs 1 E-N pair  e p p aa a a    e p p p p 2 p p (a multiple alignment of N ab a b b a a b sequences is the equivalent of all the where p and p are the overall probabilities pairwise alignments, a b which number (N)(N-1)/2 .) (frequencies) of specific residues and . a b

Log-odds score calculation (so adding scores == multiplying probabilities) q  ij s log 2 ij e ij For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half - bit” score: q  ij (rounded) 2log matrixScore 2 e ij (computers can add integers faster than floats)

BLOSUM62 matrix (half-bit scores) ( 9 half-bits = 4.5 bits ) Frequency of C residue over all proteins: 0.0162 (you have to look this up) Reverse calculation of aligned C-C pair frequency in BLOSUM data set:    q     e 0 . 0162 0 . 0162 0 . 000262 C-C 4 . 5 cc 2 22 . 63 cc e cc    thus q 22 . 63 0 . 000262 0 . 00594 cc

Constructing Blocks • Blocks are ungapped alignments of multiple sequences, usually 20 to 100 amino acids long. • Cluster the members of each block according to their percent identity. • Make pair counts and score matrix from a large collection of similarly clustered blocks. • Each BLOSUM matrix is named for the percent identity cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

Probabilistic Interpretation of Scores (ungapped) q ij  (BLOSUM62) (rounded) 2log matrixScore 2 e ij • By converting scores back to probabilities, we can give a probabilistic interpretation to an alignment score. • this alignment has a score of 16 (6+2+1+7) by FIAP BLOSUM 62, meaning an alignment with this score FLSP or more is 2 8 (256) times more likely than expected from a random alignment. • this 15 amino acid alignment has a VHRDLKPENLLLASK score of 75, meaning that it is ~ 10 11 VHRDLKPENLLLASK times more likely to be seen in a real (4+8+5+6+4+5+7+5+6+4+4+4+4+4+5) alignment than in a random alignment(!!).

Randomly Distributed Gaps  (probability of a gap at each position in the sequence) if p k g    then 2 n P ( g ) k , P ( g ) k ,..., P ( g ) k 1 2 n [note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear]

Distribution of real alignment gap lengths in large set of structurally-aligned proteins log-linear plot Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)

Summary • How a score matrix is derived • What the scores mean probablistically • Why gap penalties should be affine

Sequence comparison: Score matrices Genome 559: Introduction to - PowerPoint PPT Presentation

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas FYI - informal inductive proof of best alignment path Consider the last step in the best alignment path to node a

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Assignment 3: Sequence Comparison Part 1: Running BLAST Step 1: Obtain Gene Sequence Obtain

Sequence comparison: Sequence comparison: Significance of alignment scores

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Schizophrenia Genetics Quest for the Holy Grail Nancy Buccola MSN, APRN, PMHCNS, CNE Louisiana

2/16/2014 How do you use ketamine in your practice of emergency medicine? Low Dose

Applying Pharmacy Scientific Principles to the Laws Associated with Synthetic Drug of Abuse O N

VALUES' ( INTEGRITY& HONESTY&& ! HAPPINESS(is(based(on(circumstances((EXTERNAL)( (

Origins: Is It Reasonable To Believe in God in This Scientific Age? I s It Reasonable to Believe

Enhanced Sampling and Free Energy Applications in Biomolecular Modeling Emad Tajkhorshid NIH

Processing Heterogeneity Nikolaus Grigorieff Larson, The Far Side Heterogeneity and Biology

RAMADAN IMPLICATIONS FOR YOUTH ATHLETES Nur Adilah NYSI Performance Analyst The Ramadan fast

Sequence comparison: Score matrices Genome 559: Introduction to - PowerPoint PPT Presentation

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas FYI - informal inductive proof of best alignment path Consider the last step in the best alignment path to node a

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Assignment 3: Sequence Comparison Part 1: Running BLAST Step 1: Obtain Gene Sequence Obtain

Sequence comparison: Sequence comparison: Significance of alignment scores

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Schizophrenia Genetics Quest for the Holy Grail Nancy Buccola MSN, APRN, PMHCNS, CNE Louisiana

2/16/2014 How do you use ketamine in your practice of emergency medicine? Low Dose

Applying Pharmacy Scientific Principles to the Laws Associated with Synthetic Drug of Abuse O N

VALUES' ( INTEGRITY&amp; HONESTY&amp;&amp; ! HAPPINESS(is(based(on(circumstances((EXTERNAL)( (

Origins: Is It Reasonable To Believe in God in This Scientific Age? I s It Reasonable to Believe

Enhanced Sampling and Free Energy Applications in Biomolecular Modeling Emad Tajkhorshid NIH

Processing Heterogeneity Nikolaus Grigorieff Larson, The Far Side Heterogeneity and Biology

RAMADAN IMPLICATIONS FOR YOUTH ATHLETES Nur Adilah NYSI Performance Analyst The Ramadan fast

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

VALUES' ( INTEGRITY& HONESTY&& ! HAPPINESS(is(based(on(circumstances((EXTERNAL)( (