CSCE 471/871 Lecture 2: Pairwise CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring Optimal Stephen Scott Algorithm Heuristic Algorithms Statistical Validation sscott@cse.unl.edu 1 / 55
Outline CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott What is a sequence alignment? Alignments Why should we care? Scoring Optimal How do we do it? Algorithm Scoring matrices Heuristic Algorithms Algorithms for finding optimal alignments Statistical Statistically validating alignments Validation 2 / 55
What is a Sequence Alignment? CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Given two nucleotide or amino acid sequences, Alignments determine if they are related (descended from a What common ancestor) Why How Technically, we can align any two sequences, but not Scoring always in a meaningful way Optimal Algorithm In this lecture, we’ll focus on AA sequences, but same Heuristic Algorithms alignment principles hold for DNA sequences Statistical Validation 3 / 55
What is a Sequence Alignment? (cont’d) CSCE 471/871 Lecture 2: Pairwise HIGHLY RELATED: Alignments HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL Stephen Scott G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL Alignments What RELATED: Why HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL How ++ ++++H+ KV + +A ++ +L+ L+++H+ K Scoring LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG Optimal Algorithm SPURIOUS ALIGNMENT: Heuristic HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL Algorithms GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ Statistical F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE Validation How to filter out the last one & pick up the second? 4 / 55
Why Should We Care? CSCE 471/871 Lecture 2: Pairwise Alignments Fragment assembly in DNA sequencing Stephen Scott Experimental determination of nucleotide sequences is Alignments only reliable up to about 500-800 base pairs (bp) at a What Why time How But a genome can be millions of bp long! Scoring If fragments overlap, they can be assembled: Optimal Algorithm ...AAGTACAATCA Heuristic CAATTACTCGGA... Algorithms Need to align to detect overlap Statistical Validation 5 / 55
Why Should We Care? (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Alignments Finding homologous proteins and genes What Why I.e., evolutionarily related (common ancestor) How Structure and function are often similar, but this is Scoring reliable only if they are evolutionarily related Optimal Thus want to avoid the spurious alignment of Slide 4 Algorithm Heuristic Algorithms Statistical Validation 6 / 55
How do we do it? CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Alignments Choose a scoring scheme What Why Choose an algorithm to find optimal alignment wrt How Scoring scoring scheme Optimal Statistically validate alignment Algorithm Heuristic Algorithms Statistical Validation 7 / 55
Scoring Schemes CSCE 471/871 Lecture 2: Pairwise Since goal is to find related sequences, want Alignments evolution-based scoring scheme Stephen Scott Mutations occur often at the genomic level, but their Alignments rates of acceptance by natural selection vary depending Scoring on the mutation PAM BLOSUM E.g., changing an AA to one with similar properties is Gap Penalties more likely to be accepted Optimal Algorithm Assume that all changes occur independently of each Heuristic other and are Markovian Algorithms ⇒ Changes occuring now are independent of those in the Statistical Validation past ⇒ Makes working with probabilities easier 8 / 55
Scoring Schemes (cont’d) CSCE 471/871 Lecture 2: If AA a i is aligned with a j , then a j was substituted for a i Pairwise Alignments ...KALM... Stephen Scott ...KVLM... Alignments Was this due to an accepted mutation or simply by Scoring chance? PAM BLOSUM If A or V is likely in general, then there is less evidence Gap Penalties that this is a mutation Optimal Algorithm Want the score s ij to be higher if mutation more likely Heuristic Algorithms Take ratio of mutation prob. to prob. of AA appearing at Statistical random Validation Generally, if a j is similar to a i in property, then accepted mutation more likely and s ij higher 9 / 55
Scoring Schemes (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Alignments Only consider immediate mutations a i → a j , not Scoring a i → a k → a j PAM BLOSUM Mutations are undirected Gap Penalties Optimal ⇒ scoring matrix is symmetric Algorithm Heuristic Algorithms Statistical Validation 10 / 55
The PAM Transition Matrices CSCE 471/871 Lecture 2: Pairwise Alignments Dayhoff et al. started with several hundred manual Stephen Scott alignments between very closely related proteins Alignments ( ≥ 85 % similar in sequence), and manually-generated Scoring evolutionary trees PAM BLOSUM Computed the frequency with which each AA is Gap Penalties Optimal changed into each other AA over a short evolutionary Algorithm distance (short enough where only 1% AAs change) Heuristic Algorithms 1 PAM = 1% point accepted mutation Statistical Validation Becomes our measure of evolutionary “time” 11 / 55
The PAM Transition Matrices (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments Estimate p i with the frequency of AA a i over both Stephen Scott sequences, i.e., number of a i ’s/number of AAs Let f ij = f ji = number of a i ↔ a j changes in data set, Alignments f i = � Scoring j � = i f ij = number of changes involving a i , and PAM f = � i f i = number of changes BLOSUM Gap Penalties Define the scale to be the amount of evolution to Optimal Algorithm change 1 in 100 AAs (on average) [1 PAM dist] Heuristic Algorithms Relative mutability of a i is the ratio of number of Statistical mutations to total exposure to mutation: Validation m i = f i / ( 100 f p i ) 12 / 55
The PAM Transition Matrices (cont’d) CSCE 471/871 Lecture 2: Pairwise If m i is probability of a mutation for a i , then M ii = 1 − m i Alignments is prob. of no change Stephen Scott a i → a j if and only if a i changes and a i → a j given that a i Alignments changes, so Scoring PAM BLOSUM M ij = Pr ( a i → a j ) Gap Penalties Optimal = Pr ( a i → a j | a i changed ) Pr ( a i changed ) Algorithm Heuristic = ( f ij / f i ) m i = f ij / ( 100 f p i ) Algorithms Statistical Validation The 1 PAM transition matrix consists of the M ij and gives the probabilities of mutations from a i to a j 13 / 55
Properties of PAM Transition Matrices CSCE 471/871 Lecture 2: Pairwise Alignments � � = M ij + M ii M ij Stephen Scott j j � = i Alignments � = 1 / ( 100 f p i ) f ij + ( 1 − f i / ( 100 f p i )) Scoring PAM j � = i BLOSUM Gap Penalties = f i / ( 100 f p i ) + 1 − f i / ( 100 f p i ) = 1 Optimal Algorithm [sum of probabilities of changes to an AA + prob of no change = 1] Heuristic Algorithms Statistical � � � p i M ii = p i − f i / ( 100 f ) = 1 − f / ( 100 f ) = 0 . 99 Validation i i i [prob of no change to any AA is 99/100] 14 / 55
What About 2 PAM? CSCE 471/871 Lecture 2: Pairwise Alignments How about the probability that a i → a j in two Stephen Scott evolutionary steps? Alignments It’s the prob that a i → a k (for any k ) in step 1, and Scoring a k → a j in step 2. This is � k M ik M kj = M 2 ij PAM BLOSUM j Gap Penalties j Optimal Algorithm Heuristic Algorithms Statistical i i Validation 15 / 55
k PAM Transition Matrix CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott In general, the probability that a i → a j in k evolutionary Alignments steps is M k Scoring ij PAM As k → ∞ , the rows of M k tend to be identical with the BLOSUM Gap Penalties i th entry of each row equal to p i Optimal Algorithm A result of our Markovian assumption of mutation Heuristic Algorithms Statistical Validation 16 / 55
Building a Scoring Matrix CSCE 471/871 Lecture 2: Pairwise Alignments When aligning different AAs in two sequences, want to Stephen Scott differentiate mutations and random events Alignments Thus, interested in ratio of transition probability to prob. Scoring of randomly seeing new AA PAM BLOSUM Gap Penalties M ij f ij = M ji Optimal = (symmetric) Algorithm p j 100 f p i p j p i Heuristic Algorithms Statistical Ratio > 1 if and only if mutation more likely than Validation random event 17 / 55
Building a Scoring Matrix (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments When aligning multiple AAs, ratio of probs for multiple Stephen Scott alignment = product of ratios: Alignments � M ij a i a k a n · · · Scoring � � � � � M k ℓ M nm − → · · · PAM · · · p j p ℓ p m a j a ℓ a m BLOSUM Gap Penalties Optimal Taking logs will let us use sums rather than products Algorithm Heuristic Algorithms ⇒ “Log odds” Statistical ⇒ Avoid underflow issues Validation 18 / 55
Recommend
More recommend