CSCE 471/871 Lecture 6: Multiple Sequence Alignments Residues occupy - PDF document

Introduction: Multiple Alignments • Start with a set of sequences • In each column, residues are homolgous CSCE 471/871 Lecture 6: Multiple Sequence Alignments – Residues occupy similar positions in 3D structure – Residues diverge from a common ancestral residue – Figure 6.1, p. 137 Stephen D. Scott • Can be done manually, but requires expertise and is very tedious • Often there is no single, unequivocally “correct” alignment – Problems from low sequence identity & structural evolution 1 2 Scoring a Multiple Alignment Outline • Ideally, is based in evolution, as in e.g. PAM and BLOSUM matrices • Contrasts with pairwise alignments: • Scoring a multiple alignment 1. Position-specific scoring (some positions more conserved than others) – Minimum entropy scoring 2. Ideally, need to consider entire phylogenetic tree to explain evolution of entire family – Sum of pairs (SP) scoring • I.e. build complete probabilistic model of evolution • Multidimenisonal dynamic programming – Not enough data to parameterize such a model ) use approximations • Progressive alignment methods • Assume columns statistically independent: X S ( m ) = G + S ( m i ) • Multiple alignment via profile HMMs i m i is column i of MA m , G is (affine) score of gaps in m 3 4 Minimum Entropy Scoring (cont’d) Minimum Entropy Scoring • Set score to be S ( m i ) = � log P ( m i ) = � P a c ia log p ia • m j i = symbol in column i in sequence j , c ia = observed count of residue a in column i – Propotional to Shannon entropy – Define optimal alignment as • Assume sequences are statistically independent, i.e. residues inde- 8 9 pendent within columns < = m ⇤ = argmin X S ( m i ) m : ; m i 2 m a p c ia • Then probability of column m i is P ( m i ) = Q ia , where p ia = prob. of a in column i • Independence assumption valid only if all evolutionary subfamilies are represented equally; otherwise bias skews results 5 6

Sum of Pairs (SP) Scores Sum of Pairs (SP) Scores Example of a Problem ⇣ N ⌘ • Treat multiple alignment as pairwise alignments • Given an alignment with only “L ” in column i , using BLOSUM50 yields 2 ⇣ N ⌘ an SP score of S 1 = 5 = 5 N ( N � 1) / 2 2 • If s ( a, b ) is substitution score from e.g. PAM or BLOSUM: • If one “L ” is replaced with “G”, then SP score is S 2 = S 1 � 9( N � 1) X s ( m k i , m ` S ( m i ) = i ) k< ` • Problem: • Caveat: s ( a, b ) was derived for pairwise comparisons, not N -way S 2 5 N ( N � 1) / 2 = 1 � 18 9( N � 1) = 1 � 5 N , comparisons S 1 i.e. as N increases, S 2 /S 1 ! 1 correct SP z }| { z }| { log p abc log p ab + log p bc + log p ac = log p ab p bc p ac vs. a q 2 – But large N should give more support for “L ” in m i relative to S 2 , q 2 b q 2 q a q b q c q a q b q b q c q a q c c not less (i.e. should have S 2 /S 1 decreasing) 7 8 Multidimensional Dynamic Programming Outline • Generalization of DP for pairwise alignments • Assume statistical independence of columns and linear gap penalty • Scoring a multiple alignment (can also handle affine gap penalties) • S ( m ) = P i S ( m i ) , and ↵ i 1 ,i 2 ,...,i N = max score of alignment of • Multidimenisonal dynamic programming subsequences x 1 1 ...i 1 , x 2 1 ...i 2 , . . . , x N 1 ...i N – Standard MDP algorithm 8 � � x 1 i 1 , x 2 i 2 , x 3 i 3 , . . . , x N ↵ i 1 � 1 ,i 2 � 1 ,i 3 � 1 ,...,i N � 1 + S , > i N > S � � > � , x 2 i 2 , x 3 i 3 , . . . , x N ↵ i 1 ,i 2 � 1 ,i 3 � 1 ,...,i N � 1 + , > – MSA > i N > � � x 1 i 1 , � , x 3 i 3 , . . . , x N > + S , > ↵ i 1 � 1 ,i 2 ,i 3 � 1 ,...,i N � 1 < i N . . ↵ i 1 ,i 2 ,...,i N = max . � � > x 1 i 1 , x 2 i 2 , x 3 ↵ i 1 � 1 ,i 2 � 1 ,i 3 � 1 ,...,i N + S i 3 , . . . , � , > > > � � • Progressive alignment methods > + � , � , x 3 i 3 , . . . , x N > ↵ i 1 ,i 2 ,i 3 � 1 ,...,i N � 1 S , > i N > . : . . • Multiple alignment via profile HMMs • In each column, take all gap-residue combinations except 100% gaps 9 10 MSA [Carrillo & Lipman 88; Lipman et al. 89] Multidimensional Dynamic Programming (cont’d) • Uses MDP , but eliminates many entries from consideration to save time • Assume all N sequences are of length L • Can optimally solve problems with L = 300 and N = 7 (old numbers), L = 150 and N = 50 , L = 500 and N = 25 , and L = 1000 and N = 10 (newer numbers) • Space complexity = Θ ( ) k< ` S ( a k ` ) , where a is MA and a k ` is PA • Uses SP scoring: S ( a ) = P • Time complexity = Θ ( ) between x k and x ` induced by a • Is it practical? a k ` is optimal PA between x k and x ` (easily computed), then S ( a k ` )  • If ˆ a k ` ) for all k and ` S (ˆ 11 12

MSA (cont’d) Outline • Scoring a multiple alignment • Assume we have lower bound � ( a ⇤ ) on score of optimal alignment a ⇤ : • Multidimenisonal dynamic programming X X X � ( a ⇤ )  S ( a ⇤ ) = S ( a ⇤ k ` ) = S ( a ⇤ k ` ) + S ( a ⇤ k 0 ` 0 )  S ( a ⇤ k ` ) + a ⇤ k 0 ` 0 ) S (ˆ k 0 < ` 0 k 0 < ` 0 k< ` ( k 0 , ` 0 ) 6 =( k, ` ) ( k 0 , ` 0 ) 6 =( k, ` ) • Progressive alignment methods – Feng-Doolittle • Thus S ( a ⇤ k ` ) � � k ` = � ( a ⇤ ) � P a ⇤ k 0 ` 0 ) S (ˆ – Profile alignment k 0 < ` 0 ( k 0 , ` 0 ) 6 =( k, ` ) – CLUSTALW – Iterative refinement • When filling in matrix, only need to consider PAs that score at least � k ` (Figure 6.3, p. 144) • Multiple alignment via profile HMMs • Can get � ( a ⇤ ) from other (heuristic) alignment methods 13 14 Progressive Alignment Methods Feng-Doolittle • Repeatedly perform pairwise alignments until all sequences are aligned 1. Compute a distance matrix by aligning all pairs of sequences • Start by aligning the most similar pairs of sequences (most reliable) • Convert each pairwise alignment score to distance: – Often start with a “guide tree” D = � log S obs � S rand S max � S rand • Heuristic method (suboptimal), though generally pretty good • S obs = observed alignment score between the two sequences, S max = average score of aligning each of the two sequences to • Differences in the methods: itself, S rand = expected score of aligning two random sequences of same composition and length 1. Choosing the order to do the alignments 2. Are sequences aligned to alignments or are sequences aligned to 2. Use a hierarchical clustering algorithm [Fitch & Margoliash 67] to build sequences and then alignments aligned to alignments? guide tree based on distance matrix 3. Methods used to score and build alignments 15 16 Feng-Doolittle Profile Alignment (cont’d) • Allows for position-specific scoring, e.g.: 3. Build multiple alignment in the order that nodes were added to the guide tree in Step 2 – Penalize gaps more in a non-gap column than in a gap-heavy – Goes from most similar to least similar pairs column – Aligning two sequences is done with DP – Penalize mismatches more in a highly-conserved column than a – Aligning sequence x with existing alignment a done by pairwise heterogeneous column aligning x to each sequence in a ⇤ Highest-scoring PA determines how to align x with a • If gap penalty is linear, can use SP score with s ( � , a ) = s ( a, � ) = – Aligning existing alignment a with existing alignment a 0 is done by � g and s ( � , � ) = 0 pairwise aligning each sequence in a to each sequence in a 0 ⇤ Highest-scoring PA determines how to align a with a 0 • Given two MAs (profiles) a 1 (over x 1 , . . . , x n ) and a 2 (over x n +1 , . . . , x N ), – After each alignment formed, replace gaps with “X” character that align a 1 with a 2 by not altering the fundamental structure of either scores 0 with other symbols and gaps – Insert gaps into entire columns of a 1 and a 2 ⇤ “Once a gap, always a gap” – s ( � , � ) = 0 implies that this doesn’t affect score of a 1 or a 2 ⇤ Ensures consistency between PAs and corresponding MAs 17 18

CSCE 471/871 Lecture 6: Multiple Sequence Alignments Residues occupy - PDF document

Introduction: Multiple Alignments Start with a set of sequences In each column, residues are homolgous CSCE 471/871 Lecture 6: Multiple Sequence Alignments Residues occupy similar positions in 3D structure Residues diverge from a

Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE

Outline CSCE CSCE 471/871 471/871 Lecture 5: Lecture 5: Building Building CSCE 471/871

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring

CSCE 471/871 Lecture 2: Pairwise Alignments Why should we care? How do we do it? Stephen

CSCE 471/871 Lecture 0: Stephen Scott Administrivia Welcome Introduction What is Bioin-

CSCE 471/871 Lecture 5: Phylogenetic Trees Building Phylogenetic Trees Stephen Scott

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 471/871 Lecture 5: Building Phylogenetic Trees Building trees from pairwise distances

Multiple Alignments and Phylogenies Mark Voorhies 3/29/2012 Mark Voorhies Multiple Alignments

Multiple Alignments and Phylogenies Mark Voorhies 3/31/2011 Mark Voorhies Multiple Alignments

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Subdeterminants and Concave Integer Quadratic Programming Alberto Del Pia, University of

Varieties of positive interior algebras: operation . The two presentations produce

Privacy engineering, CyLab privacy by design, privacy impact assessments, and privacy governance

Briefing for analysts: Telecoms Partner, Competition Stuart McIntosh 21 st July 2010 Agenda

Documenting conversational conventions in Swahili Daniel W. Hieber University of California,

On lattice polytopes, convex matroid optimization, and degree sequences of hypergraphs Antoine

Mixed-Criticality Scheduling with I/O Eric Missimer, Katherine Missimer, Richard West Boston

Discovering Coherent Topics Using General Knowledge Meichun Hsu Zhiyuan (Brett) Chen Malu