CSE182-L9 Protein domain analysis via HMMs Gene finding November 09

QUIZ! • Question: • Your ‘friend’ likes to gamble. • She tosses a coin: HEADS, she gives you a dollar. TAILS, you give her a dollar. • Usually, she uses a fair coin, but ‘once in a while’, she uses a loaded coin. • Can you say what fraction of the times she loads the coin? November 09

Representation 2: Profiles • Profiles versus regular expressions – Regular expressions are intolerant to an occasional mis-match. – The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. – Profiles capture some of these ideas. November 09

Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(f ki ) 0.71 0.14 • Each entry f ki represents the frequency of symbol k in position i 0.28 0.14 November 09

Scoring matrices • Given a sequence s, does it i belong to the family described by a profile? • We align the sequence to the profile, and score it • Let S(i,j) be the score of aligning position i of the profile to residue s j • The score of an alignment is the sum of column s scores. s j November 09

Scoring Profiles ∑ [ ] S ( i , j ) = f ki M r k , s j k Scoring Matrix i k f ki s November 09

Domain analysis via profiles • Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. • What if the sequence matches some other sequences weakly (using BLAST), but does not match any known profile? November 09

Psi-BLAST idea Seq Db --In the next iteration, the red sequence will be thrown out. --It matches the query in non-essential residues • Iterate: – Find homologs using Blast on query – Discard very similar homologs – Align, make a profile, search with profile. – Why is this more sensitive? November 09

Representation 3: HMMs • Building good profiles relies upon good alignments. – Difficult if there are gaps in the V alignment. – Psi-BLAST/BLOCKS etc. work with gapless alignments. • An HMM representation of Profiles helps put the alignment construction/ membership query in a uniform framework. • Also allows for position specific gap scoring. November 09

The generative model • Think of each column in the alignment as generating symbols according to a distribution. 0.71 • For each column, build a node that outputs an a.a. with the appropriate probability Pr[Y]=0.14 Pr[F]=0.71 0.14 November 09

A simple Profile HMM • Connect nodes for each column into a chain. Thie chain generates random sequences. • What is the probability of generating FKVVGQVILD? • In this representation – Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? November 09

Profile HMMs can handle gaps • The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. – When in an insert state, generate any amino-acid – When in delete, generate a - – A sequence may be generated using different paths. November 09

Example A L - L A I V L A I - L • Probability [ALIL] is part of the family? • Note that multiple paths can generate this sequence. 1 Go to M1, and generate A 2 Go to I1 and generate L 3 Go to M2 and generate I OR 4 Go to M3 and generate L 1 Go to M1, and generate A 2 Go to M2 and generate L 3 Go to I2 and generate I November 09 4 Go to M3 and generate L

Example A L - L A I V L A I - L • Probability [ALIL] is part of the family? • Note that multiple paths can generate this sequence. – M 1 I 1 M 2 M 3 – M 1 M 2 I 2 M 3 • In order to compute the probabilities, we must assign probabilities of transition between states November 09

Profile HMMs • Directed Automaton M with nodes and edges. – Nodes emit symbols according to ‘emission probabilities’ – Transition from node to node is guided by ‘transition probabilities’ • Joint probability of seeing a sequence S, and path P – Pr[S,P| M ] = Pr[S|P, M ] Pr[P| M ] – Pr[ALIL AND M 1 I 1 M 2 M 3 | M ] = Pr[ALIL| M 1 I 1 M 2 M 3 , M ] Pr[M 1 I 1 M 2 M 3 | M ] • Pr[ALIL | M ] = ? November 09

Formally • The emitted sequence is S=S 1 S 2 …S m • The path traversed is P 1 P 2 P 3 .. • e j (s) = emission probability of symbol s in state P j • Transition probability T[j,k] : Probability of transitioning from state j to state k. • Pr(P,S| M ) = e P1 (S 1 ) T[P 1 ,P 2 ] e P2 (S 2 ) …… • What is Pr(S| M )? November 09

Two solutions • An unknown (hidden) path is traversed to produce (emit) the sequence S. • The probability that M emits S can be either – The sum over the joint probabilities over all paths. • Pr(S|M) = ∑ P Pr(S,P|M) – OR, it is the probability of the most likely path • Pr(S|M) = max P Pr(S,P|M) • Both are appropriate ways to model, and have similar algorithms to solve them. November 09

Viterbi Algorithm for HMM A L - L A I V L A I - L • Let P max (i,j|M) be the probability of the most likely solution that emits S 1 …S i , and ends in state j (is it sufficient to compute this?) • P max (i,j|M) = max k P max (i-1,k) T[k,j] e j (S i ) (Viterbi) • P sum (i,j|M) = ∑ k ( P sum (i-1,k) T[k,j] ) e j (S i ) November 09

Viterbi illustration • Let P max (i,j|M) be the probability of the most likely solution that emits S 1 …S i , and ends in state j (is it sufficient to compute this?) P max (i,j|M) = max k P max (i-1,k) T[k,j] e j (S i ) k T[k,j] j e j (S i ) S i November 09

Profile HMM membership A L - L A I V L A I - L A L I L Path: M 1 M 2 I 2 M 3 • We can use the Viterbi/Sum algorithm to compute the probability that the sequence belongs to the family. • Backtracking can be used to get the path, which allows us to give an alignment November 09

Summary • HMMs allow us to model position specific gap penalties, and allow for automated training to get a good alignment. • Patterns/Profiles/HMMs allow us to represent families and foucs on key residues • Each has its advantages and disadvantages, and needs special algorithms to query efficiently. November 09

Protein Domain databases HMM • A number of databases capture proteins (domains) using various representations • Each domain is also associated with structure/function information, parsed from the literature. • Each database has specific query mechanisms that allow us to compare our sequences against them, and assign function 3D November 09

Biology • In our discussion of BLAST, we alternated between looking at DNA, and protein sequences, treating them as strings. – DNA, RNA, and proteins are the 3 important molecules • What is the relation between the three? November 09

November 09

Transcription and translation • We define a gene as a location on the genome that codes for proteins. • The genic information is used to manufacture proteins through transcription, and translation. • There is a unique mapping from triplets to amino-acids November 09

Translation • The ribosomal machinery reads mRNA. • Each triplet is translated into a unique amino-acid until the STOP codon is encountered. • There is also a special signal where translation starts, usually at the ATG (M) codon. November 09

End of L9 November 09

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 - PowerPoint PPT Presentation

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your friend likes to gamble. She tosses a coin: HEADS, she gives you a dollar. TAILS, you give her a dollar. Usually, she uses a fair coin,

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Bioinformatics Sequence comparison 1 global pairwise alignment David Gilbert Bioinformatics

HiddenMarkovModels September 25, 2018 1 Lecture 14: Hidden Markov Models CBIO (CSCI) 4835/6835:

2017-07-29 codon substitution models and the analysis of natural selection

MolecularBio January 28, 2020 1 Lecture 6: Molecular Biology Primer CBIO (CSCI) 4835/6835:

23 [3] Di Francesco F, Garnier J, and Munson PJ (1997) Protein topology recognition from

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 3.5 S YMBOL T ABLE A PPLICATIONS sets

Patterns in nature Patterns associated with function Not exactly the same Signal Peptide

8.3 Mining Sequence Patterns in Transactional Databases 499 the items in s 2 , and so on. An item

Sambuz

Useful Links

Newsletter

Mail Us