Hidden Markov Models and Applications Fall 2019 Oct 1,3, 2019

Gene finding in prokaryotes

Reading frames • A protein is coded by groups of three nucleotides (codons): … ACGTACGTACGTACGT … … ACG-TAC-GTA-CGT-ACG-T … … TYVRT … • There are two other ways in which this sequence can be decomposed into codons: … A-CGT-ACG-TAC-GTA-CGT … … AC-GTA-CGT-ACG-TAC-GT … • These are the three reading frames • The complementary strand has three additional reading frames

Coding for a protein • Three non-overlapping nucleotides (codon) code for an amino acid • Each amino acid has more than one codon that codes for it • The code is almost universal across organisms • Codons which code for the same amino acid are similar • Six reading frames

Coding for a protein • Every gene starts with the codon ATG . This specifies the reading frame and the start of translation site. • The protein sequence continues until a stop codon ( UGA, UAA, UAG ) is encountered. • DNA: [ TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT] r • RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA • Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop

Open Reading Frames (ORFs) • An open reading frame is a sequence whose length is a multiple of 3, starts with the start codon, and ends with a stop codon, with no stop codon in the middle. • How do we determine if an ORF is a protein coding gene? Suppose we see a long run of non- stop codons after a start codon, then it has a low probability of arising by chance.

A model 1 for DNA sequences • Need a probabilistic model assigning a probability to a DNA sequence. • Simplest model: nucleotides are independent and identically distributed. Probability of a sequence s=s 1 s 2 …s n : P(s) = P(s 1 )P(s 2 )…P(s n ) • Distribution characterized by the numbers (p A , p C , p G , p T ) that satisfy: p A +p C +p G +p T =1 (multinomial distribution) 1 “All models are wrong. Some are useful” G.E.P. Box

A method for gene finding • Under our model, assuming that the four nucleotides have equal frequencies: P(run of k non-stop codons) = (61/64) k • Choose k such that P(run of k non-stop codons) is smaller than the rate of false positives we are willing to accept. At the 5% level we get k=62.

Information we ignored • Coding regions have different nucleotide usage • Different statistics of di-nucleotides and 3-mers in coding/non-coding regions • Promoter region contains sequence signals for binding of TFs and RNA polymerase • Need better models! • But works surprisingly well in prokaryotes. • Need more detailed models for eukaryotes (introns/exons)

Hidden Markov Models

Example: The Dishonest Casino Game: 1. You bet $1 2. You roll 3. Casino player rolls 4. Highest number wins $2 The casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches between fair and loaded die (not too often, and not for too long)

The dishonest casino model 0.05 0.95 0.9 FAIR LOADED P(1|F) = 1/6 P(1|L) = 1/10 0.1 P(2|F) = 1/6 P(2|L) = 1/10 P(3|F) = 1/6 P(3|L) = 1/10 P(4|F) = 1/6 P(4|L) = 1/10 P(5|F) = 1/6 P(5|L) = 1/10 P(6|F) = 1/6 P(6|L) = 1/2

Question # 1 – Evaluation GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION: Prob = 1.3 x 10 -35 How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs

Question # 2 – Decoding GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 FAIR LOADED FAIR QUESTION: What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

Question # 3 – Learning GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION: How does the casino player work: How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs

The dishonest casino model 0.05 0.95 0.9 FAIR LOADED P(1|F) = 1/6 P(1|L) = 1/10 0.1 P(2|F) = 1/6 P(2|L) = 1/10 P(3|F) = 1/6 P(3|L) = 1/10 P(4|F) = 1/6 P(4|L) = 1/10 P(5|F) = 1/6 P(5|L) = 1/10 P(6|F) = 1/6 P(6|L) = 1/2

Definition of a hidden Markov model Σ = { b 1 , b 2 ,…,b M } • Alphabet • Set of states Q = { 1,...,K } (K = |Q|) • Transition probabilities between any two states a ij = transition probability 1 2 from state i to state j a i1 + … + a iK = 1, for all states i • Initial probabilities a 0i K … a 01 + … + a 0K = 1 • Emission probabilities within each state e k (b) = P( x i = b | π i = k) e k (b 1 ) + … + e k (b M ) = 1

Hidden states and observed sequence At time step t, π t denotes the (hidden) state in the Markov chain x t denotes the symbol emitted in state π t A path of length N is: π 1 , π 2 , …, π N An observed sequence of length N is: x 1 , x 2 , …, x N

A parse of a sequence Given a sequence x = x 1 ……x N , A parse of x is a sequence of states π = π 1 , ……, π N 1 1 1 1 1 … 2 2 2 2 2 2 … … … … … K K K K K … x 1 x 2 x 3 x K

Likelihood of a parse 1 1 1 1 1 … Given a sequence x = x 1 …… ,x N 2 2 2 2 2 2 … and a parse π = π 1 , ……, π N , … … … … How likely is the parse K K K K K … (given our HMM)? x 1 x 2 x 3 x K P(x, π ) = P(x 1 , …, x N , π 1 , ……, π N ) = P(x N , π N | x 1 …x N-1 , π 1 , ……, π N-1 ) P(x 1 …x N-1 , π 1 , ……, π N-1 ) = P(x N , π N | π N-1 ) P(x 1 …x N-1 , π 1 , ……, π N-1 ) = … = P(x N , π N | π N-1 ) P(x N-1 , π N-1 | π N-2 )……P(x 2 , π 2 | π 1 ) P(x 1 , π 1 ) = P(x N | π N ) P( π N | π N-1 ) ……P(x 2 | π 2 ) P( π 2 | π 1 ) P(x 1 | π 1 ) P( π 1 ) = a 0 π 1 a π 1 π 2 ……a π N-1 π N e π 1 (x 1 )……e π N (x N ) N a e ( x ) ∏ = i π π π i 1 i i − i 1 =

Example: the dishonest casino What is the probability of a sequence of rolls x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 and the parse π = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a 0,Fair = ½ , a 0,Loaded = ½ ) ½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ × (1/6) 10 × (0.95) 9 = 5.2 × 10 -9

Example: the dishonest casino So, the likelihood the die is fair in all this run is 5.2 × 10 -9 What about π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ × P(1 | Loaded) P(Loaded|Loaded) … P(4 | Loaded) = ½ × (1/10) 8 × (1/2) 2 (0.9) 9 = 4.8 × 10 -10 Therefore, it is more likely that the die is fair all the way, than loaded all the way

Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 And let’s consider π = F, F,…, F P(x, π ) = ½ × (1/6) 10 × (0.95) 9 = 5.2 × 10 -9 (same as before) And for π = L, L,…, L: P(x, π ) = ½ × (1/10) 4 × (1/2) 6 (0.9) 9 = 3.02 × 10 -7 So, the observed sequence is ~100 times more likely if a loaded die is used

What we know 1 1 1 1 1 … 2 2 2 2 2 2 … Given a sequence x = x 1 ,…,x N … … … … and a parse π = π 1 ,…, π N , K K K K K … we know how to compute x 1 x 2 x 3 x K P(x, π )

What we would like to know 1. Evaluation GIVEN HMM M, and a sequence x, FIND Prob[ x | M ] 2. Decoding GIVEN HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π | M ] 3. Learning GIVEN HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (e i (.), a ij ) that maximize P[ x | θ ]

Problem 2: Decoding Find the best parse of a sequence

Decoding 1 1 1 1 1 GIVEN x = x 1 x 2 ,…,x N … 2 2 2 2 2 2 … We want to find π = π 1 , …, π N , such that P[ x, π ] is maximized … … … … K K K K K … π * = argmax π P[ x, π ] x 1 x 2 x 3 x N Maximize a 0 π 1 e π 1 (x 1 ) a π 1 π 2 …a π N-1 π N e π N (x N ) We can use dynamic programming! Let V k (i) = max { π 1,…, π i-1} P[x 1 …x i-1 , π 1 , …, π i-1 , x i , π i = k] = Probability of maximum probability path ending at state π i = k

Decoding – main idea Inductive assumption: Given V k (i) = max { π 1,…, π i-1} P[x 1 …x i-1 , π 1 , …, π i-1 , x i , π i = k] What is V r (i+1)? V r (i+1) = max { π 1,…, π i} P[ x 1 …x i , π 1 , …, π i , x i+1 , π i+1 = r ] = max { π 1,…, π i} P(x i+1 , π i+1 = r | x 1 …x i , π 1 ,…, π i ) P[x 1 …x i , π 1 ,…, π i ] = max { π 1,…, π i} P(x i+1 , π i+1 = r | π i ) P[x 1 …x i-1 , π 1 , …, π i-1 , x i , π i ] = max k [P(x i+1 , π i+1 = r | π i = k) max { π 1,…, π i-1} P[x 1 …x i-1 , π 1 ,…, π i-1 , x i , π i =k]] = max k e r (x i+1 ) a kr V k (i) = e r (x i+1 ) max k a kr V k (i)

The Viterbi Algorithm Input: x = x 1 ,…,x N Initialization: V 0 (0) = 1 (0 is the imaginary first position) V k (0) = 0, for all k > 0 Iteration: V j (i) = e j (x i ) × max k a kj V k (i-1) Ptr j (i) = argmax k a kj V k (i-1) Termination: P(x, π *) = max k V k (N) Traceback: π N * = argmax k V k (N) π i-1 * = Ptr π i (i)

Hidden Markov Models and Applications Fall 2019 Oct 1,3, 2019 - PowerPoint PPT Presentation

Hidden Markov Models and Applications Fall 2019 Oct 1,3, 2019 Gene finding in prokaryotes Reading frames A protein is coded by groups of three nucleotides (codons): ACGTACGTACGTACGT ACG-TAC-GTA-CGT-ACG-T TYVRT

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

Combined use of small-angle X-ray scattering and functional studies: the case of glutamate

CS6220: DATA MINING TECHNIQUES Sequence Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November

Biochemistry the quantitative study of energy transductions in living cells and the 4.

Sa Safeguardin ing Annual l Conference th October 2019 29 th 29 Welcome Gil ill l Frame In

DOLAP 2018 lvaro E. Prieto, JosNorberto Mazn, AdolfoLozanoTello,

Rhodes Serum Dioxin Study Information Night Wednesday 13 October 2004 Rhodes Serum Dioxin Study

Occupational and Environmental Health Searching for Information Evans Whitaker, MD, MLIS UCSF