Hidden Markov Models and Applications Fall 2019 Oct 1,3, 2019 - - PowerPoint PPT Presentation

hidden markov models and applications
SMART_READER_LITE
LIVE PREVIEW

Hidden Markov Models and Applications Fall 2019 Oct 1,3, 2019 - - PowerPoint PPT Presentation

Hidden Markov Models and Applications Fall 2019 Oct 1,3, 2019 Gene finding in prokaryotes Reading frames A protein is coded by groups of three nucleotides (codons): ACGTACGTACGTACGT ACG-TAC-GTA-CGT-ACG-T TYVRT


slide-1
SLIDE 1

Hidden Markov Models and Applications

Fall 2019 Oct 1,3, 2019

slide-2
SLIDE 2

Gene finding in prokaryotes

slide-3
SLIDE 3

Reading frames

  • A protein is coded by groups of three nucleotides

(codons):

…ACGTACGTACGTACGT… …ACG-TAC-GTA-CGT-ACG-T… …TYVRT…

  • There are two other ways in which this sequence can be

decomposed into codons:

…A-CGT-ACG-TAC-GTA-CGT… …AC-GTA-CGT-ACG-TAC-GT…

  • These are the three reading frames
  • The complementary strand has three additional reading

frames

slide-4
SLIDE 4

Coding for a protein

  • Three non-overlapping nucleotides (codon) code for an

amino acid

  • Each amino acid has more than one codon that codes for

it

  • The code is almost universal across organisms
  • Codons which code for the same amino acid are similar
  • Six reading frames
slide-5
SLIDE 5

Coding for a protein

  • Every gene starts with the codon ATG. This

specifies the reading frame and the start of translation site.

  • The protein sequence continues until a stop

codon (UGA, UAA, UAG) is encountered.

  • DNA: [TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT]r
  • RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA
  • Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop
slide-6
SLIDE 6

Open Reading Frames (ORFs)

  • An open reading frame is a sequence whose

length is a multiple of 3, starts with the start codon, and ends with a stop codon, with no stop codon in the middle.

  • How do we determine if an ORF is a protein

coding gene? Suppose we see a long run of non- stop codons after a start codon, then it has a low probability of arising by chance.

slide-7
SLIDE 7

A model1 for DNA sequences

  • Need a probabilistic model assigning a probability

to a DNA sequence.

  • Simplest model: nucleotides are independent and

identically distributed. Probability of a sequence s=s1s2…sn: P(s) = P(s1)P(s2)…P(sn)

  • Distribution characterized by the numbers (pA, pC,

pG, pT) that satisfy: pA+pC+pG+pT=1 (multinomial distribution)

1“All models are wrong. Some are useful” G.E.P. Box

slide-8
SLIDE 8

A method for gene finding

  • Under our model, assuming that the four

nucleotides have equal frequencies: P(run of k non-stop codons) = (61/64)k

  • Choose k such that P(run of k non-stop codons) is

smaller than the rate of false positives we are willing to accept. At the 5% level we get k=62.

slide-9
SLIDE 9

Information we ignored

  • Coding regions have different nucleotide usage
  • Different statistics of di-nucleotides and 3-mers in

coding/non-coding regions

  • Promoter region contains sequence signals for

binding of TFs and RNA polymerase

  • Need better models!
  • But works surprisingly well in prokaryotes.
  • Need more detailed models for eukaryotes

(introns/exons)

slide-10
SLIDE 10

Hidden Markov Models

slide-11
SLIDE 11

Example: The Dishonest Casino

Game:

  • 1. You bet $1
  • 2. You roll
  • 3. Casino player rolls
  • 4. Highest number wins $2

The casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches between fair and loaded die (not too often, and not for too long)

slide-12
SLIDE 12

The dishonest casino model

FAIR LOADED 0.05 0.1 0.9 0.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2

slide-13
SLIDE 13

Question # 1 – Evaluation

GIVEN:

A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION:

How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs

Prob = 1.3 x 10-35

slide-14
SLIDE 14

Question # 2 – Decoding

GIVEN:

A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION:

What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

FAIR LOADED FAIR

slide-15
SLIDE 15

Question # 3 – Learning

GIVEN:

A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION:

How does the casino player work: How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs

slide-16
SLIDE 16

The dishonest casino model

FAIR LOADED 0.05 0.1 0.9 0.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2

slide-17
SLIDE 17

Definition of a hidden Markov model

  • Alphabet

Σ = { b1, b2,…,bM }

  • Set of states Q = { 1,...,K } (K = |Q|)
  • Transition probabilities between any two states

aij = transition probability

from state i to state j ai1 + … + aiK = 1, for all states i

  • Initial probabilities a0i

a01 + … + a0K = 1

  • Emission probabilities within each state

ek(b) = P( xi = b | πi = k)

ek(b1) + … + ek(bM) = 1

K 1 … 2

slide-18
SLIDE 18

Hidden states and observed sequence

At time step t, πt denotes the (hidden) state in the Markov chain

xt denotes the symbol emitted in state πt

A path of length N is: π1, π2, …, πN An observed sequence

  • f length N is: x1, x2, …, xN
slide-19
SLIDE 19

A parse of a sequence

Given a sequence x = x1……xN, A parse of x is a sequence of states π = π1, ……, πN

1 2 K

1 2 K

1 2 K

… … … …

1 2 K

x1 x2 x3 xK

2 1 K 2

slide-20
SLIDE 20

Likelihood of a parse

Given a sequence x = x1…… ,xN and a parse π = π1, ……,πN, How likely is the parse (given our HMM)? P(x, π) = P(x1, …, xN, π1, ……, πN) = P(xN, πN | x1…xN-1, π1, ……, πN-1) P(x1…xN-1, π1, ……, πN-1) = P(xN, πN | πN-1) P(x1…xN-1, π1, ……, πN-1) = … = P(xN, πN | πN-1) P(xN-1, πN-1 | πN-2)……P(x2, π2 | π1) P(x1, π1) = P(xN | πN) P(πN | πN-1) ……P(x2 | π2) P(π2 | π1) P(x1 | π1) P(π1) = a0π1 aπ1π2……aπN-1πN eπ1(x1)……eπN(xN) 1 2 K

1 2 K

1 2 K

… … … …

1 2 K

x1 x2 x3 xK 2 1 K 2

=

=

N i i

x e a

i i i

1

) (

1

π π π

slide-21
SLIDE 21

Example: the dishonest casino

What is the probability of a sequence of rolls x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 and the parse

π = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?

(say initial probs a0,Fair = ½, a0,Loaded = ½) ½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ × (1/6)10 × (0.95)9 = 5.2 × 10-9

slide-22
SLIDE 22

Example: the dishonest casino

So, the likelihood the die is fair in all this run is 5.2 × 10-9 What about π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ × P(1 | Loaded) P(Loaded|Loaded) … P(4 | Loaded) = ½ × (1/10)8 × (1/2)2 (0.9)9 = 4.8 × 10-10 Therefore, it is more likely that the die is fair all the way, than loaded all the way

slide-23
SLIDE 23

Example: the dishonest casino

Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 And let’s consider π = F, F,…, F P(x, π) = ½ × (1/6)10 × (0.95)9 = 5.2 × 10-9 (same as before) And for π = L, L,…, L: P(x, π) = ½ × (1/10)4 × (1/2)6 (0.9)9 = 3.02 × 10-7 So, the observed sequence is ~100 times more likely if a loaded die is used

slide-24
SLIDE 24

What we know

Given a sequence x = x1,…,xN and a parse π = π1,…,πN, we know how to compute P(x, π)

1 2 K

1 2 K

1 2 K

… … … …

1 2 K

x1 x2 x3 xK 2 1 K 2

slide-25
SLIDE 25

What we would like to know

  • 1. Evaluation

GIVEN HMM M, and a sequence x, FIND Prob[ x | M ]

  • 2. Decoding

GIVEN HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π | M ]

  • 3. Learning

GIVEN HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (ei(.), aij) that maximize P[ x | θ ]

slide-26
SLIDE 26

Problem 2: Decoding Find the best parse of a sequence

slide-27
SLIDE 27

Decoding

GIVEN x = x1x2,…,xN We want to find π = π1, …, πN, such that P[ x, π ] is maximized π* = argmaxπ P[ x, π ]

Maximize a0π1 eπ1(x1) aπ1π2…aπN-1πN eπN(xN)

We can use dynamic programming! Let Vk(i) = max{π1,…, π i-1} P[x1…xi-1, π1, …, πi-1, xi, πi = k] = Probability of maximum probability path ending at state πi = k 1 2 K

1 2 K

1 2 K

… … … …

1 2 K

x1 x2 x3 xN

2 1 K 2

slide-28
SLIDE 28

Decoding – main idea

Inductive assumption: Given Vk(i) = max{π1,…, π i-1} P[x1…xi-1, π1, …, πi-1, xi, πi = k] What is Vr(i+1)? Vr(i+1) = max{π1,…, πi}P[ x1…xi, π1, …, πi, xi+1, πi+1 = r ] = max{π1,…, πi}P(xi+1, πi+1= r | x1…xi,π1,…, πi) P[x1…xi, π1,…, πi] = max{π1,…, πi}P(xi+1, πi+1= r | πi ) P[x1…xi-1, π1, …, πi-1, xi, πi] = maxk [P(xi+1, πi+1= r | πi = k) max{π1,…, π i-1}P[x1…xi-1,π1,…,πi-1, xi,πi=k]] = maxk er(xi+1) akr Vk(i) = er(xi+1) maxk akr Vk(i)

slide-29
SLIDE 29

The Viterbi Algorithm

Input: x = x1,…,xN Initialization: V0(0) = 1 (0 is the imaginary first position) Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) × maxk akj Vk(i-1) Ptrj(i) = argmaxk akj Vk(i-1) Termination: P(x, π*) = maxk Vk(N) Traceback: πN* = argmaxk Vk(N) πi-1* = Ptrπi (i)

slide-30
SLIDE 30

The Viterbi Algorithm

Input: x = x1,…,xN Initialization: V0(0) = 1 (0 is the imaginary first position) Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) × maxk akj Vk(i-1) Ptrj(i) = argmaxk akj Vk(i-1) Termination: P(x, π*) = maxk ak0Vk(N) (with an end state) Traceback: πN* = argmaxk Vk(N) πi-1* = Ptrπi (i)

slide-31
SLIDE 31

The Viterbi Algorithm

Similar to “aligning” a set of states to a sequence

x1 x2 x3 …………xi-1…xi…………………….xN State 1 j K Vj(i)

slide-32
SLIDE 32

The Viterbi Algorithm

Similar to “aligning” a set of states to a sequence Time: O(K2N) Space: O(KN)

x1 x2 x3 …………xi-1…xi…………………….xN State 1 j K Vj(i)

slide-33
SLIDE 33

Viterbi Algorithm – a practical detail

Underflows are a significant problem P[ x1,…., xi, π1, …, πi ] = a0π1 aπ1π2,…,aπi eπ1(x1),…,eπi(xi) These numbers become extremely small – underflow Solution: Take the logs of all values Vr(i) = log ek(xi) + maxk [ Vk(i-1) + log akr ]

slide-34
SLIDE 34

Example

Let x be a sequence with a portion that has 6’s with probability 1/6, followed by a portion with 6’s with fraction 1/2 x = 123456123456…12345 6626364656…1626364656 Easy to convince yourself that the optimal parse is: FFF…………………...F LLL………………………...L

slide-35
SLIDE 35

Example

Observed Sequence: x = 1,2,1,6,6 P(x) = 0.0002195337 Best 8 paths: LLLLL 0.0001018 FFFFF 0.0000524 FFFLL 0.0000248 FFLLL 0.0000149 FLLLL 0.0000089 FFFFL 0.0000083 LLLLF 0.0000018 LFFFF 0.0000017

slide-36
SLIDE 36

Problem 1: Evaluation Finding the probability a sequence is generated by the model

slide-37
SLIDE 37

Generating a sequence by the model

Given a HMM, we can generate a sequence of length n as follows:

  • 1. Start at state π1 according to probability a0π1
  • 2. Emit letter x1 according to probability eπ1(x1)
  • 3. Go to state π2 according to probability aπ1π2
  • 4. … until emitting xn

1 2 K

1 2 K

1 2 K

… … … …

1 2 K

x1 x2 x3 xn

2 1 K 2

e2(x1) a02

slide-38
SLIDE 38

The Forward Algorithm

want to calculate P(x) = probability of x, given the HMM (x = x1,…,xN ) Sum over all possible ways of generating x: P(x) = Σall paths π P(x, π) To avoid summing over an exponential number of paths π, define fk(i) = P(x1,…,xi, πi = k) (the forward probability)

slide-39
SLIDE 39

The Forward Algorithm – derivation

the forward probability: fk(i) = P(x1…xi, πi = k) = Σπ1…πi-1 P(x1…xi-1, π1,…, πi-1, πi = k) ek(xi) = Σr Σπ1…πi-2 P(x1…xi-1, π1,…, πi-2, πi-1 = r) ark ek(xi) = Σr P(x1…xi-1, πi-1 = r) ark ek(xi) = ek(xi) Σr fr(i-1) ark

slide-40
SLIDE 40

The Forward Algorithm

A dynamic programming algorithm: Initialization: f0(0) = 1 ;

fk(0) = 0, for all k > 0

Iteration: fk(i) = ek(xi) Σr fr(i-1) ark Termination: P(x) = Σk fk(N)

slide-41
SLIDE 41

The Forward Algorithm

If our model has an “end” state: Initialization: f0(0) = 1 ;

fk(0) = 0, for all k > 0

Iteration: fk(i) = ek(xi) Σr fr(i-1) ark Termination: P(x) = Σk fk(N) ak0

Where, ak0 is the probability that the terminating state is k (usually = a0k)

slide-42
SLIDE 42

Relation between Forward and Viterbi

VITERBI

Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) maxk Vk(i-1) akj Termination: P(x, π*) = maxk Vk(N)

FORWARD

Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fl(i) = el(xi) Σk fk(i-1) akl Termination: P(x) = Σk fk(N)

slide-43
SLIDE 43

The most likely state

Given a sequence x, what is the most likely state that emitted xi? In other words, we want to compute P(πi = k | x) Example: the dishonest casino

Say x = 12341623162616364616234161221341 Most likely path: π = FF……F However: marked letters more likely to be L

slide-44
SLIDE 44

Motivation for the Backward Algorithm

We want to compute P(πi = k | x), We start by computing P(πi = k, x) = P(x1…xi, πi = k, xi+1…xN) = P(x1…xi, πi = k) P(xi+1…xN | x1…xi, πi = k) = P(x1…xi, πi = k) P(xi+1…xN | πi = k) Then, P(πi = k | x) = P(πi = k, x) / P(x) = fk(i) bk(i) / P(x)

Forw rward rd, fk(i (i) Ba Backw ckward rd, bk(i (i)

slide-45
SLIDE 45

The Backward Algorithm – derivation

Define the backward probability: bk(i) = P(xi+1…xN | πi = k) = Σπi+1…πN P(xi+1,xi+2, …, xN, πi+1, …, πN | πi = k) = Σr Σπi+2…πN P(xi+1,xi+2, …, xN, πi+1 = r, πi+2, …, πN | πi = k) = Σr er(xi+1) akr Σπi+2…πN P(xi+2, …, xN, πi+2, …, πN | πi+1 = r) = Σr er(xi+1) akr br(i+1)

slide-46
SLIDE 46

The Backward Algorithm

A dynamic programming algorithm for bk(i): Initialization: bk(N) = 1, for all k Iteration: bk(i) = Σr er(xi+1) akr br(i+1) Termination: P(x) = Σr a0r er(x1) br(1)

slide-47
SLIDE 47

The Backward Algorithm

In case of an “end” state: Initialization: bk(N) = ak0, for all k Iteration: bk(i) = Σr er(xi+1) akr br(i+1) Termination: P(x) = Σr a0r er(x1) br(1)

slide-48
SLIDE 48

Example

Observed Sequence: x = 1,2,1,6,6 P(x) = 0.0002195337

Conditional Probabilities given x

Position F L 1 0.5029927 0.4970073 2 0.4752192 0.5247808 3 0.4116375 0.5883625 4 0.2945388 0.7054612 5 0.2665376 0.7334624

slide-49
SLIDE 49

Computational Complexity

What is the running time, and space required for Forward, and Backward algorithms? Time: O(K2N) Space: O(KN) Useful implementation technique to avoid underflows

  • Rescaling at each position by multiplying by a constant
slide-50
SLIDE 50

Scaling

  • Define:
  • Recursion:
  • Choosing scaling factors such that

Leads to: Scaling for the backward probabilities:

slide-51
SLIDE 51

Posterior Decoding

We can now calculate fk(i) bk(i) P(πi = k | x) = ––––––– P(x) we can now ask What is the most likely state at position i of sequence x ? Using Posterior Decoding we can now define: = argmaxk P(πi = k | x)

slide-52
SLIDE 52

Posterior Decoding

  • For each state,
  • Posterior Decoding gives us a curve of probability of

state for each position, given the sequence x

  • That is sometimes more informative than Viterbi path π*
  • Posterior Decoding may give an invalid sequence of states
  • Why?
slide-53
SLIDE 53

Viterbi vs. posterior decoding

  • A class takes a multiple choice test.
  • How does the lazy professor construct the answer

key?

  • Viterbi approach: use the answers of the best

student

  • Posterior decoding: majority vote
slide-54
SLIDE 54

Viterbi, Forward, Backward

VITERBI

Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vr(i) = er(xi) maxk Vk(i-1) akr Termination: P(x, π*) = maxk Vk(N)

FORWARD

Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fr(i) = er(xi) Σk fk(i-1) akr Termination: P(x) = Σk fk(N) ak0

BACKWARD

Initialization: bk(N) = ak0, for all k Iteration: br(i) = Σk ek(xi+1) ark bk(i+1) Termination: P(x) = Σk a0k ek(x1) bk(1)

slide-55
SLIDE 55

A+ A+ C+ + G+ + T+ + A- A- C-

  • G-
  • T-
  • A modeling Example

CpG islands in DNA sequences

slide-56
SLIDE 56

Methylation & Silencing

  • Methylation
  • Addition of CH3in C-

nucleotides

  • Silences genes in region
  • CG (denoted CpG) often

mutates to TG, when methylated

  • Methylation is inherited

during cell division

slide-57
SLIDE 57

CpG Islands

CpG nucleotides in the genome are frequently methylated C → methyl-C → T Methylation often suppressed around genes, promoters → CpG islands

slide-58
SLIDE 58

CpG Islands

  • In CpG islands:
  • CG is more frequent
  • Other dinucleotides (AA, AG, AT…) have different frequencies

Problem: Detect CpG islands

slide-59
SLIDE 59

A model of CpG Islands – Architecture

A+ A+ C+ + G+ + T+ + A- A- C-

  • G-
  • T-
  • CpG Isl

sland Non CpG Isl sland

slide-60
SLIDE 60

A model of CpG Islands – Transitions

How do we estimate parameters of the model? Emission probabilities: 1/0 1. Transition probabilities within CpG islands

Established from known CpG islands

2. Transition probabilities within non-CPG islands

Established from non-CpG islands

+

A C G T A

.180 .274 .426 .120

C

.171 .368 .274 .188

G

.161 .339 .375 .125

T

.079 .355 .384 .182

  • A

C G T A

.300 .205 .285 .210

C

.233 .298 .078 .302

G

.248 .246 .298 .208

T

.177 .239 .292 .292

slide-61
SLIDE 61

A model of CpG Islands – Transitions

  • What about transitions between (+) and (-) states?
  • Their probabilities affect
  • Avg. length of CpG island
  • Avg. separation between two CpG islands

q

X X Y Y

1-p 1-q p

Length distribution

  • f X region:

P[LX = 1] = 1-p P[LX = 2] = p(1-p) … P[LX= k] = pk-1(1-p) E[LX] = 1/(1-p) Geometric distribution, with mean 1/(1-p)

slide-62
SLIDE 62

A model of CpG Islands – Transitions

No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide

  • Estimate average length LCPG of a CpG island:

LCPG = 1/(1-p) ⇒ p = 1 – 1/LCPG

  • For each pair of (+) states: akr ← p × akr

+

  • For each (+) state k, (-) state r:

akr=(1-p)(a0r

  • )
  • Do the same for (-) states

A problem with this model: CpG islands don’t have a geometric length distribution This is a defect of HMMs – a price we pay for ease of analysis & efficient computation

A+ A+ C+ + G+ + T+ + A- A- C-

  • G-
  • T-
  • 1–p

1-q

slide-63
SLIDE 63

Using the model

Given a DNA sequence x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide xi, (say xi = A) The Viterbi parse tells whether xi is in a CpG island in the most likely parse Using the Forward/Backward algorithms we can calculate P(xi is in CpG island) = P(πi = A+ | x) Posterior Decoding can assign locally optimal predictions

  • f CpG islands

= argmaxk P(πi = k | x)

slide-64
SLIDE 64

Posterior decoding

Results of applying posterior decoding to a part of human chromosome 22

slide-65
SLIDE 65

Posterior decoding

slide-66
SLIDE 66

Viterbi decoding

slide-67
SLIDE 67

Sliding window (size = 100)

slide-68
SLIDE 68

Sliding widow (size = 200)

slide-69
SLIDE 69

Sliding window (size = 300)

slide-70
SLIDE 70

Sliding window (size = 400)

slide-71
SLIDE 71

Sliding window (size = 600)

slide-72
SLIDE 72

Sliding window (size = 1000)

slide-73
SLIDE 73

Modeling CpG islands with silent states

slide-74
SLIDE 74

What if a new genome comes?

  • Suppose we just sequenced the porcupine genome
  • We know CpG islands play the same role in this

genome

  • However, we have no known CpG islands for

porcupines

  • We suspect the frequency and characteristics of

CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

slide-75
SLIDE 75

Two learning scenarios

  • Estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…xN where we have good

(experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening

as he changes the dice and produces 10,000 rolls

  • Estimation when the “right answer” is unknown

Examples: GIVEN: the porcupine genome; we don’t know how frequent are

the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when

he changes dice

GOAL: Update the parameters θ of the model to maximize P(x|θ)

slide-76
SLIDE 76

When the right answer is known

Given x = x1…xN for which π = π1…πN is known,

Define: Akr = # of times k → r transition occurs in π

Ek(b) = # of times state k in π emits b in x The maximum likelihood estimates of the parameters are:

Akr

Ek(b)

akr = ––––– ek(b) = ––––––– Σi Aki Σc Ek(c)

slide-77
SLIDE 77

When the right answer is known

Intuition: When we know the underlying states, the best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|θ) is maximized, but θ is unreasonable 0 probabilities – VERY BAD Example: Suppose we observe: x = 2 1 5 6 1 2 3 6 2 3 5 3 4 2 1 2 1 6 3 6 π = F F F F F F F F F F F F F F F L L L L L Then: aFF = 14/15 aFL = 1/15 aLL = 1 aLF = 0 eL(4) = 0

slide-78
SLIDE 78

Pseudocounts

Solution for small training sets: Add pseudocounts Akr = # times k→r state transition occurs + tkr Ek(b) = # times state k emits symbol b in x + tk(b) tkr, tk(b) are pseudocounts representing our prior belief Larger pseudocounts ⇒ Strong prior belief Small pseudocounts (ε < 1): just to avoid 0 probabilities

slide-79
SLIDE 79

Pseudocounts

Akr + tkr

Ek(b) + tk(b)

akr = ––––––– ek(b) = –––––––––– Σi Aki + tkr Σc Ek(c) + tk(b)

slide-80
SLIDE 80

Pseudocounts

Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: t0F = t0L = tF0 = tL0 = 1; tFL = tLF = tFF = tLL = 1; tF(1) = tF(2) = … = tF(6) = 20 (strong belief fair is fair) tL(1) = tL(2) = … = tL(6) = 5 (wait and see for loaded) Above numbers arbitrary – assigning priors is an art

slide-81
SLIDE 81

When the right answer is unknown

We don’t know the actual Akr, Ek(b) Idea:

  • Initialize the model
  • Compute Akr, Ek(b)
  • Update the parameters of the model, based on Akr, Ek(b)
  • Repeat until convergence

Two algorithms: Baum-Welch, Viterbi training

slide-82
SLIDE 82

Gene finding with HMMs

slide-83
SLIDE 83
slide-84
SLIDE 84
slide-85
SLIDE 85

Example: genes in prokaryotes

  • EasyGene: a prokaryotic

gene-finder (Larsen TS, Krogh A)

  • Codons are modeled with 3

looped triplets of states (this converts the length distribution to negative binomial)

slide-86
SLIDE 86

Using the gene finder

  • Apply the gene finder to both strands
  • Training using annotated genes
slide-87
SLIDE 87

A simple eukaryotic gene finder