CS481: Bioinformatics Algorithms Can Alkan EA224 - - PowerPoint PPT Presentation

cs481 bioinformatics
SMART_READER_LITE
LIVE PREVIEW

CS481: Bioinformatics Algorithms Can Alkan EA224 - - PowerPoint PPT Presentation

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/ HMM for Fair Bet Casino (contd) HMM model for the HMM model for the Fair Bet Casino Fair Bet Casino Problem


slide-1
SLIDE 1

CS481: Bioinformatics Algorithms

Can Alkan EA224 calkan@cs.bilkent.edu.tr

http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/

slide-2
SLIDE 2

HMM for Fair Bet Casino (cont’d)

HMM model for the HMM model for the Fair Bet Casino Fair Bet Casino Problem Problem

slide-3
SLIDE 3

Hidden Paths

 A

A path path π = π π = π1… π … πn in the HMM in the HMM is defined as a is defined as a sequence of states. sequence of states.

 Consider path

Consider path π π = = FFFBBBBBFFF and sequence FFFBBBBBFFF and sequence x x = = 01011101001 01011101001

x 0 1 0 1 1 1 0 1 0 0 1

π π = F = F F F B B B B B B B F F F F

P( P(xi|π |πi) ½ ½ ½ ¾ ¾ ¾ ½ ½ ½ ¾ ¾ ¾ ¼ ¼ ¾ ½ ½ ½ ¾ ½ ½ ½ P(π P(πi-1

1  πi)

½ ½ 9/10

10

9/10

10 1/10 10 9/10 10 9 9/10 10 9 9/10 10 9/10 10 1/10 10 9/10 10 9/10 10

Transition probability from state πi-1

1 to state π

to state πi

Probability that xi was emitted from state πi

slide-4
SLIDE 4

P(x|π) Calculation

 P(

P(x|π): ): Probability that sequence Probability that sequence x was was generated by the path generated by the path π: π: n P( P(x|π) = P( ) = P(π0→ π → π1) ) · Π P Π P(x (xi| π | πi) ) · · P( P(πi

i → π

→ πi+1

i+1) i=1 =1

= = a π0,

, π1 1 ·

Π Π e πi

i (x

(xi

i)

) · a πi,

, πi+1 i+1

slide-5
SLIDE 5

Decoding Problem

 Goal:

Goal: Find an optimal hidden path of states Find an optimal hidden path of states given observations. given observations.

 Input:

Input: Sequence of observations Sequence of observations x = x x = x1…x …xn generated by an HMM generated by an HMM M(Σ, Q, A, E , Q, A, E)

 Output:

Output: A path that maximizes A path that maximizes P(x| P(x|π) over

  • ver

all possible paths all possible paths π. π.

slide-6
SLIDE 6

Building Manhattan for Building Manhattan for Decoding Problem

 Andrew Viterbi used the Manhattan grid

Andrew Viterbi used the Manhattan grid model to solve the model to solve the Decoding Problem Decoding Problem.

 Every choice of

Every choice of π = π π = π1… π … πn

n corresponds to a

corresponds to a path in the graph. path in the graph.

 The only valid direction in the graph is

The only valid direction in the graph is eastward. eastward.

 This graph has |

This graph has |Q|2(n (n-1) 1) edges. edges.

slide-7
SLIDE 7

Edit Graph for Decoding Problem

slide-8
SLIDE 8

Decoding Problem vs. Alignment Problem

Valid directions in the Valid directions in the alignment problem. alignment problem. Valid directions in the Valid directions in the decoding problem. decoding problem.

slide-9
SLIDE 9

Decoding Problem as Finding a Longest Path in a DAG

  • The

The Decoding Problem Decoding Problem is reduced to finding is reduced to finding a longest path in the a longest path in the directed acyclic graph directed acyclic graph (DAG) (DAG) above. above.

  • Notes:

Notes: the length of the path is defined as the length of the path is defined as the the product product of its edges’ weights, not the

  • f its edges’ weights, not the

sum. sum.

slide-10
SLIDE 10

Decoding Problem (cont’d)

  • Every path in the graph has the probability

Every path in the graph has the probability P(x| P(x|π).

  • The Viterbi algorithm finds the path that

The Viterbi algorithm finds the path that maximizes maximizes P(x| P(x|π) among all possible paths. ) among all possible paths.

  • The Viterbi algorithm runs in

The Viterbi algorithm runs in O(n|Q| O(n|Q|2) time. time.

slide-11
SLIDE 11

Decoding Problem: weights of edges

w

The weight w is given by: ???

(k, i) (l, i+1)

slide-12
SLIDE 12

Decoding Problem: weights of edges

w

The weight w is given by: ??

(k, i) (l, i+1)

n P( P(x|π) = ) = Π Π e πi+1

i+1 (x

(xi+1

i+1) . a

. a πi,

, πi+1 i+1 i=0 i=0

slide-13
SLIDE 13

Decoding Problem: weights of edges

w

The weight w is given by: ?

(k, i) (l, i+1)

i-th term

th term = e e πi+1

i+1 (x

(xi+1

i+1) . a

. a πi,

, πi+1 i+1

slide-14
SLIDE 14

Decoding Problem: weights of edges

w

The weight w=el(xi+1). akl

(k, i) (l, i+1)

i-th term th term = e πi (x (xi) . a . a πi,

, πi+1 i+1 =

= el(xi+1). akl for πi

i =k, π

=k, πi+1

i+1=l

slide-15
SLIDE 15

Decoding Problem and Dynamic Programming

sl,i+1

l,i+1 =

= max

maxk Є Q

Є Q {sk,i k,i ·

· weight of edge between weight of edge between (k,i) (k,i) and and (l,i+1) (l,i+1)}= }= max maxk Є Q

Є Q {sk,i k,i · a

· akl

kl · e

· el (x (xi+1

i+1)

) }= }= e el (x (xi+1

i+1) ·

) · max maxk Є Q

Є Q {sk,i k,i · a

· akl

kl}

slide-16
SLIDE 16

Decoding Problem (cont’d)

  • Initialization:

Initialization:

  • sbegin,0

begin,0 = 1

= 1

  • sk,0

k,0 = 0 for

= 0 for k ≠ begin k ≠ begin.

  • Let

Let π* be the optimal path. Then, be the optimal path. Then, P( P(x|π*) = max ) = maxk Є Q

Є Q {sk,n k,n .

. ak,end

k,end}

slide-17
SLIDE 17

Decoding Problem (cont’d)

  • Initialization:

Initialization:

  • sbegin,0

begin,0 = 1

= 1

  • sk,0

k,0 = 0 for

= 0 for k ≠ begin k ≠ begin.

  • Let

Let π* be the optimal path. Then, be the optimal path. Then, P( P(x|π*) = ) = max maxk Є Q

Є Q {sk,n k,n .

. ak,end

k,end}

Is there a problem here?

slide-18
SLIDE 18

Viterbi Algorithm

 The value of the product can become

The value of the product can become extremely small, which leads to overflowing. extremely small, which leads to overflowing.

slide-19
SLIDE 19

Viterbi Algorithm

 The value of the product can become

The value of the product can become extremely small, which leads to overflowing. extremely small, which leads to overflowing.

 To avoid overflowing, use log value instead.

To avoid overflowing, use log value instead. sk,i+1

k,i+1=

= log logel(xi+1) + max

k Є Q Є Q {sk,i k,i + log(akl)}

slide-20
SLIDE 20

FORWARD/BACKWARD

slide-21
SLIDE 21

Forward-Backward Problem

Given: Given: a sequence of coin tosses generated a sequence of coin tosses generated by an HMM by an HMM. Goal: Goal: find the probability that the dealer was find the probability that the dealer was using a biased coin at a particular time. using a biased coin at a particular time.

slide-22
SLIDE 22

Forward Algorithm

 Define

Define fk,i

k,i (forward probability

forward probability) as the ) as the probability of emitting the prefix probability of emitting the prefix x1…x …xi

i and

and reaching the state reaching the state π = k = k.

 The recurrence for the forward algorithm:

The recurrence for the forward algorithm: fk,i

k,i =

= ek(x (xi) . . Σ fl,i

l,i-1 . a

. alk

lk l Є Q

l Є Q

slide-23
SLIDE 23

Backward Algorithm

 However,

However, forward probability forward probability is not the only is not the only factor affecting factor affecting P( P(πi = k|x = k|x).

 The sequence of transitions and emissions

The sequence of transitions and emissions that the HMM undergoes between that the HMM undergoes between πi+1

i+1 and

and πn also affect also affect P( P(πi = k|x = k|x). forward xi backward

slide-24
SLIDE 24

Backward Algorithm (cont’d)

 Define

Define backward probability backward probability bk,i

k,i as the

as the probability of being in state probability of being in state πi

i = k

= k and emitting and emitting the the suffix suffix xi+1

i+1…x

…xn.

 The recurrence for the

The recurrence for the backward algorithm backward algorithm: bk,i

k,i =

= Σ el(x (xi+1

i+1) .

. bl,i+1

l,i+1 . a

. akl

kl l Є Q

l Є Q

slide-25
SLIDE 25

Forward-Backward Algorithm

 The probability that the dealer used a

The probability that the dealer used a biased coin at any moment biased coin at any moment i:

P(x, P(x, πi = k = k) f ) fk(i) . b (i) . bk(i) (i) P( P(πi = k|x = k|x) = ) = _______________

_______________ =

= ______________

______________

P(x) P(x) P(x) P(x)

P(x) is the sum of P(x, π P(x, πi = k) = k) over all k

slide-26
SLIDE 26

PROFILE HMM

slide-27
SLIDE 27

Finding Distant Members of a Protein Family

 A distant cousin of functionally related sequences in

A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities a protein family may have weak pairwise similarities with each member of the family and thus fail with each member of the family and thus fail significance test. significance test.

 However, they may have weak similarities with

However, they may have weak similarities with many many members of the family. members of the family.

 The goal is to align a sequence to

The goal is to align a sequence to all all members of members of the family at once. the family at once.

 Family of related proteins can be represented by

Family of related proteins can be represented by their multiple alignment and the corresponding their multiple alignment and the corresponding profile. profile.

slide-28
SLIDE 28

Profile Representation of Protein Families Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies

  • f nucleotides in every aligned position.

Protein family can be represented by a Protein family can be represented by a 20·n profile profile representing frequencies of amino acids. representing frequencies of amino acids.

slide-29
SLIDE 29

Profiles and HMMs

 HMMs can also be used for aligning a

HMMs can also be used for aligning a sequence against a profile representing sequence against a profile representing protein family. protein family.

 A

A 20·n 20·n profile profile P corresponds to corresponds to n n sequentially linked sequentially linked match match states states M1,…,M ,…,Mn in the in the profile HMM profile HMM of

  • f P.

P.

slide-30
SLIDE 30

Multiple Alignments and Protein Family Classification

 Multiple alignment of a protein family shows

variations in conservation along the length of a protein

 Example: after aligning many globin proteins,

the biologists recognized that the helices region in globins are more conserved than

  • thers.
slide-31
SLIDE 31

What are Profile HMMs ?

 A Profile HMM is a probabilistic

representation of a multiple alignment.

 A given multiple alignment (of a protein

family) is used to build a profile HMM.

 This model then may be used to find and

score less obvious potential matches of new protein sequences.

slide-32
SLIDE 32

Profile HMM

A profile HMM A profile HMM

slide-33
SLIDE 33

Building a profile HMM

 Multiple alignment is used to construct the HMM model.  Assign each column to a Match state in HMM. Add Insertion and

Deletion state.

 Estimate the emission probabilities according to amino acid

counts in column. Different positions in the protein will have different emission probabilities.

 Estimate the transition probabilities between Match, Deletion and

Insertion states

 The HMM model gets trained to derive the optimal parameters.

slide-34
SLIDE 34

States of Profile HMM

 Match states

Match states M1…Mn (plus (plus begin/end begin/end states) states)

 Insertion states

Insertion states I0I1…In

 Deletion states

Deletion states D1…Dn

slide-35
SLIDE 35

Transition Probabilities in Profile HMM

 log(a

log(aMI

MI)+log(a

)+log(aIM

IM) =

) = gap initiation penalty gap initiation penalty

 log(a

log(aII

II) = gap extension penalty

gap extension penalty

slide-36
SLIDE 36

Emission Probabilities in Profile HMM

  • Probabilty of emitting a symbol

Probabilty of emitting a symbol a a at an at an insertion state insertion state Ij: eIj

Ij(a) = p(a)

(a) = p(a) where where p(a) p(a) is the frequency of the is the frequency of the

  • ccurrence of the symbol
  • ccurrence of the symbol a

a in all the in all the sequences. sequences.

slide-37
SLIDE 37

Profile HMM Alignment

 Define

Define vM

j (i)

(i) as the logarithmic likelihood as the logarithmic likelihood score of the best path for matching score of the best path for matching x1..x ..xi to to profile HMM ending with profile HMM ending with xi emitted by the emitted by the state state Mj. .

 vI j j (i)

(i) and and v vD

j j (i)

(i) are defined similarly. are defined similarly.

slide-38
SLIDE 38

Profile HMM Alignment: Dynamic Programming

vM

j-1(i

(i-1) + log(a 1) + log(aMj-1,

1,Mj j )

vM

j(i) = log (e

(i) = log (eMj(x (xi)/p(x )/p(xi)) + max v )) + max vI

j-1(i

(i-1) + log(a 1) + log(aIj-1, ,Mj

j )

vD

j-1(i

(i-1) + log(a 1) + log(aDj-1,Mj

j )

vM

j(i

(i-1) + log(a 1) + log(aMj, I , Ij) vI

j(i) = log (e

(i) = log (eIj(x (xi)/p(x )/p(xi)) + max v )) + max vI

j(i

(i-1) + log(a 1) + log(aIj, I , Ij) vD

j(i

(i-1) + log(a 1) + log(aDj, I , Ij)