Labelled Sequences ◮ We are interested in the probability of sequences like: flies like the wind flies like the wind or nns vb dt nn vbz p dt nn ◮ In normal text, we see the words, but not the tags. ◮ Consider the POS tags to be underlying skeleton of the sentence, unseen but influencing the sentence shape. ◮ A structure like this, consisting of a hidden state sequence, and a related observation sequence can be modelled as a Hidden Markov Model . 8
Hidden Markov Models The generative story: � S � 9
Hidden Markov Models The generative story: � S � DT P ( DT |� S � ) P ( S , O ) = P ( DT |� S � ) 9
Hidden Markov Models The generative story: the P (the | DT ) � S � DT P ( DT |� S � ) P ( S , O ) = P ( DT |� S � ) P (the | DT) 9
Hidden Markov Models The generative story: the P (the | DT ) � S � DT NN P ( DT |� S � ) P ( NN | DT ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) 9
Hidden Markov Models The generative story: the cat P (the | DT ) P (cat | NN ) � S � DT NN P ( DT |� S � ) P ( NN | DT ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) 9
Hidden Markov Models The generative story: the cat P (the | DT ) P (cat | NN ) � S � DT NN VBZ P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) 9
Hidden Markov Models The generative story: the cat eats P (eats | VBZ ) P (the | DT ) P (cat | NN ) � S � DT NN VBZ P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) 9
Hidden Markov Models The generative story: the cat eats P (the | DT ) P (cat | NN ) P (eats | VBZ ) � S � DT NN VBZ NNS P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) 9
Hidden Markov Models The generative story: the cat eats mice P (eats | VBZ ) P (mice | NNS ) P (the | DT ) P (cat | NN ) � S � DT NN VBZ NNS P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) P (mice | NNS) 9
Hidden Markov Models The generative story: the cat eats mice P (mice | NNS ) P (the | DT ) P (cat | NN ) P (eats | VBZ ) � S � � / S � DT NN VBZ NNS P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( � / S �| NNS ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) P (mice | NNS) P ( � / S �| NNS) 9
Hidden Markov Models The generative story: cat eats the mice P (the | DT ) P (cat | NN ) P (eats | VBZ ) P (mice | NNS ) � S � NN NNS � / S � DT VBZ P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( � / S �| NNS ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) P (mice | NNS) P ( � / S �| NNS) 9
Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 10
Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 ◮ The transition probabilities model the probabilities of moving from state to state. 10
Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 ◮ The transition probabilities model the probabilities of moving from state to state. ◮ The emission probabilities model the probability that a state emits a particular observation. 10
Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. 11
Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. 11
Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. 11
Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. Our observations will be words ( w i ), and our states PoS tags ( t i ) 11
Estimation As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P ( t i | t i − 1 ) = C ( t i − 1 , t i ) C ( t i − 1 ) 12
Estimation As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P ( t i | t i − 1 ) = C ( t i − 1 , t i ) C ( t i − 1 ) Emission probabilities Computed from relative frequencies in the same way, with the words as observations: P ( w i | t i ) = C ( t i , w i ) C ( t i ) 12
Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . 13
Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow 13
Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space: ◮ log( AB ) = log( A ) + log( B ) ◮ hence P ( A ) P ( B ) = exp(log( A ) + log( B )) ◮ log( P ( S , O )) = − 1 . 368 + − 2 . 509 + − 2 . 357 + − 4 + − 2 . 143 + . . . 13
Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space: ◮ log( AB ) = log( A ) + log( B ) ◮ hence P ( A ) P ( B ) = exp(log( A ) + log( B )) ◮ log( P ( S , O )) = − 1 . 368 + − 2 . 509 + − 2 . 357 + − 4 + − 2 . 143 + . . . The issues related to MLE / smoothing that we discussed for n -gram models also applies here . . . 13
Ice Cream and Global Warming Missing records of weather in Baltimore for Summer 2007 ◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely determined by the weather. ◮ Today’s weather is partially predictable from yesterday’s. 14
Ice Cream and Global Warming Missing records of weather in Baltimore for Summer 2007 ◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely determined by the weather. ◮ Today’s weather is partially predictable from yesterday’s. A Hidden Markov Model! with: ◮ Hidden states: { H , C } (plus pseudo-states � S � and � / S � ) ◮ Observations: { 1 , 2 , 3 } 14
Ice Cream and Global Warming � S � 0.8 0.2 0.3 H C 0.6 0.2 0.5 P (1 | H ) = 0.2 P (1 | C ) = 0.5 0.2 0.2 � / S � P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 15
Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ P ( s x | O ) given O ◮ We can also learn the model parameters, given a set of observations. 16
Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) 17
Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) 17
Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) Actually, we want the state sequence ˆ S that maximises P ( S | O ): P ( S , O ) ˆ S = arg max P ( O ) S 17
Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) Actually, we want the state sequence ˆ S that maximises P ( S | O ): P ( S , O ) ˆ S = arg max P ( O ) S Since P ( O ) always is the same, we can drop the denominator: ˆ S = arg max P ( S , O ) S 17
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 � S � C C C � / S � 0.0000500 18
Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 � S � C C C � / S � 0.0000500 18
Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . 19
Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again 19
Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again Dynamic Programming: ◮ records sub-problem solutions for further re-use ◮ useful when a complex problem can be described recursively ◮ examples: Dijkstra’s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm 19
Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again Dynamic Programming: ◮ records sub-problem solutions for further re-use ◮ useful when a complex problem can be described recursively ◮ examples: Dijkstra’s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm 19
Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . 20
Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . 20
Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . 20
Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . At each step, we record backpointers showing which previous state led to the maximum probability. 20
An Example of the Viterbi Algorithm H H H � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � P ( C | S ) P (3 | C ) 0 . 2 ∗ 0 . 1 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � P ( C | S ) P (3 | C ) 0 . 2 ∗ 0 . 1 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) P ( C | 0 . 8 ∗ 0 . 4 H 0 ) . P 2 ( 1 ∗ | 0 C . ) 5 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) 0 . 8 ∗ 0 . 4 0 . 2 ∗ 0 . 5 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P ( C | 0 . 8 ∗ 0 . 4 H 0 ) . P 2 ( 1 ∗ | 0 C . ) 5 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) . ) . C 0 C 0 | | H H 0 . 2 ∗ 0 . 1 ∗ ∗ ( ( P 3 P 3 . . 0 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P ( C 0 . 8 ∗ 0 . 4 | H 0 ) . P 2 ( 1 ∗ | C 0 ) . 5 � S � � / S � ) ) H H | 3 | 1 ( ( P ( C | S ) P (3 | C ) P P 2 4 ) ) C . C . 0 0 | | 0 . 2 ∗ 0 . 1 H ∗ H ∗ ( ( 3 3 P P . . 0 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 3 1 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P P ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) . P . P 2 2 ( ( 1 3 ∗ ∗ | | 0 C 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) . ) . C 0 C 0 | | H H 0 . 2 ∗ 0 . 1 ∗ ∗ ( ( P 3 P 3 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P P ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) . P . P 2 2 ( ( ∗ 1 3 ∗ | C | 0 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) ) . C . C 0 0 | | H H ∗ 0 . 2 ∗ 0 . 1 ∗ ( ( 3 P 3 P . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 3 ( C ) = v 2 ( C ) = max( . 0384 ∗ . 02 , . 032 ∗ . 05) max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 0016 = . 032 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) 0 . 2 . P . P 2 2 ( ( 1 3 ∗ ∗ | | 0 C 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) C ) . ) . C 0 C 0 | � | | S H H / 0 . 2 ∗ 0 . 1 ∗ ∗ � ( ( ( P P 3 P 3 2 . . 0 0 0 . P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 1 3 o 1 o 2 o 3 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C 0 . 8 ∗ 0 . 4 | H | H 0 ) 0 ) 0 . 2 . P . P 2 2 ( ( 1 3 ∗ ∗ | | C C 0 0 ) ) . . 5 1 � S � � / S � ) P ( H | C ) P (3 | H ) H 1 | ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C 0 . 8 ∗ 0 . 4 | H | H 0 ) 0 ) 0 . 2 . P . P 2 2 ( ( 1 ∗ ∗ 3 | | C C 0 0 ) . . ) 5 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H H 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) P P ( � / S �| H ) ( C | 0 . 8 ∗ 0 . 4 H 0 . 2 ∗ 0 . 5 0 ) 0 . 2 . P 2 ( ∗ 3 | C 0 . ) 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H H H 21
An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) P P ( � / S �| H ) ( C | 0 . 8 ∗ 0 . 4 H 0 . 2 ∗ 0 . 5 0 ) 0 . 2 . P 2 ( ∗ 3 | C 0 . ) 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 � � H H H 21
Pseudocode for the Viterbi Algorithm Input : observations of length N , state set of size L Output : best-path create a path probability matrix viterbi [ N , L + 2] create a path backpointer matrix backpointer [ N , L + 2] for each state s from 1 to L do viterbi [1 , s ] ← trans ( � S � , s ) × emit ( o 1 , s ) backpointer [1 , s ] ← 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi [ i , s ] ← max L s ′ = 1 viterbi [ i − 1 , s ′ ] × trans ( s ′ , s ) × emit ( o i , s ) backpointer [ i , s ] ← arg max L s ′ = 1 viterbi [ i − 1 , s ′ ] × trans ( s ′ , s ) end end viterbi [ N , L + 1] ← max L s = 1 viterbi [ s , N ] × trans ( s , � / S � ) backpointer [ N , L + 1] ← arg max L s = 1 viterbi [ N , s ] × trans ( s , � / S � ) return the path by following backpointers from backpointer [ N , L + 1] 22
Diversion: Complexity and O(N) Big-O notation describes the complexity of an algorithm. ◮ it describes the worst-case order of growth in terms of the size of the input ◮ only the largest order term is represented ◮ constant factors are ignored ◮ determined by looking at loops in the code 23
Recommend
More recommend