Lecture 10: Introduction to POS Tagging : g 0 n 1 i m e r m u t a c r e g L o r P s M c i M m H a r n o y f D CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 1
HMM decoding (Viterbi) We are given a sentence w = w (1) … w (N) w = “she promised to back the bill” We want to use an HMM tagger to find its POS tags t t * = argmax t P( w , t ) = argmax t P( t (1) ) · P( w (1) | t (1) ) · P( t (2) | t (1) ) · … · P( w (N) | t (N) ) But: with T tags, w has O(T N ) possible tag sequences! To do this efficiently (in O ( T 2 N ) time), we will use a dynamic programming technique called the Viterbi algorithm which exploits the independence assumptions in the HMM. 2 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Dynamic programming Dynamic programming is a general technique to solve certain complex search problems by memoization 1.) Recursively decompose the large search problem into smaller subproblems that can be solved efficiently – There is only a polynomial number of subproblems. 2.) Store (memoize) the solutions of each subproblem in a common data structure – Processing this data structure takes polynomial time 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
The Viterbi algorithm A dynamic programming algorithm which finds the best (=most probable) tag sequence t* for an input sentence w: t * = argmax t P( w | t )P( t ) Complexity: linear in the sentence length. With a bigram HMM, Viterbi runs in O(T 2 N) steps for an input sentence with N words and a tag set of T tags. The independence assumptions of the HMM tell us how to break up the big search problem (find t * = argmax t P( w | t )P( t ) ) into smaller subproblems. The data structure used to store the solution of these subproblems is the trellis. 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Bookkeeping: the trellis w (i-1) w (i) w (i+1) ... w (N-1) w (N) w (1) w (2) ... t 1 States ... t j word w (i) has tag t j ... t T Words (“time steps”) We use a N × T table (“ trellis ”) to keep track of the HMM. The HMM can assign one of the T tags to each of the N words. 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Computing P ( t , w ) for one tag sequence P ( t (1) = t 1 ) w (i-1) w (i) w (i+1) ... w (N-1) w (N) w (1) w (2) ... t 1 P ( w (i) | t i ) P ( w (1) | t 1 ) ... P ( t j | t 1 ) P ( t .. | t i ) P ( t j | t . . ) P ( t i | t … ) t j P ( w (2) | t j ) P ( w (N) | t j ) ... P ( w (i+1) | t i+1 ) t T One path through the trellis = one tag sequence 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Viterbi: Basic Idea t (1) … t ( N − 1) t ( N ) Task: Find the tag sequence that maximizes the joint probability N π ( t (1) ) P ( w (1) ∣ t (1) ) P ( t ( i ) ∣ t ( i − 1) ) P ( w ( i ) ∣ t ( i ) ) ∏ i =2 t (1) t (2) The choice of affects the probability of , t (3) which in turn affects the probability of , etc: π ( t (1) ) P ( w (1) ∣ t (1) ) P ( t (2) ∣ t (1) ) P ( w (2) ∣ t (2) ) P ( t (3) ∣ t (2) ) … t (1) → We cannot fix (or any tag) until the end of the sentence ! 7 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Exploiting the independence assumptions t (1) t (2) t (3) … = t i t j t k … You want to find the best tag sequence This depends only This depends only on on the choices of the choices of t (2) = t j t (3) = t k t (1) = t i t (2) = t j and and argmax t i , t j , t k ,... π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) P ( t (2) = t j ∣ t (1) = t i ) P ( w (2) ∣ t (2) = t j ) P ( t (3) = t k ∣ t (2) = t j ) P ( w (3) ∣ t (3) = t k )… This depends This depends only on the choice This depends only on the choice of t (2) = t j only on the choice of t (1) = t i of t (3) = t k Step 1: For any particular choice of t (1) = t i w (1) for , compute i = 1.. N For all words π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) in the sentence: t (2) = t j w (2) Step 2a): for any particular choice of for , For all tags j = 1... T Step 2b): w (1) in the tag set: t i pick the tag for that gives the highest probability to Compute argmax t i π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) P ( t (2) = t j ∣ t (1) = t i ) P ( w (2) ∣ t (2) = t j ) Find the best t (1)...( i ) tag sequence t (2) = t j t ( i ) = t j t i Step 3: You have already found the best for any . that ends in t (3) = t k w (3) Now, for any particular choice of for , w (2) pick the tag for t j that gives the highest probability to argmax t j π ( t (1) = t i ) P ( w (1) ∣ t (1) = t i ) P ( t (2) = t j ∣ t (1) = t i ) P ( w (2) ∣ t (2) = t j ) P ( t (3) = t k ∣ t (2) = t j ) 8 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Viterbi: Basic Idea Assume we knew (for any tag ) the maximum t j t (1) … t ( N ) probability of any complete sequence t ( N ) = t j that ends in that tag [ N : last word in w ] t j Call that probability the Viterbi probability of tag N at position , and store it as trellis[N][j].viterbi Then, the probability of the best tag sequence (i.e. the maximum probability of any complete t (1) … t ( N ) sequence ) for our sentence is max k ∈ {1,..,T} (trellis[ N ][ k ].viterbi) 9 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Viterbi: Basic Idea w ( i ) t j Viterbi probability of tag for word : trellis[i][j].viterbi P ( w (1)...( i ) , t (1)...( i ) ) w (1)...( i ) The highest probability of the prefix t ( i ) = t j t (1)...( i ) and any tag sequence ending in trellis[i][j].viterbi = max P (w (1) …w (i) , t (1) …, t (i) = t j ) The probability of the best tag sequence overall is given by: max k trellis[N][k].viterbi (the largest entry in the last column of the trellis) The Viterbi probability trellis[i][j].viterbi (for any cell in the trellis) can easily be computed based on the cells in the preceding column, trellis[i-1][k].viterbi 10 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Viterbi: Basic Idea w ( i ) t j Viterbi probability of tag for word : trellis[i][j].viterbi P ( w (1)...( i ) , t (1)...( i ) ) w (1)...( i ) The highest probability of the prefix t ( i ) = t j t (1)...( i ) and any tag sequence ending in w (1) Base case: First word in the sentence viterbi = π ( t j ) P ( w (1) ∣ t j ) Initial probability for tag t j trellis[1][ j ] . Emission probability for w (1) w ( i ) Recurrence: Any other word in the sentence trellis[ i ][ j ] . viterbi = ( trellis[ i − 1][ k ] . viterbi × P ( t j ∣ t k ) P ( w (i) ∣ t j ) ) max k Viterbi probability of tag t k for Transition prob. Emission prob. the preceding word w (i-1) for t j given t k for w (i) given t j 11 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Initialization For a bigram HMM: Given an N-word sentence w (1) …w (N) and a tag set consisting of T tags, create a trellis of size N × T In the first column, initialize each cell trellis[1][k] as trellis[1][k] := π (t k ) P (w (1) | t k ) (there is only a single tag sequence for the first word that assigns a particular tag to that word) 12 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Viterbi: filling in the first column w (1) π (DT) : probability that a π (DT) × P ( w (1) ∣ DT) DT sentence starts with DT ... P ( w (1) ∣ DT) : probability π (NNS) × P ( w (1) ∣ NNS) NNS that tag DT emits word w (1) ... π (VBZ) × P ( w (1) ∣ VBZ) VBZ We want to find the best (most likely) tag sequence for the entire sentence. Each cell trellis[i][j] (corresponding to word w (i) with tag t j ) contains: — trellis[i][j].viterbi: the probability of the best sequence ending in t j — trellis[i][j].backpointer: to the cell k in the previous column that corresponds to the best tag sequence ending in t j 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
At any internal cell – For each cell in the preceding column: multiply its Viterbi probability with the transition probability to the current cell. – Keep a single backpointer to the best (highest scoring) cell in the preceding column – Multiply this score with the emission probability of the current word w (n-1) w (n) t 1 P ( w (1..n-1), t (n-1) =t 1 ) P ( t i | t 1 ... ... ) trellis[n][i].viterbi = t i P (w (n) | t i ) P ( w (1..n-1), t (n-1) =t i ) P (t i | t i ) ⋅ Max j ( trellis[n-1][j].viterbi ⋅ P(t i |t j ) ) ... ... P (t i | t T ) t T P ( w (1..n-1), t n-1 =t T ) 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
At the end of the sentence In the last column (i.e. at the end of the sentence) pick the cell with the highest entry, and trace back the backpointers to the first word in the sentence. 15 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
Recommend
More recommend