ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019
Recap: HMM • Elements of HMM: – Set of states (tags) – Output alphabet (word types) – Start state (beginning of sentence) – State transition probabilities – Output probabilities from each state Algorithms for HMMs (Goldwater, ANLP) 2
More general notation • Previous lecture: – Sequence of tags T = t 1 … t n – Sequence of words S = w 1 … w n • This lecture: – Sequence of states Q = q 1 ... q T – Sequence of outputs O = o 1 ... o T – So t is now a time step, not a tag! And T is the sequence length. Algorithms for HMMs (Goldwater, ANLP) 3
Recap: HMM • Given a sentence O = o 1 ... o T with tags Q = q 1 ... q T , compute P(O,Q) as: 𝑈 𝑄(𝑃, 𝑅) = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 argmax 𝑅 𝑄(𝑅|𝑃) • But we want to find without enumerating all possible Q – Use Viterbi algorithm to store partial computations. Algorithms for HMMs (Goldwater, ANLP) 4
Today’s lecture • What algorithms can we use to – Efficiently compute the most probable tag sequence for a given word sequence? – Efficiently compute the likelihood for an HMM (probability it outputs a given sequence s )? – Learn the parameters of an HMM given unlabelled training data? • What are the properties of these algorithms (complexity, convergence, etc)? Algorithms for HMMs (Goldwater, ANLP) 5
Tagging example Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) Algorithms for HMMs (Goldwater, ANLP) 6
Tagging example Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • Choosing the best tag for each word independently gives the wrong answer (<s> CD NN NN </s>). • P(VBD|bit) < P(NN|bit), but may yield a better sequence (<s> CD NN VB </s>) – because P(VBD|NN) and P(</s>|VBD) are high. Algorithms for HMMs (Goldwater, ANLP) 7
Viterbi: intuition Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • Suppose we have already computed a) The best tag sequence for <s> … bit that ends in NN. b) The best tag sequence for <s> … bit that ends in VBD. • Then, the best full sequence would be either – sequence (a) extended to include </s>, or – sequence (b) extended to include </s>. Algorithms for HMMs (Goldwater, ANLP) 8
Viterbi: intuition Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • But similarly, to get a) The best tag sequence for <s> … bit that ends in NN. • We could extend one of: – The best tag sequence for <s> … dog that ends in NN. – The best tag sequence for <s> … dog that ends in VB. • And so on… Algorithms for HMMs (Goldwater, ANLP) 9
Viterbi: high-level picture • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. ( t now a time step , not a tag ). Algorithms for HMMs (Goldwater, ANLP) 10
Viterbi: high-level picture • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. ( t now a time step , not a tag ). So, – Find the best path of length t-1 to each state. – Consider extending each of those by 1 step, to state q . – Take the best of those options as the best path to state q . Algorithms for HMMs (Goldwater, ANLP) 11
Notation • Sequence of observations over time o 1 , o 2 , …, o T – here, words in sentence • Vocabulary size V of possible observations • Set of possible states q 1 , q 2 , …, q N (see note next slide) – here, tags • A , an NxN matrix of transition probabilities – a ij : the prob of transitioning from state i to j . (JM3 Fig 8.7) • B , an NxV matrix of output probabilities – b i (o t ) : the prob of emitting o t from state i . ( JM3 Fig 8.8) Algorithms for HMMs (Goldwater, ANLP) 12
Note on notation • J&M use q 1 , q 2 , …, q N for set of states, but also use q 1 , q 2 , …, q T for state sequence over time. – So, just seeing q 1 is ambiguous (though usually disambiguated from context). – I’ll instead use q i for state names, and q t for state at time t. – So we could have q t = q i , meaning: the state we’re in at time t is q i . Algorithms for HMMs (Goldwater, ANLP) 13
HMM example w/ new notation Start .3 .5 q 1 q 2 .7 .5 .6 .1 .3 .1 .7 .2 x y z x y z • States {q 1 , q 2 } (or {<s>, q 1 , q 2 } ) • Output alphabet {x, y, z} Adapted from Manning & Schuetze, Fig 9.2 Algorithms for HMMs (Goldwater, ANLP) 14
Transition and Output Probabilities • Transition matrix A : q 1 q 2 a ij = P(q j | q i ) <s> 1 0 q 1 .7 .3 q 2 .5 .5 • Output matrix B : x y z b i (o) = P(o | q i ) q 1 .6 .1 .3 q 2 for output o .1 .7 .2 Algorithms for HMMs (Goldwater, ANLP) 15
Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 Algorithms for HMMs (Goldwater, ANLP) 16
Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 𝑈 • Or: 𝑄 𝑃, 𝑅 𝜇 = 𝑐 𝑟 𝑢 (𝑝 𝑢 ) 𝑏 𝑟 𝑢−1 𝑟 𝑢 𝑢=1 Algorithms for HMMs (Goldwater, ANLP) 17
Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 𝑈 • Or: 𝑄 𝑃, 𝑅 𝜇 = 𝑐 𝑟 𝑢 (𝑝 𝑢 ) 𝑏 𝑟 𝑢−1 𝑟 𝑢 𝑢=1 • Example: 𝑄 𝑃 = 𝑧, 𝑨 , 𝑅 = (𝑟 1 , 𝑟 1 ) 𝜇 = 𝑐 1 𝑧 ∙ 𝑐 1 𝑨 ∙ 𝑏 <𝑡>,1 ∙ 𝑏 11 = (.1)(.3)(1)(.7) Algorithms for HMMs (Goldwater, ANLP) 18
Viterbi: high-level picture argmax 𝑅 𝑄(𝑅|𝑃) • Want to find • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. So, – Find the best path of length t-1 to each state. – Consider extending each of those by 1 step, to state q . – Take the best of those options as the best path to state q . Algorithms for HMMs (Goldwater, ANLP) 19
Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . Algorithms for HMMs (Goldwater, ANLP) 20 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )
Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . • Fill in columns from left to right, with 𝑂 𝑤 𝑘, 𝑢 = max 𝑗=1 𝑤 𝑗, 𝑢 − 1 ∙ 𝑏 𝑗𝑘 ∙ 𝑐 𝑘 𝑝 𝑢 Algorithms for HMMs (Goldwater, ANLP) 21 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )
Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . • Fill in columns from left to right, with 𝑂 𝑤 𝑘, 𝑢 = max 𝑗=1 𝑤 𝑗, 𝑢 − 1 ∙ 𝑏 𝑗𝑘 ∙ 𝑐 𝑘 𝑝 𝑢 • Store a backtrace to show, for each cell, which state at t-1 we came from. Algorithms for HMMs (Goldwater, ANLP) 22 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )
Example • Suppose O=xzy . Our initially empty table: o 1 =x o 2 =z o 3 =y q 1 q 2 Algorithms for HMMs (Goldwater, ANLP) 23
Filling the first column o 1 =x o 2 =z o 3 =y q 1 .6 q 2 0 𝑤 1,1 = 𝑏 <𝑡>1 ∙ 𝑐 1 𝑦) = 1 (.6 𝑤 2,1 = 𝑏 <𝑡>2 ∙ 𝑐 2 𝑦) = 0 (.1 Algorithms for HMMs (Goldwater, ANLP) 24
Starting the second column o 1 =x o 2 =z o 3 =y q 1 .6 q 2 0 𝑂 𝑤 1,2 = max 𝑗=1 𝑤 𝑗, 1 ∙ 𝑏 𝑗1 ∙ 𝑐 1 𝑨 = max 𝑤 1,1 ∙ 𝑏 11 ∙ 𝑐 1 𝑨 = (.6)(.7)(.3) 𝑤 2,1 ∙ 𝑏 21 ∙ 𝑐 1 𝑨 = (0)(.5)(.3) Algorithms for HMMs (Goldwater, ANLP) 25
Starting the second column o 1 =x o 2 =z o 3 =y q 1 .6 .126 q 2 0 𝑂 𝑤 1,2 = max 𝑗=1 𝑤 𝑗, 1 ∙ 𝑏 𝑗1 ∙ 𝑐 1 𝑨 = max 𝑤 1,1 ∙ 𝑏 11 ∙ 𝑐 1 𝑨 = (.6)(.7)(.3) 𝑤 2,1 ∙ 𝑏 21 ∙ 𝑐 1 𝑨 = (0)(.5)(.3) Algorithms for HMMs (Goldwater, ANLP) 26
Recommend
More recommend