HMM, MEMM and CRF Probabilistic Graphical Models Sharif University of Technology Spring 201 7 Soleymani
Sequence labeling ο½ Taking collective a set of interrelated instances π 1 , β¦ , π π and jointly labeling them ο½ We get as input a sequence of observations π = π 1:π and need to label them with some joint label π = π§ 1:π 2
Generalization of mixture models for sequential data [Jordan] Y: states (latent variables) X: observations π β¦ π π π π 1 2 πβ1 π π π 1 π 2 π πβ1 π π 3
HMM examples ο½ Some applications of HMM ο½ Speech recognition, NLP , activity recognition ο½ Part-of-speech-tagging ππΆ ππΆπ ππ πππ ππΆπ to expected Students are study 4
HMM: probabilistic model ο½ Transitional probabilities : transition probabilities between states ο½ π΅ ππ β‘ π(π π’ = π|π π’β1 = π) ο½ Initial state distribution: start probabilities in different states ο½ π π β‘ π(π 1 = π) ο½ Observation model : Emission probabilities associated with each state ο½ π(π π’ |π π’ , πΎ) 5
π : states (latent variables) HMM: probabilistic model π : observations ο½ Transitional probabilities : transition probabilities between states ο½ π π π’ π π’β1 = π = ππ£ππ’π(π π’ |π΅ π1 , β¦ , π΅ ππ ) βπ β π‘π’ππ’ππ‘ ο½ Initial state distribution: start probabilities in different states ο½ π π 1 = ππ£ππ’π(π 1 |π 1 , β¦ , π π ) ο½ Observation model : Emission probabilities associated with each state ο½ Discrete observations: π π π’ π π’ = π = ππ£ππ’π(π π’ |πΆ π,1 , β¦ , πΆ π,πΏ )βπ β π‘π’ππ’ππ‘ ο½ General: π π π’ π π’ = π = π(. |πΎ π ) 6
Inference problems in sequential data ο½ Decoding : argmax π π§ 1 , β¦ , π§ π π¦ 1 , β¦ , π¦ π π§ 1 ,β¦,π§ π ο½ Evaluation ο½ Filtering : π π§ π’ π¦ 1 , β¦ , π¦ π’ ο½ Smoothing : π’ β² < π’, π π§ π’ β² π¦ 1 , β¦ , π¦ π’ ο½ Prediction : π’β² > π’ , π π§ π’ β² π¦ 1 , β¦ , π¦ π’ 7
Some questions ο½ Inference ο½ π π§ π’ |π¦ 1 , β¦ , π¦ π’ =? ο½ π π¦ 1 , β¦ , π¦ π =? ο½ π π§ π’ |π¦ 1 , β¦ , π¦ π =? ο½ Learning: How do we adjust the HMM parameters: ο½ Complete data : each training data includes a state sequence and the corresponding observation sequence ο½ Incomplete data : each training data includes only an observation sequence 8
Forward algorithm π, π = 1, β¦ , π π½ π’ π = π π¦ 1 , β¦ , π¦ π’ , π π’ = π π½ π’ π = π½ π’β1 π π π π’ = π|π π’β1 = π π π¦ π’ | π π’ = π π ο½ Initialization: ο½ π½ 1 π = π π¦ 1 , π 1 = π = π π¦ 1 |π 1 = π π π 1 = π ο½ Iterations: π’ = 2 to π ο½ π½ π’ π = π π½ π’β1 π π π π’ = π|π π’β1 = π π π¦ π’ |π π’ = π β¦ π π π π 1 2 πβ1 π π½ π’ π = π π’β1βπ’ (π) π½ 1 (. ) π½ πβ1 (. ) π½ π (. ) π½ 2 (. ) π 1 π 2 π πβ1 π π 9
Backward algorithm π, π β π‘π’ππ’ππ‘ πΎ π’ π = π π’βπ’β1 (π) = π π¦ π’+1 , β¦ , π¦ π |π π’ = π πΎ π’β1 π = πΎ π’ π π π π’ = π|π π’β1 = π π π¦ π’ | π π’ = π π ο½ Initialization: ο½ πΎ π π = 1 ο½ Iterations: π’ = π down to 2 ο½ πΎ π’β1 π = π πΎ π’ π π π π’ = π|π π’β1 = π π π¦ π’ | π π’ = π πΎ 1 . πΎ 2 . πΎ πβ1 . πΎ π . = 1 β¦ π π π π πΎ π’ π = π π’βπ’β1 (π) 1 2 πβ1 π 10 π 1 π 2 π πβ1 π π
Forward-backward algorithm π½ π’ (π) β‘ π π¦ 1 , π¦ 2 , β¦ , π¦ π’ , π π’ = π πΎ π’ (π) β‘ π π¦ π’+1 , π¦ π’+2 , β¦ , π¦ π |π π’ = π π½ π’ π = π½ π’β1 π π π π’ = π|π π’β1 = π π π¦ π’ | π π’ = π π π½ 1 π = π π¦ 1 , π 1 = π = π π¦ 1 |π 1 = π π π 1 = π πΎ π’β1 π = πΎ π’ π π π π’ = π|π π’β1 = π π π¦ π’ | π π’ = π π πΎ π π = 1 π π¦ 1 , π¦ 2 , β¦ , π¦ π = π½ π π πΎ π π = π½ π π π π π’ = π|π¦ 1 , π¦ 2 , β¦ , π¦ π = π½ π’ (π)πΎ π’ (π) π π π π½ π π 11
Forward-backward algorithm ο½ This will also be used in the E-step of the EM algorithm to train a HMM: πΎ 1 . πΎ 2 . πΎ πβ1 . πΎ π . = 1 β¦ π π π π 1 2 πβ1 π π½ 1 (. ) π½ 2 (. ) π½ πβ1 (. ) π½ π (. ) π 1 π 2 π πβ1 π π π π¦ 1 ,β¦,π¦ π ,π π’ =π π½ π’ (π)πΎ π’ (π) ο½ π π π’ = π π¦ 1 , β¦ , π¦ π = = π π π¦ 1 ,β¦,π¦ π π=1 π½ π (π) 12
Decoding Problem ο½ Choose state sequence to maximize the observations: ο½ argmax π π§ 1 , β¦ , π§ π’ π¦ 1 , β¦ , π¦ π’ π§ 1 ,β¦,π§ π’ ο½ Viterbi algorithm: ο½ Define auxiliary variable π : ο½ π π’ π = π§ 1 ,β¦,π§ π’β1 π(π§ 1 , π§ 2 , β¦ , π§ π’β1 , π max π’ = π, π¦ 1 , π¦ 2 , β¦ , π¦ π’ ) ο½ π π’ (π) : probability of the most probable path ending in state π π’ = π ο½ Recursive relation : ο½ π π’ π = max π π’β1 π π(π π’ = π|π π’β1 = π) π π¦ π’ π π’ = π π π π’ π = π=1,β¦,π π π’β1 π π π max π’ = π π π’β1 = π π(π¦ π’ |π π’ = π) 13
Decoding Problem: Viterbi algorithm ο½ Initialization π = 1, β¦ , π ο½ π 1 π = π π¦ 1 |π 1 = π π π 1 = π ο½ π 1 π = 0 ο½ Iterations: π’ = 2, β¦ , π π = 1, β¦ , π ο½ π π’ π = max π π’β1 π π π π’ = π π π’β1 = π π π¦ π’ π π’ = π π ο½ π π’ π = argmax π π’ π π π π’ = π π π’β1 = π π ο½ Final computation: ο½ π β = max π=1,β¦,π π π π β = argmax ο½ π§ π π π π π=1,β¦,π ο½ Traceback state sequence: π’ = π β 1 down to 1 β = π π’+1 π§ π’+1 β ο½ π§ π’ 14
Max-product algorithm πππ¦ π¦ π = max πππ¦ (π¦ π ) π ππ π π¦ π π π¦ π , π¦ π π ππ π¦ π πβπͺ(π)\π πππ¦ Γ π π¦ π π π’ π = π π’β1,π’ 15
HMM Learning ο½ Supervised learning : When we have a set of data samples, each of them containing a pair of sequences (one is the observation sequence and the other is the state sequence) ο½ Unsupervised learning : When we have a set of data samples, each of them containing a sequence of observations 16
HMM supervised learning by MLE π© π β¦ π π π π 1 2 πβ1 π πͺ π 2 π πβ1 π π π 1 ο½ Initial state probability: π π = π π 1 = π , 1 β€ π β€ π ο½ State transition probability: π΅ ππ = π π π’+1 = π π π’ = π , 1 β€ π, π β€ π ο½ State transition probability: Discrete πΆ ππ = π π π’ = π π π’ = π , 1 β€ π β€ πΏ observations 17
HMM: supervised parameter learning by MLE π π π (π) π (π) |π§ π’β1 (π) , π©) (π) |π§ π’ (π) , πͺ) π π |πΎ = π π§ π(π§ π’ π(π π’ 1 π’=2 π’=1 π=1 (π) = π, π§ π’ (π) = π π π π=1 π’=2 π½ π§ π’β1 π΅ ππ = (π) = π π π π=1 π’=2 π½ π§ π’β1 (π) = π π π=1 π½ π§ 1 π π = π (π) = π, π¦ π’ (π) = π π π π=1 π’=1 π½ π§ π’ Discrete πΆ ππ = observations (π) = π π π π=1 π’=1 π½ π§ π’ 18
Learning ο½ Problem: how to construct an HHM given only observations? ο½ Find πΎ = (π©, πͺ, π) , maximizing π(π 1 , β¦ , π π |πΎ) ο½ Incomplete data ο½ EM algorithm 19
Recommend
More recommend