hmm memm and crf
play

HMM, MEMM and CRF Probabilistic Graphical Models Sharif University - PowerPoint PPT Presentation

HMM, MEMM and CRF Probabilistic Graphical Models Sharif University of Technology Spring 201 7 Soleymani Sequence labeling Taking collective a set of interrelated instances 1 , , and jointly labeling them We get as


  1. HMM, MEMM and CRF Probabilistic Graphical Models Sharif University of Technology Spring 201 7 Soleymani

  2. Sequence labeling  Taking collective a set of interrelated instances π’š 1 , … , π’š π‘ˆ and jointly labeling them  We get as input a sequence of observations 𝒀 = π’š 1:π‘ˆ and need to label them with some joint label 𝒁 = 𝑧 1:π‘ˆ 2

  3. Generalization of mixture models for sequential data [Jordan] Y: states (latent variables) X: observations π‘Ž … 𝑍 𝑍 𝑍 𝑍 1 2 π‘ˆβˆ’1 π‘ˆ π‘Œ π‘Œ 1 π‘Œ 2 π‘Œ π‘ˆβˆ’1 π‘Œ π‘ˆ 3

  4. HMM examples  Some applications of HMM  Speech recognition, NLP , activity recognition  Part-of-speech-tagging π‘ŠπΆ π‘ŠπΆπ‘‚ π‘ˆπ‘ 𝑂𝑂𝑄 π‘ŠπΆπ‘Ž to expected Students are study 4

  5. HMM: probabilistic model  Transitional probabilities : transition probabilities between states  𝐡 π‘—π‘˜ ≑ 𝑄(𝑍 𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗)  Initial state distribution: start probabilities in different states  𝜌 𝑗 ≑ 𝑄(𝑍 1 = 𝑗)  Observation model : Emission probabilities associated with each state  𝑄(π‘Œ 𝑒 |𝑍 𝑒 , 𝚾) 5

  6. 𝑍 : states (latent variables) HMM: probabilistic model π‘Œ : observations  Transitional probabilities : transition probabilities between states  𝑄 𝑍 𝑒 𝑍 π‘’βˆ’1 = 𝑗 = π‘π‘£π‘šπ‘’π‘—(𝑍 𝑒 |𝐡 𝑗1 , … , 𝐡 𝑗𝑁 ) βˆ€π‘— ∈ 𝑑𝑒𝑏𝑒𝑓𝑑  Initial state distribution: start probabilities in different states  𝑄 𝑍 1 = π‘π‘£π‘šπ‘’π‘—(𝑍 1 |𝜌 1 , … , 𝜌 𝑁 )  Observation model : Emission probabilities associated with each state  Discrete observations: 𝑄 π‘Œ 𝑒 𝑍 𝑒 = 𝑗 = π‘π‘£π‘šπ‘’π‘—(π‘Œ 𝑒 |𝐢 𝑗,1 , … , 𝐢 𝑗,𝐿 )βˆ€π‘— ∈ 𝑑𝑒𝑏𝑒𝑓𝑑  General: 𝑄 π‘Œ 𝑒 𝑍 𝑒 = 𝑗 = 𝑔(. |𝜾 𝑗 ) 6

  7. Inference problems in sequential data  Decoding : argmax 𝑄 𝑧 1 , … , 𝑧 π‘ˆ 𝑦 1 , … , 𝑦 π‘ˆ 𝑧 1 ,…,𝑧 π‘ˆ  Evaluation  Filtering : 𝑄 𝑧 𝑒 𝑦 1 , … , 𝑦 𝑒  Smoothing : 𝑒 β€² < 𝑒, 𝑄 𝑧 𝑒 β€² 𝑦 1 , … , 𝑦 𝑒  Prediction : 𝑒′ > 𝑒 , 𝑄 𝑧 𝑒 β€² 𝑦 1 , … , 𝑦 𝑒 7

  8. Some questions  Inference  𝑄 𝑧 𝑒 |𝑦 1 , … , 𝑦 𝑒 =?  𝑄 𝑦 1 , … , 𝑦 π‘ˆ =?  𝑄 𝑧 𝑒 |𝑦 1 , … , 𝑦 π‘ˆ =?  Learning: How do we adjust the HMM parameters:  Complete data : each training data includes a state sequence and the corresponding observation sequence  Incomplete data : each training data includes only an observation sequence 8

  9. Forward algorithm 𝑗, π‘˜ = 1, … , 𝑁 𝛽 𝑒 𝑗 = 𝑄 𝑦 1 , … , 𝑦 𝑒 , 𝑍 𝑒 = 𝑗 𝛽 𝑒 𝑗 = 𝛽 π‘’βˆ’1 π‘˜ 𝑄 𝑍 𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜ 𝑄 𝑦 𝑒 | 𝑍 𝑒 = 𝑗 π‘˜  Initialization:  𝛽 1 𝑗 = 𝑄 𝑦 1 , 𝑍 1 = 𝑗 = 𝑄 𝑦 1 |𝑍 1 = 𝑗 𝑄 𝑍 1 = 𝑗  Iterations: 𝑒 = 2 to π‘ˆ  𝛽 𝑒 𝑗 = π‘˜ 𝛽 π‘’βˆ’1 π‘˜ 𝑄 𝑍 𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜ 𝑄 𝑦 𝑒 |𝑍 𝑒 = 𝑗 … 𝑍 𝑍 𝑍 𝑍 1 2 π‘ˆβˆ’1 π‘ˆ 𝛽 𝑒 𝑗 = 𝑛 π‘’βˆ’1→𝑒 (𝑗) 𝛽 1 (. ) 𝛽 π‘ˆβˆ’1 (. ) 𝛽 π‘ˆ (. ) 𝛽 2 (. ) π‘Œ 1 π‘Œ 2 π‘Œ π‘ˆβˆ’1 π‘Œ π‘ˆ 9

  10. Backward algorithm 𝑗, π‘˜ ∈ 𝑑𝑒𝑏𝑒𝑓𝑑 𝛾 𝑒 𝑗 = 𝑛 π‘’β†’π‘’βˆ’1 (𝑗) = 𝑄 𝑦 𝑒+1 , … , 𝑦 π‘ˆ |𝑍 𝑒 = 𝑗 𝛾 π‘’βˆ’1 𝑗 = 𝛾 𝑒 π‘˜ 𝑄 𝑍 𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗 𝑄 𝑦 𝑒 | 𝑍 𝑒 = 𝑗 π‘˜  Initialization:  𝛾 π‘ˆ 𝑗 = 1  Iterations: 𝑒 = π‘ˆ down to 2  𝛾 π‘’βˆ’1 𝑗 = π‘˜ 𝛾 𝑒 π‘˜ 𝑄 𝑍 𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗 𝑄 𝑦 𝑒 | 𝑍 𝑒 = 𝑗 𝛾 1 . 𝛾 2 . 𝛾 π‘ˆβˆ’1 . 𝛾 π‘ˆ . = 1 … 𝑍 𝑍 𝑍 𝑍 𝛾 𝑒 𝑗 = 𝑛 π‘’β†’π‘’βˆ’1 (𝑗) 1 2 π‘ˆβˆ’1 π‘ˆ 10 π‘Œ 1 π‘Œ 2 π‘Œ π‘ˆβˆ’1 π‘Œ π‘ˆ

  11. Forward-backward algorithm 𝛽 𝑒 (𝑗) ≑ 𝑄 𝑦 1 , 𝑦 2 , … , 𝑦 𝑒 , 𝑍 𝑒 = 𝑗 𝛾 𝑒 (𝑗) ≑ 𝑄 𝑦 𝑒+1 , 𝑦 𝑒+2 , … , 𝑦 π‘ˆ |𝑍 𝑒 = 𝑗 𝛽 𝑒 𝑗 = 𝛽 π‘’βˆ’1 π‘˜ 𝑄 𝑍 𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜ 𝑄 𝑦 𝑒 | 𝑍 𝑒 = 𝑗 π‘˜ 𝛽 1 𝑗 = 𝑄 𝑦 1 , 𝑍 1 = 𝑗 = 𝑄 𝑦 1 |𝑍 1 = 𝑗 𝑄 𝑍 1 = 𝑗 𝛾 π‘’βˆ’1 𝑗 = 𝛾 𝑒 π‘˜ 𝑄 𝑍 𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗 𝑄 𝑦 𝑒 | 𝑍 𝑒 = 𝑗 π‘˜ 𝛾 π‘ˆ 𝑗 = 1 𝑄 𝑦 1 , 𝑦 2 , … , 𝑦 π‘ˆ = 𝛽 π‘ˆ 𝑗 𝛾 π‘ˆ 𝑗 = 𝛽 π‘ˆ 𝑗 𝑗 𝑗 𝑒 = 𝑗|𝑦 1 , 𝑦 2 , … , 𝑦 π‘ˆ = 𝛽 𝑒 (𝑗)𝛾 𝑒 (𝑗) 𝑄 𝑍 𝑗 𝛽 π‘ˆ 𝑗 11

  12. Forward-backward algorithm  This will also be used in the E-step of the EM algorithm to train a HMM: 𝛾 1 . 𝛾 2 . 𝛾 π‘ˆβˆ’1 . 𝛾 π‘ˆ . = 1 … 𝑍 𝑍 𝑍 𝑍 1 2 π‘ˆβˆ’1 π‘ˆ 𝛽 1 (. ) 𝛽 2 (. ) 𝛽 π‘ˆβˆ’1 (. ) 𝛽 π‘ˆ (. ) π‘Œ 1 π‘Œ 2 π‘Œ π‘ˆβˆ’1 π‘Œ π‘ˆ 𝑄 𝑦 1 ,…,𝑦 π‘ˆ ,𝑍 𝑒 =𝑗 𝛽 𝑒 (𝑗)𝛾 𝑒 (𝑗)  𝑄 𝑍 𝑒 = 𝑗 𝑦 1 , … , 𝑦 π‘ˆ = = 𝑂 𝑄 𝑦 1 ,…,𝑦 π‘ˆ π‘˜=1 𝛽 π‘ˆ (π‘˜) 12

  13. Decoding Problem  Choose state sequence to maximize the observations:  argmax 𝑄 𝑧 1 , … , 𝑧 𝑒 𝑦 1 , … , 𝑦 𝑒 𝑧 1 ,…,𝑧 𝑒  Viterbi algorithm:  Define auxiliary variable πœ€ :  πœ€ 𝑒 𝑗 = 𝑧 1 ,…,𝑧 π‘’βˆ’1 𝑄(𝑧 1 , 𝑧 2 , … , 𝑧 π‘’βˆ’1 , 𝑍 max 𝑒 = 𝑗, 𝑦 1 , 𝑦 2 , … , 𝑦 𝑒 )  πœ€ 𝑒 (𝑗) : probability of the most probable path ending in state 𝑍 𝑒 = 𝑗  Recursive relation :  πœ€ 𝑒 𝑗 = max πœ€ π‘’βˆ’1 π‘˜ 𝑄(𝑍 𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜) 𝑄 𝑦 𝑒 𝑍 𝑒 = 𝑗 π‘˜ πœ€ 𝑒 𝑗 = π‘˜=1,…,𝑂 πœ€ π‘’βˆ’1 π‘˜ 𝑄 𝑍 max 𝑒 = 𝑗 𝑍 π‘’βˆ’1 = π‘˜ 𝑄(𝑦 𝑒 |𝑍 𝑒 = 𝑗) 13

  14. Decoding Problem: Viterbi algorithm  Initialization 𝑗 = 1, … , 𝑁  πœ€ 1 𝑗 = 𝑄 𝑦 1 |𝑍 1 = 𝑗 𝑄 𝑍 1 = 𝑗  πœ” 1 𝑗 = 0  Iterations: 𝑒 = 2, … , π‘ˆ 𝑗 = 1, … , 𝑁  πœ€ 𝑒 𝑗 = max πœ€ π‘’βˆ’1 π‘˜ 𝑄 𝑍 𝑒 = 𝑗 𝑍 π‘’βˆ’1 = π‘˜ 𝑄 𝑦 𝑒 𝑍 𝑒 = 𝑗 π‘˜  πœ” 𝑒 𝑗 = argmax πœ€ 𝑒 π‘˜ 𝑄 𝑍 𝑒 = 𝑗 𝑍 π‘’βˆ’1 = π‘˜ π‘˜  Final computation:  𝑄 βˆ— = max π‘˜=1,…,𝑂 πœ€ π‘ˆ π‘˜ βˆ— = argmax  𝑧 π‘ˆ πœ€ π‘ˆ π‘˜ π‘˜=1,…,𝑂  Traceback state sequence: 𝑒 = π‘ˆ βˆ’ 1 down to 1 βˆ— = πœ” 𝑒+1 𝑧 𝑒+1 βˆ—  𝑧 𝑒 14

  15. Max-product algorithm 𝑛𝑏𝑦 𝑦 𝑗 = max 𝑛𝑏𝑦 (𝑦 π‘˜ ) 𝑛 π‘˜π‘— 𝜚 𝑦 π‘˜ 𝜚 𝑦 𝑗 , 𝑦 π‘˜ 𝑛 π‘™π‘˜ 𝑦 π‘˜ π‘™βˆˆπ’ͺ(π‘˜)\𝑗 𝑛𝑏𝑦 Γ— 𝜚 𝑦 𝑗 πœ€ 𝑒 𝑗 = 𝑛 π‘’βˆ’1,𝑒 15

  16. HMM Learning  Supervised learning : When we have a set of data samples, each of them containing a pair of sequences (one is the observation sequence and the other is the state sequence)  Unsupervised learning : When we have a set of data samples, each of them containing a sequence of observations 16

  17. HMM supervised learning by MLE 𝑩 𝝆 … 𝑍 𝑍 𝑍 𝑍 1 2 π‘ˆβˆ’1 π‘ˆ π‘ͺ π‘Œ 2 π‘Œ π‘ˆβˆ’1 π‘Œ π‘ˆ π‘Œ 1  Initial state probability: 𝜌 𝑗 = 𝑄 𝑍 1 = 𝑗 , 1 ≀ 𝑗 ≀ 𝑁  State transition probability: 𝐡 π‘˜π‘— = 𝑄 𝑍 𝑒+1 = 𝑗 𝑍 𝑒 = π‘˜ , 1 ≀ 𝑗, π‘˜ ≀ 𝑁  State transition probability: Discrete 𝐢 𝑗𝑙 = 𝑄 π‘Œ 𝑒 = 𝑙 𝑍 𝑒 = 𝑗 , 1 ≀ 𝑙 ≀ 𝐿 observations 17

  18. HMM: supervised parameter learning by MLE 𝑂 π‘ˆ π‘ˆ (π‘œ) 𝝆 (π‘œ) |𝑧 π‘’βˆ’1 (π‘œ) , 𝑩) (π‘œ) |𝑧 𝑒 (π‘œ) , π‘ͺ) 𝑄 𝒠|𝜾 = 𝑄 𝑧 𝑄(𝑧 𝑒 𝑄(π’š 𝑒 1 𝑒=2 𝑒=1 π‘œ=1 (π‘œ) = 𝑗, 𝑧 𝑒 (π‘œ) = π‘˜ 𝑂 π‘ˆ π‘œ=1 𝑒=2 𝐽 𝑧 π‘’βˆ’1 𝐡 π‘—π‘˜ = (π‘œ) = 𝑗 𝑂 π‘ˆ π‘œ=1 𝑒=2 𝐽 𝑧 π‘’βˆ’1 (π‘œ) = 𝑗 𝑂 π‘œ=1 𝐽 𝑧 1 𝜌 𝑗 = 𝑂 (π‘œ) = 𝑗, 𝑦 𝑒 (π‘œ) = 𝑙 𝑂 π‘ˆ π‘œ=1 𝑒=1 𝐽 𝑧 𝑒 Discrete 𝐢 𝑗𝑙 = observations (π‘œ) = 𝑗 𝑂 π‘ˆ π‘œ=1 𝑒=1 𝐽 𝑧 𝑒 18

  19. Learning  Problem: how to construct an HHM given only observations?  Find 𝜾 = (𝑩, π‘ͺ, 𝝆) , maximizing 𝑄(π’š 1 , … , π’š π‘ˆ |𝜾)  Incomplete data  EM algorithm 19

More recommend