introduction to natural language processing
play

Introduction to Natural Language Processing a course taught as - PowerPoint PPT Presentation

Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 3, lecture Todays topic: Markov Models Todays teacher: Jan Haji c


  1. Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 3, lecture Today’s topic: Markov Models Today’s teacher: Jan Hajiˇ c E-mail: hajic@ufal.mff.cuni.cz WWW: http://ufal.mff.cuni.cz/jan-hajic c (´ Jan Hajiˇ UFAL MFF UK) Markov Models Week 3, lecture 1 / 1

  2. Review: Markov Process • Bayes formula (chain rule): P(W) = P(w 1 ,w 2 ,...,w T ) =  i=1..T p(w i |w 1 ,w 2 ,..,w i-n+1 ,..,w i-1 ) approximation • n-gram language models: – Markov process (chain) of the order n-1: P(W) = P(w 1 ,w 2 ,...,w T ) =  i=1..T p(w i |w i-n+1 ,w i-n+2 ,..,w i-1 ) Using just one distribution (Ex.: trigram model: p(w i |w i-2 ,w i-1 )): Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Words: My car broke down , and within hours Bob ’s car broke down , too . p( ,|broke down ) = p(w 5 |w 3 ,w 4 )) = p(w 14 |w 12 ,w 13 ) 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 2

  3. Markov Properties • Generalize to any process (not just words/LM): – Sequence of random variables: X = (X 1 ,X 2 ,...,X T ) – Sample space S ( states ), size N: S = {s 0 ,s 1 ,s 2 ,...,s N } 1. Limited History (Context, Horizon):  i  1..T; P(X i |X 1 ,...,X i-1 ) = P(X i |X i-1 ) 1 7 3 7 9 0 6 7 3 4 5... 1 7 3 7 9 0 6 7 3 4 5... 1 7 3 7 9 0 6 7 7 2. Time invariance (M.C. is stationary, homogeneous)  i  1..T,  y,x  S; P(X i =y|X i-1 =x) = p(y|x) 1 7 3 7 9 0 6 7 3 4 5... ? ok...same distribution 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 3

  4. Long History Possible • What if we want trigrams: 1 7 3 7 9 0 6 7 3 4 5... • Formally, use transformation: Define new variables Q i , such that X i = {Q i-1 ,Q i }: Then P(X i |X i-1 ) = P(Q i-1 ,Q i |Q i-2 ,Q i-1 ) = P(Q i |Q i-2 ,Q i-1 ) 9 0 Predicting (X i ): 1 7 3 7 9 0 6 7 3 4 5... 0  1 7 3 .... 0 6 7 3 4  9  History (X i-1 = {Q i-2 ,Q i-1 }):  1 7 .... 9 0 6 7 3 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 4

  5. Graph Representation: State Diagram • S = {s 0 ,s 1 ,s 2 ,...,s N } : states • Distribution P(X i |X i-1 ): • transitions (as arcs) with probabilities attached to them: Bigram 1 case:  e t 0.6 0.12 enter here sum of outgoing probs = 1 0.4 0.3 0.88 1 0.4 o a p(o|a) = 0.1 p(toe) = .6  .88  1 = .528 0.2 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 5

  6. The Trigram Case • S = {s 0 ,s 1 ,s 2 ,...,s N }: states: pairs s i = (x,y) • Distribution P(X i |X i-1 ): (r.v. X: generates pairs s i ) 1 e,n 1   t t,e 1 1 0.6 0.12 enter here n,e o,e n e o l b i t s s a o p l m l 0.88 o i w 0.07 e d 1 0.4  o t,o o,n 0.93 1 p(toe) = .6  .88  .07  .037 p(one) = ? 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 6

  7. Finite State Automaton • States ~ symbols of the [input/output] alphabet – pairs (or more): last element of the n-tuple • Arcs ~ transitions (sequence of states) • [Classical FSA: alphabet symbols on arcs: – transformation: arcs  nodes] • Possible thanks to the “limited history” M’ov Property • So far: Visible Markov Models (VMM) 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 7

  8. Hidden Markov Models • The simplest HMM: states generate [observable] output (using the “data” alphabet) but remain “invisible”: t e 1  2 1 0.6 0.12 enter here 0.4 0.3 0.88 1 0.4 4 3 p(4|3) = 0.1 p(toe) = .6  .88  1 = .528 0.2 a o 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 8

  9. Added Flexibility • So far, no change; but different states may generate the same output (why not?): t e 1  1 2 0.6 0.12 enter here 0.4 0.3 0.88 1 0.4 4 3 p(toe) = .6  .88  1 + p(4|3) = 0.1 .4  .1  1 = .568 0.2 t o 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 9

  10. Output from Arcs... • Added flexibility: Generate output from arcs, not states: t t e 1  1 2 0.6 0.12 enter here o 0.4 0.3 0.88 1 0.4 p(toe) = .6  .88  1 + 4 3 e 0.1 .4  .1  1 + e t 0.2 .4  .2  .3 + o e .4  .2  .4 = .624 o 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 10

  11. ... and Finally, Add Output Probabilities • Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc: p(t)=0 p(t)=.8 p(o)=0 p(o)=.1 !simplified! p(e)=1 p(e)=.1 p(t)=.1  1 2 p(o)=.7 0.6 0.12 enter here p(e)=.2 1 0.4 0.88 p(toe) = .6      .88      1    + 1 0.88 4 .4     .1     .88    3 p(t)=0 + p(t)=0 p(t)=.5 p(o)=.4 .4    1     .12   p(o)=1 p(o)=.2 p(e)=.6  .237 p(e)=0 p(e)=.3 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 11

  12. Slightly Different View • Allow for multiple arcs from s i  s j , mark them by output symbols, get rid of output distributions: e,.12 o,.06 e,.06  1 2 t,.48 e,.176 enter here e,.12 o,.08 t,.088 o,.4 o,1 p(toe) = .48  .616  .6+ t,.2 o,.616 4 3 .2  1  .176 + e,.6 .2  1  .12  .237 In the future, we will use the view more convenient for the problem at hand. 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 12

  13. Formalization • HMM (the most general case): – five-tuple (S, s 0 , Y, P S , P Y ), where: • S = {s 0 ,s 1 ,s 2 ,...,s T } is the set of states, s 0 is the initial state, • Y = {y 1 ,y 2 ,...,y V } is the output alphabet, • P S (s j |s i ) is the set of prob. distributions of transitions, – size of P S : |S| 2 . • P Y (y k |s i ,s j ) is the set of output (emission) probability distributions. – size of P Y : |S| 2 x |Y| • Example: – S = {x, 1, 2, 3, 4}, s 0 = x – Y = { t, o, e } 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 13

  14. Formalization - Example • Example: – S = {x, 1, 2, 3, 4}, s 0 = x – Y = { e, o, t } e x 1 2 3 4  = 1 o x 1 2 3 4 x – P S : P Y : x 1 2 3 4 t x 1 2 3 4 x 1 .2 x 0 .6 0 .4 0 x .8 .5 1 .7 2 1 0 0 .12 0 .88 1 .1 0 2 3 2 0 0 0 0 1 2 0 3 4 3 0 1 0 0 0 3 0 4 0 4 0 0 1 0 0 4  = 1 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 14

  15. Using the HMM • The generation algorithm (of limited value :-)): 1. Start in s = s 0 . 2. Move from s to s’ with probability P S (s’|s). 3. Output (emit) symbol y k with probability P S (y k |s,s’). 4. Repeat from step 2 (until somebody says enough). • More interesting usage: – Given an output sequence Y = {y 1 ,y 2 ,...,y k }, compute its probability. – Given an output sequence Y = {y 1 ,y 2 ,...,y k }, compute the most likely sequence of states which has generated it. – ...plus variations: e.g., n best state sequences 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 15

  16. HMM Algorithms: Trellis and Viterbi

  17. HMM: The Two Tasks • HMM (the general case): – five-tuple (S, S 0 , Y, P S , P Y ), where: • S = {s 1 ,s 2 ,...,s T } is the set of states, S 0 is the initial state, • Y = {y 1 ,y 2 ,...,y V } is the output alphabet, • P S (s j |s i ) is the set of prob. distributions of transitions, • P Y (y k |s i ,s j ) is the set of output (emission) probability distributions. • Given an HMM & an output sequence Y = {y 1 ,y 2 ,...,y k }: (Task 1) compute the probability of Y; (Task 2) compute the most likely sequence of states which has generated Y. 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 17

  18. Trellis - Deterministic Output HMM: Trellis: time/position t 0 1 2 3 4...  ,0  ,1  ,2  ,3 .6 t e 1 A,0 A,1 A,2 A,3  A B “rollout” e 0.12 r e h r B,0 B,1 B,2 B,3 e t 0.3 0.4 .4 n 0.88 .88 e 1 C D p(4|3) = 0.1 C,0 C,1 C,2 C,3 0.2 t .1 1 o p(toe) = .6  .88  1 + D,0 D,1 D,2 D,3 .4  .1  1 = .568 + Y: t o e - trellis state: (HMM state, position)  (  ,0) = 1  (A,1) = .6  (D,2) = .568  (B,3) = .568 - each state: holds one number (prob):   (C,1) = .4 - probability or Y:  in the last state 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 18

  19. Creating the Trellis: The Start position/stage • Start in the start state (  ), 0 1 – set its  (  , 0 ) to 1.  ,0 .6  = 1 • Create the first stage: A,1  = .6 – get the first “output” symbol y 1 .4 – create the first stage (column) C,1 – but only those trellis states which generate y 1 y 1 : t – set their  ( state , 1 ) to the P S ( state |  )  (  , 0 ) } • ...and forget about the 0 -th stage 1 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 19

Recommend


More recommend