sequential data modeling conditional random fields
play

Sequential Data Modeling - Conditional Random Fields Graham Neubig - PowerPoint PPT Presentation

Sequential Data Modeling Conditional Random Fields Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Sequential Data Modeling Conditional Random Fields Prediction


  1. Sequential Data Modeling – Conditional Random Fields Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1

  2. Sequential Data Modeling – Conditional Random Fields Prediction Problems Given x, predict y 2

  3. Sequential Data Modeling – Conditional Random Fields Prediction Problems Given x, predict y A book review Is it positive? Binary Oh, man I love this book! Prediction yes (2 choices) This book is so boring... no A tweet Its language Multi-class On the way to the park! English Prediction 公園に行くなう! (several choices) Japanese A sentence Its parts-of-speech Structured Prediction N VBD DET NN I read a book (millions of choices) I read a book 3

  4. Sequential Data Modeling – Conditional Random Fields Logistic Regression 4

  5. Sequential Data Modeling – Conditional Random Fields Example we will use: ● Given an introductory sentence from Wikipedia ● Predict whether the article is about a person Given Predict Gonso was a Sanron sect priest (754-827) Yes! in the late Nara and early Heian periods. Shichikuzan Chigogataki Fudomyoo is No! a historical site located at Magura, Maizuru City, Kyoto Prefecture. ● This is binary classification (of course!) 5

  6. Sequential Data Modeling – Conditional Random Fields Review: Linear Prediction Model ● Each element that helps us predict is a feature contains “priest” contains “(<#>-<#>)” contains “site” contains “Kyoto Prefecture” ● Each feature has a weight, positive if it indicates “yes”, and negative if it indicates “no” w contains “priest” = 2 w contains “(<#>-<#>)” = 1 w contains “site” = -3 w contains “Kyoto Prefecture” = -1 ● For a new example, sum the weights Kuya (903-972) was a priest 2 + -1 + 1 = 2 born in Kyoto Prefecture. ● If the sum is at least 0: “yes”, otherwise: “no” 6

  7. Sequential Data Modeling – Conditional Random Fields Review: Mathematical Formulation sign ( w ⋅ϕ( x )) y = I sign ( ∑ i = 1 w i ⋅ϕ i ( x )) = ● x: the input ● φ(x) : vector of feature functions {φ 1 (x), φ 2 (x), …, φ I (x)} ● w : the weight vector {w 1 , w 2 , …, w I } ● y: the prediction, +1 if “yes”, -1 if “no” ● (sign(v) is +1 if v >= 0, -1 otherwise) 7

  8. Sequential Data Modeling – Conditional Random Fields Perceptron and Probabilities P ( y ∣ x ) ● Sometimes we want the probability ● Estimating confidence in predictions ● Combining with other systems ● However, perceptron only gives us a prediction y = sign ( w ⋅ϕ( x )) In other words: 1 ) P ( y = 1 ∣ x )= 1 if w ⋅ϕ( x )≥ 0 0.5 x | y ( p P ( y = 1 ∣ x )= 0 if w ⋅ϕ( x )< 0 0 -10 -5 0 5 10 8 w*phi(x)

  9. Sequential Data Modeling – Conditional Random Fields The Logistic Function ● The logistic function is a “softened” version of the function used in the perceptron w ⋅ ϕ( x ) P ( y = 1 ∣ x )= e w ⋅ϕ( x ) 1 + e Perceptron Logistic Function 1 1 ) ) 0.5 x 0.5 x | | y y ( ( p p 0 0 -10 -5 0 5 10 -10 -5 0 5 10 w*phi(x) w*phi(x) ● Can account for uncertainty 9 ● Differentiable

  10. Sequential Data Modeling – Conditional Random Fields Logistic Regression ● Train based on conditional likelihood ● Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ∏ i P ( y i ∣ x i ; w ) ̂ w = argmax w ● How do we solve this? 10

  11. Sequential Data Modeling – Conditional Random Fields Review: Perceptron Training Algorithm create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) y' = predict_one (w, phi) if y' != y w += y * phi ● In other words ● Try to classify each training example ● Every time we make a mistake, update the weights 11

  12. Sequential Data Modeling – Conditional Random Fields Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● For every training example, calculate the gradient (the direction that will increase the probability of y) ● Move in that direction, multiplied by learning rate α 12

  13. Sequential Data Modeling – Conditional Random Fields Gradient of the Logistic Function ● Take the derivative of the probability ) w ⋅ ϕ( x ) 0.4 d d e x ( i d w P ( y = 1 ∣ x ) = h p d w w ⋅ϕ( x ) 1 + e * 0.2 w d / w ⋅ϕ( x ) ) e x | 0 ϕ( x ) = y ( -10 -5 0 5 10 p w ⋅ϕ( x ) ) 2 ( 1 + e d w*phi(x) w ⋅ϕ( x ) d d w ( 1 − e d d w P ( y =− 1 ∣ x ) w ⋅ϕ( x ) ) = 1 + e w ⋅ϕ( x ) e −ϕ ( x ) = w ⋅ϕ( x ) ) 2 ( 1 + e 13

  14. Sequential Data Modeling – Conditional Random Fields Example: Initial Update ● Set α=1, initialize w = 0 y = -1 x = A site , located in Maizuru , Kyoto 0 d e w ⋅ϕ( x )= 0 d w P ( y =− 1 ∣ x ) − 2 ϕ( x ) = 0 ) ( 1 + e − 0.25 ϕ( x ) = w ← w +− 0.25 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “,” = -0.5 w unigram “site” = -0.25 w unigram “in” = -0.25 w unigram “located” = -0.25 14 w unigram “Kyoto” = -0.25

  15. Sequential Data Modeling – Conditional Random Fields Example: Second Update y = 1 x = Shoken , monk born in Kyoto -0.5 -0.25 -0.25 1 d e w ⋅ϕ( x )=− 1 d w P ( y = 1 ∣ x ) 2 ϕ( x ) = 1 ) ( 1 + e = 0.196 ϕ( x ) w ← w + 0.196 ϕ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “Shoken” = 0.196 w unigram “,” = -0.304 w unigram “site” = -0.25 w unigram “monk” = 0.196 w unigram “in” = -0.054 w unigram “located” = -0.25 w unigram “born” = 0.196 15 w unigram “Kyoto” = -0.054

  16. Sequential Data Modeling – Conditional Random Fields Calculating Optimal Sequences, Probabilities 16

  17. Sequential Data Modeling – Conditional Random Fields Sequence Likelihood ● Logistic regression considered probability of y ∈{− 1, + 1 } P ( y ∣ x ) ● What if we want to consider probability of a sequence? X i I visited Nara Y i PRN VBD NNP P ( Y ∣ X ) 17

  18. Sequential Data Modeling – Conditional Random Fields Calculating Multi-class Probabilities ● Each sequence has it's own feature vector time flies φ( ) φ T,<S>,N =1 φ T,N,V =1 φ T,V,</S> =1 φ E,N,time =1 φ E,V,flies =1 N V time flies φ( ) φ T,<S>,V =1 φ T,V,N =1 φ T,N,</S> =1 φ E,V,time =1 φ E,N,flies =1 V N time flies φ( ) φ T,<S>,N =1 φ T,N,N =1 φ T,N,</S> =1 φ E,N,time =1 φ E,N,flies =1 N N time flies φ( ) φ T,<S>,V =1 φ T,V,V =1 φ T,V,</S> =1 φ E,V,time =1 φ E,V,flies =1 V V ● Use weights for each feature to calculate scores w T,<S>,N =1 w T,V,</S> =1 w E,N,time =1 time flies time flies φ ( )* w =3 φ ( )* w =0 N V V N time flies time flies 18 φ ( )* w =2 φ ( )* w =1 N N V V

  19. Sequential Data Modeling – Conditional Random Fields The Softmax Function ● Turn into probabilities by taking exponent and normalizing (the Softmax function) w ⋅ϕ( Y , X ) e P ( Y ∣ X )= ∑ ̃ w ⋅ϕ( ̃ Y , X ) Y e ● Take the exponent and normalize time flies time flies exp( φ ( )* w )=20.08 exp( φ ( )* w )=1.00 N V V N time flies time flies exp( φ ( )* w )=7.39 exp( φ ( )* w )=2.72 N N V V P(V N | time flies)=0.0320 P(N V | time flies)=.6437 P(N N | time flies)=.2369 P(V V | time flies)=0.0872 19

  20. Sequential Data Modeling – Conditional Random Fields Calculating Edge Features ● Like perceptron, can calculate features for each edge φ E,N,flies =1 time flies φ T,N,N =1 φ E,N,time =1 N N φ T,N,</S> =1 φ T,<S>,N =1 φ E,N,flies =1 φ T,V,N =1 <S> </S> φ E,V,flies =1 φ T,N,V =1 φ E,V,time =1 φ T,V,</S> =1 V V φ T,<S>,V =1 φ E,V,flies =1 φ T,V,V =1 20

  21. Sequential Data Modeling – Conditional Random Fields Calculating Edge Probabilities ● Calculate scores, and take exponent time flies e w*φ =1.00 P=.237 e w*φ =7.39 N N e w*φ =1.00 P=.881 P=.269 e w*φ =1.00 P=.032 <S> </S> e w*φ =1.00 P=.644 e w*φ =1.00 e w*φ =2.72 P=.119 V V P=.731 e w*φ =1.00 P=.087 ● This is now the same form as the HMM ● Can use the Viterbi algorithm 21 ● Calculate probabilities using forward-backward

  22. Sequential Data Modeling – Conditional Random Fields Conditional Random Fields 22

  23. Sequential Data Modeling – Conditional Random Fields Maximizing CRF Likelihood ● Want to maximize the likelihood for sequences w ⋅ϕ( Y , X ) e ∏ i P ( Y i ∣ X i ; w ) P ( Y ∣ X )= ̂ w = argmax ∑ ̃ w ⋅ϕ( ̃ Y , X ) Y e w ● For convenience, we consider the log likelihood log P ( Y ∣ X )= w ⋅ϕ( Y , X )− log ∑ ̃ ϕ( ̃ w ⋅ Y , X ) Y e ● Want to find gradient for stochastic gradient descent d d w log P ( Y ∣ X ) 23

Recommend


More recommend