conditional random fields
play

Conditional Random Fields Dietrich Klakow Overview Sequence - PowerPoint PPT Presentation

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks Markov Random Fields Conditional Random Fields Software example Sequence Labeling Tasks Sequence: a sentence Pierre Vinken , 61


  1. Conditional Random Fields Dietrich Klakow

  2. Overview • Sequence Labeling • Bayesian Networks • Markov Random Fields • Conditional Random Fields • Software example

  3. Sequence Labeling Tasks

  4. Sequence: a sentence Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

  5. POS Labels Pierre NNP Vinken NNP , , 61 CD years NNS old JJ , , will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD . .

  6. Chunking Task: find phrase boundaries:

  7. Chunking B-NP Pierre I-NP Vinken O , B-NP 61 I-NP years B-ADJP old O , B-VP will I-VP join B-NP the I-NP board B-PP as B-NP a I-NP nonexecutive I-NP director B-NP Nov. I-NP 29 O .

  8. Named Entity Tagging Pierre B-PERSON Vinken I-PERSON , O 61 B-DATE:AGE years I-DATE:AGE old I-DATE:AGE , O will O join O the O board B-ORG_DESC:OTHER as O a O nonexecutive O director B-PER_DESC Nov. B-DATE:DATE 29 I-DATE:DATE . O

  9. Supertagging Pierre N/N Vinken N , , 61 N/N years N old (S[adj]\NP)\NP , , will (S[dcl]\NP)/(S[b]\NP) join ((S[b]\NP)/PP)/NP the NP[nb]/N board N as PP/NP a NP[nb]/N nonexecutive N/N director N Nov. ((S\NP)\(S\NP))/N[num] 29 N[num] . .

  10. Hidden Markov Model

  11. HMM: just an Application of a Bayes Classifier [ ] π π π = π π π ˆ ˆ ˆ ( , ... ) arg max P ( x , x ... x , , ... ) 1 2 N 1 2 N 1 2 N π π π , .. 1 2 N

  12. Decomposition of Probabilities π π π P ( x , x .. x , , .. ) 1 2 N 1 2 N N ∏ = π π π P ( x | ) P ( | 1 ) − i i i i = i 1 π π P ( | ) : transition probability − i i 1 π P ( x | ) : emission probability i i

  13. Graphical view HMM Observation sequence X 1 X 2 X 3 X N ……. π 1 π 2 π 3 π N ……. Label sequence

  14. Criticism • HMMs model only limiter dependencies � come up with more flexible models � come up with graphical description

  15. Bayesian Networks

  16. Example for Bayesian Network From Russel and Norvig 95 AI: A Modern Approach = P ( C , S , R , W ) Corresponding joint distribution P ( W | S , R ) P ( S | C ) P ( R | C ) P ( C )

  17. Naïve Bayes Observations x 1 , …. x D are assumed to be independent D ∏ P ( x i z | ) = i 1

  18. Markov Random Fields

  19. • Undirected graphical model • New term: • clique in an undirected graph: • Set of nodes such that every node is connected to every other node • maximal clique : there is no node that can be added without add without destroying the clique property

  20. Example cliques: green and blue maximal clique: blue

  21. Factorization x : all nodes x ... x 1 N x : nodes in clique C C C : set of all maximal cliques M Ψ Ψ ≥ ( x ) : potential function ( ( x ) 0 ) C C C C Joint distribution described by graph 1 ∏ = Ψ p ( x ) C x ( ) C Z ∈ C C M Normalization ∑ ∏ = Ψ ( ) Z x C C ∈ x C C M Z is sometimes call the partition function

  22. Example x 2 x 1 x 3 x 5 x 4 What are the maximum cliques? Write down joint probability described by this graph � white board

  23. Energy Function Define − E ( x ) Ψ = ( x ) e C C C Insert into joint distribution ∑ − E ( x ) 1 C = ∈ p ( x ) e C C M Z

  24. Conditional Random Fields

  25. Definition Maximum random field were each random variable y i is conditioned on the complete input sequence x 1 , …x n y=(y 1 …y n ) y 1 y 2 y 3 y n-1 y n ….. x=(x 1 …x n ) x

  26. Distribution Distribution n N ∑∑ − λ f ( y , y , x , i ) 1 − j j i 1 i = = = p ( y | x ) e i 1 j 1 ( ) Z x λ parameters to be trained : j feature function f ( y , y , x , i ) : − j i 1 i (see maximum entropy models)

  27. Example feature functions Modeling transitions  if y = and y = 1 IN NNP i - 1 i  = f ( y , y , x , i ) 1 − 1 i i  else 0 Modeling emissions  = = if y and x 1 NNP September i i  = f ( y , y , x , i ) − 2 i 1 i  else 0

  28. Training • Like in maximum entropy models Generalized iterative scaling • Convergence: p(y|x) is a convex function � � unique maximum � Convergence is slow � Improved algorithms exist

  29. Decoding: Auxiliary Matrix Define additional start symbol y 0 =START and stop symbol y n+1 =STOP M i Define matrix ( x ) such that N ∑ − λ f ( y , y , x , i ) [ ] − j j i 1 i i i = = = M ( x ) M ( x ) e j 1 y y y y − − i 1 i i 1 i

  30. Reformulate Probability With that definition we have + 1 n 1 ∏ i = p ( y | x ) M ( x ) y y Z ( x ) i − 1 i = i 1 with ∑∑∑ ∑ + 1 2 n 1 = Z ( x ) ... M ( x ) M ( x ).... M ( x ) y y y y y y + 0 1 1 2 n n 1 y y y y 1 2 3 n

  31. Use Matrix Properties Use matrix product ∑ [ ] 1 2 1 2 = M ( x ) M ( x ) M ( x ) M ( x ) y y y y y y 0 2 0 1 1 2 y 1 with [ ] + 1 2 1 n = Z ( x ) M ( x ) M ( x )... M ( x ) = = y START , y STOP 0 + 1 n

  32. Software

  33. CRF++ • See http://crfpp.sourceforge.net/

  34. Summary • Sequence labeling problems • CRFs are • flexible • Expensive to train • Fast to decode

Recommend


More recommend