Semi-Markov Conditional Random Fields for Information Extraction S U N I T A S A R A W A G I A N D W I L L I A M C O H E N N I P S 2 0 0 4 P R E S E N T E D B Y : D I N E S H K H A N D E L W A L S L I D E S A R E A D O P T E D F R O M D A N I E L K H A S H A B I
Beyond Classification Learning Standard classification problem assumes individual cases are disconnected and independent (i.i.d.: independently and identically distributed). Many NLP problems do not satisfy this assumption and involve making many connected decisions, each resolving a different ambiguity, but which are mutually dependent. More sophisticated learning and inference techniques are needed to handle such situations in general. 2
Sequence Labeling Problem Many NLP problems can viewed as sequence labeling. Each token in a sequence is assigned a label. Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors (not i.i.d). 3
Named Entity Recognition My review of Fermat’s last theorem by S. Singh 1 2 3 4 5 6 7 8 9 t Fermat’s My review of last theorem by S. Singh x y Other Other Other Title Title Title other Author Author y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9
Problem Description The relational connection occurs in many applications, NLP, Computer Vision, Signal Processing, …. Traditionally in graphical models, ( , ) ( | ) ( ) p x y y x x p p Modeling the joint distribution can lead to difficulties rich local features occur in relational data, ( ) p x features may have complex dependencies, constructing probability distribution over them is difficult Solution: directly model the conditional, ( | ) p y x is sufficient for classification! CRF is simply a conditional distribution with an ( | ) p y x associated graphical structure
Log linear representation of CRFs 1 𝑎(𝐲) 𝑓 𝐗 T 𝐆(𝐲,𝐳) P 𝑠 𝐳 𝐲, 𝐗 = 𝐲 𝐠(i, 𝐲, 𝐳) 𝐆 𝐲, 𝐳 = 𝒋=𝟐 𝑔 1 , … , 𝑔 𝐿 𝐠 = Vector of local feature functions 𝑔 𝐿 (𝑗, 𝐲, 𝐳) ∈ 𝑆 Parameters to be estimated, 𝐗
Linear Chain CRF =unobservable =observable 𝑔 𝐿 𝑗, 𝐲, 𝐳 = 𝑔 ′𝐿 𝑗, 𝐲, 𝐳 𝒋 , 𝐳 𝒋−𝟐
Features The kind of features used in NLP-oriented machine learning systems typically involve Binary values: Think of a feature as being on or off rather than as a feature with a value Values that are relative to an object/class pair rather than being a function of the object alone. Typically have lots and lots of features (100,000s of features isn’ t unusual.)
Features 𝑔 1 (𝑗, 𝐲, 𝐳) = 1 y i = DT and y i−1 = V 0, otherwise 𝑔 2 (𝑗, 𝐲, 𝐳) = 1 x i = the and y i = DT 0, otherwise 𝑔 3 (𝑗, 𝐲, 𝐳) = 1 suffix x i = "ing" and y i = V 0, , otherwise
Segmentation models (Semi-CRFs) i 1 2 3 4 5 6 7 8 9 x I went skiing with Fernando Pereira in British Columbia y O O O O I I O I I Features describe the 𝑔 𝐿 𝑗, 𝐲, 𝐳 𝒋 , 𝐳 𝒋−𝟐 single word t,u t 1 =u 1 =1 t 2 =u 2 =2 t 3 =u 3 =3 t 4 = u 4 =4 t 5 =5, u 5 =6 t 6 = u 6 =7 t 7 =8, u 7 =9 x I went skiing with Fernando Pereira in British Columbia y O O O O I O I Features describe the 𝐿 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 segment from 𝑢 𝑘 to 𝑣 𝑘
Semi-CRF 𝑻 𝒒 𝑻 𝟐 𝐲 𝑣 1 𝐲 𝑢 1 𝐲 𝑢 𝑞 𝐲 𝑣 𝑞 s = 𝑡 1 , … , 𝑡 𝑞 denote a segmentation of x Segment 𝑡 consists of a start position 𝑢 𝑘 , an end position 𝑣 𝑘 , and a label 𝑧 𝑘 𝑘 = 𝑢 𝑘 , 𝑣 𝑘 , 𝑧 𝑘 𝑢 𝑘+1 = 𝑣 𝑘 + 1 and 1 ≤ 𝑢 𝑘 ≤ 𝑣 𝑘 ≤ 𝑡
Semi-CRF =unobservable =observable 𝐿 𝑘, 𝐲, 𝐭 = ′𝐿 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 1 𝑎(𝐲) 𝑓 𝐗 T 𝐇(𝐲,𝐭) P 𝑠 𝐭 𝐲, 𝐗 = 𝐲 𝐡(i, 𝐲, 𝐭) 𝑎(𝐲) = 𝑡 ′ 𝑓 𝐗 T 𝐇(𝐲,𝐭) G 𝐲, 𝐭 = 𝒋=𝟐 𝐡 is a vector of segment level feature functions.
MAP Inference Semi-CRF 𝐓 ∗ = argmax 𝑄(𝐭|𝐲, 𝐗 ) 𝐭 𝐓 ∗ = argmax 𝐗 T 𝐇(𝐲, 𝐭) 𝐭 𝐭 𝐓 ∗ = argmax 𝐗 T 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 𝐭 𝒌 𝐡 is a vector of segment level feature functions.
Viterbi algorithm for Semi-CRF 𝐭 𝐗 T max 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 𝐭 𝒌=𝟐 L be an upper bound on segment length 𝐭 𝐣:𝐳 denote set of all partial segmentation starting from 1 to i , such that the last segment has the label y and ending position i . 𝐭 ′ 𝐗 T 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 + 𝑊(𝑗, 𝑧) = max max 𝒛 ′ ,𝒆 𝐭 ′ ∈𝐭 𝐣−𝐞:𝒛′ 𝒌=𝟐 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗
Viterbi algorithm for Semi-CRF 𝐭 ′ 𝐗 T 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 + 𝑊(𝑗, 𝑧) = max max 𝒌=𝟐 𝒛 ′ ,𝒆 𝐭 ′ ∈𝐭 𝐣−𝐞:𝒛′ 𝒛 ′ ,𝒆 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 max 𝐭 ′ 𝑊 𝑗 − 𝑒, 𝑧 ′ = 𝐭 ′ ∈𝐭 𝐣−𝐞:𝒛′ 𝐗 T max 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 𝒌=𝟐 𝒛 ′ ,𝒆 𝑊 𝑗 − 𝑒, 𝑧 ′ + 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 𝑊(𝑗, 𝑧) = max
Viterbi algorithm for Semi-CRF 𝒛 ′ ,𝒆=𝟐,..𝑴 𝑊 𝑗 − 𝑒, 𝑧 ′ + 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 If i >0 max 𝑊(𝑗, 𝑧) = If i = 0 0 −∞ If i<0 𝑊 𝑦 , 𝑧 . The optimal label sequence corresponds to path traced by max 𝑧
Semi-Markov CRFs vs conventional CRFs Since conventional CRFs need not maximize over possible segment lengths d, inference for semi-CRFs is more expensive. However additional cost is only linear in L. Semi-CRFs are more expressive power. A major advantage of semi-CRFs is that they allow features which measure properties of segments, rather than individual elements.
Semi-Markov CRFs vs Higher order CRFs Semi-CRFs are no more expressive than order-L CRFs. For order-L CRFs, however the additional computational cost is exponential in L. Semi-CRFs only consider sequences in which the same label is assigned to all L positions, rather than all |𝑍| L length-L sequences. This is a useful restriction, as it leads to faster inference.
Parameter Learning: Semi-CRF 𝑂 Given the training data, we wish to learn parameters of {(𝐲 𝑚 , 𝒕 𝑚 )} 𝑚=1 the model. We express the log-likelihood over the training sequences as (𝐗 T 𝐇(𝐲 𝑚 , 𝒕 𝑚 )−log Z 𝐗 (𝐲 𝑚 )) 𝑀 𝑋 = log 𝑄(𝒕 𝑚 |𝐲 𝑚 , 𝐗) = 𝑚 𝑚 𝑀 𝑋 is concave, and can thus be maximized by gradient ascent, or one of many related methods. (Paper uses a limited-memory quasi- Newton method) 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇(𝐲 𝑚 , 𝐭 ′ )) 𝛼𝑀 𝑋 = (𝐇 𝐲 𝑚 , 𝒕 𝑚 − 𝑚 Observed feature count Expected feature count
Parameter Learning: Semi-CRF 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇(𝐲 𝑚 , 𝐭 ′ )) 𝛼𝑀 𝑋 = (𝐇 𝐲 𝑚 , 𝒕 𝑚 − 𝑚 𝑡 ′ 𝐇 𝐲 𝑚 , 𝑡 ′ 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) 𝛼𝑀 𝑋 = 𝐇 𝐲 𝑚 , 𝒕 𝑚 − Z 𝐗 (𝐲 𝑚 ) 𝑚 Markov property of G and a dynamic programming helps in fast computation of the 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇(𝐲 𝑚 , 𝐭 ′ ) expected value of the features under the current weight vector Where 𝐭 𝐣:𝐳 denotes all segmentations from 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) 𝛽(𝑗, 𝑧) = 1 to 𝑗 ending at 𝑗 and labeled 𝑧 . 𝑡 ′ ∈ 𝐭 𝐣:𝐳 Z 𝐗 (𝐲)= 𝛽( 𝐲 , 𝑧) 𝑧
Parameter Learning: Semi-CRF 𝑴 𝛽 𝑗 − 𝑒, 𝑧 ′ 𝑓 𝐗 𝑈 𝐡 𝑧,𝒛 ′ ,𝐲,𝑗−𝑒,𝑗 if i > 0 𝛽(𝑗, 𝑧) = 𝑧 ′ ∈𝓩 𝒆=𝟐 if i =0 1 if i < 0 0 A similar approach can be used to compute the expectation 𝑡 ′ 𝐇 𝐲 𝑚 , 𝑡 ′ 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) 𝜃 𝑙 𝑗, 𝑧 = 𝑡 ′ ∈ 𝐭 𝐣:𝐳 𝐻 𝑙 𝐲 𝑚 , 𝑡 ′ 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) , restricted to the part of the segmentation ending at position i . 𝜃 𝑙 𝑗, 𝑧 = 𝒆=𝟐 𝑧 ′ ∈𝓩 (𝜃 𝑙 𝑗 − 𝑒, 𝑧 ′ + 𝛽 𝑗 − 𝑒, 𝑧 ′ ′𝐿 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 )𝑓 𝐗 𝑈 𝐡 𝑧,𝒛 ′ ,𝐲,𝑗−𝑒,𝑗 𝑴
Parameter Learning: Semi-CRF 𝟐 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇 𝐲, 𝐭 ′ = 𝜃 𝑙 ( 𝐲 , 𝑧) Z 𝐗 (𝐲) 𝑧
Recommend
More recommend