Fields for Information Extraction S U N I T A S A R A W A G I A N - PowerPoint PPT Presentation

Semi-Markov Conditional Random Fields for Information Extraction S U N I T A S A R A W A G I A N D W I L L I A M C O H E N N I P S 2 0 0 4 P R E S E N T E D B Y : D I N E S H K H A N D E L W A L S L I D E S A R E A D O P T E D F R O M D A N I E L K H A S H A B I

Beyond Classification Learning  Standard classification problem assumes individual cases are disconnected and independent (i.i.d.: independently and identically distributed).  Many NLP problems do not satisfy this assumption and involve making many connected decisions, each resolving a different ambiguity, but which are mutually dependent.  More sophisticated learning and inference techniques are needed to handle such situations in general. 2

Sequence Labeling Problem  Many NLP problems can viewed as sequence labeling.  Each token in a sequence is assigned a label.  Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors (not i.i.d). 3

Named Entity Recognition My review of Fermat’s last theorem by S. Singh 1 2 3 4 5 6 7 8 9 t Fermat’s My review of last theorem by S. Singh x y Other Other Other Title Title Title other Author Author y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9

Problem Description  The relational connection occurs in many applications, NLP, Computer Vision, Signal Processing, ….   Traditionally in graphical models, ( , ) ( | ) ( ) p x y y x x p p  Modeling the joint distribution can lead to difficulties  rich local features occur in relational data, ( ) p x  features may have complex dependencies,  constructing probability distribution over them is difficult  Solution: directly model the conditional, ( | ) p y x  is sufficient for classification!  CRF is simply a conditional distribution with an ( | ) p y x associated graphical structure

Log linear representation of CRFs 1 𝑎(𝐲) 𝑓 𝐗 T 𝐆(𝐲,𝐳) P 𝑠 𝐳 𝐲, 𝐗 = 𝐲 𝐠(i, 𝐲, 𝐳) 𝐆 𝐲, 𝐳 = 𝒋=𝟐 𝑔 1 , … , 𝑔 𝐿 𝐠 = Vector of local feature functions 𝑔 𝐿 (𝑗, 𝐲, 𝐳) ∈ 𝑆  Parameters to be estimated, 𝐗

Linear Chain CRF =unobservable =observable 𝑔 𝐿 𝑗, 𝐲, 𝐳 = 𝑔 ′𝐿 𝑗, 𝐲, 𝐳 𝒋 , 𝐳 𝒋−𝟐

Features The kind of features used in NLP-oriented machine learning systems typically involve  Binary values: Think of a feature as being on or off rather than as a feature with a value  Values that are relative to an object/class pair rather than being a function of the object alone.  Typically have lots and lots of features (100,000s of features isn’ t unusual.)

Features 𝑔 1 (𝑗, 𝐲, 𝐳) = 1 y i = DT and y i−1 = V 0, otherwise 𝑔 2 (𝑗, 𝐲, 𝐳) = 1 x i = the and y i = DT 0, otherwise 𝑔 3 (𝑗, 𝐲, 𝐳) = 1 suffix x i = "ing" and y i = V 0, , otherwise

Segmentation models (Semi-CRFs) i 1 2 3 4 5 6 7 8 9 x I went skiing with Fernando Pereira in British Columbia y O O O O I I O I I Features describe the 𝑔 𝐿 𝑗, 𝐲, 𝐳 𝒋 , 𝐳 𝒋−𝟐 single word t,u t 1 =u 1 =1 t 2 =u 2 =2 t 3 =u 3 =3 t 4 = u 4 =4 t 5 =5, u 5 =6 t 6 = u 6 =7 t 7 =8, u 7 =9 x I went skiing with Fernando Pereira in British Columbia y O O O O I O I Features describe the 𝑕 𝐿 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 segment from 𝑢 𝑘 to 𝑣 𝑘

Semi-CRF 𝑻 𝒒 𝑻 𝟐 𝐲 𝑣 1 𝐲 𝑢 1 𝐲 𝑢 𝑞 𝐲 𝑣 𝑞 s = 𝑡 1 , … , 𝑡 𝑞 denote a segmentation of x Segment 𝑡 consists of a start position 𝑢 𝑘 , an end position 𝑣 𝑘 , and a label 𝑧 𝑘 𝑘 = 𝑢 𝑘 , 𝑣 𝑘 , 𝑧 𝑘 𝑢 𝑘+1 = 𝑣 𝑘 + 1 and 1 ≤ 𝑢 𝑘 ≤ 𝑣 𝑘 ≤ 𝑡

Semi-CRF =unobservable =observable 𝑕 𝐿 𝑘, 𝐲, 𝐭 = 𝑕 ′𝐿 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 1 𝑎(𝐲) 𝑓 𝐗 T 𝐇(𝐲,𝐭) P 𝑠 𝐭 𝐲, 𝐗 = 𝐲 𝐡(i, 𝐲, 𝐭) 𝑎(𝐲) = 𝑡 ′ 𝑓 𝐗 T 𝐇(𝐲,𝐭) G 𝐲, 𝐭 = 𝒋=𝟐 𝐡 is a vector of segment level feature functions.

MAP Inference Semi-CRF 𝐓 ∗ = argmax 𝑄(𝐭|𝐲, 𝐗 ) 𝐭 𝐓 ∗ = argmax 𝐗 T 𝐇(𝐲, 𝐭) 𝐭 𝐭 𝐓 ∗ = argmax 𝐗 T 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 𝐭 𝒌 𝐡 is a vector of segment level feature functions.

Viterbi algorithm for Semi-CRF 𝐭 𝐗 T max 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 𝐭 𝒌=𝟐  L be an upper bound on segment length  𝐭 𝐣:𝐳 denote set of all partial segmentation starting from 1 to i , such that the last segment has the label y and ending position i . 𝐭 ′ 𝐗 T 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 + 𝑊(𝑗, 𝑧) = max max 𝒛 ′ ,𝒆 𝐭 ′ ∈𝐭 𝐣−𝐞:𝒛′ 𝒌=𝟐 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗

Viterbi algorithm for Semi-CRF 𝐭 ′ 𝐗 T 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 + 𝑊(𝑗, 𝑧) = max max 𝒌=𝟐 𝒛 ′ ,𝒆 𝐭 ′ ∈𝐭 𝐣−𝐞:𝒛′ 𝒛 ′ ,𝒆 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 max 𝐭 ′ 𝑊 𝑗 − 𝑒, 𝑧 ′ = 𝐭 ′ ∈𝐭 𝐣−𝐞:𝒛′ 𝐗 T max 𝐡 y 𝑘 , y 𝑘−1 , 𝐲, 𝑢 𝑘 , 𝑣 𝑘 𝒌=𝟐 𝒛 ′ ,𝒆 𝑊 𝑗 − 𝑒, 𝑧 ′ + 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 𝑊(𝑗, 𝑧) = max

Viterbi algorithm for Semi-CRF 𝒛 ′ ,𝒆=𝟐,..𝑴 𝑊 𝑗 − 𝑒, 𝑧 ′ + 𝐗 T 𝐡 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 If i >0 max 𝑊(𝑗, 𝑧) = If i = 0 0 −∞ If i<0 𝑊 𝑦 , 𝑧 . The optimal label sequence corresponds to path traced by max 𝑧

Semi-Markov CRFs vs conventional CRFs Since conventional CRFs need not maximize over possible segment lengths d, inference for semi-CRFs is more expensive. However additional cost is only linear in L. Semi-CRFs are more expressive power. A major advantage of semi-CRFs is that they allow features which measure properties of segments, rather than individual elements.

Semi-Markov CRFs vs Higher order CRFs Semi-CRFs are no more expressive than order-L CRFs. For order-L CRFs, however the additional computational cost is exponential in L. Semi-CRFs only consider sequences in which the same label is assigned to all L positions, rather than all |𝑍| L length-L sequences. This is a useful restriction, as it leads to faster inference.

Parameter Learning: Semi-CRF 𝑂  Given the training data, we wish to learn parameters of {(𝐲 𝑚 , 𝒕 𝑚 )} 𝑚=1 the model. We express the log-likelihood over the training sequences as (𝐗 T 𝐇(𝐲 𝑚 , 𝒕 𝑚 )−log Z 𝐗 (𝐲 𝑚 )) 𝑀 𝑋 = log 𝑄(𝒕 𝑚 |𝐲 𝑚 , 𝐗) = 𝑚 𝑚 𝑀 𝑋 is concave, and can thus be maximized by gradient ascent, or one of many related methods. (Paper uses a limited-memory quasi- Newton method) 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇(𝐲 𝑚 , 𝐭 ′ )) 𝛼𝑀 𝑋 = (𝐇 𝐲 𝑚 , 𝒕 𝑚 − 𝑚 Observed feature count Expected feature count

Parameter Learning: Semi-CRF 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇(𝐲 𝑚 , 𝐭 ′ )) 𝛼𝑀 𝑋 = (𝐇 𝐲 𝑚 , 𝒕 𝑚 − 𝑚 𝑡 ′ 𝐇 𝐲 𝑚 , 𝑡 ′ 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) 𝛼𝑀 𝑋 = 𝐇 𝐲 𝑚 , 𝒕 𝑚 − Z 𝐗 (𝐲 𝑚 ) 𝑚 Markov property of G and a dynamic programming helps in fast computation of the 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇(𝐲 𝑚 , 𝐭 ′ ) expected value of the features under the current weight vector Where 𝐭 𝐣:𝐳 denotes all segmentations from 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) 𝛽(𝑗, 𝑧) = 1 to 𝑗 ending at 𝑗 and labeled 𝑧 . 𝑡 ′ ∈ 𝐭 𝐣:𝐳 Z 𝐗 (𝐲)= 𝛽( 𝐲 , 𝑧) 𝑧

Parameter Learning: Semi-CRF 𝑴 𝛽 𝑗 − 𝑒, 𝑧 ′ 𝑓 𝐗 𝑈 𝐡 𝑧,𝒛 ′ ,𝐲,𝑗−𝑒,𝑗 if i > 0 𝛽(𝑗, 𝑧) = 𝑧 ′ ∈𝓩 𝒆=𝟐 if i =0 1 if i < 0 0 A similar approach can be used to compute the expectation 𝑡 ′ 𝐇 𝐲 𝑚 , 𝑡 ′ 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) 𝜃 𝑙 𝑗, 𝑧 = 𝑡 ′ ∈ 𝐭 𝐣:𝐳 𝐻 𝑙 𝐲 𝑚 , 𝑡 ′ 𝒇 𝐗 T 𝐇(𝐲 𝑚 ,𝑡 ′ ) , restricted to the part of the segmentation ending at position i . 𝜃 𝑙 𝑗, 𝑧 = 𝒆=𝟐 𝑧 ′ ∈𝓩 (𝜃 𝑙 𝑗 − 𝑒, 𝑧 ′ + 𝛽 𝑗 − 𝑒, 𝑧 ′ 𝑕 ′𝐿 𝑧, 𝑧 ′ , 𝐲, 𝑗 − 𝑒, 𝑗 )𝑓 𝐗 𝑈 𝐡 𝑧,𝒛 ′ ,𝐲,𝑗−𝑒,𝑗 𝑴

Parameter Learning: Semi-CRF 𝟐 𝐹 P 𝑠 𝐭 ′ 𝐲, 𝐗 𝐇 𝐲, 𝐭 ′ = 𝜃 𝑙 ( 𝐲 , 𝑧) Z 𝐗 (𝐲) 𝑧

Fields for Information Extraction S U N I T A S A R A W A G I A N - PowerPoint PPT Presentation

Semi-Markov Conditional Random Fields for Information Extraction S U N I T A S A R A W A G I A N D W I L L I A M C O H E N N I P S 2 0 0 4 P R E S E N T E D B Y : D I N E S H K H A N D E L W A L S L I D E S A R E A D O P T E D F R O

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Visualization Visualization Height Fields and Contours Height Fields and Contours Scalar Fields

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Function Fields, Curves Introduction Function Fields vs. Curves and Global sections Function

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

Distance fields imre paadik Overview Signed distance fields Distance fields in computer

Variations on a Theme: Fields of Definition, Fields of Moduli, Automorphisms, and Twists

Graphical Models - Part II Oliver Schulte - CMPT 726 Bishop PRML Ch. 8 Markov Random Fields

Light Fields in Ray and Wave Optics Introduction to Light Fields: Ramesh Raskar Wigner

The first-order theory of finitely generated Statements Distinguishing fields F.g. fields

Fields by the Marston Ferry Road Fields by the Elsfield Road Marston Hamm Victoria Arms Marston

Markov Networks [KF] Chapter 4 CS 786 University of Waterloo Lecture 7: May 24, 2012 Outline

Models CMSC 678 UMBC Announcement 1: Progress Report on Project Due Monday April 16 th , 11:59

Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The

Image Segmentation Philipp Kr ahenb uhl Stanford University April 24, 2013 Philipp Kr

Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang Biointelligence

t t sts

ProbabilisticGraphicalModels(Cmput651): UndirectedGraphicalModels1

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require