Inference and Representation David Sontag New York University Lecture 13, Dec. 8, 2015 David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 1 / 29
Conditional random fields (CRFs) Conditional random fields are undirected graphical models of conditional distributions p ( Y | X ) Y is a set of target variables X is a set of observed variables We typically show the graphical model using just the Y variables Potentials are a function of X and Y David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 2 / 29
Formal definition A CRF is a Markov network on variables X ∪ Y , which specifies the conditional distribution 1 � P ( y | x ) = φ c ( x c , y c ) Z ( x ) c ∈ C with partition function � � Z ( x ) = φ c ( x c , ˆ y c ) . y ˆ c ∈ C As before, two variables in the graph are connected with an undirected edge if they appear together in the scope of some factor The only difference with a standard Markov network is the normalization term – before marginalized over X and Y , now only over Y David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 3 / 29
Application: named-entity recognition Given a sentence, determine the people and organizations involved and the relevant locations: “Mrs. Green spoke today in New York. Green chairs the finance committee.” Entities sometimes span multiple words. Entity of a word not obvious without considering its context CRF has one variable X i for each word, which encodes the possible labels of that word The labels are, for example, “B-person, I-person, B-location, I-location, B-organization, I-organization” Having beginning (B) and within (I) allows the model to segment adjacent entities David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 4 / 29
Application: named-entity recognition The graphical model looks like (called a skip-chain CRF ): There are three types of potentials: φ 1 ( Y t , Y t +1 ) represents dependencies between neighboring target variables [analogous to transition distribution in a HMM] φ 2 ( Y t , Y t ′ ) for all pairs t , t ′ such that x t = x t ′ , because if a word appears twice, it is likely to be the same entity φ 3 ( Y t , X 1 , · · · , X T ) for dependencies between an entity and the word sequence [e.g., may have features taking into consideration capitalization] Notice that the graph structure changes depending on the sentence! David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 5 / 29
Application: Part-of-speech tagging United flies some large jet N V D A N United 1 flies 2 some 3 large 4 jet 5 David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 6 / 29
Graphical model formulation of POS tagging given: • a sentence of length n and a tag set T • one variable for each word, takes values in T • edge potentials θ ( i − 1 , i , t � , t ) for all i ∈ n , t , t � ∈ T example: United 1 flies 2 some 3 large 4 jet 5 T = { A , D , N , V } David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 7 / 29
Features for POS tagging Parameterization as log-linear model: Weights w ∈ R d . Feature vectors f c ( x , y c ) ∈ R d . φ c ( x , y c ; w ) = exp( w · f c ( x , y c )) Edge potentials: Fully parameterize ( T × T features and weights), i.e. θ i − 1 , i ( t ′ , t ) = w T t ′ , t where the superscript “T” denotes that these are the weights for the transitions Node potentials: Introduce features for the presence or absence of certain attributes of each word (e.g., initial letter capitalized, suffix is “ing”), for each possible tag ( T × #attributes features and weights) This part is conditional on the input sentence! Edge potential same for all edges. Same for node potentials. David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 8 / 29
Density estimation for CRFs Suppose we want to predict a set of variables Y given some others X , e.g., stereo vision or part-of-speech tagging: output: disparity ! input: two images ! RB IN DT NN IN DT NN Once upon a time in a land We concentrate on predicting p ( Y | X ), and use a conditional loss function loss ( x , y , ˆ M ) = − log ˆ p ( y | x ) . Since the loss function only depends on ˆ p ( y | x ), suffices to estimate the conditional distribution, not the joint David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 9 / 29
Density estimation for CRFs 1 � � � CRF: p ( y | x ) = φ c ( x , y c ) , Z ( x ) = φ c ( x , ˆ y c ) Z ( x ) c ∈ C ˆ y c ∈ C � � loss ( x , y , ˆ Empirical risk minimization with CRFs, i.e. min ˆ M E D M ) : 1 � w ML = arg min − log p ( y | x ; w ) |D| w ( x , y ) ∈D � � � � = arg max log φ c ( x , y c ; w ) − log Z ( x ; w ) w c ( x , y ) ∈D � � � � � = arg max w · f c ( x , y c ) − log Z ( x ; w ) w c ( x , y ) ∈D ( x , y ) ∈D What if prediction is only done with MAP inference? Then, the partition function is irrelevant. Is there a way to train to take advantage of this? David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 10 / 29
Goal of learning ˆ The goal of learning is to return a model M that precisely captures the distribution p ∗ from which our data was sampled This is in general not achievable because of computational reasons limited data only provides a rough approximation of the true underlying distribution ˆ M to construct the ”best” approximation to M ∗ We need to select What is ”best”? David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 11 / 29
What notion of “best” should learning be optimizing? This depends on what we want to do Density estimation: we are interested in the full distribution (so later we can 1 compute whatever conditional probabilities we want) Specific prediction tasks: we are using the distribution to make a prediction 2 Structure or knowledge discovery: we are interested in the model itself 3 David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 12 / 29
Structured prediction Often we learn a model for the purpose of structured prediction , in which given x we predict y by finding the MAP assignment: argmax p ( y | x ) ˆ y Rather than learn using log-loss (density estimation), we use a loss function better suited to the specific task One reasonable choice would be the classification error : I { ∃ y ′ � = y s.t. ˆ p ( y ′ | x ) ≥ ˆ E ( x , y ) ∼ p ∗ [1 p ( y | x ) } ] which is the probability over all ( x , y ) pairs sampled from p ∗ that our classifier selects the right labels If p ∗ is in the model family, training with log-loss (density estimation) and classification error would perform similarly (given sufficient data) Otherwise, better to directly go for what we care about (classification error) David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 13 / 29
Structured prediction Consider the empirical risk for 0-1 loss (classification error): 1 I { ∃ y ′ � = y s.t. ˆ � p ( y ′ | x ) ≥ ˆ 1 p ( y | x ) } |D| ( x , y ) ∈D p ( y ′ | x ) ≥ ˆ Each constraint ˆ p ( y | x ) is equivalent to � f c ( x , y ′ � w · c ) − log Z ( x ; w ) ≥ w · f c ( x , y c ) − log Z ( x ; w ) c c The log-partition function cancels out on both sides. Re-arranging, we have: �� � � f c ( x , y ′ w · c ) − f c ( x , y c ) ≥ 0 c c Said differently, the empirical risk is zero when ∀ ( x , y ) ∈ D and y ′ � = y , �� � � f c ( x , y ′ w · f c ( x , y c ) − c ) > 0 . c c David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 14 / 29
Structured prediction Empirical risk is zero when ∀ ( x , y ) ∈ D and y ′ � = y , �� � � f c ( x , y ′ w · f c ( x , y c ) − c ) > 0 . c c In the simplest setting, learning corresponds to finding a weight vector w that satisfies all of these constraints (when possible) This is a linear program (LP)! How many constraints does it have? |D| ∗ |Y| – exponentially many! Thus, we must avoid explicitly representing this LP This lecture is about algorithms for solving this LP (or some variant) in a tractable manner David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 15 / 29
Structured perceptron algorithm Input: Training examples D = { ( x m , y m ) } Let f ( x , y ) = � c f c ( x , y c ). Then, the constraints that we want to satisfy are � � f ( x m , y m ) − f ( x m , y ) ∀ y � = y m w · > 0 , The perceptron algorithm uses MAP inference in its inner loop: MAP ( x m ; w ) = arg max y ∈Y w · f ( x m , y ) The maximization can often be performed efficiently by using the structure! The perceptron algorithm is then: Start with w = 0 1 While the weight vector is still changing: 2 For m = 1 , . . . , |D| 3 y ← MAP ( x m ; w ) 4 w ← w + f ( x m , y m ) − f ( x m , y ) 5 David Sontag (NYU) Inference and Representation Lecture 13, Dec. 8, 2015 16 / 29
Recommend
More recommend