conditional random fields
play

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational learning Conditional Random Fields Generative vs discriminative models joint distributions Traditional graphical models (both BN and MN) model joint


  1. Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational learning Conditional Random Fields

  2. Generative vs discriminative models joint distributions Traditional graphical models (both BN and MN) model joint probability distributions p ( x , y ) In many situations we know in advance which variables will be observed, and which will need to be predicted (i.e. x vs y ) Hidden Markov Models (as a special case of BN) also model joint probabilities of states and observations, even if they are often used to estimate the most probable sequence of states y given the observations x A problem with joint distributions is that they need to explicitly model the probability of x , which can be quite complex (e.g. a textual document) Conditional Random Fields

  3. Generative vs discriminative models y y 2 y y y y 1 n−1 n n+1 x 1 x n x i x n−1 x n+1 x 1 x 2 x n Naive Bayes Hidden Markov Model generative models Directed graphical models are called generative when the joint probability decouples as p ( x , y ) = p ( x | y ) p ( y ) The dependencies between input and output are only from the latter to the former: the output generates the input Naive Bayes classifiers and Hidden Markov Models are both generative models Conditional Random Fields

  4. Generative vs discriminative models Discriminative models If the purpose is choosing the most probable configuration for the output variables, we can directly model the conditional probability of the output given the input: p ( y | x ) The parameters of such distribution have higher freedom wrt those of the full p ( x , y ) , as p ( x ) is not modelled This allows to effectively exploit the structure of x without modelling the interactions between its parts, but only those with the output Such models are called discriminative as they aim at modeling the discrimination between different outputs Conditional Random Fields

  5. Conditional Random Fields (CRF , Lafferty et al. 2001) Definition Conditional random fields are conditional Markov networks: 1 � p ( y | x ) = Z ( x ) exp ( − E (( x , y ) C )) ( x , y ) C The partition function Z ( x ) is summed only over y to provide a proper conditional probability: � � � � − E (( x , y ′ ) C ) Z ( x ) = exp y ′ ( x , y ′ ) C Conditional Random Fields

  6. Conditional Random Fields Feature functions K 1 � � p ( y | x ) = Z ( x ) exp λ k f k (( x , y ) C ) ( x , y ) C k = 1 The negated energy function is often written simply as a weighted sum of real-valued feature functions Each feature function should capture a certain characteristic of the clique variables Conditional Random Fields

  7. Linear chain CRF Description (simple form) � K � h 1 � � � p ( y | x ) = Z ( x ) exp λ k f k ( y t , y t − 1 ) + µ h f h ( x t , y t ) t k = 1 h = 1 Models the relation between an input and an output sequence Output sequences are modelled as a linear chain, with a link between each consecutive output element Each output element is connected to the corresponding input. Conditional Random Fields

  8. Linear chain CRF Description (more generic form) K 1 � � p ( y | x ) = Z ( x ) exp λ k f k ( y t , y t − 1 , x t ) t k = 1 the linear chain CRF can model arbitrary features of the input, not only identity of the current observation (like in HMMs) We can think of x t as a vector containing input information relevant for position t , possibly including inputs at previous or following positions We can easily make transition scores (between consecutive outputs y t − 1 , y t ) dependent also on current input x t Conditional Random Fields

  9. Linear chain CRF Parameter estimation Parameters λ k of feature functions need to be estimated from data We estimate them from a training set of i.i.d. input/output sequence pairs D = { ( x ( i ) , y ( i ) ) } i = 1 , . . . , N each example ( x ( i ) , y ( i ) ) is made of a sequence of inputs and a corresponding sequence of outputs: x ( i ) = { x ( i ) y ( i ) = { y ( i ) 1 , . . . , x ( i ) 1 , . . . , y ( i ) T } T } Note For simplicity of notation we assume each training sequence have the same length. The generic form would replace T with T ( i ) Conditional Random Fields

  10. Parameter estimation Maximum likelihood estimation Parameter estimation is performed maximizing the likelihood of the data D given the parameters θ = { λ 1 , . . . , λ K } As usual to simplify derivations we will equivalently maximize log-likelihood As CRF model a conditional probability, we will maximize conditional log-likelihood : N N � � p ( y ( i ) | x ( i ) ) = log p ( y ( i ) | x ( i ) ) ℓ ( θ ) = log i = 1 i = 1 Conditional Random Fields

  11. Parameter estimation Maximum likelihood estimation Replacing the equation for conditional probability we obtain: � � N K 1 � � � λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) ℓ ( θ ) = t ) log Z ( x ( i ) ) exp i = 1 t k = 1 N K N � � � � λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) log Z ( x ( i ) ) = t ) − i = 1 t k = 1 i = 1 Conditional Random Fields

  12. Gradient of the likelihood N N ∂ℓ ( θ ) � � � � � f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) f k ( y , y ′ , x ( i ) t ) p θ ( y , y ′ | x ( i ) ) = t ) − ∂λ k i = 1 t i = 1 y , y ′ t � �� � � �� � ˜ E [ f k ] E θ [ f k ] Interpretation ˜ E [ f k ] is the expected value of f k under the empirical distribution ˜ p ( y , x ) represented by the training examples E θ [ f k ] is the expected value of f k under the distribution represented by the model with the current value of the parameters : p θ ( y | x )˜ p ( x ) ( ˜ p ( x ) is the empirical distribution of x ) Conditional Random Fields

  13. Gradient of the likelihood Interpretation ∂ℓ ( θ ) = ˜ E [ f k ] − E θ [ f k ] ∂λ k The gradient measures the difference between the expected value of the feature under the empirical and model distributions The gradient is zero when the model adheres to the empirical observations This highlights the risk of overfitting training examples Conditional Random Fields

  14. Parameter estimation Adding regularization CRF often have a large number of parameters to account for different characteristics of the inputs Many parameters mean risk of overfitting training data In order to reduce the risk of overfitting, we penalize parameters with a too large norm Conditional Random Fields

  15. Parameter estimation Zero-mean Gaussian prior A common choice is assuming a Gaussian prior over parameters, with zero mean and covariance σ 2 I (where I is the identity matrix) � � −|| θ || 2 p ( θ ) ∝ exp 2 σ 2 where Gaussian coefficient can be ignored as it’s independent of θ σ 2 is a free parameter determining how much to penalize feature weights moving away from the zero the log probability becomes: K λ 2 log ( p ( θ )) ∝ −|| θ || 2 � k 2 σ 2 = − 2 σ 2 k = 1 Conditional Random Fields

  16. Parameter estimation Maximum a-posteriori estimation We can now estimate the maximum a-posteriori parameters: θ ∗ = argmax θ ℓ ( θ ) + log p ( θ ) = argmax θ ℓ r ( θ ) where the regularized likelihood ℓ r ( θ ) is: N K N K λ 2 � � � � � λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) log Z ( x ( i ) ) − k ℓ r ( θ ) = t ) − 2 σ 2 i = 1 t k = 1 i = 1 k = 1 Conditional Random Fields

  17. Parameter estimation Optimizing the regularized likelihood Gradient ascent → usually too slow Newton’s method (uses Hessian, matrix of all second order derivatives) → too expensive to compute the Hessian Quasi-Netwon methods are often employed: compute an approximation of the Hessian with only first derivative (e.g. BFGS) limited-memory versions exist that avoid storing the full approximate Hessian (size is quadratic in the number of parameters) Conditional Random Fields

  18. Inference Inference problems Computing the gradient requires computing the marginal distribution for each edge p θ ( y , y ′ | x ( i ) ) This has to be computed at each gradient step, as the set of parameters θ changes in the direction of the gradient Computing the likelihood requires computing the partition function Z ( x ) . During testing, finding the most likely labeling requires solving: y ∗ = argmax y p ( y | x ) Inference algorithms All such tasks can be performed efficiently by dynamic programming algorithms similar to those for HMM Conditional Random Fields

  19. Inference algorithms Analogy to HMM Inference algorithms rely on forward , backward and Viterbi procedures analogous to those for HMM To simplify notation and highlight analogy to HMM, we will use the formula of CRF with clique potentials: 1 � p ( y | x ) = Ψ t ( y t , y t − 1 , x t ) Z ( x ) t where the clique potentials are: K � Ψ t ( y t , y t − 1 , x t ) = exp λ k f k ( y t , y t − 1 , x t ) k = 1 Conditional Random Fields

  20. Inference algorithms Forward procedure The forward variable α t ( i ) collects the unnormalized probability of output y t = i and the sequence of inputs { x 1 , . . . , x t } : α t ( i ) ∝ p ( x 1 , . . . , x t , y t = i ) As for HMMs, it is computed recursively � α t ( i ) = Ψ t ( i , j , x t ) α t − 1 ( j ) j ∈ S where S is the set of possible values for the output variable Conditional Random Fields

Recommend


More recommend