Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Problem (Probabilistic Learning) Let d ( y | x ) be the (unknown) true conditional distribution. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be i.i.d. samples from d ( x, y ) . ◮ Find a distribution p ( y | x ) that we can use as a proxy for d ( y | x ) . or ◮ Given a parametrized family of distributions, p ( y | x, w ) , find the parameter w ∗ making p ( y | x, w ) closest to d ( y | x ) . Open questions: ◮ What do we mean by closest? ◮ What’s a good candidate for p ( y | x, w ) ? ◮ How to actually find w ∗ ? ◮ conceptually, and ◮ numerically 2 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle of Parsimony (Parsimoney, aka Occam’s razor) “Pluralitas non est ponenda sine neccesitate.” William of Ockham “We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances.” Isaac Newton “Make everything as simple as possible, but not simpler.” Albert Einstein “Use the simplest explanation that covers all the facts.” what we’ll use 3 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields ◮ 1) Define what aspects we consider relevant facts about the data. ◮ 2) Pick the simplest distribution reflecting that. Definition (Simplicity ≡ Entropy) The simplicity of a distribution p is given by its entropy : � H ( p ) = − p ( z ) log p ( z ) z ∈Z Definition (Relevant Facts ≡ Feature Functions) By φ i : Z → R for i = 1 , . . . , D we denote a set of feature functions that express everything we want to be able to model about our data. ◮ the grayvalue of a pixel, For example: ◮ a bag-of-words histogram of an image, ◮ the time of day an image was taken, ◮ a flag if a pixel is darker than half of its neighbors. 4 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle (Maximum Entropy Principle) Let z 1 , . . . , z N be samples from a distribution d ( z ) . Let φ 1 , . . . , φ D be � feature functions, and denote by µ i := 1 n φ i ( z n ) their means over the N sample set. The maximum entropy distribution, p , is the solution to p is a prob.distr. H ( p ) max subject to E z ∼ p ( z ) { φ i ( z ) } = µ i . � �� � � �� � be faithful to what we know be as simple as possible Theorem (Exponential Family Distribution) Under some very reasonable conditions, the maximum entropy distribution has the form p ( z ) = 1 � � � Z exp i w i φ i ( z ) for some parameter vector w = ( w 1 , . . . , w D ) and constant Z . 5 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: ◮ Let Z = R , φ 1 ( z ) = z , φ 2 ( z ) = z 2 . ◮ The exponential family distribution is 1 Z ( w ) exp( w 1 z + w 2 z 2 ) p ( z ) = b 2 a for a = w 2 , b = − w 1 � 2 ) � = Z ( a, b ) exp( a z − b . w 2 It’s a Gaussian! ◮ Given examples z 1 , . . . , z N , we can compute a and b , and derive w . 6 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: ◮ Let Z = { 1 , . . . , K } , φ k ( z ) = � z = k � , for k = 1 , . . . , K . ◮ The exponential family distribution is 1 � p ( z ) = Z ( w ) exp( w k φ k ( z ) ) k exp( w 1 ) /Z for z = 1 , exp( w 2 ) /Z for z = 2 , = . . . exp( w K ) /Z for z = K . with Z = exp( w 1 ) + · · · + exp( w K ) . It’s a Multinomial! 7 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: ◮ Let Z = { 0 , 1 } N × M image grid, φ i ( y ) := y i for each pixel i , φ NM ( y ) = � i ∼ j � y i � = y j � (summing over all 4-neighbor pairs) ◮ The exponential family distribution is 1 � p ( z ) = Z ( w ) exp( � w, φ ( y ) � + ˜ w � y i � = y j � ) i,j It’s a (binary) Markov Random Field! 8 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Conditional Random Field Learning Assume: ◮ a set of i.i.d. samples D = { ( x n , y n ) } n =1 ,...,N , ( x n , y n ) ∼ d ( x, y ) ◮ feature functions ( φ 1 ( x, y ) , . . . , φ D ( x, y ) ) ≡ : φ ( x, y ) 1 ◮ parametrized family p ( y | x, w ) = Z ( x,w ) exp( � w, φ ( x, y ) � ) Task: ◮ adjust w of p ( y | x, w ) based on D . Many possible technique to do so: ◮ Expectation Matching ◮ Maximum Likelihood ◮ Best Approximation ◮ MAP estimation of w Punchline: they all turn out to be (almost) the same! 9 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Maximum Likelihood Parameter Estimation Idea: maximize conditional likelihood of observing outputs y 1 , . . . , y N for inputs x 1 , . . . , x N w ∗ = argmax p ( y 1 , . . . , y N | x 1 , . . . , x N , w ) w ∈ R D N � i.i.d. p ( y n | x n , w ) = argmax w ∈ R D n =1 N − log( · ) � log p ( y n | x n , w ) = argmin − w ∈ R D n =1 � �� � negative conditional log-likelihood (of D ) 10 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: find p ( y | x, w ) that is closest to d ( y | x ) Definition (Similarity between conditional distributions) For fixed x ∈ X : KL -divergence measure similarity d ( y | x ) � KL cond ( p || d )( x ) := d ( y | x ) log p ( y | x, w ) y ∈Y For x ∼ d ( x ) , compute expectation: � � KL tot ( p || d ) : = E x ∼ d ( x ) KL cond ( p || d )( x ) d ( y | x ) � � = d ( x, y ) log p ( y | x, w ) x ∈X y ∈Y 11 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: find p ( y | x, w ) of minimal KL tot -distance to d ( y | x ) d ( y | x ) � � w ∗ = argmin d ( x, y ) log p ( y | x, w ) w ∈ R D x ∈X y ∈Y � drop const. = argmin w ∈ R D − d ( x, y ) log p ( y | x, w ) ( x,y ) ∈X×Y N ( x n ,y n ) ∼ d ( x,y ) � log p ( y n | x n , w ) ≈ argmin − w ∈ R D n =1 � �� � negative conditional log-likelihood (of D ) 12 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields MAP Estimation of w Idea: Treat w as random variable; maximize posterior probability p ( w |D ) N p ( x 1 , y 1 , . . . , x n , y n | w ) p ( w ) p ( y n | x n , w ) � Bayes i.i.d. p ( w |D ) = = p ( w ) p ( D ) p ( y n | x n ) n =1 p ( w ) : prior belief on w (cannot be estimated from data). w ∗ = argmax � � p ( w |D ) = argmin − log p ( w |D ) w ∈ R D w ∈ R D N � � � log p ( y n | x n , w ) + log p ( y n | x n ) = argmin − log p ( w ) − w ∈ R D � �� � n =1 indep. of w N � � � log p ( y n | x n , w ) = argmin − log p ( w ) − w ∈ R D n =1 13 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N � w ∗ = argmin � � log p ( y n | x n , w ) − log p ( w ) − w ∈ R D n =1 Choices for p ( w ) : (uniform; in R D not really a distribution) ◮ p ( w ) : ≡ const. N � w ∗ = argmin � � log p ( y n | x n , w ) − + const. w ∈ R D n =1 � �� � negative conditional log-likelihood 2 σ 2 � w � 2 1 ◮ p ( w ) := const. · e − (Gaussian) N − 1 � w ∗ = argmin � 2 σ 2 � w � 2 + � log p ( y n | x n , w ) + const. w ∈ R D n =1 � �� � regularized negative conditional log-likelihood 14 / 39
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D ) N 1 � � 2 σ 2 � w � 2 − � e � w,φ ( x n ,y ) � � � w, φ ( x n , y n ) � − log L ( w ) = n =1 y ∈Y ( σ 2 → ∞ makes it unregularized ) Probabilistic parameter estimation or training means solving w ∗ = argmin w ∈ R D L ( w ) . Same optimization problem as for multi-class logistic regression . 15 / 39
Recommend
More recommend