Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. - PowerPoint PPT Presentation

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Problem (Probabilistic Learning) Let d ( y | x ) be the (unknown) true conditional distribution. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be i.i.d. samples from d ( x, y ) . ◮ Find a distribution p ( y | x ) that we can use as a proxy for d ( y | x ) . or ◮ Given a parametrized family of distributions, p ( y | x, w ) , find the parameter w ∗ making p ( y | x, w ) closest to d ( y | x ) . Open questions: ◮ What do we mean by closest? ◮ What’s a good candidate for p ( y | x, w ) ? ◮ How to actually find w ∗ ? ◮ conceptually, and ◮ numerically 2 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle of Parsimony (Parsimoney, aka Occam’s razor) “Pluralitas non est ponenda sine neccesitate.” William of Ockham “We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances.” Isaac Newton “Make everything as simple as possible, but not simpler.” Albert Einstein “Use the simplest explanation that covers all the facts.” what we’ll use 3 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields ◮ 1) Define what aspects we consider relevant facts about the data. ◮ 2) Pick the simplest distribution reflecting that. Definition (Simplicity ≡ Entropy) The simplicity of a distribution p is given by its entropy : � H ( p ) = − p ( z ) log p ( z ) z ∈Z Definition (Relevant Facts ≡ Feature Functions) By φ i : Z → R for i = 1 , . . . , D we denote a set of feature functions that express everything we want to be able to model about our data. ◮ the grayvalue of a pixel, For example: ◮ a bag-of-words histogram of an image, ◮ the time of day an image was taken, ◮ a flag if a pixel is darker than half of its neighbors. 4 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle (Maximum Entropy Principle) Let z 1 , . . . , z N be samples from a distribution d ( z ) . Let φ 1 , . . . , φ D be � feature functions, and denote by µ i := 1 n φ i ( z n ) their means over the N sample set. The maximum entropy distribution, p , is the solution to p is a prob.distr. H ( p ) max subject to E z ∼ p ( z ) { φ i ( z ) } = µ i . � �� be faithful to what we know be as simple as possible Theorem (Exponential Family Distribution) Under some very reasonable conditions, the maximum entropy distribution has the form p ( z ) = 1 � � � Z exp i w i φ i ( z ) for some parameter vector w = ( w 1 , . . . , w D ) and constant Z . 5 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: ◮ Let Z = R , φ 1 ( z ) = z , φ 2 ( z ) = z 2 . ◮ The exponential family distribution is 1 Z ( w ) exp( w 1 z + w 2 z 2 ) p ( z ) = b 2 a for a = w 2 , b = − w 1 � 2 ) � = Z ( a, b ) exp( a z − b . w 2 It’s a Gaussian! ◮ Given examples z 1 , . . . , z N , we can compute a and b , and derive w . 6 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: ◮ Let Z = { 1 , . . . , K } , φ k ( z ) = � z = k � , for k = 1 , . . . , K . ◮ The exponential family distribution is 1 � p ( z ) = Z ( w ) exp( w k φ k ( z ) ) k  exp( w 1 ) /Z for z = 1 ,      exp( w 2 ) /Z for z = 2 , = . . .      exp( w K ) /Z for z = K . with Z = exp( w 1 ) + · · · + exp( w K ) . It’s a Multinomial! 7 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: ◮ Let Z = { 0 , 1 } N × M image grid, φ i ( y ) := y i for each pixel i , φ NM ( y ) = � i ∼ j � y i � = y j � (summing over all 4-neighbor pairs) ◮ The exponential family distribution is 1 � p ( z ) = Z ( w ) exp( � w, φ ( y ) � + ˜ w � y i � = y j � ) i,j It’s a (binary) Markov Random Field! 8 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Conditional Random Field Learning Assume: ◮ a set of i.i.d. samples D = { ( x n , y n ) } n =1 ,...,N , ( x n , y n ) ∼ d ( x, y ) ◮ feature functions ( φ 1 ( x, y ) , . . . , φ D ( x, y ) ) ≡ : φ ( x, y ) 1 ◮ parametrized family p ( y | x, w ) = Z ( x,w ) exp( � w, φ ( x, y ) � ) Task: ◮ adjust w of p ( y | x, w ) based on D . Many possible technique to do so: ◮ Expectation Matching ◮ Maximum Likelihood ◮ Best Approximation ◮ MAP estimation of w Punchline: they all turn out to be (almost) the same! 9 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Maximum Likelihood Parameter Estimation Idea: maximize conditional likelihood of observing outputs y 1 , . . . , y N for inputs x 1 , . . . , x N w ∗ = argmax p ( y 1 , . . . , y N | x 1 , . . . , x N , w ) w ∈ R D N � i.i.d. p ( y n | x n , w ) = argmax w ∈ R D n =1 N − log( · ) � log p ( y n | x n , w ) = argmin − w ∈ R D n =1 � �� negative conditional log-likelihood (of D ) 10 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: find p ( y | x, w ) that is closest to d ( y | x ) Definition (Similarity between conditional distributions) For fixed x ∈ X : KL -divergence measure similarity d ( y | x ) � KL cond ( p || d )( x ) := d ( y | x ) log p ( y | x, w ) y ∈Y For x ∼ d ( x ) , compute expectation: � � KL tot ( p || d ) : = E x ∼ d ( x ) KL cond ( p || d )( x ) d ( y | x ) � � = d ( x, y ) log p ( y | x, w ) x ∈X y ∈Y 11 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: find p ( y | x, w ) of minimal KL tot -distance to d ( y | x ) d ( y | x ) � � w ∗ = argmin d ( x, y ) log p ( y | x, w ) w ∈ R D x ∈X y ∈Y � drop const. = argmin w ∈ R D − d ( x, y ) log p ( y | x, w ) ( x,y ) ∈X×Y N ( x n ,y n ) ∼ d ( x,y ) � log p ( y n | x n , w ) ≈ argmin − w ∈ R D n =1 � �� negative conditional log-likelihood (of D ) 12 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields MAP Estimation of w Idea: Treat w as random variable; maximize posterior probability p ( w |D ) N p ( x 1 , y 1 , . . . , x n , y n | w ) p ( w ) p ( y n | x n , w ) � Bayes i.i.d. p ( w |D ) = = p ( w ) p ( D ) p ( y n | x n ) n =1 p ( w ) : prior belief on w (cannot be estimated from data). w ∗ = argmax � � p ( w |D ) = argmin − log p ( w |D ) w ∈ R D w ∈ R D N � � � log p ( y n | x n , w ) + log p ( y n | x n ) = argmin − log p ( w ) − w ∈ R D � �� n =1 indep. of w N � � � log p ( y n | x n , w ) = argmin − log p ( w ) − w ∈ R D n =1 13 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N � w ∗ = argmin � � log p ( y n | x n , w ) − log p ( w ) − w ∈ R D n =1 Choices for p ( w ) : (uniform; in R D not really a distribution) ◮ p ( w ) : ≡ const. N � w ∗ = argmin � � log p ( y n | x n , w ) − + const. w ∈ R D n =1 � �� negative conditional log-likelihood 2 σ 2 � w � 2 1 ◮ p ( w ) := const. · e − (Gaussian) N − 1 � w ∗ = argmin � 2 σ 2 � w � 2 + � log p ( y n | x n , w ) + const. w ∈ R D n =1 � �� regularized negative conditional log-likelihood 14 / 39

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D ) N 1 � � 2 σ 2 � w � 2 − � e � w,φ ( x n ,y ) � � � w, φ ( x n , y n ) � − log L ( w ) = n =1 y ∈Y ( σ 2 → ∞ makes it unregularized ) Probabilistic parameter estimation or training means solving w ∗ = argmin w ∈ R D L ( w ) . Same optimization problem as for multi-class logistic regression . 15 / 39

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. - PowerPoint PPT Presentation

Sebastian Nowozin and Christoph Lampert Structured Models in Computer Vision Part 4. Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

Conditional Random Fields [Hanna M. Wallach, Conditional Random Fields: An Introduction,

Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and

Graphical Models - Part II Oliver Schulte - CMPT 726 Bishop PRML Ch. 8 Markov Random Fields

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

Visualization Visualization Height Fields and Contours Height Fields and Contours Scalar Fields

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Protocol for Booleans ifTrue:ifFalse: trueBlock falseBlock Full conditional Part conditional

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Conditional quenched CLTs for random walks among random conductances Christophe Gallesco Nina

Limit theorems for excursion sets of stationary random fields Evgeny Spodarev | 23.01.2013 WIAS,

A conditional quenched CLT for random walks among random conductances on Z d Christophe Gallesco

Function Fields, Curves Introduction Function Fields vs. Curves and Global sections Function

Calculating probabilities of two events F OUN DATION S OF P ROBABILITY IN P YTH ON Alexander

CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, Bayesian Networks, Conditional

s r strtr qt

Gibbs sampling Dr. Jarad Niemi Iowa State University March 29, 2018 Jarad Niemi (Iowa State)

Probabilistic Graphical Models Lecture 2 Bayesian Networks Representation CS/CNS/EE 155

Probability, continued CMPUT 296: Basics of Machine Learning 2.2-2.4 Recap Probabilities

Machine Learning - MT 2017 10. Classification : Generative Models Varun Kanade University of

Stochastic Simulation Markov Chain Monte Carlo Bo Friis Nielsen Institute of Mathematical

Sambuz

Useful Links

Newsletter

Mail Us