CSE574 - Administriva • No class on Fri 01/25 (Ski Day)
Last Wednesday • HMMs – Most likely individual state at time t: (forward) – Most likely sequence of states (Viterbi) – Learning using EM • Generative vs. Discriminative Learning – Model p(y,x) vs. p(y|x) – p(y|x) : don’t bother about p(x) if we only want to do classification
Today • Markov Networks – Most likely individual state at time t: (forward) – Most likely sequence of states (Viterbi) – Learning using EM • CRFs – Model p(y,x) vs. p(y|x) – p(y|x) : don’t bother about p(x) if we only want to do classification
Finite State Models Generative HMMs Naïve Bayes directed models Sequence General Graphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs General Sequence Graphs Figure by Sutton & McCallum
Graphical Models Node is independent of its non- • Family of probability distributions that factorize in a descendants given its parents certain way • Directed (Bayes Nets) Node is independent all other x = x 1 x 2 . . .x K x 0 x 4 p ( x ) = Q K nodes given its neighbors i =1 p ( x i | P arents ( x i )) x 3 x 1 x 2 • Undirected (Markov Random Field) Q p ( x ) = 1 C Ψ C ( x C ) Z x 0 x 4 C ⊂ { x 1 , . . ., x K } clique x 5 Ψ C potential function x 3 x 1 x 2 • Factor Graphs Q p ( x ) = 1 A Ψ A ( x A ) x 0 x 4 Z A ⊂ { x 1 , . . ., x K } x 5 x 3 x 1 Ψ A factor function x 2
Markov Networks • Undirected graphical models A B C D • Potential functions defined over cliques 1 ∑∏ ∏ = Φ = Φ ( ) ( ) ( ) Z X P X X c c Z c X c ⎧ 3.7 if A and B ⎪ Φ = ⎨ ( , ) 2.1 if A and B A B ⎪ 0.7 otherwise ⎩ ⎧ 2.3 if B and C and D Φ = ⎨ ( , , ) B C D ⎩ 5.1 otherwise Slide by Domingos
Markov Networks • Undirected graphical models A B C D • Potential functions defined over cliques ⎛ ⎞ ⎛ ⎞ ∑ ∑ 1 ∑ = = exp ( ) Z ⎜ w f X ⎟ ( ) exp ( ) P X ⎜ w f X ⎟ i i ⎝ ⎠ i i ⎝ ⎠ Z X i i Weight of Feature i Feature i ⎧ 1 if A and B = ⎨ ( , ) f A B ⎩ 0 otherwise ⎧ 1 if B and C and D = ⎨ ( , , ) f B C D ⎩ 0 Slide by Domingos
Hammersley-Clifford Theorem If Distribution is strictly positive (P(x) > 0) And Graph encodes conditional independences Then Distribution is product of potentials over cliques of graph Inverse is also true. Slide by Domingos
Markov Nets vs. Bayes Nets Property Markov Nets Bayes Nets Form Prod. potentials Prod. potentials Potentials Arbitrary Cond. probabilities Cycles Allowed Forbidden Partition Z = ? Z = 1 func. Indep. check Graph separation D-separation Indep. props. Some Some Inference MCMC, BP, etc. Convert to Markov Slide by Domingos
Inference in Markov Networks • Goal: compute marginals & conditionals of ⎛ ⎞ ⎛ ⎞ 1 ∑ ∑ ∑ = = ( ) exp ( ) exp ( ) ⎜ ⎟ P X w f X Z ⎜ w f X ⎟ i i i i ⎝ ⎠ ⎝ ⎠ Z i X i • Exact inference is #P-complete E.g.: What is ? P ( x i ) • Conditioning on Markov blanket is easy: What is ? P ( x i | x 1 , . . ., x i − 1 , x i +1 , . . ., x N ) ( ) ∑ exp ( ) w f x = i i i ( | ( )) P x MB x ( ) ( ) ∑ ∑ = + = exp ( 0) exp ( 1) w f x w f x i i i i i i • Gibbs sampling exploits this Slide by Domingos
Markov Chain Monte Carlo • Idea: – create chain of samples x (1) , x (2) , … where x(i+1) depends on x(i) – set of samples x (1) , x (2) , … used to approximate p(x) X 1 x (1) = ( X 1 = x (1) 1 , X 2 = x (1) 2 , . . ., X 5 = x (1) 5 ) X 2 x (2) = ( X 1 = x (2) 1 , X 2 = x (2) 2 , . . ., X 5 = x (2) 5 ) X 3 x (3) = ( X 1 = x (3) 1 , X 2 = x (3) 2 , . . ., X 5 = x (3) X 4 X 5 5 ) Slide by Domingos
Markov Chain Monte Carlo • Gibbs Sampler 1. Start with an initial assignment to nodes 2. One node at a time, sample node given others 3. Repeat 4. Use samples to compute P(X) • Convergence: Burn-in + Mixing time • Many modes ⇒ Multiple chains Iterations required to Iterations required to move away be close to stationary dist. from particular initial condition Slide by Domingos
Other Inference Methods • Belief propagation (sum-product) • Mean field / Variational approximations Slide by Domingos
Learning • Learning Weights – Maximize likelihood – Convex optimization: gradient ascent, quasi- Newton methods, etc. – Requires inference at each step (slow!) • Learning Structure – Feature Search – Evaluation using Likelihood, …
Back to CRFs • CRFs are conditionally trained Markov Networks
Linear-Chain Conditional Random Fields • From HMMs to CRFs T Y p ( y t | y t − 1 ) p ( x t | y t ) p ( y , x ) = t =1 can also be written as ⎛ ⎞ ⎝ X X X X X p ( y , x ) = 1 ⎠ Z exp λ ij 1 { y t = i } 1 { y t − 1 = j } + μ oi 1 { y t = i } 1 { x t = o } t t i,j ∈ S i ∈ S o ∈ O (set , …) λ ij := log p ( y 0 = i | y = j ) We let new parameters vary freely, so we need normalization constant Z.
Linear-Chain Conditional Random Fields ⎛ ⎞ ⎝ X X X X X p ( y , x ) = 1 This is a ⎠ Z exp λ ij 1 { y t = i } 1 { y t − 1 = j } + μ oi 1 { y t = i } 1 { x t = o } linear-chain t t i,j ∈ S i ∈ S o ∈ O • Introduce feature functions CRF, f k ( y t , y t − 1 , x t ) but includes One feature per transition One feature per state-observation pair only ( , ) current f ij ( y, y 0 , x t ) := 1 y = i 1 y 0 = j f io ( y, y 0 , x t ) := 1 y = i 1 x = o word’s à K ! X p ( y , x ) = 1 identity as Z exp λ k f k ( y t , y t − 1 , x t ) a feature k =1 • Then the conditional distribution is ³P K ´ exp k =1 λ k f k ( y t , y t − 1 , x t ) p ( y , x ) P ³P K ´ p ( y | x ) = y 0 p ( y 0 , x ) = P y 0 exp k =1 λ k f k ( y t , y t − 1 , x t )
Linear-Chain Conditional Random Fields • Conditional p(y|x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions!
Linear-Chain Conditional Random Fields • Definition: A linear-chain CRF is a distribution that takes the form parameters feature functions à K ! X 1 p ( y | x ) = Z ( x ) exp λ k f k ( y t , y t − 1 , x t ) k =1 where Z(x) is a normalization function à K ! X X Z ( x ) = exp λ k f k ( y t , y t − 1 , x t ) y k =1
Linear-Chain Conditional Random Fields • HMM-like linear-chain CRF … y … x • Linear-chain CRF, in which transition score depends on the current observation … y … x
Questions • #1 – Inference Given observations x 1 … x N and CRF θ , what is P(y t ,y t-1 |x) and what is Z(x)? (needed for learning) • #2 – Inference Given observations x 1 … x N and CRF θ , what is the most likely (Viterbi) labeling y*= arg max y p(y|x)? • #3 – Learning Given iid training data D={x (i) , y (i) }, i=1..N, how do we estimate the parameters θ ={ λ k } of a linear-chain CRF?
Solutions to #1 and #2 • Forward/Backward and Viterbi algorithms similar to versions for HMMs HMM Definition T Y • HMM as factor graph p ( y t | y t − 1 ) p ( x t | y t ) p ( y , x ) = t =1 T Y p ( y , x ) = Ψ t p ( y t , y t − 1 , x t ) t =1 Ψ t ( j, i, x ) := p ( y t = j | y t − 1 = i ) p ( x t = x | y t = j ) • Then X forward recursion α t ( i ) = Ψ t ( j, i, x t ) α t − 1 ( i ) i ∈ S X backward recursion β t ( i ) = Ψ t +1 ( j, i, x t +1 ) β t +1 ( j ) j ∈ S Viterbi recursion δ t ( j ) = max i ∈ S Ψ t ( j, i, x t ) δ t − 1 ( i )
Forward/Backward for linear-chain CRFs … • … identical to HMM version except for factor functions à K ! Ψ t ( j, i, x t ) CRF Definition X p ( y | x ) = 1 • CRF can be written as Z exp λ k f k ( y t , y t − 1 , x t ) k =1 T Y p ( y | x ) = 1 Ψ t ( y t , y t − 1 , x t ) Z ÃX ! t =1 Ψ t ( y t , y t − 1 , x t ) := exp λ k f k ( y t , y t − 1 , x t ) k X • Same: forward recursion α t ( i ) = Ψ t ( j, i, x t ) α t − 1 ( i ) i ∈ S X backward recursion β t ( i ) = Ψ t +1 ( j, i, x t +1 ) β t +1 ( j ) j ∈ S Viterbi recursion δ t ( j ) = max i ∈ S Ψ t ( j, i, x t ) δ t − 1 ( i )
Forward/Backward for linear-chain CRFs • Complexity same as for HMMs Time: Space: K = |S| #states O(K 2 N) O(KN) N length of sequence Linear in length of sequence!
Solution to #3 - Learning • Want to maximize Conditional log likelihood N X log p ( y ( i ) | x ( i ) ) l ( θ ) = i =1 • Substitute in CRF model into likelihood CRFs typically learned using numerical N T K N K X X X X X λ 2 optimization of likelihood. λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) log Z ( x ( i ) ) k t ) − − l ( θ ) = 2 σ 2 (Also possible for HMMs, but we only i =1 t =1 k =1 i =1 k =1 discussed EM) • Add Regularizer Often large number of parameters, so need to avoid overfitting
Regularization • Commonly used l 2 -norm (Euclidean) – Corresponds to Gaussian prior over parameters K X λ 2 k − 2 σ 2 k =1 • Alternative is l 1 -norm – Corresponds to exponential prior over parameters – Encourages sparsity K X | λ k | − σ k =1 • Accuracy of final model not sensitive to σ
Recommend
More recommend