Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected models Siamak Ravanbakhsh Fall 2019
Learning objectives Learning objectives the form of likelihood for undirected models why is it difficult to optimize? conditional likelihood in undirected models different approximations for parameter learning MAP inference and regularization pseudo likelihood pseudo moment-matching contrastive learning
Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) C
Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) observations ∣ D ∣ = 100 C E [ I ( A = 1, B = 1)] = .4, E [ I ( B = 1, C = 1)] = .4 D D
Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) observations ∣ D ∣ = 100 C E [ I ( A = 1, B = 1)] = .4, E [ I ( B = 1, C = 1)] = .4 D D log-likelihood: log p ( D ; θ ) = I ( a = I ( b = 1, b = 1) + θ 1, c = 1) − 100 log Z ( θ ) ∑ a , b , c ∈ D θ 1 2 = 40 θ + 40 θ − 100 log Z ( θ ) 1 2
Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) observations ∣ D ∣ = 100 C E [ I ( A = 1, B = 1)] = .4, E [ I ( B = 1, C = 1)] = .4 D D log-likelihood: log p ( D ; θ ) = I ( a = I ( b = 1, b = 1) + θ 1, c = 1) − 100 log Z ( θ ) ∑ a , b , c ∈ D θ 1 2 = 40 θ + 40 θ − 100 log Z ( θ ) 1 2 because of the partition function the likelihood does not decompose log-likelihood function θ 2 θ 1
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D expected sufficient statistics μ D
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D example expected sufficient statistics μ D expected sufficient statistics params. E [ I ( X θ = 0, X = 0)] = P ( X = 0, X = 0) 1,2,0,0 1 2 1 2 D E [ I ( X = 1, X = 0)] = P ( X = 1, X = 0) θ 1,2,1,0 1 2 1 2 D E [ I ( X θ = 0, X = 1)] = P ( X = 0, X = 1) 1,2,0,1 1 2 1 2 D θ E [ I ( X = 1, X = 1)] = P ( X = 1, X = 1) 1,2,1,1 1 2 1 2 D image: Michael Jordan's draft
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D expected sufficient statistics μ D has interesting properties log Z ( θ ) ∂ ∑ x exp(⟨ θ , ϕ ( x )⟩) so E ∂ 1 E ∇ log Z ( θ ) = [ ϕ ( x )] ∂ θ log Z ( θ ) = = ∑ x ( x ) exp(⟨ θ , ϕ ( x )⟩) = [ ϕ ( x )] ϕ i θ θ i p i ∂ θ Z ( θ ) Z ( θ ) i
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D expected sufficient statistics μ D has interesting properties log Z ( θ ) ∂ ∑ x exp(⟨ θ , ϕ ( x )⟩) so E ∂ 1 E ∇ log Z ( θ ) = [ ϕ ( x )] ∂ θ log Z ( θ ) = = ∑ x ( x ) exp(⟨ θ , ϕ ( x )⟩) = [ ϕ ( x )] ϕ i θ θ i p i ∂ θ Z ( θ ) Z ( θ ) i ∂ 2 E [ ϕ E [ ϕ ( x )] E [ ϕ log Z ( θ ) = ( x ) ϕ ( x )] − ( x )] = Cov ( ϕ , ϕ ) i j i j i j ∂ θ ∂ θ i j so the Hessian matrix is positive definite is convex log Z ( θ )
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?)
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?) NO!
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?) NO! estimating is a difficult inference problem Z ( θ )
Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?) NO! estimating is a difficult inference problem Z ( θ ) how about just using the gradient info? involves inference as well E ∇ log Z ( θ ) = [ ϕ ( x )] θ θ any combination of inference-gradient based optimization for learning undirected models
Moment matching Moment matching for for linear exponential family linear exponential family probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave set its derivative to zero ∇ ∣ D ∣( E E ℓ( θ , D ) = [ ϕ ( x )] − [ ϕ ( x )]) = 0 D θ p θ ⇒ E E [ ϕ ( x )] [ ϕ ( x )] = D p θ find the parameter θ that results in the same expected sufficient statistics as the data
Moment matching Moment matching for for linear exponential family linear exponential family probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave set its derivative to zero ∇ ∣ D ∣( E E ℓ( θ , D ) = [ ϕ ( x )] − [ ϕ ( x )]) = 0 D θ p θ ⇒ E E [ ϕ ( x )] [ ϕ ( x )] = D p θ p ( X = 0, X = 1; θ ) = p ( X = 0, X = 1) 1 2 1 2 D find the parameter θ that results in the same expected sufficient statistics as the data
Learning needs inference Learning needs inference in an inner loop in an inner loop maximizing the likelihood: arg max log p ( D ∣ θ ) θ gradient ∝ E E [ ϕ ( x )] − [ ϕ ( x )] D p θ optimality condition E E [ ϕ ( x )] = [ ϕ ( x )] D p θ easy to calculate inference in the graphical model
Recommend
More recommend