Discrete Markov Random Fields the Inference story Pradeep Ravikumar
Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2
History Mid to Late Twentieth Century Pioneering work of Conspiracy Theorists The System, it is all connected ... 3
History Late Twentieth Century: people realize that existing scientific literature offers a marriage between probability theory and graph theory – which can be used to model the world. 4
History Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs. 5
History Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs. 6
Graphical Models X2 X4 X3 X1 7
Graphical Models X2 X4 X3 X1 8
Graphical Models X2 X4 X3 X1 Separating Set ∼ ( X 2 , X 3 ) disconnects X 1 and X 4 9
Graphical Models X2 X4 X3 X1 Separating Set ∼ ( X 2 , X 3 ) disconnects X 1 and X 4 Global Markov Property ∼ X 1 ⊥ X 4 | ( X 2 , X 3 ) 10
Graphical Models MP ( G ) ∼ Set of all Markov properties, by ranging over separating sets of G . P represented by G ∼ P satisfies MP ( G ) P1(X) P3(X) G(X) P4(X) P2(X) 11
Hammersley and Clifford Theorem Positive P over X satisfies MP ( G ) iff P factorizes according to cliques C in G , P ( X ) = 1 � ψ C ( X C ) Z C ∈C Specific member of family specified by weights over cliques. P1(X) { ψ (3) G(X) P3(X) C ( X C ) } P4(X) P2(X) 12
Exponential Family 1 � p ( X ) = ψ C ( X C ) Z C ∈C � = exp( log ψ C ( X C ) − log Z ) C ∈C �� � Exponential family: p ( X ; θ ) = exp α ∈C θ α φ α ( X ) − Ψ( θ ) • { φ α } ∼ features • { θ α } ∼ parameters • Ψ( θ ) ∼ log partition function 13
Inference Answering queries about the graphical model probability distribution. 14
Inference �� � For undirected model p ( x ; θ ) = exp α ∈ I θ α φ α ( x ) − Ψ( θ ) key inference problems are: ⊲ compute log partition function (normalization constant) Ψ( θ ) ⊲ marginals p ( x A ) = P xv,v �∈ A p ( x ) ⊲ most probable configurations x ∗ = arg max x p ( x | x L ) These problems are intractable in full generality. 15
Log Partition Function � � Z = log ψ α ( x α ) x α ∈ I 16
Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I � � ψ α ( x α ) x α � � � = ψ α ( x α ) x i α { x j � = i } � � � � = ψ α ( x α ) ψ α ( x α ) x i α ∈ C \ i α ∈ C i { x j � = i } � � = ψ α ( x α ) g ( x j � = i ) α ∈ C \ i { x j � = i } 17
Variable Elimination � � = ψ α ( x α ) g ( x j � = i ) Z α ∈ C \ i { x j � = i } Continue to “eliminate” other variables x j . Is this a linear time method then? 18
Variable Elimination � � = ψ α ( x α ) g ( x j � = i ) Z α ∈ C \ i { x j � = i } Continue to “eliminate” other variables x j . Is this a linear time method then? g ( x j � = i ) depends on variables j which share a factor with i . 19
Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 ψ 36 ψ 46 ψ 15 ψ 25 x 3 x 4 x 1 x 2 20
Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 ψ 36 ψ 46 ψ 15 ψ 25 x 3 x 4 x 1 x 2 21
Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 m 6 m 5 22
Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 m 6 m 5 23
Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I Exponential in tree-width. x 7 � � � � ψ 57 ( ψ 25 ) ψ 15 ψ 57 ψ 67 x 7 x 5 x 1 x 2 x 5 x 6 � � � ψ 36 ψ 46 ψ 67 ( ψ 36 ψ 46 ) ψ 15 ψ 25 x 3 x 4 x 6 3 x 4 x 1 x 2 24
Inference exp( θ ⊤ φ ( x ) − A ( θ )) p ( x ; θ ) = A ( θ ) ∼ log partition function 25
Inference � exp( θ ⊤ φ ( x )) A ( θ ) = log x ≤ B ( θ, λ ) ≥ C ( θ, λ ) ∼ “variational” parameter λ A ( θ ) ≤ inf λ B ( θ, λ ) ≥ sup C ( θ, λ ) λ Summing over configurations → Optimization! 26
Inference But... (there’s always a but!) is there a principled way to obtain “parametrized” bounds B ( θ, λ ) and C ( θ, λ ) ? 27
Fenchel Duality f ( x ) ∼ concave function Define f ∗ ( λ ) = min x { λ ⊤ x − f ( x ) } . ′ ( x λ ) = λ , Slope of tangent at x λ is λ = ⇒ f Tangent ∼ λ ⊤ x − ( Intercept ) ⇒ f ( x λ ) = λ ⊤ x λ − ( Intercept ) = ⇒ f ∗ ( λ ) ∼ Intercept of line with slope λ tangent to f ( x ) . = 28
Fenchel Duality Tangent ∼ λ ⊤ x − f ∗ ( λ ) f ( x ) = min λ { λ ⊤ x − f ∗ ( λ ) } Thus, g ( x, λ ) = λ ⊤ x − f ∗ ( λ ) is an upper bound of f ( x ) ! 29
Fenchel Duality Let us apply fenchel duality to the log partition function! A ( θ ) ∼ convex A ∗ ( µ ) ( θ ⊤ µ − A ( θ )) = sup θ µ ( θ ⊤ µ − A ∗ ( µ )) A ( θ ) = sup 30
Log Partition Function Define the “marginal polytope” � { µ ∈ R d | ∃ p ( . ) s . t . M = φ ( x ) p ( x ) = µ } x M ∼ Convex hull of { φ ( x ) } 31
Mean parameter mapping Consider the mapping ∧ : Θ → M , � ∧ ( θ ) := E θ [ φ ( x )] = φ ( x ) p ( x ; θ ) x The mapping associates θ to “mean parameters” µ := ∧ ( θ ) ∈ M . Conversely, for µ in Int( M ) , ∃ θ = ∧ − 1 ( µ ) (unique if exponential family is minimal) 32
Partition function conjugates A ∗ ( µ ) ( θ ⊤ µ − A ( θ )) = sup θ µ ( θ ⊤ µ − A ∗ ( µ )) A ( θ ) = sup Optimal parameters given by, ∧ − 1 ( µ ) = θ µ = ∧ ( θ ) µ θ 33
Partition function conjugate Properties of the fenchel conjugate, A ∗ ( µ ) = sup θ ( θ ⊤ µ − A ( θ )) ⊲ A ∗ ( µ ) is finite only for µ ∈ M . ⊲ A ∗ ( µ ) is the entropy of graphical model distribution with “mean parameters” µ , or equivalently with parameters ∧ − 1 ( µ ) ! 34
Partition Function A ( θ ) = sup µ ∈M θ ⊤ µ − A ∗ ( µ ) “Hardness” is due to two bottlenecks • M : a polytope with exponentially many vertices and no compact representation • A ∗ ( µ ) : entropy computation Approximate either or both! 35
Pairwise Graphical Models θ 12 φ 12( x 1 , x 2) x 1 x 2 Overcomplete potentials: � 1 x s = j I j ( x s ) = 0 otherwise � 1 x s = j and x t = k I j,k ( x s , x t ) = 0 otherwise . � � p ( x | θ ) = exp θ s ; j I j ( x s ) + θ s,t ; j,k I j,k ( x s , x t ) − Ψ( θ ) s,j s,t ; j,k 36
Overcomplete Representation; Mean Parameters := E θ [ I j ( x s )] = p ( x s = j ; θ ) µ s ; j := E θ [ I j,k ( x s , x t )] = p ( x s = j, x t = k ; θ ) µ s,t ; j,k Mean parameters are marginals! Define the following functional forms, � µ s ( x s ) = µ s ; j I j ( x s ) j � µ st ( x s , x t ) = µ s,t ; j,k I jk ( x s , x t ) j,k 37
Outer Polytope Approximations � � LOCAL( G ) := { µ ≥ 0 | µ s ( x s ) = 1 , µ st ( x s , x t ) = µ s ( x s ) } x s x t 38
Inner Polytope Approximations For the given graph G and a subgraph H , let E ( H ) = { θ ′ | θ ′ st = θ st 1 ( s,t ) ∈ H } M ( G ; H ) = { µ | µ = E θ [ φ ( x )] for some θ ∈ E ( H ) } . M ( G ; H ) ⊆ M ( G ) 39
Entropy Approximations Tree-structured distributions, µ st ( x s , x t ) � � p ( x ; µ ) = µ s ( x s ) µ s µ t s ( s,t ) ∈ E Define, � H s ( µ s ) := µ s ( x s ) log µ s ( x s ) x s � I st ( µ st ) := µ st ( x s , x t ) log µ st ( x s , x t ) x s ,x t 40
Tree-structured Entropy, � � A ∗ tree ( µ ) = − H s ( µ s ) + I st ( µ st ) s ∈ V ( s,t ) ∈ E Compact representation; can be used as an approximation. 41
Approximate Inference Techniques Belief Propagation – Polytope ∼ LOCAL ( G ) , Entropy ∼ Tree- structured entropy! Structured Mean Field – Polytope ∼ M ( G ; H ) , Entropy ∼ H- structured entropy Mean Field – H = H 0 , completely independent graph 42
Divergence Measure View Given: p ( x ; θ ) ∝ exp( θ ⊤ φ ( x )) Would like a more “manageable” surrogate distribution, q ∈ Q , min q ∈Q D ( q ( x ) || p ( x ; θ )) 43
Divergence Measure View min q ∈Q D ( q ( x ) || p ( x ; θ )) D ( q || p ) = KL ( q || p ) ∼ Structured Mean Field, Belief Propagation D ( p || q ) = KL ( p || q ) ∼ Expectation Propagation (look out for talk on Continuous Markov Random Fields!) Typically approximate KL measure with “energy approximations” (Bethe free energy, Kikuchi free energy) (Ravikumar,Lafferty 05; Preconditioner Approximations) Optimizing for a minimax criterion reduces task to a generalized linear systems 44
problem! 45
Bounds on event probabilities Doctor: So what is the lower bound on the diagnosis probability? Graphical Model: I don’t know, but here is an “approximate” value. Doctor: =( Can we get upper and lower bounds on p ( X ∈ C ; θ ) (instead of just “approximate” values)? 46
Recommend
More recommend