Discrete Markov Random Fields the Inference story Pradeep Ravikumar

Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2

History Mid to Late Twentieth Century Pioneering work of Conspiracy Theorists The System, it is all connected ... 3

History Late Twentieth Century: people realize that existing scientific literature offers a marriage between probability theory and graph theory – which can be used to model the world. 4

History Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs. 5

History Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs. 6

Graphical Models X2 X4 X3 X1 7

Graphical Models X2 X4 X3 X1 8

Graphical Models X2 X4 X3 X1 Separating Set ∼ ( X 2 , X 3 ) disconnects X 1 and X 4 9

Graphical Models X2 X4 X3 X1 Separating Set ∼ ( X 2 , X 3 ) disconnects X 1 and X 4 Global Markov Property ∼ X 1 ⊥ X 4 | ( X 2 , X 3 ) 10

Graphical Models MP ( G ) ∼ Set of all Markov properties, by ranging over separating sets of G . P represented by G ∼ P satisfies MP ( G ) P1(X) P3(X) G(X) P4(X) P2(X) 11

Hammersley and Clifford Theorem Positive P over X satisfies MP ( G ) iff P factorizes according to cliques C in G , P ( X ) = 1 � ψ C ( X C ) Z C ∈C Specific member of family specified by weights over cliques. P1(X) { ψ (3) G(X) P3(X) C ( X C ) } P4(X) P2(X) 12

Exponential Family 1 � p ( X ) = ψ C ( X C ) Z C ∈C � = exp( log ψ C ( X C ) − log Z ) C ∈C �� Exponential family: p ( X ; θ ) = exp α ∈C θ α φ α ( X ) − Ψ( θ ) • { φ α } ∼ features • { θ α } ∼ parameters • Ψ( θ ) ∼ log partition function 13

Inference Answering queries about the graphical model probability distribution. 14

Inference �� For undirected model p ( x ; θ ) = exp α ∈ I θ α φ α ( x ) − Ψ( θ ) key inference problems are: ⊲ compute log partition function (normalization constant) Ψ( θ ) ⊲ marginals p ( x A ) = P xv,v �∈ A p ( x ) ⊲ most probable configurations x ∗ = arg max x p ( x | x L ) These problems are intractable in full generality. 15

Log Partition Function � � Z = log ψ α ( x α ) x α ∈ I 16

Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I � � ψ α ( x α ) x α � � � = ψ α ( x α ) x i α { x j � = i } � � � � = ψ α ( x α ) ψ α ( x α ) x i α ∈ C \ i α ∈ C i { x j � = i } � � = ψ α ( x α ) g ( x j � = i ) α ∈ C \ i { x j � = i } 17

Variable Elimination � � = ψ α ( x α ) g ( x j � = i ) Z α ∈ C \ i { x j � = i } Continue to “eliminate” other variables x j . Is this a linear time method then? 18

Variable Elimination � � = ψ α ( x α ) g ( x j � = i ) Z α ∈ C \ i { x j � = i } Continue to “eliminate” other variables x j . Is this a linear time method then? g ( x j � = i ) depends on variables j which share a factor with i . 19

Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 ψ 36 ψ 46 ψ 15 ψ 25 x 3 x 4 x 1 x 2 20

Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 ψ 36 ψ 46 ψ 15 ψ 25 x 3 x 4 x 1 x 2 21

Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 m 6 m 5 22

Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I x 7 ψ 57 ψ 67 x 5 x 6 m 6 m 5 23

Variable Elimination � � Z = log ψ α ( x α ) x α ∈ I Exponential in tree-width. x 7 � � � � ψ 57 ( ψ 25 ) ψ 15 ψ 57 ψ 67 x 7 x 5 x 1 x 2 x 5 x 6 � � � ψ 36 ψ 46 ψ 67 ( ψ 36 ψ 46 ) ψ 15 ψ 25 x 3 x 4 x 6 3 x 4 x 1 x 2 24

Inference exp( θ ⊤ φ ( x ) − A ( θ )) p ( x ; θ ) = A ( θ ) ∼ log partition function 25

Inference � exp( θ ⊤ φ ( x )) A ( θ ) = log x ≤ B ( θ, λ ) ≥ C ( θ, λ ) ∼ “variational” parameter λ A ( θ ) ≤ inf λ B ( θ, λ ) ≥ sup C ( θ, λ ) λ Summing over configurations → Optimization! 26

Inference But... (there’s always a but!) is there a principled way to obtain “parametrized” bounds B ( θ, λ ) and C ( θ, λ ) ? 27

Fenchel Duality f ( x ) ∼ concave function Define f ∗ ( λ ) = min x { λ ⊤ x − f ( x ) } . ′ ( x λ ) = λ , Slope of tangent at x λ is λ = ⇒ f Tangent ∼ λ ⊤ x − ( Intercept ) ⇒ f ( x λ ) = λ ⊤ x λ − ( Intercept ) = ⇒ f ∗ ( λ ) ∼ Intercept of line with slope λ tangent to f ( x ) . = 28

Fenchel Duality Tangent ∼ λ ⊤ x − f ∗ ( λ ) f ( x ) = min λ { λ ⊤ x − f ∗ ( λ ) } Thus, g ( x, λ ) = λ ⊤ x − f ∗ ( λ ) is an upper bound of f ( x ) ! 29

Fenchel Duality Let us apply fenchel duality to the log partition function! A ( θ ) ∼ convex A ∗ ( µ ) ( θ ⊤ µ − A ( θ )) = sup θ µ ( θ ⊤ µ − A ∗ ( µ )) A ( θ ) = sup 30

Log Partition Function Define the “marginal polytope” � { µ ∈ R d | ∃ p ( . ) s . t . M = φ ( x ) p ( x ) = µ } x M ∼ Convex hull of { φ ( x ) } 31

Mean parameter mapping Consider the mapping ∧ : Θ → M , � ∧ ( θ ) := E θ [ φ ( x )] = φ ( x ) p ( x ; θ ) x The mapping associates θ to “mean parameters” µ := ∧ ( θ ) ∈ M . Conversely, for µ in Int( M ) , ∃ θ = ∧ − 1 ( µ ) (unique if exponential family is minimal) 32

Partition function conjugates A ∗ ( µ ) ( θ ⊤ µ − A ( θ )) = sup θ µ ( θ ⊤ µ − A ∗ ( µ )) A ( θ ) = sup Optimal parameters given by, ∧ − 1 ( µ ) = θ µ = ∧ ( θ ) µ θ 33

Partition function conjugate Properties of the fenchel conjugate, A ∗ ( µ ) = sup θ ( θ ⊤ µ − A ( θ )) ⊲ A ∗ ( µ ) is finite only for µ ∈ M . ⊲ A ∗ ( µ ) is the entropy of graphical model distribution with “mean parameters” µ , or equivalently with parameters ∧ − 1 ( µ ) ! 34

Partition Function A ( θ ) = sup µ ∈M θ ⊤ µ − A ∗ ( µ ) “Hardness” is due to two bottlenecks • M : a polytope with exponentially many vertices and no compact representation • A ∗ ( µ ) : entropy computation Approximate either or both! 35

Pairwise Graphical Models θ 12 φ 12( x 1 , x 2) x 1 x 2 Overcomplete potentials: � 1 x s = j I j ( x s ) = 0 otherwise � 1 x s = j and x t = k I j,k ( x s , x t ) = 0 otherwise .   � � p ( x | θ ) = exp θ s ; j I j ( x s ) + θ s,t ; j,k I j,k ( x s , x t ) − Ψ( θ )  s,j s,t ; j,k 36

Overcomplete Representation; Mean Parameters := E θ [ I j ( x s )] = p ( x s = j ; θ ) µ s ; j := E θ [ I j,k ( x s , x t )] = p ( x s = j, x t = k ; θ ) µ s,t ; j,k Mean parameters are marginals! Define the following functional forms, � µ s ( x s ) = µ s ; j I j ( x s ) j � µ st ( x s , x t ) = µ s,t ; j,k I jk ( x s , x t ) j,k 37

Outer Polytope Approximations � � LOCAL( G ) := { µ ≥ 0 | µ s ( x s ) = 1 , µ st ( x s , x t ) = µ s ( x s ) } x s x t 38

Inner Polytope Approximations For the given graph G and a subgraph H , let E ( H ) = { θ ′ | θ ′ st = θ st 1 ( s,t ) ∈ H } M ( G ; H ) = { µ | µ = E θ [ φ ( x )] for some θ ∈ E ( H ) } . M ( G ; H ) ⊆ M ( G ) 39

Entropy Approximations Tree-structured distributions, µ st ( x s , x t ) � � p ( x ; µ ) = µ s ( x s ) µ s µ t s ( s,t ) ∈ E Define, � H s ( µ s ) := µ s ( x s ) log µ s ( x s ) x s � I st ( µ st ) := µ st ( x s , x t ) log µ st ( x s , x t ) x s ,x t 40

Tree-structured Entropy, � � A ∗ tree ( µ ) = − H s ( µ s ) + I st ( µ st ) s ∈ V ( s,t ) ∈ E Compact representation; can be used as an approximation. 41

Approximate Inference Techniques Belief Propagation – Polytope ∼ LOCAL ( G ) , Entropy ∼ Tree- structured entropy! Structured Mean Field – Polytope ∼ M ( G ; H ) , Entropy ∼ H- structured entropy Mean Field – H = H 0 , completely independent graph 42

Divergence Measure View Given: p ( x ; θ ) ∝ exp( θ ⊤ φ ( x )) Would like a more “manageable” surrogate distribution, q ∈ Q , min q ∈Q D ( q ( x ) || p ( x ; θ )) 43

Divergence Measure View min q ∈Q D ( q ( x ) || p ( x ; θ )) D ( q || p ) = KL ( q || p ) ∼ Structured Mean Field, Belief Propagation D ( p || q ) = KL ( p || q ) ∼ Expectation Propagation (look out for talk on Continuous Markov Random Fields!) Typically approximate KL measure with “energy approximations” (Bethe free energy, Kikuchi free energy) (Ravikumar,Lafferty 05; Preconditioner Approximations) Optimizing for a minimax criterion reduces task to a generalized linear systems 44

problem! 45

Bounds on event probabilities Doctor: So what is the lower bound on the diagnosis probability? Graphical Model: I don’t know, but here is an “approximate” value. Doctor: =( Can we get upper and lower bounds on p ( X ∈ C ; θ ) (instead of just “approximate” values)? 46

Discrete Markov Random Fields the Inference story Pradeep Ravikumar - PowerPoint PPT Presentation

Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2 History Mid to Late Twentieth Century

Graphical Models - Part II Oliver Schulte - CMPT 726 Bishop PRML Ch. 8 Markov Random Fields

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Discrete time Markov chains Today: Discrete Time Markov Chains, Limiting Discrete time Markov

Markov Random Fields Umamahesh Srinivas iPAL Group Meeting February 25, 2011 Outline Basic

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

Markov Random Fields and its Applications Huiwen Chang Introduction Markov Random

Visualization Visualization Height Fields and Contours Height Fields and Contours Scalar Fields

Simulation of Discrete-Time Markov Chains Discrete-Time Markov Chains (DTMCs) Numerical Solution

Outline Markov networks (a.k.a. Markov random fields) Markov Networks Reading: Michael

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

1 QC STORY -32 QC STORY -32 QC STORY -32 QC Story-1 QC Story-1 QC Story-1 Awards and

Discrete time Markov chains Today: Short recap of probability theory Markov chain

Image Segmentation Philipp Kr ahenb uhl Stanford University April 24, 2013 Philipp Kr

Markov Decision Processes 2/23/18 Recall: State Space Search Problems A set of discrete

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur July 26, Aug

Semi-Markov PEPA: Compositional Modelling and Analysis with General Distributions Jeremy Bradley

Models CMSC 678 UMBC Announcement 1: Progress Report on Project Due Monday April 16 th , 11:59

Markov Networks [KF] Chapter 4 CS 786 University of Waterloo Lecture 7: May 24, 2012 Outline

Fields for Information Extraction S U N I T A S A R A W A G I A N D W I L L I A M C O H E N

Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang Biointelligence