Graphical Models Graphical Models Exponential family & - PowerPoint PPT Presentation

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018

Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship between two parametrizations inference and learning as mapping between the two relative entropy and two types of projections

A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b )

A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b ) definition follows from these characteristics: 1 I ( X = x ) ≜ log( ) = log( P ( X = x )) P ( X = x )

Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x )) expected code length in transmitting X (repeatedly) e.g., using Huffman coding achieves its maximum for uniform prob. 0 ≤ H ( P ) ≤ log(∣ V al ( X )∣)

Entropy: Entropy: example example V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X: a → 0 average length? b → 10 c → 110 1 1 1 1 1 1 1 1 1 1 15 H ( P ) = − log( ) − log( ) − log( ) − log( ) − log( ) = 1 2 2 4 4 8 8 16 16 16 32 16 d → 1110 1 1 1 3 5 e → 11110 2 2 8 4 16 f → 11111 contribution to the average length from X=a

Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy

Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy the extra amount of information transmitted: D ( p ∥ q ) ≜ p ( x )(log( p ( x ) − log( q ( x ))) ∑ x ∈ V al ( X ) Kullback-Leibler divergence or relative entorpy

Relative Relative entropy: information theory entropy: information theory Kullback-Leibler divergence D ( p ∥ q ) ≜ p ( x )(log( q ( x ) − log( p ( x ))) ∑ x ∈ V al ( X ) some properties: non-negative and zero iff p=q asymmetric 1 D ( p ∥ u ) = p ( x )(log( p ( x )) − log( )) = log( N ) − H ( p ) ∑ x N

Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles

Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution

Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution entropy of a macrostate: (normalized) log number of its microstates

Entropy Entropy: physics : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N

Entropy: physics Entropy : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N N t N t N b N b H = − ln( ) − ln( ) N N N N P ( X = top ) nats instead of bits

Differential entorpy Differential entorpy (continuous domains) (continuous domains) divide the domain using small bins of width Δ V al ( X ) ∃ x ∈ (Δ i , Δ( i + 1)) i ( i +1)Δ p ( x )d x = p ( x )Δ ∫ i Δ i H ( p ) = − p ( x )Δ ln( p ( x )Δ) = − ln(Δ) − p ( x )Δ ln( p ( x )) ∑ i ∑ i Δ i i i i ignore take the limit to get H ( p ) ≜ Δ → 0 ∫ V al ( x ) p ( x ) ln( p ( x ))d x

max-entropy max-entropy distribution distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x ∫ V al ( X ) p ( x )d x = 1 E [ ϕ ( X )] = μ ∀ k p k k

max-entropy distribution max-entropy distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x p ( x ) ∝ exp( θ ϕ ( x )) ∑ k k k ∫ V al ( X ) p ( x )d x = 1 Lagrange multipliers E [ ϕ ( X )] = μ ∀ k p k k

Exponential family Exponential family an exponential family has the following form p ( x ; θ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) base measure sufficient statistics log-partition function the inner product of two vectors A ( θ ) = ln( h ( x ) exp ( θ ϕ ( x )) dx ) ∫ V al ( X ) ∑ k k k with a convex parameter space θ ∈ Θ = { θ ∈ ℜ ∣ A ( θ ) < ∞} D

Example: Example: univariate Gaussian univariate Gaussian ( x − μ ) 2 1 2 moment form: p ( x ; μ , σ ) = exp(− ) √ 2 σ 2 2 πσ 2 2 p ( x ; μ , σ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 2 1 [ x , x ] μ 2 −1 2 μ 1 2 η ( μ , σ ) = [ , ] (ln(2 πσ ) + ) σ 2 2 σ 2 2 σ 2 2 + for μ , σ ∈ ℜ × ℜ

Example: Example: Bernoulli Bernoulli 1− x conventional form (mean parametrization) p ( x ; μ ) = μ (1 − μ ) x p ( x ; μ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 1 1 η ( μ ) = [ln( μ ), ln(1 − μ )] for μ ∈ (0, 1) [ I ( x = 1), I ( x = 0)]

Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) natural parameters η ( θ ) simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}

Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) can absorb it as a natural parameters η ( θ ) sufficient stat. with θ = 1 simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}

Example: Example: univariate Gaussian univariate Gaussian take 2 natural parameters in the univariate Gaussian p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) −1 μ 2 2 [ , ] [ x , x ] −1 θ 1 (ln( θ / π ) + )? σ 2 2 σ 2 2 2 2 θ 2 θ ∈ ℜ × ℜ − where is a convex set

Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )]

Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )] however is not a convex set Θ

Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)]

Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)] this parametrization is redundant or overcomplete p ( x , [ θ , θ ]) = p ( x , [ θ + c , θ + c ]) 1 2 1 2 redundant iff ∃ θ s.t. ∀ x ⟨ θ , ϕ ( x )⟩ = c

Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ

Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ is convex and this parametrization is minimal Θ

Graphical Models Graphical Models Exponential family & - PowerPoint PPT Presentation

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018 Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Directed Graphical Models: Bayesian Networks Probabilistic Graphical Models Sharif University of

Graphical Models Independence & Factorization Including structure Complexity of

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Probabilistic Graphical Models CMSC 691 UMBC Two Problems for Graphical Models 1 ,

Undirected Graphical Models: Markov Random Fields Probabilistic Graphical Models Sharif

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Outline 1 What are Graphical Models? Infering Graphical Models from Time Series Graph Theory 2

Graphical Models Graphical Models Relationship between the directed & undirected models

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Two types of GMs Directed edges give causality relationships ( Bayesian Network or Directed

Two types of GMs Directed edges give causality relationships ( Bayesian Network or Directed

Directed Graphical Models + Undirected Graphical Models Matt Gormley Lecture 7 Sep. 18, 2019

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family &

Graphical models Review Graphical models (Bayes nets, Markov random fields, factor graphs) !

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Graphical models Review P [( x y z ) ( y u ) ( z w ) ( z

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Graphical Models Graphical Models Review of probability theory Review of probability theory

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Graphical Models Graphical Models Exponential family & - PowerPoint PPT Presentation

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018 Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Directed Graphical Models: Bayesian Networks Probabilistic Graphical Models Sharif University of

Graphical Models Independence &amp; Factorization Including structure Complexity of

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Probabilistic Graphical Models CMSC 691 UMBC Two Problems for Graphical Models 1 ,

Undirected Graphical Models: Markov Random Fields Probabilistic Graphical Models Sharif

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Outline 1 What are Graphical Models? Infering Graphical Models from Time Series Graph Theory 2

Graphical Models Graphical Models Relationship between the directed &amp; undirected models

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Two types of GMs Directed edges give causality relationships ( Bayesian Network or Directed

Two types of GMs Directed edges give causality relationships ( Bayesian Network or Directed

Directed Graphical Models + Undirected Graphical Models Matt Gormley Lecture 7 Sep. 18, 2019

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family &amp;

Graphical models Review Graphical models (Bayes nets, Markov random fields, factor graphs) !

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Graphical models Review P [( x y z ) ( y u ) ( z w ) ( z

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Graphical Models Graphical Models Review of probability theory Review of probability theory

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Graphical Models Independence & Factorization Including structure Complexity of

Graphical Models Graphical Models Relationship between the directed & undirected models

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family &