probabilistic graphical models probabilistic graphical
play

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives entropy exponential family distribution duality in


  1. Probabilistic Graphical Models Probabilistic Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Fall 2019

  2. Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship between two parametrizations inference and learning as mapping between the two relative entropy and two types of projections

  3. A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b )

  4. A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b ) definition follows from these characteristics: 1 I ( X = x ) ≜ log( ) = − log( P ( X = x )) P ( X = x )

  5. Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x ))

  6. Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x )) achieves its maximum for uniform distribution 0 ≤ H ( P ) ≤ log(∣ V al ( X )∣)

  7. Entropy: Entropy: information theory information theory alternatively expected (optimal) message length in reporting observed X e.g., using Huffman coding

  8. Entropy: Entropy: information theory information theory alternatively expected (optimal) message length in reporting observed X e.g., using Huffman coding V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X:

  9. Entropy: Entropy: information theory information theory alternatively expected (optimal) message length in reporting observed X e.g., using Huffman coding V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X: a → 0 average length? b → 10 1 1 1 1 1 1 1 1 1 1 15 H ( P ) = − log( ) − log( ) − log( ) − log( ) − log( ) = 1 2 2 4 4 8 8 16 16 16 32 16 c → 110 1 1 1 3 5 d → 1110 2 2 4 8 16 e → 11110 f → 11111 contribution to the average length from X=a

  10. Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy

  11. Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy the extra amount of information transmitted: D ( p ∥ q ) ≜ p ( x )(log( p ( x ) − log( q ( x ))) ∑ x ∈ V al ( X ) Kullback-Leibler divergence or relative entorpy

  12. Relative Relative entropy: information theory entropy: information theory Kullback-Leibler divergence D ( p ∥ q ) ≜ p ( x )(log( q ( x ) − log( p ( x ))) ∑ x ∈ V al ( X ) some properties: non-negative and zero iff p=q asymmetric 1 D ( p ∥ u ) = p ( x )(log( p ( x )) − log( )) = log( N ) − H ( p ) ∑ x N

  13. Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles

  14. Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions p ( top ) = 0 p ( top ) = 1

  15. Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions each macrostate is a distribution p ( top ) = 0 p ( top ) = 1

  16. Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions each macrostate is a distribution p ( top ) = 0 which distribution is more likely? p ( top ) = 1

  17. Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions each macrostate is a distribution p ( top ) = 0 which distribution is more likely? entropy of a macrostate: (normalized) log number of its microstates p ( top ) = 1

  18. Entropy: physics Entropy : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b p ( top ) = 0 p ( top ) = 1

  19. Entropy: physics Entropy : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b ≃ N ln( N ) − N p ( top ) = 0 p ( top ) = 1

  20. Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b ≃ N ln( N ) − N p ( top ) = 0 N N N N = c − ln( ) − ln( ) t t b b N N N N p ( top ) = 1

  21. Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b ≃ N ln( N ) − N p ( top ) = 0 N N N N = c − ln( ) − ln( ) t t b b N N N N P ( X = top ) p ( top ) = 1 = − p ( x ) ln( p ( x )) ∑ x ∈{ top , bottom }

  22. Differential entropy Differential entropy for continuous domains for continuous domains divide the domain using small bins of width Δ V al ( X ) ∃ x ∈ (Δ i , Δ( i + 1)) i ( i +1)Δ p ( x )d x = p ( x )Δ ∫ i Δ i ( p ) = − p ( x )Δ ln( p ( x )Δ) = − ln(Δ) − p ( x )Δ ln( p ( x )) ∑ i ∑ i H Δ i i i i ignore take the limit to get H ( p ) ≜ Δ → 0 ∫ V al ( x ) p ( x ) ln( p ( x ))d x

  23. max-entropy max-entropy distribution distribution High entropy distribution: more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p

  24. max-entropy max-entropy distribution distribution High entropy distribution: more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p when optimizing for p(x) subject to constrains, maximize the entropy arg max H ( p ) p p ( x ) > 0 ∀ x ∫ V al ( X ) p ( x )d x = 1 E [ ϕ ( X )] = ∀ k μ p k k

  25. max-entropy max-entropy distribution distribution High entropy distribution: more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p when optimizing for p(x) subject to constrains, maximize the entropy arg max H ( p ) p p ( x ) > 0 ∀ x p ( x ) ∝ exp( ∑ k ( x )) θ ϕ k k ∫ V al ( X ) p ( x )d x = 1 Lagrange multipliers E [ ϕ ( X )] = ∀ k μ p k k

  26. Exponential family Exponential family an exponential family has the following form p ( x ; θ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) base measure sufficient statistics log-partition function the inner product of two vectors A ( θ ) = ln( h ( x ) exp ( ( x )) dx ) ∫ V al ( X ) ∑ k θ ϕ k k with a convex parameter space θ ∈ Θ = { θ ∈ ℜ ∣ A ( θ ) < ∞} D

Recommend


More recommend