exponential family distributions
play

Exponential Family Distributions CMSC 691 UMBC Exponential Family - PowerPoint PPT Presentation

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form Support function Formally necessary, often irrelevant (e.g., Gaussian distributions), except when it isnt (e.g., Dirichlet


  1. Exponential Family Distributions CMSC 691 UMBC

  2. Exponential Family Form

  3. Exponential Family Form Support function • Formally necessary, often irrelevant (e.g., Gaussian distributions), except when it isn’t (e.g., Dirichlet distributions)

  4. Exponential Family Form Distribution Parameters • Natural parameters • Feature weights

  5. Exponential Family Form Sufficient statistics • Feature function(s)

  6. Exponential Family Form Log-normalizer

  7. Exponential Family Form Log-normalizer ℎ 𝑦 ′ exp(𝜄 𝑈 𝑔(𝑦 ′ )) 𝐵 𝜄 = log ෍ Discrete x 𝑦′ 𝐵 𝜄 = log ∫ ℎ 𝑦 ′ exp(𝜄 𝑈 𝑔(𝑦 ′ ))𝑒𝑦 ′ Continuous x

  8. Why Bother with This? • A common form for common distributions • “Easily” compute gradients of likelihood wrt parameters • “Easily” compute expectations, especially entropy and KL divergence • “Easy” posterior inference via conjugate distributions

  9. Why? Capture Common Distributions Bernoulli/Binomial These can all be written in Categorical/Multinomial this “common” form (different h , f , and A Poisson functions) 𝑞 𝜄 𝑦 = Normal ℎ 𝑦 exp 𝜄 𝑈 𝑔 𝑦 − 𝐵(𝜄) Gamma … See a good stats book, or https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions

  10. Why? Capture Common Distributions Discrete/Categorical (Finite distributions) “Traditional” Form Exponential Family Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝜄 = ??? 𝑘 𝑔 𝑦 = ??? 1 𝑑 = ቊ1, 𝑑 is true ℎ 𝑦 = ??? 0, 𝑑 is false

  11. Why? Capture Common Distributions Discrete/Categorical (Finite distributions) How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐) “Traditional” Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝑘 1 𝑑 = ቊ1, 𝑑 is true 0, 𝑑 is false

  12. Why? Capture Common Distributions Discrete/Categorical (Finite distributions) How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐) “Traditional” Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝑘 𝑘 1 𝑑 = ቊ1, 𝑑 is true = exp ෍ 1 𝑙 = 𝑘 ∗ log 𝜌 𝑘 𝑘 0, 𝑑 is false

  13. Why? Capture Common Distributions Discrete/Categorical (Finite distributions) How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐) “Traditional” Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝑘 𝑘 1 𝑑 = ቊ1, 𝑑 is true = exp ෍ 1 𝑙 = 𝑘 ∗ log 𝜌 𝑘 𝑘 0, 𝑑 is false 𝑈 1[𝑙 = 1] … = exp log 𝜌 1[𝑙 = 𝐿]

  14. Why? Capture Common Distributions Discrete/Categorical (Finite distributions) “Traditional” Form Exponential Family Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝜄 = log 𝜌 1 , … , log 𝜌 𝐿 𝑘 𝑔 𝑦 = 1 𝑦 = 1 , … , 1[𝑦 = 𝐿] 1 𝑑 = ቊ1, 𝑑 is true ℎ 𝑦 = 1 0, 𝑑 is false

  15. Why? Capture Common Distributions Gaussian “Traditional” Form Exponential Family Form ℎ 𝑦 = 1

  16. Why? Capture Common Distributions Dirichlet Exponential Family Form “Traditional” Form If we assume ℎ 𝑦 = 1 𝑦 ∈ Δ 𝐿−1 If we explicitly ℎ 𝑦 = 1 ෍ 𝑦 𝑙 = 1 enforce 𝑦 ∈ Δ 𝐿−1 𝑙

  17. Why? Capture Common Distributions Discrete (Finite distributions) Dirichlet (Distributions over (finite) distributions) Gaussian Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log- Normal,…

  18. Why? “Easy” Gradients Gradient of likelihood

  19. Why? “Easy” Gradients Gradient of likelihood Expected sufficient Observed sufficient statistics (feature counts) statistics (feature counts)

  20. Why? “Easy” Gradients Expected sufficient statistics Gradient of likelihood “Count” w.r.t. current model parameters Observed sufficient statistics “Count” w.r.t. empirical distribution

  21. Why? “Easy” Expectations expectation of gradient of the the sufficient log normalizer statistics

  22. Why Bother with This? • A common form for common distributions • “Easily” compute gradients of likelihood wrt parameters • “Easily” compute expectations, especially entropy and KL divergence • “Easy” posterior inference via conjugate distributions

  23. Conjugate Distributions • Let 𝜄 ∼ 𝑞 , and let 𝑦|𝜄 ∼ 𝑟 • If p is the conjugate prior for q then the posterior distribution 𝑞(𝜄|𝑦) is of the same type/family as the prior 𝑞(𝜄)

  24. Why? “Easy” Posterior Inference

  25. Why? “Easy” Posterior Inference p is the conjugate prior for q

  26. Why? “Easy” Posterior Inference Posterior p has same form as prior p p is the conjugate prior for q

  27. Why? “Easy” Posterior Inference Posterior p has same form as prior p p is the conjugate prior for q All exponential family models have a conjugate prior (in theory)

  28. Why? “Easy” Posterior Inference Posterior p has same form as prior p p is the conjugate prior for q Posterior Likelihood Prior Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta) Normal Normal (fixed var.) Normal Gamma Exponential Gamma …

  29. Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄)

  30. Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 Rewrite q and p with exponential family forms

  31. Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp ෍ 𝑗 Replace with specific natural parameters and sufficient statistic functions

  32. Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp ෍ 𝑗 Notice common terms that can be simplified together

  33. Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp ෍ 𝑗 𝑈 Group common terms = exp 𝛽 − 1 + ෍ 𝑔 𝑦 𝑗 log 𝜄 − 𝐵(𝜄) 𝑗

  34. Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp ෍ 𝑗 T Group common Notice: this is the = exp 𝛽 − 1 + ෍ 𝑔 𝑦 𝑗 log 𝜄 − 𝐵(𝜄) terms form of a Dirichlet 𝑗

  35. Why Bother with This? • A common form for common distributions • “Easily” compute gradients of likelihood wrt parameters • “Easily” compute expectations, especially entropy and KL divergence • “Easy” posterior inference via conjugate distributions

Recommend


More recommend