Exponential Family Distributions CMSC 691 UMBC
Exponential Family Form
Exponential Family Form Support function • Formally necessary, often irrelevant (e.g., Gaussian distributions), except when it isn’t (e.g., Dirichlet distributions)
Exponential Family Form Distribution Parameters • Natural parameters • Feature weights
Exponential Family Form Sufficient statistics • Feature function(s)
Exponential Family Form Log-normalizer
Exponential Family Form Log-normalizer ℎ 𝑦 ′ exp(𝜄 𝑈 𝑔(𝑦 ′ )) 𝐵 𝜄 = log Discrete x 𝑦′ 𝐵 𝜄 = log ∫ ℎ 𝑦 ′ exp(𝜄 𝑈 𝑔(𝑦 ′ ))𝑒𝑦 ′ Continuous x
Why Bother with This? • A common form for common distributions • “Easily” compute gradients of likelihood wrt parameters • “Easily” compute expectations, especially entropy and KL divergence • “Easy” posterior inference via conjugate distributions
Why? Capture Common Distributions Bernoulli/Binomial These can all be written in Categorical/Multinomial this “common” form (different h , f , and A Poisson functions) 𝑞 𝜄 𝑦 = Normal ℎ 𝑦 exp 𝜄 𝑈 𝑔 𝑦 − 𝐵(𝜄) Gamma … See a good stats book, or https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions
Why? Capture Common Distributions Discrete/Categorical (Finite distributions) “Traditional” Form Exponential Family Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝜄 = ??? 𝑘 𝑔 𝑦 = ??? 1 𝑑 = ቊ1, 𝑑 is true ℎ 𝑦 = ??? 0, 𝑑 is false
Why? Capture Common Distributions Discrete/Categorical (Finite distributions) How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐) “Traditional” Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝑘 1 𝑑 = ቊ1, 𝑑 is true 0, 𝑑 is false
Why? Capture Common Distributions Discrete/Categorical (Finite distributions) How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐) “Traditional” Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝑘 𝑘 1 𝑑 = ቊ1, 𝑑 is true = exp 1 𝑙 = 𝑘 ∗ log 𝜌 𝑘 𝑘 0, 𝑑 is false
Why? Capture Common Distributions Discrete/Categorical (Finite distributions) How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐) “Traditional” Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝑘 𝑘 1 𝑑 = ቊ1, 𝑑 is true = exp 1 𝑙 = 𝑘 ∗ log 𝜌 𝑘 𝑘 0, 𝑑 is false 𝑈 1[𝑙 = 1] … = exp log 𝜌 1[𝑙 = 𝐿]
Why? Capture Common Distributions Discrete/Categorical (Finite distributions) “Traditional” Form Exponential Family Form 𝟐[𝑙=𝑘] 𝑞 𝜌 𝑌 = 𝑙 = ෑ 𝜌 𝑘 𝜄 = log 𝜌 1 , … , log 𝜌 𝐿 𝑘 𝑔 𝑦 = 1 𝑦 = 1 , … , 1[𝑦 = 𝐿] 1 𝑑 = ቊ1, 𝑑 is true ℎ 𝑦 = 1 0, 𝑑 is false
Why? Capture Common Distributions Gaussian “Traditional” Form Exponential Family Form ℎ 𝑦 = 1
Why? Capture Common Distributions Dirichlet Exponential Family Form “Traditional” Form If we assume ℎ 𝑦 = 1 𝑦 ∈ Δ 𝐿−1 If we explicitly ℎ 𝑦 = 1 𝑦 𝑙 = 1 enforce 𝑦 ∈ Δ 𝐿−1 𝑙
Why? Capture Common Distributions Discrete (Finite distributions) Dirichlet (Distributions over (finite) distributions) Gaussian Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log- Normal,…
Why? “Easy” Gradients Gradient of likelihood
Why? “Easy” Gradients Gradient of likelihood Expected sufficient Observed sufficient statistics (feature counts) statistics (feature counts)
Why? “Easy” Gradients Expected sufficient statistics Gradient of likelihood “Count” w.r.t. current model parameters Observed sufficient statistics “Count” w.r.t. empirical distribution
Why? “Easy” Expectations expectation of gradient of the the sufficient log normalizer statistics
Why Bother with This? • A common form for common distributions • “Easily” compute gradients of likelihood wrt parameters • “Easily” compute expectations, especially entropy and KL divergence • “Easy” posterior inference via conjugate distributions
Conjugate Distributions • Let 𝜄 ∼ 𝑞 , and let 𝑦|𝜄 ∼ 𝑟 • If p is the conjugate prior for q then the posterior distribution 𝑞(𝜄|𝑦) is of the same type/family as the prior 𝑞(𝜄)
Why? “Easy” Posterior Inference
Why? “Easy” Posterior Inference p is the conjugate prior for q
Why? “Easy” Posterior Inference Posterior p has same form as prior p p is the conjugate prior for q
Why? “Easy” Posterior Inference Posterior p has same form as prior p p is the conjugate prior for q All exponential family models have a conjugate prior (in theory)
Why? “Easy” Posterior Inference Posterior p has same form as prior p p is the conjugate prior for q Posterior Likelihood Prior Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta) Normal Normal (fixed var.) Normal Gamma Exponential Gamma …
Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄)
Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 Rewrite q and p with exponential family forms
Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp 𝑗 Replace with specific natural parameters and sufficient statistic functions
Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp 𝑗 Notice common terms that can be simplified together
Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp 𝑗 𝑈 Group common terms = exp 𝛽 − 1 + 𝑔 𝑦 𝑗 log 𝜄 − 𝐵(𝜄) 𝑗
Conjugate Prior Example • 𝑞 𝜄 = Dir(𝛽) , 𝑟 𝑦 𝑗 𝜄 = Cat(𝜄) i.i.d. • Let 𝑔(𝑦) be the Cat sufficient statistic function • 𝑞 𝜄 𝑦 = Dir(𝛽 + σ 𝑗 𝑔(𝑦 𝑗 )) 𝑞 𝜄 𝑦 1 , … , 𝑦 𝑂 ) ∝ 𝑟 𝑦 1 , … , 𝑦 𝑂 𝜄) 𝑞(𝜄) exp log 𝜄 𝑈 𝑔 𝑦 𝑗 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄) = ෑ exp 𝑗 log 𝜄 𝑈 𝑔 𝑦 𝑗 + ( 𝛽 − 1 𝑈 log 𝜄 − 𝐵(𝜄)) = exp 𝑗 T Group common Notice: this is the = exp 𝛽 − 1 + 𝑔 𝑦 𝑗 log 𝜄 − 𝐵(𝜄) terms form of a Dirichlet 𝑗
Why Bother with This? • A common form for common distributions • “Easily” compute gradients of likelihood wrt parameters • “Easily” compute expectations, especially entropy and KL divergence • “Easy” posterior inference via conjugate distributions
Recommend
More recommend