generalized linear models glims
play

Generalized Linear Models (GLIMs) Probabilistic Graphical Models - PowerPoint PPT Presentation

Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani Outline Exponential family Many standard distributions are in this family Similarities


  1. Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani

  2. Outline  Exponential family  Many standard distributions are in this family  Similarities among learning algorithms for different models in this family:  ML estimation has a simple form for exponential families  moment matching of sufficient statistics  Bayesian learning is simplest for exponential families  GLIMs as to parameterize conditional distributions that have an exponential distribution on a variable for each value of parent 2

  3. Exponential family: canonical parameterization 1 𝑎 𝜽 ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑄 𝒚 𝜽 = 𝑎 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈 𝒚 − ln 𝑎(𝜽) 𝐵(𝜽) : log partition function  𝑈: 𝒴 → ℝ 𝐿 : sufficient statistics function  𝜽 : natural or canonical parameters  ℎ: 𝒴 → ℝ + : reference measure independent of parameters  𝑎 : Normalization factor or partition function ( 0 < 𝑎 𝜽 < ∞ ) 3

  4. Example: Bernouli 𝑄 𝑦 𝜄 = 𝜄 𝑦 1 − 𝜄 1−𝑦 𝜄 = exp ln 1 − 𝜄 𝑦 + ln 1 − 𝜄 𝜄 𝜃 = ln • 1−𝜄 𝑓 𝜃 𝜄 1 𝜃 = ln 1−𝜄 ⇒ 𝜄 = 𝑓 𝜃 +1 = • 1+𝑓 −𝜃 𝑈 𝑦 = 𝑦 • 𝐵 𝜃 = − ln 1 − 𝜄 = ln 1 + 𝑓 𝜃 • ℎ 𝑦 = 1 • 4

  5. Example: Gaussian exp − 𝑦 − 𝜈 2 1 𝑄 𝑦 𝜈, 𝜏 2 = 2𝜏 2 2𝜌𝜏 𝜈 • 𝜽 = 𝜃 1 𝜏 2 𝜃 2 = 1 − 2𝜏 2 2𝜃 2 , 𝜏 2 = − 𝜃 1 1 ⇒ 𝜈 = − • 2𝜃 2 𝑦 • 𝑈 𝑦 = 𝑦 2 2 𝜈 2 = − 1 2 ln 2𝜌 − 1 2 ln −2𝜃 2 − 𝜃 1 • 𝐵 𝜽 = − ln 2𝜌𝜏 exp 2𝜏 2 4𝜃 2 • ℎ 𝑦 = 1 5

  6. Example: Multinomial 𝐿 𝐿 𝑦 𝑙 𝑄 𝒚 𝜾 = 𝜄 𝑙 𝜄 𝑙 = 1 𝑙=1 𝑙=1 𝐿 𝑄 𝒚 𝜾 = exp 𝑦 𝑙 ln 𝜄 𝑙 𝑙=1 𝐿−1 𝐿−1 𝐿−1 = exp 𝑦 𝑙 ln 𝜄 𝑙 + 1 − 𝑦 𝑙 ln 1 − 𝜄 𝑙 𝑙=1 𝑙=1 𝑙=1 𝑈 𝜽 = 𝜃 1 , … , 𝜃 𝐿−1 𝑈 = ln 𝜄 1 𝜄 𝐿−1 𝐿−1 𝜄 𝑙 , … , ln • 𝐿−1 𝜄 𝑙 1− 𝑙=1 1− 𝑙=1 𝑈 𝑓 𝜃𝑙 𝜄 1 𝜄 𝐿−1 𝜽 = ln 𝜄 𝐿 , … , ln ⇒ 𝜄 𝑙 = • 𝑓 𝜃𝑘 𝐿 𝜄 𝐿 𝑘=1 𝑈 𝒚 = 𝑦 1 , … , 𝑦 𝐿−1 𝑈 • 𝐿−1 𝜄 𝑙 𝐿 𝑓 𝜃 𝑘 𝐵 𝜽 = − ln 𝜄 𝐿 = − ln 1 − 𝑙=1 = ln 𝑙=1 • 6

  7. Well-behaved parameter space  Multiple exponential families may encode the same set of distributions  We want the parameter space 𝜽 0 < 𝑎 𝜽 < ∞ to be:  Convex set  Non-redundant : 𝜽 ≠ 𝜽 ′ ⇒ 𝑄 𝒚 𝜽 ≠ 𝑄 𝒚 𝜽 ′  The function from 𝜾 to 𝜽 is invertible  Example: invertible function from 𝜄 to 𝜃 in the Bernoulli example 𝜄 1 = 1+𝑓 −𝜃 7

  8. Examples of non-exponential distributions  Uniform  Laplace  Student t-distribution 8

  9. Moments 𝐵 𝜽 = ln 𝑎 𝜽 𝑎 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 ℎ 𝒚 𝑈(𝒚) exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 𝛼 𝜽 𝑎 𝜽 𝛼 𝜽 𝐵 𝜽 = = 𝑎 𝜽 𝑎 𝜽 = 𝑈(𝒚) ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 = 𝐹 𝑄(𝒚|𝜽) 𝑈(𝒚) 𝑎 𝜽 The first derivative of 𝐵 𝜽 is ⇒ 𝛼 𝜽 𝐵 𝜽 = 𝐹 𝜽 𝑈(𝒚) the mean of sufficient statistics 2 𝐵 𝜽 = 𝐹 𝜽 𝑈 𝒚 𝑈 𝒚 𝑈 − 𝐹 𝜽 𝑈 𝒚 𝐹 𝜽 𝑈 𝒚 𝑈 = 𝐷𝑝𝑤 𝜽 𝑈 𝒚 𝛼 𝜽 The i- th derivative gives the i- th centered moment of sufficient statistics. 9

  10. Properties  The moment parameters 𝜾 can be derived as a function of the natural or canonical parameters: 𝛼 𝜽 𝐵 𝜽 = 𝐹 𝜽 𝑈(𝒚) For many distributions, 𝜾 ≡ 𝐹 𝜽 𝑈(𝒚) ⇒ 𝛼 𝜽 𝐵 𝜽 = 𝜾 we have 𝜾 ≡ 𝐹 𝜽 𝑈(𝑦)  𝐵(𝜽) is convex since 𝛼 2 𝐵 𝜽 = 𝐷𝑝𝑤 𝜽 𝑈 𝒚 ≽ 0 𝜽  Covariance matrix is always positive semi-definite ⇒ Hessian 𝛼 2 𝐵 𝜽 is 𝜽 positive semi-definite, and hence that 𝐵 𝜽 = ln 𝑎 𝜽 is a convex function of 𝜽 . 10

  11. Exponential family: moment parameterization  A distribution in the exponential family can also be parameterized by the moment parameterization : 1 𝑎 𝜾 ℎ 𝒚 exp 𝜔 𝜾 𝑈 𝑈(𝒚) 𝑄 𝒚 𝜾 = 𝜽 = 𝜔(𝜾) 𝑎 𝜾 = ℎ 𝒚 exp 𝜔 𝜾 𝑈 𝑈(𝒚) 𝑒𝒚 𝜔 maps the parameters 𝜾 to the space of sufficient statistics 𝜾 ≡ 𝐹 𝜽 𝑈(𝒚) = 𝛼 𝜽 𝐵 𝜽 𝜾 = 𝜔 −1 𝜽 is ascending ⟹ 𝜔 −1 𝜽 = 𝜾 = 𝛼 is  If 𝛼 2 𝐵 𝜽 ≻ 0 ⇒ 𝛼 𝜽 𝐵 𝜽 𝜽 𝐵 𝜽 𝜽 ascending and thus is 1-to-1  The mapping from the moments to the canonical parameters is invertible (1-to-1 relationship): 𝜽 = 𝜔(𝜾) 11

  12. Sufficiency  A statistic is a function of a random variable  Suppose that the distribution of 𝑌 depends on a parameter 𝜄  “ 𝑈(𝑌) is a sufficient statistic for 𝜄 if there is no information in 𝑌 regarding 𝜄 beyond that in 𝑈(𝑌) ”  Sufficiency in both frequentist and Bayesian frameworks implies a factorization of 𝑄 𝑦 𝜄 (Neyman factorization theorem): 𝑄 𝑦, 𝑈 𝑦 , 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 𝑕 𝑦, 𝑈 𝑦 𝑄 𝑦, 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 𝑕(𝑦, 𝑈(𝑦)) 𝑄 𝑦|𝜄 = 𝑔′ 𝑈 𝑦 , 𝜄 𝑕(𝑦, 𝑈(𝑦)) 12

  13. Sufficient statistic  Sufficient statistic and the exponential family: 𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈 𝒚 − 𝐵(𝜽)  Sufficient statistic in the case of i.i.d sampling can be obtained easily for a set of N observations from a distribution 𝑂 ℎ 𝒚 (𝑜) exp 𝜽 𝑈 𝑈 𝒚 𝑜 𝑄 𝒠 𝜽 = − 𝐵(𝜽) 𝑜=1 𝑂 𝑂 ℎ 𝒚 (𝑜) exp{𝜽 𝑈 𝑈 𝒚 𝑜 = − 𝑂𝐵 𝜽 } 𝑜=1 𝑜=1 𝒠 has itself an exponential distribution with sufficient statistic 𝑂 𝑈 𝒚 𝑜 𝑜=1 13

  14. MLE for exponential family 𝑂 ℎ 𝒚 (𝑜) exp 𝜽 𝑈 𝑈 𝒚 𝑜 ℓ 𝜽; 𝒠 = ln 𝑄 𝒠 𝜽 = ln − 𝐵(𝜽) 𝑜=1 𝑂 𝑂 Concave ℎ(𝒚 (𝑜) ) + 𝜽 𝑈 𝑈 𝒚 𝑜 = ln − 𝑂𝐵 𝜽 function 𝑜=1 𝑜=1 𝑂 𝑈 𝒚 𝑜 𝛼 𝜽 ℓ 𝜽; 𝒠 = 0 ⇒ − 𝑂𝛼 𝜽 𝐵 𝜽 = 0 𝑜=1 𝑂 𝑈 𝒚 𝑜 𝜽 = 𝑜=1 ⇒ 𝛼 𝜽 𝐵 𝑂 𝑂 𝑈 𝒚 𝑜 𝜽 𝑈(𝒚) = 𝑜=1 𝜽 𝐵 ⇒ 𝛼 𝜽 = 𝐹 𝑂 moment matching 15

  15. Exponential family: summary  Many famous distribution are in the exponential family  Important properties for learning with exponential families:  Gradients of log partition function gives expected sufficient statistics, or moments , for some models  Moments of any distribution in exponential family can be easily computed by taking the derivatives of the log normalizer  The Hessian of the log partition function is positive semi-definite and so the log partition function is convex  Are important for modeling distributions of Markov networks 16

  16. Generalized linear models (GLIMs)  Conditional relationship between 𝑍 and 𝒀  Examples:  Linear regression: 𝑄 𝑧 𝒚, 𝒙, 𝜏 2 = 𝒪(𝑧|𝒙 𝑈 𝒚, 𝜏 2 )  Discriminative linear classifier (two class)  Logistic regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|𝜏 𝒙 𝑈 𝒚 )  Probit regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|Φ 𝒙 𝑈 𝒚 ) where Φ is the cdf of 𝒪(0,1) 17

  17. Generalized linear models (GLIMs)  𝑄(𝑧|𝒚) is a generalized linear model if:  𝒚 enters into the model via a linear combination 𝜾 𝑈 𝒚  The conditional mean of 𝑄(𝑧|𝒚) is expressed as 𝑔 𝜾 𝑈 𝒚 :  𝑔 is called the response function  𝜈 = 𝐹 𝑧|𝒚 = 𝑔 𝜾 𝑈 𝒚  The distribution of 𝑧 is characterized by an exponential family distribution (with conditional mean 𝑔 𝜾 𝑈 𝒚 )  We have two choices in the specification of a GLIM:  The choice of the exponential family distribution  Usually constrained by the nature of 𝑍  The choice of the response function 𝑔  the principal degree of freedom in the specification of a GLIM  However, we need to impose constraints on this function (e.g., 𝑔 must be in [0,1] for Bernoulli distribution on 𝑧 ) 18

  18. The relation between vars. in a GLIMs 19

  19. Canonical response function  Canonical response function: 𝑔(. ) = 𝜔 −1 (. ) or 𝜊 = 𝜃  In this case, the choice of the exponential family density completely determines the GLIM  The constraints on the range of 𝑔 are automatically satisfied. are guaranteed to be possible values of the conditional  𝜈 = 𝑔 𝜃 expectation (i.e., 𝑔 𝜃 = 𝜔 −1 𝜃 = 𝑒𝐵 𝜃 = 𝐹 𝑍|𝜃 ) 𝑒𝜃 20

  20. Log likelihood for GLIMs ℓ 𝜽; 𝒠 = ln 𝑄 𝒠 𝜽 𝑂 ℎ 𝑧 (𝑜) exp 𝜃 (𝑜) 𝑧 (𝑜) − 𝐵 𝜃 (𝑜) = ln 𝑜=1 𝑂 𝑂 ln ℎ 𝑧 (𝑜) + 𝜃 (𝑜) 𝑧 (𝑜) − 𝐵 𝜃 (𝑜) = 𝑜=1 𝑜=1  𝜃 (𝑜) = 𝜔(𝜈 𝑜 ) and 𝜈 𝑜 = 𝑔 𝜾 𝑈 𝒚 (𝑜)  In the case of canonical response function 𝜃 (𝑜) = 𝜾 𝑈 𝒚 (𝑜) 𝑂 𝑂 𝑂 ln ℎ 𝑧 (𝑜) + 𝜾 𝑈 𝒚 (𝑜) 𝑧 (𝑜) − 𝐵 𝜾 𝑈 𝒚 (𝑜) ℓ 𝜾; 𝒠 = 𝑜=1 𝑜=1 𝑜=1 Sufficient statistics for 𝜾 21

Recommend


More recommend