Exponential family & Generalized Linear Models (GLIMs) - PowerPoint PPT Presentation

Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani

Outline  Exponential family  Many standard distributions are in this family  Similarities among learning algorithms for different models in this family:  ML estimation has a simple form for exponential families  moment matching of sufficient statistics  Bayesian learning is simplest for exponential families  They have a maximum entropy interpretation  GLIMs as to parameterize conditional distributions that have an exponential distribution on a variable for each value of parent 2

Exponential family: canonical parameterization 1 𝑎 𝜽 ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑄 𝒚 𝜽 = 𝑎 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈 𝒚 − ln 𝑎(𝜽) 𝐵(𝜽) : log partition function  𝑈: 𝒴 → ℝ 𝐿 : sufficient statistics function  𝜽 : natural or canonical parameters  ℎ: 𝒴 → ℝ + : reference measure independent of parameters  𝑎 : Normalization factor or partition function ( 0 < 𝑎 𝜽 < ∞ ) 3

Example: Bernouli 𝑄 𝑦 𝜄 = 𝜄 𝑦 1 − 𝜄 1−𝑦 𝜄 = exp ln 1 − 𝜄 𝑦 + ln 1 − 𝜄 𝜄 𝜃 = ln • 1−𝜄 𝑓 𝜃 𝜄 1 𝜃 = ln 1−𝜄 ⇒ 𝜄 = 𝑓 𝜃 +1 = • 1+𝑓 −𝜃 𝑈 𝑦 = 𝑦 • 𝐵 𝜃 = − ln 1 − 𝜄 = ln 1 + 𝑓 𝜃 • ℎ 𝑦 = 1 • 4

Example: Gaussian exp − 𝑦 − 𝜈 2 1 𝑄 𝑦 𝜈, 𝜏 2 = 2𝜏 2 2𝜌𝜏 𝜈 • 𝜽 = 𝜃 1 𝜏 2 𝜃 2 = 1 − 2𝜏 2 2𝜃 2 , 𝜏 2 = − 𝜃 1 1 ⇒ 𝜈 = − • 2𝜃 2 𝑦 • 𝑈 𝑦 = 𝑦 2 2 𝜈 2 = − 1 2 ln 2𝜌 − 1 2 ln −2𝜃 2 − 𝜃 1 • 𝐵 𝜃 = − ln 2𝜌𝜏 exp 2𝜏 2 4𝜃 2 • ℎ 𝑦 = 1 5

Example: Multinomial 𝐿 𝐿 𝑦 𝑙 𝑄 𝒚 𝜾 = 𝜄 𝑙 𝜄 𝑙 = 1 𝑙=1 𝑙=1 𝐿 𝑄 𝒚 𝜾 = exp 𝑦 𝑙 ln 𝜄 𝑙 𝑙=1 𝐿−1 𝐿−1 𝐿−1 = exp 𝑦 𝑙 ln 𝜄 𝑙 + 1 − 𝑦 𝑙 ln 1 − 𝜄 𝑙 𝑙=1 𝑙=1 𝑙=1 𝑈 𝜄 1 𝜄 𝐿−1 𝜽 = 𝜃 1 , … , 𝜃 𝐿−1 𝑈 = ln 𝐿−1 𝜄 𝑙 , … , ln • 𝐿−1 𝜄 𝑙 1− 𝑙=1 1− 𝑙=1 𝑈 𝑓 𝜃𝑙 𝜄 1 𝜄 𝐿−1 𝜽 = ln 𝜄 𝐿 , … , ln ⇒ 𝜄 𝑙 = • 𝑓 𝜃𝑘 𝐿 𝜄 𝐿 𝑘=1 𝑈 𝒚 = 𝑦 1 , … , 𝑦 𝐿−1 𝑈 • 𝐿−1 𝜄 𝑙 = ln 𝑙=1 𝐿 𝑓 𝜃 𝑘 𝐵 𝜽 = − ln 𝜄 𝐿 = − ln 1 − 𝑙=1 • 6

Well-behaved parameter space  Multiple representations in the exponential family may encode the same set of distributions  However, we want the parameter space 𝜽 0 < 𝑎 𝜽 < ∞ to be:  Convex set  Non-redundant : 𝜽 ≠ 𝜽 ′ ⇒ 𝑄 𝒚 𝜽 ≠ 𝑄 𝒚 𝜽 ′  The function from 𝜾 to 𝜽 is invertible  Example: invertible function from 𝜄 to 𝜃 in the Bernoulli example 𝜄 1 = 1+𝑓 −𝜃 7

Examples of non-exponential distributions  Uniform  Laplace  Student t-distribution 8

Moments 𝐵 𝜽 = ln 𝑎 𝜽 𝑎 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 ℎ 𝒚 𝑈(𝒚) exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 𝛼 𝜽 𝑎 𝜽 𝛼 𝜽 𝐵 𝜽 = = 𝑎 𝜽 𝑎 𝜽 = 𝑈(𝒚) ℎ 𝒚 exp 𝜽 𝑈 𝑈(𝒚) 𝑒𝒚 = 𝐹 𝑄(𝒚|𝜽) 𝑈(𝒚) 𝑎 𝜽 The first derivative of 𝐵 𝜽 is ⇒ 𝛼 𝜽 𝐵 𝜽 = 𝐹 𝜽 𝑈(𝒚) the mean of sufficient statistics 2 𝐵 𝜽 = 𝐹 𝜽 𝑈 𝒚 𝑈 𝒚 𝑈 − 𝐹 𝜽 𝑈 𝒚 𝐹 𝜽 𝑈 𝒚 𝑈 = 𝐷𝑝𝑤 𝜽 𝑈 𝒚 𝛼 𝜽 The i- th derivative gives the i- th centered moment of sufficient statistics. 9

Properties  The moment parameters 𝜾 can be derived as a function of the natural or canonical parameters: 𝛼 𝜽 𝐵 𝜽 = 𝐹 𝜽 𝑈(𝒚) For many distributions, 𝜾 ≡ 𝐹 𝜽 𝑈(𝒚) ⇒ 𝛼 𝜽 𝐵 𝜽 = 𝜾 we have 𝜾 ≡ 𝐹 𝜽 𝑈(𝑦)  𝐵(𝜽) is convex since 𝛼 2 𝐵 𝜽 = 𝐷𝑝𝑤 𝜽 𝑈 𝒚 ≽ 0 𝜽  Covariance matrix is always positive semi-definite ⇒ Hessian 𝛼 2 𝐵 𝜽 is 𝜽 positive semi-definite, and hence that 𝐵 𝜽 = ln 𝑎 𝜽 is a convex function of 𝜽 . 10

Exponential family: moment parameterization  A distribution in the exponential family can also be parameterized by the moment parameterization : 1 𝑎 𝜾 ℎ 𝒚 exp 𝜔 𝜾 𝑈 𝑈(𝒚) 𝑄 𝒚 𝜾 = 𝜽 = 𝜔(𝜾) 𝑎 𝜾 = ℎ 𝒚 exp 𝜔 𝜾 𝑈 𝑈(𝒚) 𝑒𝒚 𝜔 maps the parameters 𝜾 to the space of sufficient statistics 𝜾 ≡ 𝐹 𝜽 𝑈(𝒚) = 𝛼 𝜽 𝐵 𝜽 𝜾 = 𝜔 −1 𝜽 is ascending ⟹ 𝜔 −1 𝜽 = 𝜾 = 𝛼 is  If 𝛼 2 𝐵 𝜽 ≻ 0 ⇒ 𝛼 𝜽 𝐵 𝜽 𝜽 𝐵 𝜽 𝜽 ascending and thus 𝜔 −1 𝜽 is 1-to-1  The mapping from the moments to the canonical parameters is invertible (1-to-1 relationship): 𝜽 = 𝜔(𝜾) 11

Sufficiency  A statistic is a function of a random variable  Suppose that the distribution of 𝑌 depends on a parameter 𝜄  “ 𝑈(𝑌) is a sufficient statistic for 𝜄 if there is no information in 𝑌 regarding 𝜄 beyond that in 𝑈(𝑌) ”  Sufficiency in both frequentist and Bayesian frameworks implies a factorization of 𝑄 𝑦 𝜄 (Neyman factorization theorem): 𝑄 𝑦, 𝑈 𝑦 , 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 𝑕 𝑦, 𝑈 𝑦 𝑄 𝑦, 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 𝑕(𝑦, 𝑈(𝑦)) 𝑄 𝑦|𝜄 = 𝑔′ 𝑈 𝑦 , 𝜄 𝑕(𝑦, 𝑈(𝑦)) 12

Sufficient statistic  Sufficient statistic and the exponential family: 𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽 𝑈 𝑈 𝒚 − 𝐵(𝜽)  Sufficient statistic in the case of i.i.d sampling can be obtained easily for a set of N observations from a distribution 𝑂 ℎ 𝒚 (𝑜) exp 𝜽 𝑈 𝑈 𝒚 𝑜 𝑄 𝒠 𝜽 = − 𝐵(𝜽) 𝑜=1 𝑂 𝑂 ℎ 𝒚 (𝑜) exp{𝜽 𝑈 𝑈 𝒚 𝑜 = − 𝑂𝐵 𝜽 } 𝑜=1 𝑜=1 𝒠 has itself an exponential distribution with sufficient statistic 𝑂 𝑈 𝒚 𝑜 𝑜=1 13

MLE for exponential family 𝑂 ℎ 𝒚 (𝑜) exp 𝜽 𝑈 𝑈 𝒚 𝑜 ℓ 𝜽; 𝒠 = ln 𝑄 𝒠 𝜽 = ln − 𝐵(𝜽) 𝑜=1 𝑂 𝑂 Concave ℎ(𝒚 (𝑜) ) + 𝜽 𝑈 𝑈 𝒚 𝑜 = ln − 𝑂𝐵 𝜽 function on 𝜽 𝑜=1 𝑜=1 𝑂 𝑈 𝒚 𝑜 𝛼 𝜽 ℓ 𝜽; 𝒠 = 0 ⇒ − 𝑂𝛼 𝜽 𝐵 𝜽 = 0 𝑜=1 𝑂 𝑈 𝒚 𝑜 𝜽 𝐵 𝜽 = 𝑜=1 ⇒ 𝛼 𝑂 𝑂 𝑈 𝒚 𝑜 𝜽 𝑈(𝒚) = 𝑜=1 𝜽 𝐵 ⇒ 𝛼 𝜽 = 𝐹 𝑂 moment matching 14

Maximum entropy models  Among all distributions with certain moments of interest, the exponential family is the most random (makes fewest assumptions or structure)  Out of all distributions which reproduce the observed sufficient statistics, the exponential family distribution (roughly) makes the fewest additional assumptions.  The unique distribution maximizing the entropy, subject to the constraint that these moments are exactly matched, is then an exponential family distribution 15

Maximum entropy  Constraints: 𝑙 𝒚 : an arbitrary function 𝑔 𝐺 𝑙 : constant 𝐹 𝑔 𝑙 = 𝑔 𝑙 𝒚 𝑄 𝒚 = 𝐺 𝑙 𝒚  Maximum entropy (maxent): pick the distribution with maximum entropy subject to the constraints 𝑀 𝑄, 𝝁 = − 𝑄 𝒚 log 𝑄 𝒚 + 𝜇 0 1 − 𝑄 𝒚 + 𝜇 𝑙 𝐺 𝑙 − 𝑔 𝑙 𝒚 𝑄 𝒚 𝒚 𝒚 𝑙 𝒚 𝛼𝑀 = 0 ⇒ 𝑄 𝒚 = 1 𝑎 exp − 𝜇 𝑙 𝑔 𝑙 𝒚 𝑙 𝑎 = exp − 𝜇 𝑙 𝑔 𝑙 𝒚 𝒚 𝑙 16

Maximum entropy: constraints  Constants in the constraints:  𝐺 𝑙 measure the empirical counts on the training data 𝑂 𝑔 𝑙 𝒚 (𝑜) 𝑜=1  𝐺 𝑙 = 𝑂  These constraints also ensure consistency automatically. 17

Exponential family: summary  Many famous distribution are in the exponential family  Important properties for learning with exponential families:  Gradients of log partition function gives expected sufficient statistics, or moments , for some models  Moments of any distribution in exponential family can be easily computed by taking the derivatives of the log normalizer  The Hessian of the log partition function is positive semi-definite and so the log partition function is convex  Among all distributions with certain moments of interest, the exponential family has the highest entropy  Are important for modeling distributions of Markov networks 18

Generalized linear models (GLIMs)  Models for conditional relationship between 𝑍 and 𝒀  We can use GLIMs to model the CPDs of Bayesian networks.  Examples:  Linear regression: 𝑄 𝑧 𝒚, 𝒙, 𝜏 2 = 𝒪(𝑧|𝒙 𝑈 𝒚, 𝜏 2 )  Discriminative linear classifier (two class)  Logistic regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|𝜏 𝒙 𝑈 𝒚 )  Probit regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|Φ 𝒙 𝑈 𝒚 ) where Φ is the cdf of 𝒪(0,1) 19

Exponential family & Generalized Linear Models (GLIMs) - PowerPoint PPT Presentation

Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Outline Exponential family Many standard distributions are in this family Similarities

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Beyond the exponential family Eric Pedersen, Gavin Simpson, David Miller August 6th, 2016 Away

About Revit Family (NAH) Project Family Management Annotation Family System Family

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

Section5.2 Exponential Functions and Graphs Graphing Definition The exponential function with

GSoC 2016: Exponential Integrators Chiara Segala Mentor: Prof. Marco Caliari GSoC 2016:

Solving exponential and logarithmic equations We explore some results involving exponential

Exponential distribution STAT 587 (Engineering) Iowa State University September 17, 2020

Exponential & Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Exponential and Logarithm Natural Logs and e Exponential Growth and Decay Functions Slide 3 /

Introduction to exponential functions An exponential function is a function of the form f ( x ) = b

Exponential polynomials Paola DAquino Seconda Universita di Napoli Cesme, May 2012 Topics

Generalized linear models Sren Hjsgaard Department of Mathematical Sciences Aalborg

!"!#$%$#&'($)**$+$,-..$/0))$ 1234$5.6745$8923$:;9254$-<7$=255>$

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

On the Solution of Optimization and Variational Problems with Imperfect Information Uday V.

GLM notes of JKKs lecture I am using I for the incidence matrix, while John used X in lecture. C

Three Components of Global Linear Models f is a function that maps a structure ( x, y ) to a

MINING Final Review Instructor: Yizhou Sun yzsun@cs.ucla.edu December 6, 2017 Learnt

1 Projections Orthographic Projection Projections Orthographic Projection To lower

Sambuz

Useful Links

Newsletter

Mail Us

Exponential family & Generalized Linear Models (GLIMs) - PowerPoint PPT Presentation

Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Outline Exponential family Many standard distributions are in this family Similarities

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Beyond the exponential family Eric Pedersen, Gavin Simpson, David Miller August 6th, 2016 Away

About Revit Family (NAH) Project Family Management Annotation Family System Family

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Graphical Models Graphical Models Exponential family &amp; Variational Inference I Siamak

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

Section5.2 Exponential Functions and Graphs Graphing Definition The exponential function with

GSoC 2016: Exponential Integrators Chiara Segala Mentor: Prof. Marco Caliari GSoC 2016:

Solving exponential and logarithmic equations We explore some results involving exponential

Exponential distribution STAT 587 (Engineering) Iowa State University September 17, 2020

Exponential &amp; Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Exponential and Logarithm Natural Logs and e Exponential Growth and Decay Functions Slide 3 /

Introduction to exponential functions An exponential function is a function of the form f ( x ) = b

Exponential polynomials Paola DAquino Seconda Universita di Napoli Cesme, May 2012 Topics

Generalized linear models Sren Hjsgaard Department of Mathematical Sciences Aalborg

!&quot;!#$%$#&amp;'($)**$+$,-..$/0))$ 1234$5.6745$8923$:;9254$-&lt;7$=255&gt;$

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

On the Solution of Optimization and Variational Problems with Imperfect Information Uday V.

GLM notes of JKKs lecture I am using I for the incidence matrix, while John used X in lecture. C

Three Components of Global Linear Models f is a function that maps a structure ( x, y ) to a

MINING Final Review Instructor: Yizhou Sun yzsun@cs.ucla.edu December 6, 2017 Learnt

1 Projections Orthographic Projection Projections Orthographic Projection To lower

Sambuz

Useful Links

Newsletter

Mail Us

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak

Exponential & Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

!"!#$%$#&'($)**$+$,-..$/0))$ 1234$5.6745$8923$:;9254$-<7$=255>$