Probabilistic & Unsupervised Learning Exponential families: - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Exponential families: convexity, duality and free energies Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

Exponential families: the log partition function Consider an exponential family distribution with sufficient statistic s ( X ) and natural parameter θ (and no base factor in X alone). We can write its probability or density function as � � θ T s ( X ) − Φ( θ ) p ( X | θ ) = exp where Φ( θ ) is the log partition function � � � θ T s ( x ) Φ( θ ) = log exp x Φ( θ ) plays an important role in the theory of the exponential family. For example, it maps natural parameters to the moments of the sufficient statistics: ∂ θ Φ( θ ) = e − Φ( θ ) � ∂ s ( x ) e θ T s ( x ) = E θ [ s ( X )] = µ ( θ ) = µ x ∂ 2 s ( x ) 2 e θ T s ( x ) − e − 2 Φ( θ ) � � s ( x ) e θ T s ( x ) � 2 ∂ θ 2 Φ( θ ) = e − Φ( θ ) � = V θ [ s ( X )] x x The second derivative is thus positive semi-definite, and so Φ( θ ) is convex in θ .

Exponential families: mean parameters and negative entropy A (minimal) exponential family distribution can also be parameterised by the means of the sufficient statistics. µ ( θ ) = E θ [ s ( X )] Consider the negative entropy of the distribution as a function of the mean parameter: Ψ( µ ) = E θ [ log p ( X | θ ( µ ))] = θ T µ − Φ( θ ) so θ T µ = Φ( θ ) + Ψ( µ ) The negative entropy is dual to the log-partition function. For example, d µ Ψ( µ ) = ∂ ∂ d + d θ � � � � θ T µ − Φ( θ ) θ T µ − Φ( θ ) ∂ µ d µ ∂ θ = θ + d θ d µ ( µ − µ ) = θ

Exponential families: duality In fact, the log partition function and negative entropy are Legendre dual or convex conjugate functions. Consider the KL divergence between distributions with natural parameters θ and θ ′ : � � � � θ ′ � � � � � � p ( X | θ ′ ) − log p ( X | θ ′ ) + log p ( X | θ ) = KL p ( X | θ ) = E θ KL θ = − θ ′ T µ + Φ( θ ′ ) + Ψ( µ ) ≥ 0 ⇒ Ψ( µ ) ≥ θ ′ T µ − Φ( θ ′ ) where µ are the mean parameters corresponding to θ . Now, the minimum KL divergence of zero is reached iff θ = θ ′ , so � � � � θ ′ T µ − Φ( θ ′ ) θ ′ T µ − Φ( θ ′ ) Ψ( µ )= sup θ ( µ )= argmax and, if finite θ ′ θ ′ The left-hand equation is the definition of the conjugate dual of a convex function. Continuous functions are reciprocally dual, so we also have: � θ T µ ′ − Ψ( µ ′ ) � � θ T µ ′ − Ψ( µ ′ ) � Φ( θ )= sup µ ( θ )= argmax and, if finite µ ′ µ ′ Thus, duality gives us another relation between θ and µ .

Duality, inference and the free energy Consider a joint exponential family distribution on observed x and latent z . � � θ T s ( x , z ) − Φ XZ ( θ ) p ( x , z ) = exp The posterior on z is also in the exponential family, with the clamped sufficient statistic s Z ( z ; x ) = s XZ ( x obs , z ) ; the same (now possibly redundant) natural parameter θ ; and partition function Φ Z ( θ ) = log � z exp θ T s Z ( z ) . The likelihood is � e θ T s ( x , z ) − Φ XZ ( θ ) = � e θ T s Z ( z ; x ) e − Φ XZ ( θ ) = exp [Φ Z ( θ ) − Φ XZ ( θ )] L ( θ ) = p ( x | θ ) = z z So we can write the log-likelihood as [ θ T µ Z − Φ XZ ( θ ) ℓ ( θ ) = sup − Ψ( µ Z ) ] = sup F ( θ , µ Z ) µ Z � �� µ Z � log p ( x , z ) � q − H [ q ] This is the familiar free energy with q ( z ) represented by its mean parameters µ Z !

Inference with mean parameters We have described inference in terms of the distribution q , approximating as needed, then computing expected suff stats. Can we describe it instead as an optimisation over µ directly? µ ∗ [ θ T µ Z − Ψ( µ Z )] Z = argmax µ Z

Inference with mean parameters We have described inference in terms of the distribution q , approximating as needed, then computing expected suff stats. Can we describe it instead as an optimisation over µ directly? µ ∗ [ θ T µ Z − Ψ( µ Z )] Z = argmax µ Z Concave maximisation(!), but two complications: ◮ The optimum must be found over feasible means. Interdependance of the sufficient statistics may prevent arbitrary sets of mean sufficient statistics being achieved

Inference with mean parameters We have described inference in terms of the distribution q , approximating as needed, then computing expected suff stats. Can we describe it instead as an optimisation over µ directly? µ ∗ [ θ T µ Z − Ψ( µ Z )] Z = argmax µ Z Concave maximisation(!), but two complications: ◮ The optimum must be found over feasible means. Interdependance of the sufficient statistics may prevent arbitrary sets of mean sufficient statistics being achieved ◮ Feasible means are convex combinations of all the single-configuration sufficient statistics. � � µ = ν ( x ) s ( x ) ν ( x ) = 1 x x

Inference with mean parameters We have described inference in terms of the distribution q , approximating as needed, then computing expected suff stats. Can we describe it instead as an optimisation over µ directly? µ ∗ [ θ T µ Z − Ψ( µ Z )] Z = argmax µ Z Concave maximisation(!), but two complications: ◮ The optimum must be found over feasible means. Interdependance of the sufficient statistics may prevent arbitrary sets of mean sufficient statistics being achieved ◮ Feasible means are convex combinations of all the single-configuration sufficient statistics. � � µ = ν ( x ) s ( x ) ν ( x ) = 1 x x ◮ Take a Boltzmann machine on two variables, x 1 , x 2 . ◮ The sufficient stats are s ( x ) = [ x 1 , x 2 , x 1 x 2 ] . ◮ Clearly only the stats S = { [ 0 , 0 , 0 ] , [ 0 , 1 , 0 ] , [ 1 , 0 , 0 ] , [ 1 , 1 , 1 ] } are possible. ◮ Thus µ ∈ convex hull ( S ) .

Inference with mean parameters We have described inference in terms of the distribution q , approximating as needed, then computing expected suff stats. Can we describe it instead as an optimisation over µ directly? µ ∗ [ θ T µ Z − Ψ( µ Z )] Z = argmax µ Z Concave maximisation(!), but two complications: ◮ The optimum must be found over feasible means. Interdependance of the sufficient statistics may prevent arbitrary sets of mean sufficient statistics being achieved ◮ Feasible means are convex combinations of all the single-configuration sufficient statistics. � � µ = ν ( x ) s ( x ) ν ( x ) = 1 x x ◮ Take a Boltzmann machine on two variables, x 1 , x 2 . ◮ The sufficient stats are s ( x ) = [ x 1 , x 2 , x 1 x 2 ] . ◮ Clearly only the stats S = { [ 0 , 0 , 0 ] , [ 0 , 1 , 0 ] , [ 1 , 0 , 0 ] , [ 1 , 1 , 1 ] } are possible. ◮ Thus µ ∈ convex hull ( S ) . ◮ For a discrete distribution, this space of possible means is bounded by exponentially many hyperplanes connecting the discrete configuration stats: called the marginal polytope.

Inference with mean parameters We have described inference in terms of the distribution q , approximating as needed, then computing expected suff stats. Can we describe it instead as an optimisation over µ directly? µ ∗ [ θ T µ Z − Ψ( µ Z )] Z = argmax µ Z Concave maximisation(!), but two complications: ◮ The optimum must be found over feasible means. Interdependance of the sufficient statistics may prevent arbitrary sets of mean sufficient statistics being achieved ◮ Feasible means are convex combinations of all the single-configuration sufficient statistics. � � µ = ν ( x ) s ( x ) ν ( x ) = 1 x x ◮ Take a Boltzmann machine on two variables, x 1 , x 2 . ◮ The sufficient stats are s ( x ) = [ x 1 , x 2 , x 1 x 2 ] . ◮ Clearly only the stats S = { [ 0 , 0 , 0 ] , [ 0 , 1 , 0 ] , [ 1 , 0 , 0 ] , [ 1 , 1 , 1 ] } are possible. ◮ Thus µ ∈ convex hull ( S ) . ◮ For a discrete distribution, this space of possible means is bounded by exponentially many hyperplanes connecting the discrete configuration stats: called the marginal polytope. ◮ Even when restricted to the marginal polytope, evaluating Ψ( µ ) can be challenging.

Convexity and undirected trees ◮ We can parametrise a discrete pairwise MRF as follows: p ( X ) = 1 � � f i ( X ) f ij ( X i , X j ) Z i ( ij )   � � � � = exp θ i ( k ) δ ( X i = k ) + θ ij ( k , l ) δ ( X i = k ) δ ( X j = l ) − Φ( θ )  i k ( ij ) k , l

Convexity and undirected trees ◮ We can parametrise a discrete pairwise MRF as follows: p ( X ) = 1 � � f i ( X ) f ij ( X i , X j ) Z i ( ij )   � � � � = exp θ i ( k ) δ ( X i = k ) + θ ij ( k , l ) δ ( X i = k ) δ ( X j = l ) − Φ( θ )  i k ( ij ) k , l ◮ So discrete MRFs are always exponential family, with natural and mean parameters: � � θ = θ i ( k ) , θ ij ( k , l ) ∀ i , j , k , l � � µ = p ( X i = k ) , p ( X i = k , X j = l ) ∀ i , j , k , l In particular, the mean parameters are just the singleton and pairwise probability tables.

Probabilistic & Unsupervised Learning Exponential families: - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Exponential families: convexity, duality and free energies Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Learning, Markets, and Exponential Families Financialization of ML Outline Market Making OLO

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family &

Beyond the exponential family Eric Pedersen, Gavin Simpson, David Miller August 6th, 2016 Away

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

A word on duality Jonathan Turk Arizona State University October 21, 2020 Overview

Outline: 1. A quick reminder of FZZ duality. 2. Target space interpretation of FZZ in the

Kernel Methods - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia

Lagrangian Duality Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

Optimization Theory and n n n 1 minimize

Seiberg duality for SUSY QCD Phases of gauge theories V ( R ) 1 Coulomb : R 1 Free

World-sheet duality for supersphere -models Thomas Quella (University of Amsterdam)

Operations Research Linear Programming Duality Ling-Chieh Kung Department of Information

Sambuz

Useful Links

Newsletter

Mail Us

Probabilistic & Unsupervised Learning Exponential families: - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Exponential families: convexity, duality and free energies Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Learning, Markets, and Exponential Families Financialization of ML Outline Market Making OLO

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family &amp;

Beyond the exponential family Eric Pedersen, Gavin Simpson, David Miller August 6th, 2016 Away

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

A word on duality Jonathan Turk Arizona State University October 21, 2020 Overview

Outline: 1. A quick reminder of FZZ duality. 2. Target space interpretation of FZZ in the

Kernel Methods - I Henrik I Christensen Robotics &amp; Intelligent Machines @ GT Georgia

Lagrangian Duality Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

Optimization Theory and n n n 1 minimize

Seiberg duality for SUSY QCD Phases of gauge theories V ( R ) 1 Coulomb : R 1 Free

World-sheet duality for supersphere -models Thomas Quella (University of Amsterdam)

Operations Research Linear Programming Duality Ling-Chieh Kung Department of Information

Sambuz

Useful Links

Newsletter

Mail Us

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family &

Kernel Methods - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia