Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics - PowerPoint PPT Presentation

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyv¨ arinen Gatsby Unit University College London Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Part III: Estimation of unnormalized models ◮ Often, in natural image statistics, the probabilistic models are unnormalized ◮ Major computational problem ◮ Here, we consider new methods to tackle this problem ◮ Later, we see applications on natural image statistics Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Unnormalized models: Problem definition ◮ We want to estimate a parametric model of a multivariate random vector x ∈ R n ◮ Density function f norm is known only up to a multiplicative constant 1 f norm ( x ; θ ) = Z ( θ ) p un ( x ; θ ) � Z ( θ ) = ξ ∈ R n p un ( ξ ; θ ) d ξ ◮ Functional form of p un is known (can be easily computed) ◮ Partition function Z cannot be computed with reasonable computing time (numerical integration) ◮ Here: How to estimate model while avoiding numerical integration? Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Examples of unnormalized models related to ICA ◮ ICA with overcomplete basis simple by 1 � G ( w T f norm ( x ; W ) = Z ( W ) exp[ i x )] (1) i ◮ Estimation of second layer in ISA and topographic ICA 1 � � j x ) 2 )] m ij ( w T f norm ( x ; W , M ) = Z ( W , M ) exp[ G ( (2) i i ◮ Non-Gaussian Markov Random Fields ◮ ... many more Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Previous solutions ◮ Monte Carlo methods ◮ Consistent estimators (convergence to real parameter values when sample size → ∞ ) ◮ Computation very slow (I think) ◮ Various approximations, e.g. variational methods ◮ Computation often fast ◮ Consistency not known, or proven inconsistent ◮ Pseudo-likelihood and contrastive divergence ◮ Presumably consistent ◮ Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Content of this talk ◮ We have proposed two methods for estimation of unnormalized models ◮ Both methods avoid numerical integration ◮ First: Score matching (Hyv¨ arinen, JMLR, 2005) ◮ Take derivative of model log-density w.r.t. x , so partition function disappears ◮ Fit this derivative to the same derivative of data density ◮ Easy to compute due to partial integration trick ◮ Closed-form solution for exponential families ◮ Second: Noise-contrastive estimation (Gutmann and Hyv¨ arinen, JMLR, 2012) ◮ Learn to distinguish data from artificially generated noise: Logistic regression learns ratios of pdf’s of data and noise ◮ For known noise pdf, we have in fact learnt data pdf ◮ Consistent even in the unnormalized case Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Definition of “score function” (in this talk) ◮ Define model score function R n → R n as  ∂ log f norm ( ξ ; θ )  ∂ξ 1 .  .  ψ ( ξ ; θ ) =  = ∇ ξ log f norm ( ξ ; θ ) .    ∂ log f norm ( ξ ; θ ) ∂ξ n where f norm is normalized model density. ◮ Similarly, define data score function as ψ x ( ξ ) = ∇ ξ log p x ( ξ ) where observed data is assumed to follow p x ( . ). ◮ In conventional terminology: Fisher score with respect to a hypothetical location parameter: f norm ( x − θ ), evaluated at θ = 0 . Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Score matching: definition of objective function ◮ Estimate by minimizing a distance between model score function ψ ( . ; θ ) and score function of observed data ψ x ( . ): J ( θ ) = 1 � ξ ∈ R n p x ( ξ ) � ψ ( ξ ; θ ) − ψ x ( ξ ) � 2 d ξ (3) 2 ˆ θ = arg min θ J ( θ ) ◮ This gives a consistent estimator almost by construction ◮ ψ ( ξ ; θ ) does not depend on Z ( θ ) because ψ ( ξ ; θ ) = ∇ ξ log p un ( ξ ; θ ) −∇ ξ log Z ( θ ) = ∇ ξ log p un ( ξ ; θ ) − 0 (4) ◮ No need to compute normalization constant Z , non-normalized pdf p un is enough. ◮ Computation of J quite simple due to theorem below Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

A computational trick: central theorem of score matching ◮ In the objective function we have score function of data distribution ψ x ( . ). How to compute it? ◮ In fact, no need to compute it because Theorem Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as n � ∂ i ψ i ( ξ ; θ ) + 1 � � � 2 ψ i ( ξ ; θ ) 2 J ( θ ) = ξ ∈ R n p x ( ξ ) d ξ + const. (5) i =1 where the constant does not depend on θ , and , ∂ i ψ i ( ξ ; θ ) = ∂ 2 log p un ( ξ ; θ ) ψ i ( ξ ; θ ) = ∂ log p un ( ξ ; θ ) ∂ξ 2 ∂ξ i i Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Simple explanation of score matching trick ◮ Consider objective function J ( θ ): 1 � � p x ( ξ ) � ψ ( ξ ; θ ) � 2 d ξ − p x ( ξ ) ψ x ( ξ ) T ψ ( ξ ; θ ) d ξ + const. 2 ◮ Constant does not depend on θ . First term easy to compute. ◮ The trick is to use partial integration on second term. In one dimension: p x ( x ) p ′ � � x ( x ) p x ( x )(log p x ) ′ ( x ) ψ ( x ; θ ) dx = p x ( x ) ψ ( x ; θ ) dx � � p ′ p x ( x ) ψ ′ ( x ; θ ) dx = x ( x ) ψ ( x ; θ ) dx = 0 − ◮ This is why score function of data distribution p x ( x ) disappears! Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Final method of score matching ◮ Replace integration over data density p x ( . ) by sample average ◮ Given T observations x (1) , . . . , x ( T ), minimize T n J ( θ ) = 1 � ∂ i ψ i ( x ( t ); θ ) + 1 � ˜ � � 2 ψ i ( x ( t ); θ ) 2 (6) T t =1 i =1 where ψ i is a partial derivative of non-normalized model log-density log p un , and ∂ i ψ i a second partial derivative ◮ Only needs evaluation of some derivatives of the non-normalized (log)-density p un which are simple to compute (by assumption) ◮ Thus: a new computationally simple and statistically consistent method for parameter estimation Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Closed-form solution in the exponential family ◮ Assume pdf can be expressed in the form m � log p un ( ξ ; θ ) = θ k F k ( ξ ) − log Z ( θ ) (7) k =1 ◮ Define matrices of partial derivatives: , and H ki ( ξ ) = ∂ 2 F k K ki ( ξ ) = ∂ F k (8) ∂ξ 2 ∂ξ i i ◮ Then, the score matching estimator is given by: � − 1 � ˆ E { K ( x ) K ( x ) T } ˆ � ˆ θ = − ( E { h i ( x ) } ) (9) i where ˆ E denotes the sample average, and the vector h i is the i -th column of the matrix H . Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

ICA with overcomplete basis Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Second method: Noise-contrastive estimation (NCE) ◮ Train a nonlinear classifier to discriminate observed data from some artificial noise ◮ To be successful, the classifier must “discover structure” in the data ◮ For example, compare natural images with Gaussian noise Natural images Gaussian noise Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Definition of classifier in NCE ◮ Observed data set X = ( x (1) , . . . , x ( T )) with un known pdf p x ◮ Generate “noise” Y = ( y (1) , . . . , y ( T )) with known pdf p y ◮ Define a nonlinear function (e.g. multilayer perceptron) g ( u ; θ ), which models data log-density log p x ( u ). ◮ We use logistic regression with the nonlinear function G ( u ; θ ) = g ( u ; θ ) − log p y ( u ) . (10) ◮ Well-known developments lead to objective (likelihood) � J ( θ ) = log [ h ( x ( t ); θ )] + log [1 − h ( y ( t ); θ )] t 1 where h ( u ; θ ) = (11) 1 + exp[ − G ( u ; θ )] Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

What does the classifying system do in NCE? ◮ Theorem: ◮ Assume our parametric model g ( u ; θ ) (e.g. an MLP) can approximate any function. ◮ Then, the maximum of classification objective is attained when g ( u ; θ ) = log p x ( u ) (12) where p x ( u ) is the pdf of the observed data. ◮ Corollary: If data generated according to model, i.e. log p x ( u ) = g ( u ; θ ∗ ) , we have a statistically consistent estimator. ◮ Supervised learning thus leads to unsupervised estimation of a probabilistic model given by log-density g ( u ; θ ). Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics - PowerPoint PPT Presentation

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyv arinen Gatsby Unit University College London Aapo Hyv arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience

Population Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit

Neural Encoding Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience

Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience

Statistical methods for neural decoding Liam Paninski Gatsby Computational Neuroscience Unit

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Introduction to Neural Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

FrankWood Gatsby UCL

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic & Unsupervised Learning Beyond linear-Gaussian models and Mixtures Maneesh

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine

Lo Locally Differentially Private Frequency Es Esti timati tion on Ex Exploi oiti ting Con

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Linear programming and the DEA approach Anders Ringgaard Kristensen Department of Veterinary and

Adaptive diversification 2. Liu and Shen variant of FSMVRPTW metaheuristic for the 3. Recent

Stack: Resize scale by constant Original size c then increase by c when full Run time to push n

Sambuz

Useful Links

Newsletter

Mail Us

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics - PowerPoint PPT Presentation

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyv arinen Gatsby Unit University College London Aapo Hyv arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience

Population Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit

Neural Encoding Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience

Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience

Statistical methods for neural decoding Liam Paninski Gatsby Computational Neuroscience Unit

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Introduction to Neural Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

FrankWood Gatsby UCL

Probabilistic &amp; Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic &amp; Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic &amp; Unsupervised Learning Beyond linear-Gaussian models and Mixtures Maneesh

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine

Lo Locally Differentially Private Frequency Es Esti timati tion on Ex Exploi oiti ting Con

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Linear programming and the DEA approach Anders Ringgaard Kristensen Department of Veterinary and

Adaptive diversification 2. Liu and Shen variant of FSMVRPTW metaheuristic for the 3. Recent

Stack: Resize scale by constant Original size c then increase by c when full Run time to push n

Sambuz

Useful Links

Newsletter

Mail Us

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic & Unsupervised Learning Beyond linear-Gaussian models and Mixtures Maneesh