gatsby theoretical neuroscience lectures non gaussian
play

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics - PowerPoint PPT Presentation

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyv arinen Gatsby Unit University College London Aapo Hyv arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics


  1. Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyv¨ arinen Gatsby Unit University College London Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  2. Part III: Estimation of unnormalized models ◮ Often, in natural image statistics, the probabilistic models are unnormalized ◮ Major computational problem ◮ Here, we consider new methods to tackle this problem ◮ Later, we see applications on natural image statistics Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  3. Unnormalized models: Problem definition ◮ We want to estimate a parametric model of a multivariate random vector x ∈ R n ◮ Density function f norm is known only up to a multiplicative constant 1 f norm ( x ; θ ) = Z ( θ ) p un ( x ; θ ) � Z ( θ ) = ξ ∈ R n p un ( ξ ; θ ) d ξ ◮ Functional form of p un is known (can be easily computed) ◮ Partition function Z cannot be computed with reasonable computing time (numerical integration) ◮ Here: How to estimate model while avoiding numerical integration? Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  4. Examples of unnormalized models related to ICA ◮ ICA with overcomplete basis simple by 1 � G ( w T f norm ( x ; W ) = Z ( W ) exp[ i x )] (1) i ◮ Estimation of second layer in ISA and topographic ICA 1 � � j x ) 2 )] m ij ( w T f norm ( x ; W , M ) = Z ( W , M ) exp[ G ( (2) i i ◮ Non-Gaussian Markov Random Fields ◮ ... many more Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  5. Previous solutions ◮ Monte Carlo methods ◮ Consistent estimators (convergence to real parameter values when sample size → ∞ ) ◮ Computation very slow (I think) ◮ Various approximations, e.g. variational methods ◮ Computation often fast ◮ Consistency not known, or proven inconsistent ◮ Pseudo-likelihood and contrastive divergence ◮ Presumably consistent ◮ Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  6. Content of this talk ◮ We have proposed two methods for estimation of unnormalized models ◮ Both methods avoid numerical integration ◮ First: Score matching (Hyv¨ arinen, JMLR, 2005) ◮ Take derivative of model log-density w.r.t. x , so partition function disappears ◮ Fit this derivative to the same derivative of data density ◮ Easy to compute due to partial integration trick ◮ Closed-form solution for exponential families ◮ Second: Noise-contrastive estimation (Gutmann and Hyv¨ arinen, JMLR, 2012) ◮ Learn to distinguish data from artificially generated noise: Logistic regression learns ratios of pdf’s of data and noise ◮ For known noise pdf, we have in fact learnt data pdf ◮ Consistent even in the unnormalized case Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  7. Definition of “score function” (in this talk) ◮ Define model score function R n → R n as  ∂ log f norm ( ξ ; θ )  ∂ξ 1 .  .  ψ ( ξ ; θ ) =  = ∇ ξ log f norm ( ξ ; θ ) .    ∂ log f norm ( ξ ; θ ) ∂ξ n where f norm is normalized model density. ◮ Similarly, define data score function as ψ x ( ξ ) = ∇ ξ log p x ( ξ ) where observed data is assumed to follow p x ( . ). ◮ In conventional terminology: Fisher score with respect to a hypothetical location parameter: f norm ( x − θ ), evaluated at θ = 0 . Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  8. Score matching: definition of objective function ◮ Estimate by minimizing a distance between model score function ψ ( . ; θ ) and score function of observed data ψ x ( . ): J ( θ ) = 1 � ξ ∈ R n p x ( ξ ) � ψ ( ξ ; θ ) − ψ x ( ξ ) � 2 d ξ (3) 2 ˆ θ = arg min θ J ( θ ) ◮ This gives a consistent estimator almost by construction ◮ ψ ( ξ ; θ ) does not depend on Z ( θ ) because ψ ( ξ ; θ ) = ∇ ξ log p un ( ξ ; θ ) −∇ ξ log Z ( θ ) = ∇ ξ log p un ( ξ ; θ ) − 0 (4) ◮ No need to compute normalization constant Z , non-normalized pdf p un is enough. ◮ Computation of J quite simple due to theorem below Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  9. A computational trick: central theorem of score matching ◮ In the objective function we have score function of data distribution ψ x ( . ). How to compute it? ◮ In fact, no need to compute it because Theorem Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as n � ∂ i ψ i ( ξ ; θ ) + 1 � � � 2 ψ i ( ξ ; θ ) 2 J ( θ ) = ξ ∈ R n p x ( ξ ) d ξ + const. (5) i =1 where the constant does not depend on θ , and , ∂ i ψ i ( ξ ; θ ) = ∂ 2 log p un ( ξ ; θ ) ψ i ( ξ ; θ ) = ∂ log p un ( ξ ; θ ) ∂ξ 2 ∂ξ i i Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  10. Simple explanation of score matching trick ◮ Consider objective function J ( θ ): 1 � � p x ( ξ ) � ψ ( ξ ; θ ) � 2 d ξ − p x ( ξ ) ψ x ( ξ ) T ψ ( ξ ; θ ) d ξ + const. 2 ◮ Constant does not depend on θ . First term easy to compute. ◮ The trick is to use partial integration on second term. In one dimension: p x ( x ) p ′ � � x ( x ) p x ( x )(log p x ) ′ ( x ) ψ ( x ; θ ) dx = p x ( x ) ψ ( x ; θ ) dx � � p ′ p x ( x ) ψ ′ ( x ; θ ) dx = x ( x ) ψ ( x ; θ ) dx = 0 − ◮ This is why score function of data distribution p x ( x ) disappears! Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  11. Final method of score matching ◮ Replace integration over data density p x ( . ) by sample average ◮ Given T observations x (1) , . . . , x ( T ), minimize T n J ( θ ) = 1 � ∂ i ψ i ( x ( t ); θ ) + 1 � ˜ � � 2 ψ i ( x ( t ); θ ) 2 (6) T t =1 i =1 where ψ i is a partial derivative of non-normalized model log-density log p un , and ∂ i ψ i a second partial derivative ◮ Only needs evaluation of some derivatives of the non-normalized (log)-density p un which are simple to compute (by assumption) ◮ Thus: a new computationally simple and statistically consistent method for parameter estimation Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  12. Closed-form solution in the exponential family ◮ Assume pdf can be expressed in the form m � log p un ( ξ ; θ ) = θ k F k ( ξ ) − log Z ( θ ) (7) k =1 ◮ Define matrices of partial derivatives: , and H ki ( ξ ) = ∂ 2 F k K ki ( ξ ) = ∂ F k (8) ∂ξ 2 ∂ξ i i ◮ Then, the score matching estimator is given by: � − 1 � ˆ E { K ( x ) K ( x ) T } ˆ � ˆ θ = − ( E { h i ( x ) } ) (9) i where ˆ E denotes the sample average, and the vector h i is the i -th column of the matrix H . Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  13. ICA with overcomplete basis Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  14. Second method: Noise-contrastive estimation (NCE) ◮ Train a nonlinear classifier to discriminate observed data from some artificial noise ◮ To be successful, the classifier must “discover structure” in the data ◮ For example, compare natural images with Gaussian noise Natural images Gaussian noise Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  15. Definition of classifier in NCE ◮ Observed data set X = ( x (1) , . . . , x ( T )) with un known pdf p x ◮ Generate “noise” Y = ( y (1) , . . . , y ( T )) with known pdf p y ◮ Define a nonlinear function (e.g. multilayer perceptron) g ( u ; θ ), which models data log-density log p x ( u ). ◮ We use logistic regression with the nonlinear function G ( u ; θ ) = g ( u ; θ ) − log p y ( u ) . (10) ◮ Well-known developments lead to objective (likelihood) � J ( θ ) = log [ h ( x ( t ); θ )] + log [1 − h ( y ( t ); θ )] t 1 where h ( u ; θ ) = (11) 1 + exp[ − G ( u ; θ )] Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  16. What does the classifying system do in NCE? ◮ Theorem: ◮ Assume our parametric model g ( u ; θ ) (e.g. an MLP) can approximate any function. ◮ Then, the maximum of classification objective is attained when g ( u ; θ ) = log p x ( u ) (12) where p x ( u ) is the pdf of the observed data. ◮ Corollary: If data generated according to model, i.e. log p x ( u ) = g ( u ; θ ∗ ) , we have a statistically consistent estimator. ◮ Supervised learning thus leads to unsupervised estimation of a probabilistic model given by log-density g ( u ; θ ). Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Recommend


More recommend