Gaussian Processes Covariance Functions and Classification Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics T¨ ubingen, Germany Gaussian Processes in Practice, Bletchley Park, July 12th, 2006 Carl Edward Rasmussen Covariance Functions and Classification
Outline Covariance functions encode structure. You can learn about them by • sampling, • optimizing the marginal likelihood. GP’s with various covariance functions are equivalent to many well known models, large neural networks, splines, relevance vector machines... • infinitely many Gaussian bumps regression • Rational Quadratic and Mat´ ern Quick two-page recap of GP regression Approximate inference for Gaussian process classification: Replace the non-Gaussian intractable posterior by a Gaussian. Expectation Propagation. Carl Edward Rasmussen Covariance Functions and Classification
From random functions to covariance functions Consider the class of functions (sums of squared exponentials): 1 � γ i exp( − ( x − i/n ) 2 ) , f ( x ) = lim where γ i ∼ N (0 , 1) , ∀ i n n →∞ i � ∞ γ ( u ) exp( − ( x − u ) 2 ) du, = where γ ( u ) ∼ N (0 , 1) , ∀ u. −∞ The mean function is: � ∞ � ∞ exp( − ( x − u ) 2 ) µ ( x ) = E [ f ( x )] = γp ( γ ) dγdu = 0 , −∞ −∞ and the covariance function: � − ( x − u ) 2 − ( x ′ − u ) 2 � E [ f ( x ) f ( x ′ )] = � exp du − 2( u − x + x ′ ) 2 + ( x + x ′ ) 2 − ( x − x ′ ) 2 � − x 2 − x ′ 2 � � � � = exp ) du ∝ exp . 2 2 2 Thus, the squared exponential covariance function is equivalent to regression using infinitely many Gaussian shaped basis functions placed everywhere, not just at your training points! Carl Edward Rasmussen Covariance Functions and Classification
Why it is dangerous to use only finitely many basis functions? 1 0.5 ? 0 −0.5 −10 −8 −6 −4 −2 0 2 4 6 8 10 Carl Edward Rasmussen Covariance Functions and Classification
Rational quadratic covariance function The rational quadratic (RQ) covariance function: r 2 � − α � k RQ ( r ) = 1 + 2 αℓ 2 with α, ℓ > 0 can be seen as a scale mixture (an infinite sum) of squared exponential (SE) covariance functions with different characteristic length-scales. Using τ = ℓ − 2 and p ( τ | α, β ) ∝ τ α − 1 exp( − ατ/β ): � k RQ ( r ) = p ( τ | α, β ) k SE ( r | τ ) dτ − τr 2 r 2 � − ατ � − α τ α − 1 exp � � � � � ∝ exp dτ ∝ 1 + . β 2 2 αℓ 2 Carl Edward Rasmussen Covariance Functions and Classification
Rational quadratic covariance function II 3 1 α =1/2 α =2 2 α→∞ 0.8 1 output, f(x) covariance 0.6 0 0.4 −1 0.2 −2 −3 0 0 1 2 3 −5 0 5 input distance input, x The limit α → ∞ of the RQ covariance function is the SE. Carl Edward Rasmussen Covariance Functions and Classification
Mat´ ern covariance functions Stationary covariance functions can be based on the Mat´ ern form: � √ � √ 1 2 ν � ν 2 ν � k ( x , x ′ ) = κ | x − x ′ | κ | x − x ′ | K ν , Γ( ν )2 ν − 1 where K ν is the modified Bessel function of second kind of order ν , and κ is the characteristic length scale. Sample functions from Mat´ ern forms are ⌊ ν − 1 ⌋ times differentiable. Thus, the hyperparameter ν can control the degree of smoothness Carl Edward Rasmussen Covariance Functions and Classification
Mat´ ern covariance functions II Univariate Mat´ ern covariance function with unit characteristic length scale and unit variance: covariance function sample functions 1 ν =1/2 2 ν =1 covariance output, f(x) 1 ν =2 ν→∞ 0 0.5 −1 −2 0 0 1 2 3 −5 0 5 input distance input, x Carl Edward Rasmussen Covariance Functions and Classification
Mat´ ern covariance functions II It is possible that the most interesting cases for machine learning are ν = 3 / 2 and ν = 5 / 2, for which √ √ 3 r 3 r � � � � k ν =3 / 2 ( r ) = 1 + exp − , ℓ ℓ √ √ + 5 r 2 5 r 5 r � � � � k ν =5 / 2 ( r ) = 1 + exp − , 3 ℓ 2 ℓ ℓ Other special cases: • ν = 1 / 2: Laplacian covariance function, sample functions: stationary Browninan motion • ν → ∞ : Gaussian covariance function with smooth (infinitely differentiable) sample functions Carl Edward Rasmussen Covariance Functions and Classification
A Comparison Left, SE covariance function, log marginal likelihood − 15 . 6, and right Mat´ ern covariance function with ν = 3 / 2, marginal likelihood − 18 . 0. Carl Edward Rasmussen Covariance Functions and Classification
GP regression recap We use a Gaussian process prior for the latent function: f | X, θ ∼ N ( 0 , K ) The likelihood is a factorized Gaussian m � N ( y i | f i , σ 2 y | f ∼ n ) i =1 The posterior is Gaussian p ( f |D , θ ) = p ( f | X, θ ) p ( y | f ) p ( D| θ ) The latent value at the test point, f ( x ∗ ) is Gaussian � p ( f ∗ |D , θ, x ∗ ) = p ( f ∗ | f , X, θ, x ∗ ) p ( f |D , θ ) d f , and the predictive class probability is Gaussian � p ( y ∗ |D , θ, x ∗ ) = p ( y ∗ | f ∗ ) p ( f ∗ |D , θ, x ∗ ) d f ∗ . Carl Edward Rasmussen Covariance Functions and Classification
Prior and posterior 2 2 1 1 output, f(x) output, f(x) 0 0 −1 −1 −2 −2 −5 0 5 −5 0 5 input, x input, x Predictive distribution: p ( y ∗ | x ∗ , x , y ) ∼ N � k ( x ∗ , x ) ⊤ [ K + σ 2 noise I ] − 1 y , k ( x ∗ , x ∗ ) + σ 2 noise − k ( x ∗ , x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ , x ) � Carl Edward Rasmussen Covariance Functions and Classification
The marginal likelihood To chose between models M 1 , M 2 , . . . , compare the posterior for the models p ( M i |D ) = p ( y | x , M i ) p ( M i ) . p ( D ) Log marginal likelihood: log p ( y | x , M i ) = − 1 2 y ⊤ K − 1 y − 1 2 log | K | − n 2 log(2 π ) is the combination of a data fit term and complexity penalty. Occam’s Razor is automatic. Carl Edward Rasmussen Covariance Functions and Classification
Binary Gaussian Process Classification 1 4 class probability, π (x) latent function, f(x) 2 0 −2 −4 0 input, x input, x The class probability is related to the latent function through: � � p ( y = 1 | f ( x )) = π ( x ) = Φ f ( x ) . Observations are independent given f , so the likelihood is n n � � p ( y | f ) = p ( y i | f i ) = Φ( y i f i ) . i =1 i =1 Carl Edward Rasmussen Covariance Functions and Classification
Likelihood functions The logistic (1 + exp( − y i f i )) − 1 and probit Φ( y i f i ) and their derivatives: log likelihood, log p(y i |f i ) 1 log likelihood, log p(y i |f i ) 2 0 0 −1 −2 −2 −4 log likelihood log likelihood −3 −6 1st derivative 1st derivative 2nd derivative 2nd derivative −2 0 2 −2 0 2 latent times target, z i =y i f i latent times target, z i =y i f i Carl Edward Rasmussen Covariance Functions and Classification
Exact expressions We use a Gaussian process prior for the latent function: f | X, θ ∼ N ( 0 , K ) The posterior becomes: m p ( f |D , θ ) = p ( f | X, θ ) p ( y | f ) = N ( f | 0 , K ) � Φ( y i f i ) , p ( D| θ ) p ( D| θ ) i =1 which is non-Gaussian. The latent value at the test point, f ( x ∗ ) is � p ( f ∗ |D , θ, x ∗ ) = p ( f ∗ | f , X, θ, x ∗ ) p ( f |D , θ ) d f , and the predictive class probability becomes � p ( y ∗ |D , θ, x ∗ ) = p ( y ∗ | f ∗ ) p ( f ∗ |D , θ, x ∗ ) d f ∗ , both of which are intractable to compute. Carl Edward Rasmussen Covariance Functions and Classification
Gaussian Approximation to the Posterior We approximate the non-Gaussian posterior by a Gaussian: p ( f |D , θ ) ≃ q ( f |D , θ ) = N ( m , A ) then q ( f ∗ |D , θ, x ∗ ) = N ( f ∗ | µ ∗ , σ 2 ∗ ), where µ ∗ = k ⊤ ∗ K − 1 m ∗ ( K − 1 − K − 1 AK − 1 ) k ∗ . σ 2 ∗ = k ( x ∗ , x ∗ ) − k ⊤ Using this approximation: � µ ∗ � � Φ( f ∗ ) N ( f ∗ | µ ∗ , σ 2 q ( y ∗ = 1 |D , θ, x ∗ ) = ∗ ) d f ∗ = Φ √ 1 + σ 2 ∗ Carl Edward Rasmussen Covariance Functions and Classification
What Gaussian? Some suggestions: • local expansion: Laplace’s method • optimize a variational lower bound (using Jensen’s ineqality): � � � p ( y | f ) p ( f ) � log p ( y | X ) = log p ( y | f ) p ( f ) d f ≥ log q ( f ) d f q ( f ) • the Expectation Propagation (EP) algorithm Carl Edward Rasmussen Covariance Functions and Classification
Expectation Propagation Posterior: n 1 � p ( f | X, y ) = Z p ( f | X ) p ( y i | f i ) , i =1 where the normalizing term is the marginal likelihood n � � Z = p ( y | X ) = p ( f | X ) p ( y i | f i ) d f . i =1 Exact likelihood: p ( y i | f i ) = Φ( f i y i ) which makes inference intractable. In EP we use a local likelihood approximation p ( y i | f i ) ≃ t i ( f i | ˜ i ) � ˜ σ 2 σ 2 Z i , ˜ µ i , ˜ Z i N ( f i | ˜ µ i , ˜ i ) , where the site parameters are ˜ σ 2 Z i , ˜ µ i and ˜ i , such that: n � t i ( f i | ˜ µ , ˜ � ˜ σ 2 Z i , ˜ µ i , ˜ i ) = N (˜ Σ) Z i . i =1 i Carl Edward Rasmussen Covariance Functions and Classification
Recommend
More recommend