Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 1 / 22
What are Gaussian processes? GPs let us do Bayesian inference on functions . Using GPs we can: Interpolate spatial data Forecast time series Represent latent surfaces for classification, point processes, etc. Emulate likelihoods and complex, black-box functions Model cool stuff across many scientific disciplines! [https://pythonhosted.org/infpy/gps.html] [http://becs.aalto.fi/en/research/bayes/mcmcstuff/traindata.jpg] Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 2 / 22
Preliminaries The basic setup: Data set { ( x i , y i ) , i = 1 , . . . , n } . Inputs x i ∈ S ⊂ R D . Outputs y i ∈ R . x i ∼ p ( x ) y i = f ( x i ) + ǫ i iid ∼ N (0 , σ 2 ǫ i ǫ ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 3 / 22
Preliminaries The basic setup: Data set { ( x i , y i ) , i = 1 , . . . , n } . Inputs x i ∈ S ⊂ R D . Outputs y i ∈ R . x i ∼ p ( x ) y i = f ( x i ) + ǫ i iid ∼ N (0 , σ 2 ǫ i ǫ ) Definition f is a Gaussian process if for any collection X = { x i ∈ S , i = 1 , . . . , n } , f ( x 1 ) . . ∼ N ( µ ( X ) , K ( X , X )) . f ( x n ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 3 / 22
Mean, covariance functions GPs characterized by mean, covariance functions: Mean function: µ ( x ). WLOG, we can assume µ = 0. (Why?) Covariance function k where [ K ( X , X )] ij = k ( x i , x j ) = Cov( f ( x i ) , f ( x j )) . Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 4 / 22
Mean, covariance functions GPs characterized by mean, covariance functions: Mean function: µ ( x ). WLOG, we can assume µ = 0. (Why?) Covariance function k where [ K ( X , X )] ij = k ( x i , x j ) = Cov( f ( x i ) , f ( x j )) . Example: −|| x i − x j || 2 � � k ( x i , x j ) = τ 2 exp (squared exponential) 2 ℓ 2 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 4 / 22
GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22
GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . � f � K ( X , X ) Prediction with � �� 0 � K ( X , X ∗ ) �� | X , X ∗ ∼ N , f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0 noise-free f ∗ | f , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X )] − 1 f , � K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X )] − 1 K ( X , X ∗ ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22
GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . � f � K ( X , X ) Prediction with � �� 0 � K ( X , X ∗ ) �� | X , X ∗ ∼ N , f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0 noise-free f ∗ | f , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X )] − 1 f , � K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X )] − 1 K ( X , X ∗ ) Prediction � � �� � � K ( X , X ) + σ 2 K ( X , X ∗ ) �� y 0 ǫ I n | X , X ∗ ∼ N , f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0 with noisy f ∗ | y , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X ) + σ 2 ǫ I n ] − 1 y , � K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X ) + σ 2 ǫ I n ] − 1 K ( X , X ∗ ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22
GP regression (prediction) Some cool things we’ve noticed: f , f ∗ , y , y ∗ are all jointly Gaussian. GP regression gives us interval (distributional) predictions for free. Prediction using noise-free vs. noisy data: Which situation is more likely in practice? The “nugget” σ 2 ǫ I n : Arises due to measurement error or high-frequency behavior. Provides numerical stability and regularization. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 6 / 22
Illustrating GP regression TRUTH: τ 2 = 1 , ℓ 2 = 1 , σ 2 ǫ = 0 . 01. 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
Illustrating GP regression Sample { ( x i , y i ) , i = 1 , . . . 20 } 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
Illustrating GP regression Posterior mean of f ∗ | y 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
Illustrating GP regression 95% prediction interval for f ∗ | y 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
Illustrating GP regression Fitting GP with ℓ 2 = 10: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
Illustrating GP regression Fitting GP with ℓ 2 = 0 . 1: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
Illustrating GP regression Fitting GP with σ 2 ǫ = 1: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
Illustrating GP regression Fitting GP with σ 2 ǫ = 0 . 0001: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22
GPs and Bayesian linear regression Assume f ( x i ) is linear in p -dimensional feature vector of x i : f ( x i ) = φ ( x i ) ′ w = φ ′ i w Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 8 / 22
GPs and Bayesian linear regression Assume f ( x i ) is linear in p -dimensional feature vector of x i : f ( x i ) = φ ( x i ) ′ w = φ ′ i w Usual Bayesian regression setup for φ : ind ∼ N ( φ ′ i w , σ 2 y i | X ǫ ) (likelihood) w ∼ N (0 , Σ) (prior) w , A − 1 ) w | y , X ∼ N (ˆ (posterior) f ∗ | y , X , x ∗ ∼ N (( φ ∗ ) ′ ˆ w , ( φ ∗ ) ′ A − 1 φ ∗ ) (posterior predictive) where w = A − 1 Φy /σ 2 ˆ ǫ . A = ΦΦ ′ /σ 2 ǫ + Σ − 1 . Φ = p × n matrix stacking φ i , i = 1 , . . . , n columnwise. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 8 / 22
GPs and Bayesian linear regression After some matrix algebra (Woodbury identity!), we can write this as: f ∗ | y , X , x ∗ ∼ N � ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 y , ( φ ∗ ) ′ Σ φ ∗ − ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 Φ ′ Σ φ ∗ � Taking k ( x i , x j ) = φ ( x i ) ′ Σ φ ( x j ), we get familiar GP prediction expression. Thus { Bayesian regression } ⊂ { Gaussian processes } . { Gaussian processes } ⊂ { Bayesian regression } ? Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 9 / 22
GPs and Bayesian linear regression After some matrix algebra (Woodbury identity!), we can write this as: f ∗ | y , X , x ∗ ∼ N � ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 y , ( φ ∗ ) ′ Σ φ ∗ − ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 Φ ′ Σ φ ∗ � Taking k ( x i , x j ) = φ ( x i ) ′ Σ φ ( x j ), we get familiar GP prediction expression. Thus { Bayesian regression } ⊂ { Gaussian processes } . { Gaussian processes } ⊂ { Bayesian regression } ? “Kernel trick”: feature vectors φ only enter as inner products Φ ′ Σ Φ , ( φ ∗ ) ′ Σ Φ , or ( φ ∗ ) ′ Σ φ ∗ . Kernel (covariance function) k ( · , · ) spares us from ever calculating φ ( x ). Where have we seen this before? Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 9 / 22
Covariance functions Common choices: � � −|| x i − x j || k ( x i , x j ) = τ 2 exp (exponential) 2 ℓ −|| x i − x j || 2 � � k ( x i , x j ) = τ 2 exp (squared exponential) 2 ℓ 2 + || x i − x j || 3 � 1 − 3 || x i − x j || � k ( x i , x j ) = τ 2 1 [ || x i − x j || ≤ θ ] (spherical) 2 θ 3 2 θ � ν τ 2 � || x i − x j || B ν ( φ || x i − x j || ) (mat´ k ( x i , x j ) = ern) Γ( ν ) 2 φ k ( x i , x j ) = σ 2 + τ 2 ( x i − c ) ′ ( x j − c ) (linear) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 10 / 22
Covariance functions Properties Isotrophy (stationarity) Covariance only depends on distance: k ( x i , x j ) = c ( || x i − x j || ). Common in many GP applications. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 11 / 22
Recommend
More recommend