Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan - PowerPoint PPT Presentation

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 1 / 22

What are Gaussian processes? GPs let us do Bayesian inference on functions . Using GPs we can: Interpolate spatial data Forecast time series Represent latent surfaces for classification, point processes, etc. Emulate likelihoods and complex, black-box functions Model cool stuff across many scientific disciplines! [https://pythonhosted.org/infpy/gps.html] [http://becs.aalto.fi/en/research/bayes/mcmcstuff/traindata.jpg] Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 2 / 22

Preliminaries The basic setup: Data set { ( x i , y i ) , i = 1 , . . . , n } . Inputs x i ∈ S ⊂ R D . Outputs y i ∈ R . x i ∼ p ( x ) y i = f ( x i ) + ǫ i iid ∼ N (0 , σ 2 ǫ i ǫ ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 3 / 22

Preliminaries The basic setup: Data set { ( x i , y i ) , i = 1 , . . . , n } . Inputs x i ∈ S ⊂ R D . Outputs y i ∈ R . x i ∼ p ( x ) y i = f ( x i ) + ǫ i iid ∼ N (0 , σ 2 ǫ i ǫ ) Definition f is a Gaussian process if for any collection X = { x i ∈ S , i = 1 , . . . , n } ,   f ( x 1 ) . .  ∼ N ( µ ( X ) , K ( X , X ))   .  f ( x n ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 3 / 22

Mean, covariance functions GPs characterized by mean, covariance functions: Mean function: µ ( x ). WLOG, we can assume µ = 0. (Why?) Covariance function k where [ K ( X , X )] ij = k ( x i , x j ) = Cov( f ( x i ) , f ( x j )) . Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 4 / 22

Mean, covariance functions GPs characterized by mean, covariance functions: Mean function: µ ( x ). WLOG, we can assume µ = 0. (Why?) Covariance function k where [ K ( X , X )] ij = k ( x i , x j ) = Cov( f ( x i ) , f ( x j )) . Example: −|| x i − x j || 2 � � k ( x i , x j ) = τ 2 exp (squared exponential) 2 ℓ 2 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 4 / 22

GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22

GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . � f � K ( X , X ) Prediction with � �� 0 � K ( X , X ∗ ) ��  | X , X ∗ ∼ N ,  f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0  noise-free     f ∗ | f , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X )] − 1 f ,    � K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X )] − 1 K ( X , X ∗ )    Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22

GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . � f � K ( X , X ) Prediction with � �� 0 � K ( X , X ∗ ) ��  | X , X ∗ ∼ N ,  f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0  noise-free     f ∗ | f , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X )] − 1 f ,    � K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X )] − 1 K ( X , X ∗ )    Prediction � � �� K ( X , X ) + σ 2 K ( X , X ∗ ) �� y 0 ǫ I n  | X , X ∗ ∼ N ,  f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0  with noisy     f ∗ | y , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X ) + σ 2 ǫ I n ] − 1 y ,    �  K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X ) + σ 2 ǫ I n ] − 1 K ( X , X ∗ )   Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22

GP regression (prediction) Some cool things we’ve noticed: f , f ∗ , y , y ∗ are all jointly Gaussian. GP regression gives us interval (distributional) predictions for free. Prediction using noise-free vs. noisy data: Which situation is more likely in practice? The “nugget” σ 2 ǫ I n : Arises due to measurement error or high-frequency behavior. Provides numerical stability and regularization. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 6 / 22

Illustrating GP regression TRUTH: τ 2 = 1 , ℓ 2 = 1 , σ 2 ǫ = 0 . 01. 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

Illustrating GP regression Sample { ( x i , y i ) , i = 1 , . . . 20 } 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

Illustrating GP regression Posterior mean of f ∗ | y 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

Illustrating GP regression 95% prediction interval for f ∗ | y 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

Illustrating GP regression Fitting GP with ℓ 2 = 10: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

Illustrating GP regression Fitting GP with ℓ 2 = 0 . 1: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

Illustrating GP regression Fitting GP with σ 2 ǫ = 1: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

Illustrating GP regression Fitting GP with σ 2 ǫ = 0 . 0001: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

GPs and Bayesian linear regression Assume f ( x i ) is linear in p -dimensional feature vector of x i : f ( x i ) = φ ( x i ) ′ w = φ ′ i w Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 8 / 22

GPs and Bayesian linear regression Assume f ( x i ) is linear in p -dimensional feature vector of x i : f ( x i ) = φ ( x i ) ′ w = φ ′ i w Usual Bayesian regression setup for φ : ind ∼ N ( φ ′ i w , σ 2 y i | X ǫ ) (likelihood) w ∼ N (0 , Σ) (prior) w , A − 1 ) w | y , X ∼ N (ˆ (posterior) f ∗ | y , X , x ∗ ∼ N (( φ ∗ ) ′ ˆ w , ( φ ∗ ) ′ A − 1 φ ∗ ) (posterior predictive) where w = A − 1 Φy /σ 2 ˆ ǫ . A = ΦΦ ′ /σ 2 ǫ + Σ − 1 . Φ = p × n matrix stacking φ i , i = 1 , . . . , n columnwise. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 8 / 22

GPs and Bayesian linear regression After some matrix algebra (Woodbury identity!), we can write this as: f ∗ | y , X , x ∗ ∼ N � ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 y , ( φ ∗ ) ′ Σ φ ∗ − ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 Φ ′ Σ φ ∗ � Taking k ( x i , x j ) = φ ( x i ) ′ Σ φ ( x j ), we get familiar GP prediction expression. Thus { Bayesian regression } ⊂ { Gaussian processes } . { Gaussian processes } ⊂ { Bayesian regression } ? Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 9 / 22

GPs and Bayesian linear regression After some matrix algebra (Woodbury identity!), we can write this as: f ∗ | y , X , x ∗ ∼ N � ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 y , ( φ ∗ ) ′ Σ φ ∗ − ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 Φ ′ Σ φ ∗ � Taking k ( x i , x j ) = φ ( x i ) ′ Σ φ ( x j ), we get familiar GP prediction expression. Thus { Bayesian regression } ⊂ { Gaussian processes } . { Gaussian processes } ⊂ { Bayesian regression } ? “Kernel trick”: feature vectors φ only enter as inner products Φ ′ Σ Φ , ( φ ∗ ) ′ Σ Φ , or ( φ ∗ ) ′ Σ φ ∗ . Kernel (covariance function) k ( · , · ) spares us from ever calculating φ ( x ). Where have we seen this before? Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 9 / 22

Covariance functions Common choices: � � −|| x i − x j || k ( x i , x j ) = τ 2 exp (exponential) 2 ℓ −|| x i − x j || 2 � � k ( x i , x j ) = τ 2 exp (squared exponential) 2 ℓ 2 + || x i − x j || 3 � 1 − 3 || x i − x j || � k ( x i , x j ) = τ 2 1 [ || x i − x j || ≤ θ ] (spherical) 2 θ 3 2 θ � ν τ 2 � || x i − x j || B ν ( φ || x i − x j || ) (mat´ k ( x i , x j ) = ern) Γ( ν ) 2 φ k ( x i , x j ) = σ 2 + τ 2 ( x i − c ) ′ ( x j − c ) (linear) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 10 / 22

Covariance functions Properties Isotrophy (stationarity) Covariance only depends on distance: k ( x i , x j ) = c ( || x i − x j || ). Common in many GP applications. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 11 / 22

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan - PowerPoint PPT Presentation

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 1 / 22 What are Gaussian processes? GPs let us do Bayesian inference on functions . Using GPs we can: Interpolate spatial

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Overview Dawn Song Teaching Team Dawn Song What is

WNNLP 2020 The 2nd IN5550 Teaching Workshop on Neural Natural Language Processing

Trails and Networks: Loom; Network Representation of Trails Mihovil Bartulovic

NEO : base and integration in OpenElectrophy Samuel Garcia neo : Neural Ensemble Objects

CRE Financing Jeremy R. Starkey President Monarch Bank Commercial Real Estate Finance CRE

BCS 114 Understanding and Communicating with Todays Leaders 1 Course Objectives At the end

THE 20 2018 MANFRED LA LACHS MOOT COURT AN EXTRA-ORDINARY EXPERIENCE BEATRIZ PISONI BARBA,

Dynamics during the Stanley Cup Playoffs Daniel de Leng, Mattias Tiger, Mathias Almquist,