Overview Prediction with Gaussian Processes: Basic Ideas Bayesian - PowerPoint PPT Presentation

Overview Prediction with Gaussian Processes: Basic Ideas • Bayesian Prediction Chris Williams • Gaussian Process Priors over Functions I V N E U R S E I H T • GP regression Y T O H F G R E U • GP classification D I B N School of Informatics, University of Edinburgh, UK Bayesian prediction Bayesian Linear Regression • Define a prior over functions � f ( x ) = w i φ i ( x ) w ∼ N (0 , Σ) i • Observe data, obtain a posterior distribution over functions P ( f | D ) ∝ P ( f ) P ( D | f ) posterior ∝ prior × likelihood Samples from the prior • Make predictions by averaging predictions over the posterior P ( f | D ) • Averaging mitigates overfitting

Gaussian Processes: Priors over functions Examples of GPs • For a stochastic process f ( x ) , mean function is • σ 2 0 + σ 2 1 xx ′ µ ( x ) = E [ f ( x )] . Assume µ ( x ) ≡ 0 ∀ x • Covariance function k ( x , x ′ ) = E [ f ( x ) f ( x ′ )] . • exp −| x − x ′ | • Forget those weights! We should be thinking of defining priors over functions, not weights. • Priors over function-space can be defined directly by choosing a covariance function, e.g. • exp − ( x − x ′ ) 2 k ( x , x ′ ) = exp( − w | x − x ′ | ) • Gaussian processes are stochastic processes defined by their mean and covariance functions. Connection to feature space Gaussian process regression A Gaussian process prior over functions can be thought of as a Gaussian prior on the coefficients w ∼ N (0 , Λ) where i =1 , Gaussian likelihood p ( y i | f i ) ∼ N (0 , σ 2 ) Dataset D = ( x i , y i ) n N F � n f ( x ) = w i φ i ( x ) = w . Φ( x ) ¯ � f ( x ) = α i k ( x , x i ) i =1 i =1   φ 1 ( x ) where φ 2 ( x )   Φ( x ) = . α = ( K + σ 2 I ) − 1 y .   .   φ N F ( x ) In many interesting cases, N F = ∞ var( x ) = k ( x , x ) − k T ( x )( K + σ 2 I ) − 1 k ( x ) Choose Φ( · ) as eigenfunctions of the kernel k ( x , x ′ ) wrt p ( x ) (Mercer) � in time O ( n 3 ) , with k ( x ) = ( k ( x , x 1 ) , . . . , k ( x , x n )) T k ( x , y ) p ( x ) φ i ( x ) d x = λ i φ i ( y )

2.5 2 1.5 1 0.5 Y(x) 0 • Approximation methods can reduce O ( n 3 ) to O ( nm 2 ) for m ≪ n −0.5 −1 −1.5 −2 −2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 After 1 observation: x • GP regression is competitive with other kernel methods (e.g. SVMs) 2.5 2 1.5 1 0.5 • Can use non-Gaussian likelihoods (e.g. Student-t) Y(x) 0 −0.5 −1 −1.5 −2 −2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 After 2 observations: x Adapting kernel parameters • For GPs, the marginal likelihood (aka Bayesian evidence) log P ( y | θ ) can be optimized wrt the kernel parameters θ = ( v 0 , w ) d k ( x i , x j ) = v 0 exp − 1 l − x j l ) 2 w l ( x i • For GP regression log P ( y | θ ) can be computed exactly � 2 l =1 log P ( y | θ ) == − 1 2 log | K + σ 2 I | − 1 2 y T ( K + σ 2 I ) − 1 y − n 2 log 2 π w 1 = 5 . 0 w 2 = 5 . 0 w 1 = 5 . 0 w 2 = 0 . 5

Regularization Previous work • Wiener-Kolmogorov prediction theory (1940’s) • ¯ f ( x ) is the (functional) minimum of • Splines (Kimeldorf and Wahba, 1971; Wahba 1990) n 1 ( y i − f ( x i )) 2 + 1 2 � f � 2 � J [ f ] = • ARMA models for time-series H 2 σ 2 i =1 • Kriging in geostatistics (for 2-d or 3-d spaces) (1st term = − log-likelihood, 2nd term = − log-prior) • Regularization networks (Poggio and Girosi, 1989, 1990) • However, the regularization framework does not yield predictive variance • Design and Analysis of Computer Experiments (Sacks et al, 1989) or marginal likelihood • Infinite neural networks (Neal, 1995) GP prediction for classification problems • Likelihood − log P ( y i | f i ) = log(1 + e − y i f i ) 1 3 π f • Integrals can’t be done analytically – Find maximum a posteriori value of P ( f | y ) (Williams and Barber, 1997) −3 0 – Expectation-Propagation (Minka, 2001; Opper and Winther, 2000) Squash through logistic (or erf) function – MCMC methods (Neal, 1997)

MAP Gaussian process classification SVMs To obtain the MAP approximation to the GPC solution, we find ˆ f that maximizes the convex 1-norm soft margin classifier has the form function n n log(1 + e − y i f i ) − 1 y i α ∗ i k ( x , x i ) + w ∗ � � 2 f T K − 1 f + c f ( x ) = Ψ( y ) = − 0 i =1 i =1 where y i ∈ {− 1 , 1 } and α ∗ optimizes the quadratic form The optimization is carried out using the Newton-Raphson iteration n n f new = K ( I + WK ) − 1 ( W f + ( t − π )) α i − 1 � � Q ( α ) = t i t j α i α j k ( x i , x j ) where W = diag( π 1 (1 − π 1 ) , .., π n (1 − π n )) and π i = σ ( ˆ 2 f i ) . Basic complexity is O ( n 3 ) i =1 i,j =1 For a test point x ∗ we compute ¯ subject to the constraints f ( x ∗ ) and the variance, and make the prediction as � n P (class 1 | x ∗ , D ) = σ ( f ∗ ) p ( f ∗ | y ) d f ∗ � y i α i = 0 i =1 C ≥ α i ≥ 0 , i = 1 , . . . , n This is a quadratic programming problem. Can be solved in many ways, e.g. with interior point methods, or special purpose algorithms such as SMO. log(1 + exp(−z)) max(1−z, 0) 2 Basic complexity is O ( n 3 ) . 1 • Define g σ ( z ) = log(1 + e − z ) 0 −2 0 1 4 • SVM classifier is similar to GP classifier, but with g σ replaced by g SV M ( z ) = [1 − z ] + (Wahba, 1999) • Note that the MAP solution using g σ solution is not sparse, but gives a probability output

Overview Prediction with Gaussian Processes: Basic Ideas Bayesian - PowerPoint PPT Presentation

Overview Prediction with Gaussian Processes: Basic Ideas Bayesian Prediction Chris Williams Gaussian Process Priors over Functions I V N E U R S E I H T GP regression Y T O H F G R E U GP classification D I

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Stochastic (partial) differential equations and Gaussian processes Simo Srkk Aalto

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Gaussian processes - Refresher and some more in insig ights Marcel Lthi Graphics and Vision

CSC 411 Lecture 20: Gaussian Processes Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

MAXIMUM CONSISTENCY METHOD for Data Fitting under Interval Uncertainty Sergey P. Shary

A Short Introduction to Bayesian Optimization With applications to parameter tuning on

Lecture 13 Gaussian Process Models - Part 2 Colin Rundel 03/01/2017 1 EDA and GPs 2 t i t j t

Gaussian Processes for Robotics McGill COMP 765 Oct 24 th , 2017 A robot must learn Modeling

Understanding Wide Neural Networks Jaehoon Lee Google Brain HEP-AI Journal Club Feb 5, 2019

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian