Scalable Gaussian processes with a twist of Probabilistic Numerics - PowerPoint PPT Presentation

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia Antipolis, France Data Science Meetup - October 30 th 2017

Agenda • Kernel Methods • Scalable Gaussian Processes (using Preconditioning) • Probabilistic Numerics 1/33

Kernel Methods • Operate in a high-dimensional, implicit feature space 2/33 • E.g. RBF : k • Rely on the construction of an n × n Gram matrix K ( ) ( 2 d 2 ) x i , x j = σ 2 exp − 1 ) ⊤ Λ where d 2 = ( ( ) x i − x j x i − x j

Kernel Methods • Wide variety of kernel functions available Taken from David Duvenaud’s PhD Thesis 3/33

Kernel Methods • Choice is not always straightforward! Taken from David Duvenaud’s PhD Thesis 4/33

All About that Bass Bayes marginal likelihood 5/33 posterior = likelihood × prior p ( par | X , y ) = p ( y | X , par ) × p ( par ) p ( y | X )

All About that Bass Bayes - Making Predictions • We average over all possible parameter values, weighted by their posterior probability 6/33 ∫ p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , par ) p ( par | X , y ) d par = N ( E [ y ∗ ] , V [ y ∗ ])

Gaussian Processes

Gaussian Processes - Prior Distribution over Functions 7/33 3 2 1 label 0 K ∞ = −1 −2 −3 −4 −2 0 2 4 input

Gaussian Processes - Conditioned on Observations 8/33 3 2 ● ● 1 ● ● ● ● label ● ● ● ● 0 ● ● ● ● K ∞ = ● ● ● ● ● ● −1 −2 −3 −4 −2 0 2 4 input

Gaussian Processes - Posterior Distribution over Functions 9/33 3 2 ● ● 1 ● ● ● ● label ● ● ● ● 0 ● ● ● ● K y = ● ● ● ● ● ● −1 −2 −3 −4 −2 0 2 4 input

Gaussian Processes GP regression example Inference result GP prior 10/33 3 3 3 2 2 2 ● ● ● ● 1 1 1 ● ● ● ● ● ● ● ● label label label ● ● ● ● ● ● 0 ● 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −1 −1 −2 −2 −2 −3 −3 −3 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 input input input K y = K ∞ = K ∞ =

Bayesian Learning vs Deep Learning • Deep Learning + Scalable to very large datasets + Increased model flexibility/capacity - Frequentist approaches make only point estimates - Less robust to overfitting • Bayesian Learning + Incorporates uncertainty in predictions + Works well with smaller datasets - Lack of conjugacy necessitates approximation - Expensive computational and storage requirements 11/33

Bayesian Learning vs Deep Learning - Deep Gaussian Processes • Deep probabilistic models • Composition of functions 12/33 ( h ( N h − 1 ) ( θ ( N h − 1 ) ) ◦ . . . ◦ h ( 0 ) ( θ ( 0 ) )) f ( x ) = ( x ) h ( 0 ) ( x ) h ( 0 ) ( x ) h ( 1 ) ( x ) h ( 1 ) ( )

Bayesian Learning vs Deep Learning - Deep Gaussian Processes • Inference requires calculating the marginal likelihood: • Very challenging! p p p 13/33 ∫ ( Y | F ( N h ) , θ ( N h ) ) p ( Y | X , θ ) = × ( F ( N h ) | F ( N h − 1 ) , θ ( N h − 1 ) ) × . . . × ( F ( 1 ) | X , θ ( 0 ) ) dF ( N h ) . . . dF ( 1 )

Bayesian Learning vs Deep Learning - Deep Gaussian Processes X Y Cutajar et al., Random Feature Expansions for Deep Gaussian Processes , ICML 2017 Yarin Gal, Bayesian Deep Learning , PhD Thesis 14/33 Φ ( 0 ) F ( 1 ) Φ ( 1 ) F ( 2 ) Ω ( 0 ) W ( 0 ) Ω ( 1 ) W ( 1 ) θ ( 0 ) θ ( 1 )

Scalable Gaussian Processes

Gaussian Processes 2 Tr y y • Marginal likelihood 15/33 • Derivatives wrt par 2 y T K − 1 log [ p ( y | par )] = − 1 2 log | K y | − 1 y y + const . ∂ log [ p ( y | par )] ( ∂ K y ) ∂ K y K − 1 2 y T K − 1 K − 1 = − 1 + 1 ∂ par i ∂ par i ∂ par i y y

Gaussian Processes - Stochastic Trace Estimation Taken from Shakir Mohamed’s Machine Learning Blog 16/33

Gaussian Processes - Stochastic Gradients y Linear systems only! y y N r 2 N r • Stochastic gradient r y 17/33 y Tr • Stochastic estimate of the trace - assuming E [ rr T ] = I , then ( ) ( ) [ ] ∂ K y ∂ K y ∂ K y K − 1 K − 1 r T K − 1 = Tr E [ rr T ] = E ∂ par i ∂ par i ∂ par i ∂ K y ∂ K y r ( i ) + 1 ∑ r ( i ) T K − 1 2 y T K − 1 K − 1 − 1 ∂ par i ∂ par i y y i = 1

tn 2 for t CG iterations - in theory t Solving Linear Systems • Cholesky Decomposition • K must be stored in memory! • Conjugate Gradient • Numerical solution of linear systems • n (possibly worse!) 18/33 • Involve the solution of linear systems K z = v • O ( n 2 ) space and O ( n 3 ) time - unfeasible for large n

Solving Linear Systems • Cholesky Decomposition • K must be stored in memory! • Conjugate Gradient • Numerical solution of linear systems 18/33 • Involve the solution of linear systems K z = v • O ( n 2 ) space and O ( n 3 ) time - unfeasible for large n • O ( tn 2 ) for t CG iterations - in theory t = n (possibly worse!) z0 z

Solving Linear Systems • Preconditioned Conjugate Gradient (henceforth PCG ) • Transforms linear system to be better conditioned, improving convergence CG PCG 19/33 • Yields a new linear system of the form P − 1 K z = P − 1 v • O ( tn 2 ) for t PCG iterations - in practice t ≪ n z0 z0 z z

• For low-rank preconditioners we employ the Woodbury inversion lemma: K y = 1 = • For other preconditioners we solve inner linear systems once again Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert P = P using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible

Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert • For low-rank preconditioners we employ the Woodbury inversion lemma: P = • For other preconditioners we solve inner linear systems once again using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible K y = P − 1 =

Preconditioning Approaches PITC Regularization SKI Block Jacobi Partial SVD r Nyström Spectral UU K UX m 21/33 FITC UU K UX P = K XU K − 1 UU K UX + λ I where U ⊂ X P = K XU K − 1 K − K XU K − 1 ( ) UU K UX + diag + λ I P = K XU K − 1 K − K XU K − 1 ( ) UU K UX + bldiag + λ I ∑ m [ 2 π s ⊤ ( )] P ij = σ 2 x i − x j + λ I ij r = 1 cos K = A Λ A ⊤ P = A [ · , 1 : m ] Λ [ 1 : m , 1 : m ] A ⊤ ⇒ [ 1 : m , · ] + λ I P = bldiag ( K ) + λ I P = WK UU W ⊤ + λ I where K UU is Kronecker P = K + λ I + δ I

Comparison of Preconditioners vs CG 22/33

Experimental Setup - GP Kernel Parameter Optimization • Exact gradient-based optimization using Cholesky decomposition (CHOL) • Stochastic gradient-based optimization • Linear systems solved with CG and PCG • GP Approximations • Variational learning of inducing variables ( VAR ) • Fully Independent Training Conditional ( FITC ) • Partially Independent Training Conditional ( PITC ) 23/33

Results - ARD Kernel Regression Protein ( n = 45730, d =9) Classification EEG ( n = 14979, d =14) Power plant ( n = 9568, d =4) 24/33 Spam ( n = 4061, d =57) 40 0 Negative Test Log−Lik Negative Test Log−Lik 0.22 35 0.12 −10 Error Rate 30 RMSE 0.20 −20 25 0.08 20 −30 0.18 15 0.04 −40 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −1 0 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) 0.72 60 0.25 400 Negative Test Log−Lik Negative Test Log−Lik 50 0.68 350 Error Rate RMSE 0.15 40 300 0.64 30 250 0.05 200 20 0.60 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 1.5 2.0 2.5 3.0 3.5 4.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) PCG CG CHOL FITC PITC VAR

Follow-up Work • Faster Kernel Ridge Regression Using Sketching and Preconditioning Avron et al. (2017) • FALKON: An Optimal Large Scale Kernel Method Rosasco et al. (2017) • Large Linear Multi-output Gaussian Process Learning for Time Series Feinberg et al. (2017) • Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes Kim et al. (2017) 25/33

Follow-up work ... but what’s left to do now? 26/33

Probabilistic Numerics

Scalable Gaussian processes with a twist of Probabilistic Numerics - PowerPoint PPT Presentation

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia Antipolis, France Data Science Meetup - October 30 th 2017 Agenda Kernel Methods Scalable Gaussian Processes (using Preconditioning)

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

HYGROTHERMALLY STABLE LAMINATES WITH EXTENSION-TWIST AND BEND-TWIST COUPLINGS R. Haynes*, E.

The OPE of bare twist operators in bosonic S N orbifold CFTs at large- N A.W. Peet University of

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

On the duality of proofs and countermodels in labelled sequent calculi Sara Negri University of

X -logics based multivalued reasoning for dialogical agents (ongoing work) Vincent Risch

Graph diffusions and matrix functions: fast algorithms and localization results Thesis

Representation theorems and the semantics of (semi)lattice based logics Viorica

Interpolation Using Sequents and Their Generalizations Roman Kuznets Embedded Computing Systems

ICCL Summer School 2008 The logic of generalized truth values. A tour into Philosophical Logic

Tackling Defeasible Reasoning in Bochum: the Research Group for Non-Monotonic Logic and Formal

Standard Completeness I: Proof Theoretic Approach Agata Ciabattoni Vienna University of