Sparse Gaussian Processes with Spherical Harmonic Features Vincent - PowerPoint PPT Presentation

Sparse Gaussian Processes with Spherical Harmonic Features Vincent Dutordoir 1 , Nicolas Durrande 1 and James Hensman 2 1 PROWLER.io, 2 Amazon (Work completed while JH was at PROWLER.io) International Conference of Machine Learning – 2020

Contribution We improve the scaling of Sparse GPs with #datapoints and #inputs Wall-clock Time NLPD (Error) 1000 1.50 Airline dataset: 918.77 1.31 1.29 Regression problem 750 NLPD (Lower is better) 1.00 6 . 10 6 datapoints Time (seconds) 500 8 input dimensions 0.50 250 Setup 41.32 GTX 1070 GPU 0 0.00 SVGP * VISH * Models 2 / 14

Variational Inference with Spherical Harmonics (VISH) y Gist of method: make inputs d + 1 dimensional bias project data radially on S d Fast SVGP on the sphere map predictions on S d back to the original space x The efficiency of VISH comes from using spherical harmonics as inducing functions for the SVGP on the sphere. 3 / 14

From inducing points to inducing features Inducing Points VISH u m = f ( z m ) u m = � f , φ m � H K uu = K uu = K − 1 uu is O ( M 3 ) K − 1 uu is O ( M ) Orthogonality of the basisfunctions φ leads to diagonal K uu and O ( M ) inversion 4 / 14

Deep-dive 5 / 14

Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . 6 / 14

Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . Minimise KL-divergence from p ( f ( · ) | y ) to q ( f ( · )) = GP ( µ ( · ) , ν ( · , · ′ )) � u ( · ) K − 1 µ ( · ) = k ⊤ uu m , u ( · ) K − 1 uu ( K uu − S ) K − 1 ν ( · , · ′ ) = k ( · , · ′ ) − k ⊤ uu k u ( · ′ ) where [ K uu ] m , m ′ = Cov ( u m , u m ′ ) and [ k u ( · )] m = Cov ( u m , f ( · )) . 6 / 14

Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . Minimise KL-divergence from p ( f ( · ) | y ) to q ( f ( · )) = GP ( µ ( · ) , ν ( · , · ′ )) � u ( · ) K − 1 µ ( · ) = k ⊤ uu m , u ( · ) K − 1 uu ( K uu − S ) K − 1 ν ( · , · ′ ) = k ( · , · ′ ) − k ⊤ uu k u ( · ′ ) where [ K uu ] m , m ′ = Cov ( u m , u m ′ ) and [ k u ( · )] m = Cov ( u m , f ( · )) . A more flexible (e.g. non-Gaussian likelihoods) and scalable (e.g. mini-batching) model at a cost of O ( M 3 + M 2 N ) . 6 / 14

Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . Minimise KL-divergence from p ( f ( · ) | y ) to q ( f ( · )) = GP ( µ ( · ) , ν ( · , · ′ )) � u ( · ) K − 1 µ ( · ) = k ⊤ uu m , u ( · ) K − 1 uu ( K uu − S ) K − 1 ν ( · , · ′ ) = k ( · , · ′ ) − k ⊤ uu k u ( · ′ ) where [ K uu ] m , m ′ = Cov ( u m , u m ′ ) and [ k u ( · )] m = Cov ( u m , f ( · )) . A more flexible (e.g. non-Gaussian likelihoods) and scalable (e.g. mini-batching) model at a cost of O ( M 3 + M 2 N ) . Speedup through structure in the K uu matrix (e.g. Hensman et al 2017, VFF). 6 / 14

Outline Gaussian processes on the circle and hypersphere Spherical harmonics as inducing features Linear projection data on the hyper-sphere 7 / 14

Gaussian processes on the circle f = � k ( θ 1 , θ 2 ) = i ξ i φ i ( θ ) , with � ∞ Φ( θ ) = [cos( i θ ) , sin( i θ )] ∞ i = 0 λ i φ i ( θ 1 ) φ i ( θ 2 ) ξ i ∼ N ( 0 , λ i ) i = 0 2 3.0 2.5 1 z 2.0 0 1.5 1 1.0 1.0 0.5 0.5 1.0 0.0 0.5 y 0.0 0.0 0.5 0.5 /2 0 /2 x 1.0 1.0 1 2 8 / 14

Spherical Harmonics Orthonormal basis on the hyper sphere Eigenfunctions the Laplace-Beltrami operator ∆ S d − 1 φ i = λ i φ i Eigenfunction of zonal kernels 9 / 14

Mercer’s theorem for zonal kernels on the sphere Zonal kernels are the spherical counterpart of stationary x’ kernels k ( x , x ′ ) = k ′ ( distance ( x , x ′ )) . x T x x 10 / 14

Mercer’s theorem for zonal kernels on the sphere Zonal kernels are the spherical counterpart of stationary x’ kernels k ( x , x ′ ) = k ′ ( distance ( x , x ′ )) . x T x x Mercer’s decomposition: Any zonal kernel k on the hypersphere can be decomposed as ∞ � k ( x , x ′ ) = λ i φ i ( x ) φ i ( x ′ ) . i = 0 10 / 14

Mercer’s theorem for zonal kernels on the sphere Zonal kernels are the spherical counterpart of stationary x’ kernels k ( x , x ′ ) = k ′ ( distance ( x , x ′ )) . x T x x Mercer’s decomposition: Any zonal kernel k on the hypersphere can be decomposed as ∞ � k ( x , x ′ ) = λ i φ i ( x ) φ i ( x ′ ) . i = 0 Karhunen–Loève expansion: A GP f on the hypersphere with zonal covariance k can be written f = � i ξ i φ i with ξ i ∼ N ( 0 , λ i ) : 10 / 14

Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) 11 / 14

Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H 11 / 14

Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H ⇒ Diagonal covariance matrix: [ K uu ] m , m ′ = Cov ( u m , u m ′ ) = � φ m , φ m ′ � H = λ − 1 = m δ mm ′ 11 / 14

Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H ⇒ Diagonal covariance matrix: [ K uu ] m , m ′ = Cov ( u m , u m ′ ) = � φ m , φ m ′ � H = λ − 1 = m δ mm ′ = ⇒ Spherical Harmonics as features [ k u ( · )] m = Cov ( u m , f ( · )) = φ m ( · ) 11 / 14

Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H ⇒ Diagonal covariance matrix: [ K uu ] m , m ′ = Cov ( u m , u m ′ ) = � φ m , φ m ′ � H = λ − 1 = m δ mm ′ = ⇒ Spherical Harmonics as features [ k u ( · )] m = Cov ( u m , f ( · )) = φ m ( · ) ⇒ A O ( M 2 N ) approximate GP q ( f ( · )) = � � Φ ⊤ ( · ) m ; k ( · , · ′ ) − Φ ⊤ ( · )( Λ − S ) Φ ( · ′ ) GP , where Λ = diag ( λ 1 , . . . , λ M ) and Φ ( · ) = [ φ 1 ( · ) , . . . , φ M ( · )] . 11 / 14

Linear mapping to the hypersphere Most datasets do not correspond to data on a hypersphere... y The proposed solution is to augment bias the inputs with a constant variable (bias) before projecting it radially onto the hypersphere. x Although such construction may seem arbitrary, it is used implicitly in the Arc-Cosine kernel [Cho & Saul, 2009]: x ⊤ x ′ k ( x , x ′ ) = � x �� x ′ � (sin θ + ( π − θ ) cos θ ) with θ = arccos � x �� x ′ � . � �� radial angular 12 / 14

Experiment Airline dataset: 6,000,000 datapoints regression task fitted in 40 seconds on a single cheap GTX 1070 GPU NLPD Wall-clock Time 1.50 1000 918.77 1.31 1.32 1.29 750 1.00 Time (Seconds) NLPD (Error) 500 0.50 250 41.32 75.61 0.00 0 SVGP Additive-VFF * VISH * SVGP Additive-VFF * VISH * models models 13 / 14

Conclusion Summary of the advantages It is the fastest SVGP model to date ⇒ No need for expensive hardware The natural ordering of spherical harmonics makes our model scale nicely with the input dimension ⇒ Does not suffer from the curse of dimensionality as VFF Similarities with Arc-cosine kernel makes extrapolation properties similar to Neural Networks Reach out to have a chat if you want to know more! 14 / 14

Sparse Gaussian Processes with Spherical Harmonic Features Vincent - PowerPoint PPT Presentation

Sparse Gaussian Processes with Spherical Harmonic Features Vincent Dutordoir 1 , Nicolas Durrande 1 and James Hensman 2 1 PROWLER.io, 2 Amazon (Work completed while JH was at PROWLER.io) International Conference of Machine Learning 2020

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Harmonic Map Let f : T 2 S 3 = SU (2) be a harmonic map. A harmonic map is a critical

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

CONVERGENCE OF FILTERED SPHERICAL HARMONIC EQUATIONS FOR RADIATION TRANSPORT MARTIN FRANK ,

Convergence of Filtered Spherical Harmonic Equations for Radiation Transport Martin Frank (RWTH)

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Class 14: Simple harmonic motion Class 14: Simple harmonic motion Origin of simple harmonic motion

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Spherical Harmonic Lighting Petras Sukys, Simon Ivarsson Chalmers University of Technology 2013

Local L -values and geometric harmonic analysis on spherical varieties Jonathan Wang (joint w/

Look It Up: Practical PostgreSQL Indexing Christophe Pettus PostgreSQL Experts

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Scene Understanding Introduction & Overview Outline Motivation The problems Scene

CP Solvers Gecode Marco Chiarandini Department of Mathematics & Computer Science University

Neural Networks + Backpropagation Last Class Softmax Classifier Generalization /

Linguistics 384: Language and Computers approaches Linguistic knowledge Topic 5: Machine

What is MT good for? Language and Example translations Language and Computers Computers

Instance-level recognition 1) Local invariant features 2) Matching and recognition with local

Sparse Gaussian Processes with Spherical Harmonic Features Vincent - PowerPoint PPT Presentation

Sparse Gaussian Processes with Spherical Harmonic Features Vincent Dutordoir 1 , Nicolas Durrande 1 and James Hensman 2 1 PROWLER.io, 2 Amazon (Work completed while JH was at PROWLER.io) International Conference of Machine Learning 2020

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Harmonic Map Let f : T 2 S 3 = SU (2) be a harmonic map. A harmonic map is a critical

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

CONVERGENCE OF FILTERED SPHERICAL HARMONIC EQUATIONS FOR RADIATION TRANSPORT MARTIN FRANK ,

Convergence of Filtered Spherical Harmonic Equations for Radiation Transport Martin Frank (RWTH)

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Class 14: Simple harmonic motion Class 14: Simple harmonic motion Origin of simple harmonic motion

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Spherical Harmonic Lighting Petras Sukys, Simon Ivarsson Chalmers University of Technology 2013

Local L -values and geometric harmonic analysis on spherical varieties Jonathan Wang (joint w/

Look It Up: Practical PostgreSQL Indexing Christophe Pettus PostgreSQL Experts

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Scene Understanding Introduction &amp; Overview Outline Motivation The problems Scene

CP Solvers Gecode Marco Chiarandini Department of Mathematics &amp; Computer Science University

Neural Networks + Backpropagation Last Class Softmax Classifier Generalization /

Linguistics 384: Language and Computers approaches Linguistic knowledge Topic 5: Machine

What is MT good for? Language and Example translations Language and Computers Computers

Instance-level recognition 1) Local invariant features 2) Matching and recognition with local

Scene Understanding Introduction & Overview Outline Motivation The problems Scene

CP Solvers Gecode Marco Chiarandini Department of Mathematics & Computer Science University