Sparse Gaussian Processes with Spherical Harmonic Features Vincent Dutordoir 1 , Nicolas Durrande 1 and James Hensman 2 1 PROWLER.io, 2 Amazon (Work completed while JH was at PROWLER.io) International Conference of Machine Learning – 2020
Contribution We improve the scaling of Sparse GPs with #datapoints and #inputs Wall-clock Time NLPD (Error) 1000 1.50 Airline dataset: 918.77 1.31 1.29 Regression problem 750 NLPD (Lower is better) 1.00 6 . 10 6 datapoints Time (seconds) 500 8 input dimensions 0.50 250 Setup 41.32 GTX 1070 GPU 0 0.00 SVGP * VISH * Models 2 / 14
Variational Inference with Spherical Harmonics (VISH) y Gist of method: make inputs d + 1 dimensional bias project data radially on S d Fast SVGP on the sphere map predictions on S d back to the original space x The efficiency of VISH comes from using spherical harmonics as inducing functions for the SVGP on the sphere. 3 / 14
From inducing points to inducing features Inducing Points VISH u m = f ( z m ) u m = � f , φ m � H K uu = K uu = K − 1 uu is O ( M 3 ) K − 1 uu is O ( M ) Orthogonality of the basisfunctions φ leads to diagonal K uu and O ( M ) inversion 4 / 14
Deep-dive 5 / 14
Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . 6 / 14
Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . Minimise KL-divergence from p ( f ( · ) | y ) to q ( f ( · )) = GP ( µ ( · ) , ν ( · , · ′ )) � u ( · ) K − 1 µ ( · ) = k ⊤ uu m , u ( · ) K − 1 uu ( K uu − S ) K − 1 ν ( · , · ′ ) = k ( · , · ′ ) − k ⊤ uu k u ( · ′ ) where [ K uu ] m , m ′ = Cov ( u m , u m ′ ) and [ k u ( · )] m = Cov ( u m , f ( · )) . 6 / 14
Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . Minimise KL-divergence from p ( f ( · ) | y ) to q ( f ( · )) = GP ( µ ( · ) , ν ( · , · ′ )) � u ( · ) K − 1 µ ( · ) = k ⊤ uu m , u ( · ) K − 1 uu ( K uu − S ) K − 1 ν ( · , · ′ ) = k ( · , · ′ ) − k ⊤ uu k u ( · ′ ) where [ K uu ] m , m ′ = Cov ( u m , u m ′ ) and [ k u ( · )] m = Cov ( u m , f ( · )) . A more flexible (e.g. non-Gaussian likelihoods) and scalable (e.g. mini-batching) model at a cost of O ( M 3 + M 2 N ) . 6 / 14
Sparse Variational Gaussian processes Scalable and flexible Capture the GP by a set of inducing variables u = f ( Z ) , at locations z 1 , . . . , z M . Minimise KL-divergence from p ( f ( · ) | y ) to q ( f ( · )) = GP ( µ ( · ) , ν ( · , · ′ )) � u ( · ) K − 1 µ ( · ) = k ⊤ uu m , u ( · ) K − 1 uu ( K uu − S ) K − 1 ν ( · , · ′ ) = k ( · , · ′ ) − k ⊤ uu k u ( · ′ ) where [ K uu ] m , m ′ = Cov ( u m , u m ′ ) and [ k u ( · )] m = Cov ( u m , f ( · )) . A more flexible (e.g. non-Gaussian likelihoods) and scalable (e.g. mini-batching) model at a cost of O ( M 3 + M 2 N ) . Speedup through structure in the K uu matrix (e.g. Hensman et al 2017, VFF). 6 / 14
Outline Gaussian processes on the circle and hypersphere Spherical harmonics as inducing features Linear projection data on the hyper-sphere 7 / 14
Gaussian processes on the circle f = � k ( θ 1 , θ 2 ) = i ξ i φ i ( θ ) , with � ∞ Φ( θ ) = [cos( i θ ) , sin( i θ )] ∞ i = 0 λ i φ i ( θ 1 ) φ i ( θ 2 ) ξ i ∼ N ( 0 , λ i ) i = 0 2 3.0 2.5 1 z 2.0 0 1.5 1 1.0 1.0 0.5 0.5 1.0 0.0 0.5 y 0.0 0.0 0.5 0.5 /2 0 /2 x 1.0 1.0 1 2 8 / 14
Spherical Harmonics Orthonormal basis on the hyper sphere Eigenfunctions the Laplace-Beltrami operator ∆ S d − 1 φ i = λ i φ i Eigenfunction of zonal kernels 9 / 14
Mercer’s theorem for zonal kernels on the sphere Zonal kernels are the spherical counterpart of stationary x’ kernels k ( x , x ′ ) = k ′ ( distance ( x , x ′ )) . x T x x 10 / 14
Mercer’s theorem for zonal kernels on the sphere Zonal kernels are the spherical counterpart of stationary x’ kernels k ( x , x ′ ) = k ′ ( distance ( x , x ′ )) . x T x x Mercer’s decomposition: Any zonal kernel k on the hyper- sphere can be decomposed as ∞ � k ( x , x ′ ) = λ i φ i ( x ) φ i ( x ′ ) . i = 0 10 / 14
Mercer’s theorem for zonal kernels on the sphere Zonal kernels are the spherical counterpart of stationary x’ kernels k ( x , x ′ ) = k ′ ( distance ( x , x ′ )) . x T x x Mercer’s decomposition: Any zonal kernel k on the hyper- sphere can be decomposed as ∞ � k ( x , x ′ ) = λ i φ i ( x ) φ i ( x ′ ) . i = 0 Karhunen–Loève expansion: A GP f on the hypersphere with zonal covariance k can be written f = � i ξ i φ i with ξ i ∼ N ( 0 , λ i ) : 10 / 14
Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) 11 / 14
Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H 11 / 14
Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H ⇒ Diagonal covariance matrix: [ K uu ] m , m ′ = Cov ( u m , u m ′ ) = � φ m , φ m ′ � H = λ − 1 = m δ mm ′ 11 / 14
Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H ⇒ Diagonal covariance matrix: [ K uu ] m , m ′ = Cov ( u m , u m ′ ) = � φ m , φ m ′ � H = λ − 1 = m δ mm ′ = ⇒ Spherical Harmonics as features [ k u ( · )] m = Cov ( u m , f ( · )) = φ m ( · ) 11 / 14
Spherical harmonics as inducing features in SVGPs Define the kernel’s RKHS H with reproducing inner-product: � k ( x , · ) , h ( · ) � H = h ( x ) Approximate posterior constructed out of inducing features u m = � f , φ m � H ⇒ Diagonal covariance matrix: [ K uu ] m , m ′ = Cov ( u m , u m ′ ) = � φ m , φ m ′ � H = λ − 1 = m δ mm ′ = ⇒ Spherical Harmonics as features [ k u ( · )] m = Cov ( u m , f ( · )) = φ m ( · ) ⇒ A O ( M 2 N ) approximate GP q ( f ( · )) = � � Φ ⊤ ( · ) m ; k ( · , · ′ ) − Φ ⊤ ( · )( Λ − S ) Φ ( · ′ ) GP , where Λ = diag ( λ 1 , . . . , λ M ) and Φ ( · ) = [ φ 1 ( · ) , . . . , φ M ( · )] . 11 / 14
Linear mapping to the hypersphere Most datasets do not correspond to data on a hypersphere... y The proposed solution is to augment bias the inputs with a constant variable (bias) before projecting it radially onto the hypersphere. x Although such construction may seem arbitrary, it is used implicitly in the Arc-Cosine kernel [Cho & Saul, 2009]: x ⊤ x ′ k ( x , x ′ ) = � x �� x ′ � (sin θ + ( π − θ ) cos θ ) with θ = arccos � x �� x ′ � . � �� � � �� � radial angular 12 / 14
Experiment Airline dataset: 6,000,000 datapoints regression task fitted in 40 seconds on a single cheap GTX 1070 GPU NLPD Wall-clock Time 1.50 1000 918.77 1.31 1.32 1.29 750 1.00 Time (Seconds) NLPD (Error) 500 0.50 250 41.32 75.61 0.00 0 SVGP Additive-VFF * VISH * SVGP Additive-VFF * VISH * models models 13 / 14
Conclusion Summary of the advantages It is the fastest SVGP model to date ⇒ No need for expensive hardware The natural ordering of spherical harmonics makes our model scale nicely with the input dimension ⇒ Does not suffer from the curse of dimensionality as VFF Similarities with Arc-cosine kernel makes extrapolation properties similar to Neural Networks Reach out to have a chat if you want to know more! 14 / 14
Recommend
More recommend