Constraining Gaussian Processes by Variational Fourier Features Arno Solin Aalto University Joint work with Manon Kok (and earlier work with Nicolas Durrande, James Hensman, and Simo S¨ arkk¨ a) September 12, 2019 � @arnosolin � arno.solin.fi
Outline Motivation Conclusion Examples Model Low-rank How this relates representation to SLAM Non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 2/35
The idea Constraining Gaussian processes by variational Fourier features Arno Solin 3/35
What? ◮ Gaussian processes (GPs) provide a powerful framework for extrapolation, interpolation, and noise removal in regression and classification ◮ We constrain GPs to arbitrarily-shaped domains with boundary conditions ◮ Applications in, e.g. , imaging, spatial analysis, robotics, or general ML tasks Constraining Gaussian processes by variational Fourier features Arno Solin 4/35
Why is this non-trivial? GPs provide convenient ways for model specification and inference, but . . . ◮ Issue #1: How to represent this prior? ◮ Issue #2: Limitations in scaling do large data sets ◮ Issue #3: Limitations in dealing with non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 5/35
Hilbert Space Methods for Reduced-Rank GPs Constraining Gaussian processes by variational Fourier features Arno Solin 6/35
Problem formulation ◮ Gaussian process (GP) regression problem: f ( x ) ∼ GP ( 0 , κ ( x , x ′ )) , y i = f ( x i ) + ε i . ◮ The GP-regression has cubic computational complexity O ( n 3 ) in the number of measurements. ◮ This results from the inversion of an n × n matrix: n I ) − 1 y E [ f ( x ∗ )] = κ ( x ∗ , x 1 : n ) ( κ ( x 1 : n , x 1 : n ) + σ 2 n I ) − 1 κ ( x 1 : n , x ∗ ) . V [ f ( x ∗ )] = κ ( x ∗ , x ∗ ) − κ ( x ∗ , x 1 : n ) ( κ ( x 1 : n , x 1 : n ) + σ 2 ◮ Various sparse, reduced-rank, and related approximations have been developed for mitigating this problem. Constraining Gaussian processes by variational Fourier features Arno Solin 7/35
Covariance operator ◮ For covariance function κ ( x , x ′ ) we can define covariance operator: � K φ = κ ( · , x ′ ) φ ( x ′ ) d x ′ . ◮ For stationary covariance function κ ( x , x ′ ) � κ ( � r � ) ; r = x − x ′ we get � κ ( r ) e − i ω T r d r . S ( ω ) = ◮ The transfer function corresponding to the operator K is S ( ω ) = F [ K ] . ◮ The spectral density S ( ω ) also gives the approximate eigenvalues of the operator K . Constraining Gaussian processes by variational Fourier features Arno Solin 8/35
Laplacian operator series ◮ In isotropic case S ( ω ) � S ( � ω � ) , we can expand S ( � ω � ) = a 0 + a 1 � ω � 2 + a 2 ( � ω � 2 ) 2 + a 3 ( � ω � 2 ) 3 + · · · . ◮ The Fourier transform of the Laplace operator ∇ 2 is −� ω � 2 , i.e. , K = a 0 + a 1 ( −∇ 2 ) + a 2 ( −∇ 2 ) 2 + a 3 ( −∇ 2 ) 3 + · · · . ◮ Defines a pseudo-differential operator as a series of differential operators. ◮ Let us now approximate the Laplacian operators with a Hilbert method... Constraining Gaussian processes by variational Fourier features Arno Solin 9/35
Series expansions of GPs ◮ Assume a covariance function κ ( x , x ′ ) and an inner product, say, � � f , g � = f ( x ) g ( x ) w ( x ) d x . Ω ◮ The inner product induces a Hilbert-space of (random) functions. ◮ If we fix a basis { φ j ( x ) } , a Gaussian process f ( x ) can be expanded into a series ∞ � f ( x ) = f j φ j ( x ) , j = 1 where f j are jointly Gaussian. ◮ If we select φ j to be the eigenfunctions of κ ( x , x ′ ) w.r.t. �· , ·� , then this becomes a Karhunen–Lo` eve series. ◮ In the Karhunen–Lo` eve case the coefficients f j are independent Gaussian. Constraining Gaussian processes by variational Fourier features Arno Solin 10/35
Hilbert-space approximation of the Laplacian ◮ Consider the eigenvalue problem for the Laplacian operators: � −∇ 2 φ j ( x ) = λ 2 j φ j ( x ) , x ∈ Ω , φ j ( x ) = 0 , x ∈ ∂ Ω . ◮ The eigenfunctions φ j ( · ) are orthonormal w.r.t. inner product � � f , g � = f ( x ) g ( x ) d x , Ω � φ i ( x ) φ j ( x ) d x = δ ij . Ω ◮ The negative Laplacian has the formal kernel � λ 2 ℓ ( x , x ′ ) = j φ j ( x ) φ j ( x ′ ) j in the sense that � −∇ 2 f ( x ) = ℓ ( x , x ′ ) f ( x ′ ) d x ′ . Constraining Gaussian processes by variational Fourier features Arno Solin 11/35
Approximation of the covariance function ◮ Recall that we have the expansion K = a 0 + a 1 ( −∇ 2 ) + a 2 ( −∇ 2 ) 2 + a 3 ( −∇ 2 ) 3 + · · · . ◮ Substituting the formal kernel gives κ ( x , x ′ ) ≈ a 0 + a 1 ℓ 1 ( x , x ′ ) + a 2 ℓ 2 ( x , x ′ ) + a 3 ℓ 3 ( x , x ′ ) + · · · � � � a 0 + a 1 λ 2 j + a 2 λ 4 j + a 3 λ 6 φ j ( x ) φ j ( x ′ ) . = j + · · · j ◮ Evaluating the spectral density series at � ω � 2 = λ 2 j gives S ( λ j ) = a 0 + a 1 λ 2 j + a 2 λ 4 j + a 3 λ 6 j + · · · . ◮ This leads to the final approximation � κ ( x , x ′ ) ≈ S ( λ j ) φ j ( x ) φ j ( x ′ ) . j Constraining Gaussian processes by variational Fourier features Arno Solin 12/35
Accuracy of the approximation ν = 1 ν = 3 ν = 5 ν = 7 ν → ∞ Exact 2 2 2 2 m = 12 m = 32 m = 64 m = 128 0 5 ℓ Approximations to covariance functions of the Mat´ ern class of various degrees of smoothness; ν = 1 / 2 corresponds to the exponential Ornstein–Uhlenbeck covariance function, and ν → ∞ to the squared exponential (exponentiated quadratic) covariance function. Constraining Gaussian processes by variational Fourier features Arno Solin 13/35
Gaussian processes on a sphere Easy to apply in simple domains (hyper-spheres, hyper-cubes, . . . ) Constraining Gaussian processes by variational Fourier features Arno Solin 14/35
Reduced-rank method for GP regression ◮ Recall the GP-regression problem f ( x ) ∼ GP ( 0 , κ ( x , x ′ )) y i = f ( x i ) + ε i . ◮ Let us now approximate m � f ( x ) ≈ f j φ j ( x ) , j = 1 where f j ∼ N ( 0 , S ( λ j )) . ◮ Via the matrix inversion lemma we then get E [ f ( x ∗ )] ≈ φ T ∗ ( Φ T Φ + σ 2 n Λ − 1 ) − 1 Φ T y , V [ f ( x ∗ )] ≈ σ 2 n φ T ∗ ( Φ T Φ + σ 2 n Λ − 1 ) − 1 φ ∗ . Constraining Gaussian processes by variational Fourier features Arno Solin 15/35
Computational complexity ◮ The computation of Φ T Φ takes O ( nm 2 ) operations. ◮ The covariance function parameters do not enter Φ and we need to evaluate Φ T Φ only once (nice in parameter estimation). ◮ The scaling in input dimensionality can be quite bad—but depends on the chosen domain. Constraining Gaussian processes by variational Fourier features Arno Solin 16/35
Airline delay example ◮ Every commercial flight in the US for 2008 ( n ≈ 6 M). ◮ Inputs, x : Age of the aircraft, route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. ◮ Target, y : Delay at landing (in minutes). ◮ Additive model: 8 � f ( x ) ∼ GP ( 0 , κ se ( x d , x ′ d )) d = 1 ε i ∼ N ( 0 , σ 2 y i = f ( x i ) + ε i , n ) Constraining Gaussian processes by variational Fourier features Arno Solin 17/35
Airline delay example ◮ Every commercial flight in the US for 2008 ( n ≈ 6 M). ◮ Inputs, x : Age of the aircraft, route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. ◮ Target, y : Delay at landing (in minutes). ◮ Additive model: 8 � f ( x ) ∼ GP ( 0 , κ se ( x d , x ′ d )) d = 1 ε i ∼ N ( 0 , σ 2 y i = f ( x i ) + ε i , n ) Results Constraining Gaussian processes by variational Fourier features Arno Solin 17/35
Constraining Gaussian processes by variational Fourier features Arno Solin 18/35
The model In terms of a GP prior and a likelihood, this can be written as � f ( x ) ∼ GP ( 0 , κ ( x , x ′ )) , x ∈ Ω s.t. f ( x ) = 0 , x ∈ ∂ Ω n � y | f ∼ p ( y i | f ( x i )) i = 1 where ( x i , y i ) are the n input–output pairs Constraining Gaussian processes by variational Fourier features Arno Solin 19/35
Why is this non-trivial? GPs provide convenient ways for model specification and inference, but . . . ◮ Issue #1: How to represent this prior? ◮ Issue #2: Limitations in scaling do large data sets ◮ Issue #3: Limitations in dealing with non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 20/35
Addressing the three issues ◮ As a pre-processing step, we solve a Fourier-like generalised harmonic feature representation of the GP prior in the domain of interest ◮ Both constrains the GP and attains a low-rank representation that is used for speeding up inference ◮ The method scales as O ( nm 2 ) in prediction and O ( m 3 ) in hyperparameter learning ( n number of data, m features) ◮ A variational approach to allow the method to deal with non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 21/35
Recommend
More recommend