Constraining Gaussian Processes by Variational Fourier Features - PowerPoint PPT Presentation

Constraining Gaussian Processes by Variational Fourier Features Arno Solin Aalto University Joint work with Manon Kok (and earlier work with Nicolas Durrande, James Hensman, and Simo S¨ arkk¨ a) September 12, 2019 � @arnosolin � arno.solin.fi

Outline Motivation Conclusion Examples Model Low-rank How this relates representation to SLAM Non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 2/35

The idea Constraining Gaussian processes by variational Fourier features Arno Solin 3/35

What? ◮ Gaussian processes (GPs) provide a powerful framework for extrapolation, interpolation, and noise removal in regression and classification ◮ We constrain GPs to arbitrarily-shaped domains with boundary conditions ◮ Applications in, e.g. , imaging, spatial analysis, robotics, or general ML tasks Constraining Gaussian processes by variational Fourier features Arno Solin 4/35

Why is this non-trivial? GPs provide convenient ways for model specification and inference, but . . . ◮ Issue #1: How to represent this prior? ◮ Issue #2: Limitations in scaling do large data sets ◮ Issue #3: Limitations in dealing with non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 5/35

Hilbert Space Methods for Reduced-Rank GPs Constraining Gaussian processes by variational Fourier features Arno Solin 6/35

Problem formulation ◮ Gaussian process (GP) regression problem: f ( x ) ∼ GP ( 0 , κ ( x , x ′ )) , y i = f ( x i ) + ε i . ◮ The GP-regression has cubic computational complexity O ( n 3 ) in the number of measurements. ◮ This results from the inversion of an n × n matrix: n I ) − 1 y E [ f ( x ∗ )] = κ ( x ∗ , x 1 : n ) ( κ ( x 1 : n , x 1 : n ) + σ 2 n I ) − 1 κ ( x 1 : n , x ∗ ) . V [ f ( x ∗ )] = κ ( x ∗ , x ∗ ) − κ ( x ∗ , x 1 : n ) ( κ ( x 1 : n , x 1 : n ) + σ 2 ◮ Various sparse, reduced-rank, and related approximations have been developed for mitigating this problem. Constraining Gaussian processes by variational Fourier features Arno Solin 7/35

Covariance operator ◮ For covariance function κ ( x , x ′ ) we can define covariance operator: � K φ = κ ( · , x ′ ) φ ( x ′ ) d x ′ . ◮ For stationary covariance function κ ( x , x ′ ) � κ ( � r � ) ; r = x − x ′ we get � κ ( r ) e − i ω T r d r . S ( ω ) = ◮ The transfer function corresponding to the operator K is S ( ω ) = F [ K ] . ◮ The spectral density S ( ω ) also gives the approximate eigenvalues of the operator K . Constraining Gaussian processes by variational Fourier features Arno Solin 8/35

Laplacian operator series ◮ In isotropic case S ( ω ) � S ( � ω � ) , we can expand S ( � ω � ) = a 0 + a 1 � ω � 2 + a 2 ( � ω � 2 ) 2 + a 3 ( � ω � 2 ) 3 + · · · . ◮ The Fourier transform of the Laplace operator ∇ 2 is −� ω � 2 , i.e. , K = a 0 + a 1 ( −∇ 2 ) + a 2 ( −∇ 2 ) 2 + a 3 ( −∇ 2 ) 3 + · · · . ◮ Defines a pseudo-differential operator as a series of differential operators. ◮ Let us now approximate the Laplacian operators with a Hilbert method... Constraining Gaussian processes by variational Fourier features Arno Solin 9/35

Series expansions of GPs ◮ Assume a covariance function κ ( x , x ′ ) and an inner product, say, � � f , g � = f ( x ) g ( x ) w ( x ) d x . Ω ◮ The inner product induces a Hilbert-space of (random) functions. ◮ If we fix a basis { φ j ( x ) } , a Gaussian process f ( x ) can be expanded into a series ∞ � f ( x ) = f j φ j ( x ) , j = 1 where f j are jointly Gaussian. ◮ If we select φ j to be the eigenfunctions of κ ( x , x ′ ) w.r.t. �· , ·� , then this becomes a Karhunen–Lo` eve series. ◮ In the Karhunen–Lo` eve case the coefficients f j are independent Gaussian. Constraining Gaussian processes by variational Fourier features Arno Solin 10/35

Hilbert-space approximation of the Laplacian ◮ Consider the eigenvalue problem for the Laplacian operators: � −∇ 2 φ j ( x ) = λ 2 j φ j ( x ) , x ∈ Ω , φ j ( x ) = 0 , x ∈ ∂ Ω . ◮ The eigenfunctions φ j ( · ) are orthonormal w.r.t. inner product � � f , g � = f ( x ) g ( x ) d x , Ω � φ i ( x ) φ j ( x ) d x = δ ij . Ω ◮ The negative Laplacian has the formal kernel � λ 2 ℓ ( x , x ′ ) = j φ j ( x ) φ j ( x ′ ) j in the sense that � −∇ 2 f ( x ) = ℓ ( x , x ′ ) f ( x ′ ) d x ′ . Constraining Gaussian processes by variational Fourier features Arno Solin 11/35

Approximation of the covariance function ◮ Recall that we have the expansion K = a 0 + a 1 ( −∇ 2 ) + a 2 ( −∇ 2 ) 2 + a 3 ( −∇ 2 ) 3 + · · · . ◮ Substituting the formal kernel gives κ ( x , x ′ ) ≈ a 0 + a 1 ℓ 1 ( x , x ′ ) + a 2 ℓ 2 ( x , x ′ ) + a 3 ℓ 3 ( x , x ′ ) + · · · � � � a 0 + a 1 λ 2 j + a 2 λ 4 j + a 3 λ 6 φ j ( x ) φ j ( x ′ ) . = j + · · · j ◮ Evaluating the spectral density series at � ω � 2 = λ 2 j gives S ( λ j ) = a 0 + a 1 λ 2 j + a 2 λ 4 j + a 3 λ 6 j + · · · . ◮ This leads to the final approximation � κ ( x , x ′ ) ≈ S ( λ j ) φ j ( x ) φ j ( x ′ ) . j Constraining Gaussian processes by variational Fourier features Arno Solin 12/35

Accuracy of the approximation ν = 1 ν = 3 ν = 5 ν = 7 ν → ∞ Exact 2 2 2 2 m = 12 m = 32 m = 64 m = 128 0 5 ℓ Approximations to covariance functions of the Mat´ ern class of various degrees of smoothness; ν = 1 / 2 corresponds to the exponential Ornstein–Uhlenbeck covariance function, and ν → ∞ to the squared exponential (exponentiated quadratic) covariance function. Constraining Gaussian processes by variational Fourier features Arno Solin 13/35

Gaussian processes on a sphere Easy to apply in simple domains (hyper-spheres, hyper-cubes, . . . ) Constraining Gaussian processes by variational Fourier features Arno Solin 14/35

Reduced-rank method for GP regression ◮ Recall the GP-regression problem f ( x ) ∼ GP ( 0 , κ ( x , x ′ )) y i = f ( x i ) + ε i . ◮ Let us now approximate m � f ( x ) ≈ f j φ j ( x ) , j = 1 where f j ∼ N ( 0 , S ( λ j )) . ◮ Via the matrix inversion lemma we then get E [ f ( x ∗ )] ≈ φ T ∗ ( Φ T Φ + σ 2 n Λ − 1 ) − 1 Φ T y , V [ f ( x ∗ )] ≈ σ 2 n φ T ∗ ( Φ T Φ + σ 2 n Λ − 1 ) − 1 φ ∗ . Constraining Gaussian processes by variational Fourier features Arno Solin 15/35

Computational complexity ◮ The computation of Φ T Φ takes O ( nm 2 ) operations. ◮ The covariance function parameters do not enter Φ and we need to evaluate Φ T Φ only once (nice in parameter estimation). ◮ The scaling in input dimensionality can be quite bad—but depends on the chosen domain. Constraining Gaussian processes by variational Fourier features Arno Solin 16/35

Airline delay example ◮ Every commercial flight in the US for 2008 ( n ≈ 6 M). ◮ Inputs, x : Age of the aircraft, route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. ◮ Target, y : Delay at landing (in minutes). ◮ Additive model: 8 � f ( x ) ∼ GP ( 0 , κ se ( x d , x ′ d )) d = 1 ε i ∼ N ( 0 , σ 2 y i = f ( x i ) + ε i , n ) Constraining Gaussian processes by variational Fourier features Arno Solin 17/35

Airline delay example ◮ Every commercial flight in the US for 2008 ( n ≈ 6 M). ◮ Inputs, x : Age of the aircraft, route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. ◮ Target, y : Delay at landing (in minutes). ◮ Additive model: 8 � f ( x ) ∼ GP ( 0 , κ se ( x d , x ′ d )) d = 1 ε i ∼ N ( 0 , σ 2 y i = f ( x i ) + ε i , n ) Results Constraining Gaussian processes by variational Fourier features Arno Solin 17/35

Constraining Gaussian processes by variational Fourier features Arno Solin 18/35

The model In terms of a GP prior and a likelihood, this can be written as � f ( x ) ∼ GP ( 0 , κ ( x , x ′ )) , x ∈ Ω s.t. f ( x ) = 0 , x ∈ ∂ Ω n � y | f ∼ p ( y i | f ( x i )) i = 1 where ( x i , y i ) are the n input–output pairs Constraining Gaussian processes by variational Fourier features Arno Solin 19/35

Why is this non-trivial? GPs provide convenient ways for model specification and inference, but . . . ◮ Issue #1: How to represent this prior? ◮ Issue #2: Limitations in scaling do large data sets ◮ Issue #3: Limitations in dealing with non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 20/35

Addressing the three issues ◮ As a pre-processing step, we solve a Fourier-like generalised harmonic feature representation of the GP prior in the domain of interest ◮ Both constrains the GP and attains a low-rank representation that is used for speeding up inference ◮ The method scales as O ( nm 2 ) in prediction and O ( m 3 ) in hyperparameter learning ( n number of data, m features) ◮ A variational approach to allow the method to deal with non-Gaussian likelihoods Constraining Gaussian processes by variational Fourier features Arno Solin 21/35

Constraining Gaussian Processes by Variational Fourier Features - PowerPoint PPT Presentation

Constraining Gaussian Processes by Variational Fourier Features Arno Solin Aalto University Joint work with Manon Kok (and earlier work with Nicolas Durrande, James Hensman, and Simo S arkk a) September 12, 2019 @arnosolin

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

Topic 5: Discrete-Time Fourier Transform (DTFT) o DT Fourier Transform o Overview of Fourier

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of

Fourier Transform for Partial Differential Equations Introduction: Fourier Transform

Signals and Systems Chapter 4: The Continuous Time Fourier Transform Derivation of the CT Fourier

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Skin and Soft Tissue Infections: MRSA and Beyond Catherine Liu, M.D. Assistant Clinical Professor

A NALYSIS OF A LGORITHMS Acknowledgement: The course slides are adapted from the slides

Computer Systems Lecture 13 Pipeline Stages CS 230 - Spring 2020 3-1 System Layers

Do Retail Trades Move Markets? Brad Barber Terrance Odean Ning Zhu Do Noise Traders Move

Detailed Design and Verification with JML Curt Clifton Rose-Hulman Institute of Technology And

x86 p. 1 A Cautionary Tale Intel 64/IA32 and AMD64 - before Aug. 2007 (Era of Vagueness) 1.

Subtraction Starter: 1. 4573 - 1282 = 2. 7808 - 1921 = 3. Amy has saved 45.37 in her piggy

Getting Started with Coccinelle KVM edition part 1 Julia Lawall (Inria/LIP6/Irill/UPMC)