Deep Gaussian Processes with Importance-Weighted Variational Inference Hugh Salimbeni Vincent Dutordoir, James Hensman, Marc P Deisenroth
Problem setting
Problem setting Bimodal density
Problem setting Changes with input
Problem setting Skewness
Problem setting Skewness • Bus arrival times
Problem setting Skewness • Bus arrival times • Confounding variables
A possible approach f φ N y n x n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach test samples training data f φ N y n x n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach test samples Neural network training data f φ N y n x n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach test samples Neural network training data f φ Latent variable N (per point) y n x n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach test samples Neural network training data f φ Latent variable N (per point) y n x n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) Concatenation with inputs
A possible approach f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1)
A possible approach f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) Unreliable extrapolation
A possible approach Overfitting f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) Unreliable extrapolation
A possible approach Overfitting Deterministic function f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) Unreliable extrapolation
A possible approach Overfitting Deterministic function f φ N x n y n w n y n = N ( f φ ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) Unreliable extrapolation Small number of examples per input x n
Another possible approach ∞ f N y n x n w n y n = N ( f ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ, k )
Another possible approach Non-parametric prior ∞ f N y n x n w n y n = N ( f ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ, k )
Another possible approach Non-parametric prior ∞ f N y n x n w n y n = N ( f ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) Better extrapolation f ∼ GP ( µ, k )
Another possible approach Non-parametric prior ∞ f N y n x n w n y n = N ( f ([ x n , w n ]) , σ 2 ) w n ∼ N (0 , 1) Better extrapolation f ∼ GP ( µ, k ) Underfitting
Our model ∞ ∞ g f N x n y n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ 1 , k 1 ) g ∼ GP ( µ 2 , k 2 )
Our model ∞ ∞ g f N x n y n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ 1 , k 1 ) g ∼ GP ( µ 2 , k 2 )
Our model ∞ ∞ g f N x n y n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ 1 , k 1 ) g ∼ GP ( µ 2 , k 2 )
Our model Extrapolating gracefully ∞ ∞ g f N x n y n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ 1 , k 1 ) g ∼ GP ( µ 2 , k 2 )
Our model Extrapolating gracefully ∞ ∞ g f N x n y n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ 1 , k 1 ) Better data fit g ∼ GP ( µ 2 , k 2 )
Contributions
Contributions • New architecture - latent variables by concatenation, not addition
Contributions • New architecture - latent variables by concatenation, not addition • Importance-weighted variational inference, exploiting analytic results
Contributions • New architecture - latent variables by concatenation, not addition • Importance-weighted variational inference, exploiting analytic results • Provide an extensive empirical comparison with all 41 UCI regression datasets
A few details ∞ ∞ g f N y n x n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) w n ∼ N (0 , 1) f ∼ GP ( µ 1 , k 1 ) g ∼ GP ( µ 2 , k 2 )
A few details ∞ ∞ g f N y n x n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) Importance weighting w n ∼ N (0 , 1) (Gaussian proposal) f ∼ GP ( µ 1 , k 1 ) g ∼ GP ( µ 2 , k 2 )
A few details ∞ ∞ g f N y n x n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) Importance weighting w n ∼ N (0 , 1) (Gaussian proposal) Variational inference f ∼ GP ( µ 1 , k 1 ) (sparse GP posterior) g ∼ GP ( µ 2 , k 2 )
A few details ∞ ∞ g f N y n x n w n y n = N ( f ( g ([ x n , w n ])) , σ 2 ) Importance weighting w n ∼ N (0 , 1) (Gaussian proposal) Variational inference f ∼ GP ( µ 1 , k 1 ) (sparse GP posterior) g ∼ GP ( µ 2 , k 2 ) Our approach exploits analytic results, leading to a tighter bound
Results
Results • Latent variables in the DGP are highly beneficial
Results • Latent variables in the DGP are highly beneficial • Sometimes depth is enough. Sometimes latent variables are enough. Some datasets need both .
Results • Latent variables in the DGP are highly beneficial • Sometimes depth is enough. Sometimes latent variables are enough. Some datasets need both . • Importance-weighted VI outperforms VI
Results • Latent variables in the DGP are highly beneficial • Sometimes depth is enough. Sometimes latent variables are enough. Some datasets need both . • Importance-weighted VI outperforms VI
Thanks for listening Poster #218 • New architecture • Importance-weighted • 41 datasets
Recommend
More recommend