Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017
Motivation § DGPs promise much, but are difficult to train 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Motivation § DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Motivation § DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Motivation § DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Motivation § DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales Other recently proposed schemes [1, 2, 5] make additional approximations and require more machinery than VI 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Talk outline 1. Summary : Model Inference Results 2. Details: Model Inference Results 3. Questions 3 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Model We use the standard DGP model, with one addition: 4 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Model We use the standard DGP model, with one addition: § We include a linear (identity) mean function for all the internal layers 4 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Model We use the standard DGP model, with one addition: § We include a linear (identity) mean function for all the internal layers (1D example in [4]) 4 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Inference § We use the model conditioned on the inducing points as a conditional variational posterior 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Inference § We use the model conditioned on the inducing points as a conditional variational posterior § We impose Gaussians on the inducing points, (independent between layers but full rank within layers) 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Inference § We use the model conditioned on the inducing points as a conditional variational posterior § We impose Gaussians on the inducing points, (independent between layers but full rank within layers) § We use sampling to deal with the intractable expectation 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Inference § We use the model conditioned on the inducing points as a conditional variational posterior § We impose Gaussians on the inducing points, (independent between layers but full rank within layers) § We use sampling to deal with the intractable expectation 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Inference § We use the model conditioned on the inducing points as a conditional variational posterior § We impose Gaussians on the inducing points, (independent between layers but full rank within layers) § We use sampling to deal with the intractable expectation We never compute N ˆ N matrices (we make no additional simplifications to variational posterior) 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better § We can get 98.1% on mnist with only 100 inducing points 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better § We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better § We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) § Identical model/inference hyperparameters for all our models 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: § If dimensions agree use the identity, otherwise PCA 7 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: § If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean function works better) 7 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: § If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean function works better) § Not so sensible alternative: random. Doesn’t work well (posterior is (very) multimodal) 7 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
The DGP: Graphical Model X Z 0 f 1 u 1 ✏ h 1 Z 1 f 2 u 2 ✏ h 2 Z 2 f 3 u 3 y 8 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
The DGP: Density likelihood hkkkkkikkkkkj N p p y , t h l , f l , u l u L π p p y i | f L l “ 1 q “ i q ˆ i “ 1 L π p p h l | f l q p p f l | u l ; h l ´ 1 , Z l ´ 1 q p p u l ; Z l ´ 1 q l “ 1 loooooooooooooooooooooooooomoooooooooooooooooooooooooon DGP prior 9 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Factorised Variational Posterior X Z 0 X N ( u 1 | m 1 , S 1 ) f 1 u 1 f 1 u 1 ✏ 2 ) h 1 Z 1 i N ( h 1 i | µ 1 i , σ 1 h 1 Q i N ( u 2 | m 2 , S 2 ) f 2 u 2 f 2 u 2 ✏ 2 ) h 2 Z 2 Q i N ( h 2 i | µ 2 i , σ 2 h 2 i N ( u 3 | m 3 , S 3 ) f 3 u 3 f 3 u 3 y y 10 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Our Variational Posterior X Z 0 X N ( u 1 | m 1 , S 1 ) f 1 u 1 f 1 u 1 ✏ ✏ h 1 Z 1 h 1 N ( u 2 | m 2 , S 2 ) f 2 u 2 f 2 u 2 ✏ ✏ h 2 Z 2 h 2 N ( u 3 | m 3 , S 3 ) f 3 u 3 f 3 u 3 y f 3 11 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Recap: ‘GPs for Big Data’ [3] q p f , u q “ p p f | u ; X , Z q N p u | m , S q 12 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Recap: ‘GPs for Big Data’ [3] q p f , u q “ p p f | u ; X , Z q N p u | m , S q Marginalise u from the variational posterior: ª p p f | u ; X , Z q N p u | m , S q d u “ N p f | µ , Σ q “ : q p f | m , S ; X , Z q (1) 12 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Recap: ‘GPs for Big Data’ [3] q p f , u q “ p p f | u ; X , Z q N p u | m , S q Marginalise u from the variational posterior: ª p p f | u ; X , Z q N p u | m , S q d u “ N p f | µ , Σ q “ : q p f | m , S ; X , Z q (1) Define the following mean and covariance functions: µ m , Z p x i q “ m p x i q ` α p x i q T p m ´ m p Z qq , Σ S , Z p x i , x j q “ k p x i , x j q ´ α p x i q T p k p Z , Z q ´ S q α p x j q . where α p x i q “ k p x i , Z q k p Z , Z q ´ 1 12 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
Recommend
More recommend