Doubly Stochastic Inference for Deep Gaussian Processes Hugh - PowerPoint PPT Presentation

Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017

Motivation § DGPs promise much, but are difficult to train 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Motivation § DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Motivation § DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Motivation § DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales Other recently proposed schemes [1, 2, 5] make additional approximations and require more machinery than VI 2 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Talk outline 1. Summary : Model Inference Results 2. Details: Model Inference Results 3. Questions 3 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Model We use the standard DGP model, with one addition: 4 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Model We use the standard DGP model, with one addition: § We include a linear (identity) mean function for all the internal layers 4 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Model We use the standard DGP model, with one addition: § We include a linear (identity) mean function for all the internal layers (1D example in [4]) 4 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Inference § We use the model conditioned on the inducing points as a conditional variational posterior 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Inference § We use the model conditioned on the inducing points as a conditional variational posterior § We impose Gaussians on the inducing points, (independent between layers but full rank within layers) 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Inference § We use the model conditioned on the inducing points as a conditional variational posterior § We impose Gaussians on the inducing points, (independent between layers but full rank within layers) § We use sampling to deal with the intractable expectation 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Inference § We use the model conditioned on the inducing points as a conditional variational posterior § We impose Gaussians on the inducing points, (independent between layers but full rank within layers) § We use sampling to deal with the intractable expectation We never compute N ˆ N matrices (we make no additional simplifications to variational posterior) 5 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better § We can get 98.1% on mnist with only 100 inducing points 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better § We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Results § We show significant improvement over single layer models on large ( „ 10 6 ) and massive ( „ 10 9 ) data § Big jump in improvement over single layer GP with 5 ˆ number of inducing points § On small data we never do worse than the single layer model, and often better § We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) § Identical model/inference hyperparameters for all our models 6 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: § If dimensions agree use the identity, otherwise PCA 7 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: § If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean function works better) 7 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: § If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean function works better) § Not so sensible alternative: random. Doesn’t work well (posterior is (very) multimodal) 7 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

The DGP: Graphical Model X Z 0 f 1 u 1 ✏ h 1 Z 1 f 2 u 2 ✏ h 2 Z 2 f 3 u 3 y 8 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

The DGP: Density likelihood hkkkkkikkkkkj N p p y , t h l , f l , u l u L π p p y i | f L l “ 1 q “ i q ˆ i “ 1 L π p p h l | f l q p p f l | u l ; h l ´ 1 , Z l ´ 1 q p p u l ; Z l ´ 1 q l “ 1 loooooooooooooooooooooooooomoooooooooooooooooooooooooon DGP prior 9 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Factorised Variational Posterior X Z 0 X N ( u 1 | m 1 , S 1 ) f 1 u 1 f 1 u 1 ✏ 2 ) h 1 Z 1 i N ( h 1 i | µ 1 i , σ 1 h 1 Q i N ( u 2 | m 2 , S 2 ) f 2 u 2 f 2 u 2 ✏ 2 ) h 2 Z 2 Q i N ( h 2 i | µ 2 i , σ 2 h 2 i N ( u 3 | m 3 , S 3 ) f 3 u 3 f 3 u 3 y y 10 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Our Variational Posterior X Z 0 X N ( u 1 | m 1 , S 1 ) f 1 u 1 f 1 u 1 ✏ ✏ h 1 Z 1 h 1 N ( u 2 | m 2 , S 2 ) f 2 u 2 f 2 u 2 ✏ ✏ h 2 Z 2 h 2 N ( u 3 | m 3 , S 3 ) f 3 u 3 f 3 u 3 y f 3 11 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Recap: ‘GPs for Big Data’ [3] q p f , u q “ p p f | u ; X , Z q N p u | m , S q 12 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Recap: ‘GPs for Big Data’ [3] q p f , u q “ p p f | u ; X , Z q N p u | m , S q Marginalise u from the variational posterior: ª p p f | u ; X , Z q N p u | m , S q d u “ N p f | µ , Σ q “ : q p f | m , S ; X , Z q (1) Define the following mean and covariance functions: µ m , Z p x i q “ m p x i q ` α p x i q T p m ´ m p Z qq , Σ S , Z p x i , x j q “ k p x i , x j q ´ α p x i q T p k p Z , Z q ´ S q α p x j q . where α p x i q “ k p x i , Z q k p Z , Z q ´ 1 12 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

Doubly Stochastic Inference for Deep Gaussian Processes Hugh - PowerPoint PPT Presentation

Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017 Motivation DGPs promise much, but are difficult to train 2 Doubly Stochastic Inference for DGPs Hugh

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Stochastic (partial) differential equations and Gaussian processes Simo Srkk Aalto

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

Fractional Gaussian Noise, Fractional Gaussian Noise, Subdiffusion and Stochastic and Stochastic

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Searching for Doubly Self Searching for Doubly Self- Orthogonal Latin Squares Orthogonal Latin

doubly linked lists Sept. 20/21, 2017 1 Singly linked list head tail 2 Doubly linked list

Doubly-Linked Lists 4-02-2013 Doubly-linked list Implementation of List ListIterator

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

A quantum algorithm for model independent searches for new physics Prasanth Shyamsundar

Investor Presentation November 2019 Safe Harbor Statement The offering to which this

Supervised Topic Models Atallah Hezbor and Anant Kharkar Outline (optional. Mostly for our

ADVANCEMENT Review of RFP for Next W.M. Keck Foundation Research Program Funding Cycle June 9,

Dive Deeper in Finance GTC 2017 San Jos California Daniel Egloff Dr. sc. math.

Accelerating Best Response Calculation in Large Extensive Games Q J # $ K 10 P " C

Multi-phase Particles Morphology Formation: Model & Methods Simone Rusconi BCAM Basque

Fresh Start to End Family Homelessness Department of Human Services Guiding Principles

Doubly Stochastic Inference for Deep Gaussian Processes Hugh - PowerPoint PPT Presentation

Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017 Motivation DGPs promise much, but are difficult to train 2 Doubly Stochastic Inference for DGPs Hugh

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Stochastic (partial) differential equations and Gaussian processes Simo Srkk Aalto

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

Fractional Gaussian Noise, Fractional Gaussian Noise, Subdiffusion and Stochastic and Stochastic

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Searching for Doubly Self Searching for Doubly Self- Orthogonal Latin Squares Orthogonal Latin

doubly linked lists Sept. 20/21, 2017 1 Singly linked list head tail 2 Doubly linked list

Doubly-Linked Lists 4-02-2013 Doubly-linked list Implementation of List ListIterator

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

A quantum algorithm for model independent searches for new physics Prasanth Shyamsundar

Investor Presentation November 2019 Safe Harbor Statement The offering to which this

Supervised Topic Models Atallah Hezbor and Anant Kharkar Outline (optional. Mostly for our

ADVANCEMENT Review of RFP for Next W.M. Keck Foundation Research Program Funding Cycle June 9,

Dive Deeper in Finance GTC 2017 San Jos California Daniel Egloff Dr. sc. math.

Accelerating Best Response Calculation in Large Extensive Games Q J # $ K 10 P &quot; C

Multi-phase Particles Morphology Formation: Model &amp; Methods Simone Rusconi BCAM Basque

Fresh Start to End Family Homelessness Department of Human Services Guiding Principles

Accelerating Best Response Calculation in Large Extensive Games Q J # $ K 10 P " C

Multi-phase Particles Morphology Formation: Model & Methods Simone Rusconi BCAM Basque