Power Expectation Propagation for Deep Gaussian Processes Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge with Thang Bui, Yingzhen Li, Jos´ e Miguel Hern´ andez Lobato, Daniel Hern´ andez Lobato, Josiah Jan 1 / 32
Motivation: Gaussian Process regression outputs inputs 2 / 32
Motivation: Gaussian Process regression ? outputs inputs 2 / 32
Motivation: Gaussian Process regression ? outputs inputs 2 / 32
Motivation: Gaussian Process regression inference & learning ? outputs inputs 2 / 32
Motivation: Gaussian Process regression inference & learning intractabilities computational analytic ? outputs inputs 2 / 32
EP pseudo-point approximation true posterior 3 / 32
EP pseudo-point approximation true posterior 3 / 32
EP pseudo-point approximation marginal posterior likelihood true posterior 3 / 32
EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32
EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32
EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32
EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32
EP pseudo-point approximation marginal posterior likelihood approximate posterior true posterior input locations of outputs and covariance 'pseudo' data 'pseudo' data 3 / 32
EP algorithm 4 / 32
EP algorithm take out one 1. remove pseudo-observation likelihood cavity 4 / 32
EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood tilted 4 / 32
EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 4 / 32
EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family update 4. update pseudo-observation likelihood 4 / 32
EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood 4 / 32
EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 4 / 32
Fixed points of EP = FITC approximation 5 / 32
Fixed points of EP = FITC approximation suppressed & 5 / 32
Fixed points of EP = FITC approximation suppressed & 5 / 32
Fixed points of EP = FITC approximation suppressed & 5 / 32
Fixed points of EP = FITC approximation suppressed & 6 / 32
Fixed points of EP = FITC approximation suppressed & 6 / 32
Fixed points of EP = FITC approximation suppressed & 6 / 32
Fixed points of EP = FITC approximation suppressed & 7 / 32
Fixed points of EP = FITC approximation suppressed & 7 / 32
Fixed points of EP = FITC approximation suppressed & 7 / 32
Fixed points of EP = FITC approximation suppressed & 7 / 32
Fixed points of EP = FITC approximation suppressed & 8 / 32
Fixed points of EP = FITC approximation suppressed & = equivalent 8 / 32
Fixed points of EP = FITC approximation suppressed & = Csato & Opper (2002) equivalent Qi, Abdel-Gawad & Minka (2010) 8 / 32
Fixed points of EP = FITC approximation suppressed & = Csato & Opper (2002) equivalent Qi, Abdel-Gawad & Minka (2010) Interpretation resolves philosophical issues with FITC (increase M with N) FITC known to overfit => EP over-estimates marginal likelihood 8 / 32
EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 9 / 32
Power EP algorithm (as tractable as EP) take out fraction of 1. remove pseudo-observation likelihood cavity add in fraction of 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 10 / 32
Power EP: a unifying framework FITC VFE Csato and Opper, 2002 Titsias, 2009 Snelson and Ghahramani, 2005 11 / 32
Power EP: a unifying framework Approximate blocks of data: structured approximations 12 / 32
Power EP: a unifying framework Approximate blocks of data: structured approximations 12 / 32
Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 12 / 32
Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) 12 / 32
Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) 12 / 32
Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) pseudo-data in new space 12 / 32
Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) Figueiras-Vidal & La ́ zaro-Gredilla 2009 pseudo-data in new space T obar et al. 2015 Matthews et al, 2016 12 / 32
Power EP: a unifying framework GP Regression GP Classification ** ** [16*] inter-domain inter-domain [13] [17,13] [7,4*,6*] (PITC) [9,11,8*] [14*] structured structured [12*,15*] [10,5,6*] approx. approx. (FITC) VFE VFE PEP PEP EP EP [4] Quiñonero-Candela et al. 2005 [8] Titsias, 2009 [12] Naish-Guzman et al, 2007 [5] Snelson et al., 2005 [9] Csató, 2002 [13] Qi et al., 2010 [6] Snelson, 2006 [14] Hensman et al., 2015 [10] Csató et al., 2002 [7] Schwaighofer, 2002 [11] Seeger et al., 2003 [15] Hernández-Lobato et al., 2016 [16] Matthews et al., 2016 * = optimised pseudo-inputs [17] Figueiras-Vidal et al., 2009 ** = structured versions of VFE recover VFE 13 / 32
How should I set the power parameter α ? 8 UCI regression datasets 6 UCI classification datasets 20 random splits 20 random splits M = 0 - 200 M = 10, 50, 100 hypers and inducing inputs optimised hypers and inducing inputs optimised 0.8 0.1 0.5 0.8 0.2 0.05 0.4 0 0 1 0.6 1 1 2 3 4 5 6 7 8 1 2 3 4 CD Error Average Rank SMSE Average Rank Error Average Rank indicates significant difference 0.4 0.2 0.8 1 0.1 1 0.5 0 0.6 0.05 0.8 0 1 2 3 4 1 2 3 4 5 6 7 8 CD SMLL Average Rank MLL Average Rank MLL Average Rank = 0.5 does well over all 14 / 32
Deep Gaussian processes f l ∼ GP (0 , k ( ., . )) y n = g ( x n ) = f L ( f L − 1 ( · · · f 2 ( f 1 ( x n )))) + ǫ n x n f 1 h L − 1 , n := f L − 1 ( · · · f 1 ( x n )) , y n = f L ( h L − 1 , n ) + ǫ n h 1 , n Deep GPs a are f 2 multi-layer generalisation of Gaussian processes, equivalent to deep neural networks with infinitely h 2 , n wide hidden layers f 3 Questions: How to perform inference and learning tractably? y n How Deep GPs compare to alternative, e.g. Bayesian neural networks? N a Damianou and Lawrence (2013) [unsupervised learning] 15 / 32
Pros and cons of Deep GPs Why deep GPs? Because x n deep and nonparametric, f 1 discover useful input warping or dimensionality compression/expansion, i.e. automatic, h 1 , n nonparametric Bayesian kernel design, f 2 give a non-Gaussian functional mapping g , h 2 , n f 3 y n N 16 / 32
Recommend
More recommend