menu
play

Menu Concerns about the quality of the predictive distributions - PowerPoint PPT Presentation

S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005 Menu Concerns about


  1. S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui˜ nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005

  2. Menu • Concerns about the quality of the predictive distributions • Augmentation: a bit more expensive, but gooood ... • Dude, where’s my prior? • A short tale about sparse greedy support set selection

  3. The Regression Task • Simplest case, additive independent Gaussian noise of variance σ 2 • Gaussian process prior over functions: p ( y | f ) ∼ N ( f , σ 2 I ) , p ( f ) ∼ N (0 , K ) • Task: obtain the predictive distribution of f ∗ at the new input x ∗ : � p ( f ∗ | x ∗ , y ) = p ( f ∗ | x ∗ , f ) p ( f | y ) d f • Need to compute the posterior distribution (expensive): K ( K + σ 2 I ) − 1 y , σ 2 K ( K + σ 2 I ) − 1 � � p ( f | y ) ∼ N • ... and integrate f from the conditional distribution of f ∗ : K ∗ , · K − 1 y , K ∗ , ∗ − K ∗ , · K − 1 K ⊤ � � p ( f ∗ | x ∗ , f ) ∼ N ∗ , ·

  4. Usual Reduced Set Approximations • Consider some very common approximations – Na¨ ıve process approximation on subset of the data – Subset of regressors (Wahba, Smola and Bartlett...) – Sparse online GPs (Csat´ o and Opper) – Fast Sparse Projected Process Approx (Seeger et al.) – Relevance Vector Machines (Tipping) – Augmented Reduced Rank GPs (Rasmussen, Qui˜ nonero Candela) • All based on considering only a subset I of the latent variables � p ( f ∗ | x ∗ , y ) = p ( f ∗ | x ∗ , f I ) p ( f I | y ) d f I • However they differ in: – the way the support set I and the hyperparameters are learnt – the likelihood and/or predictive distribution approximations • This has important consequences on the resulting predictive distribution – risk of over-fitting – degenerate approximations with nonsense predictive uncertainties

  5. Na¨ ıve Process Approximation • Extremely simple idea: throw away all the data outside I ! • The posterior only benefits from the information contained in y I : K I ( K I + σ 2 I ) − 1 y I , σ 2 K I ( K I + σ 2 I ) − 1 � � p ( f I | y I ) ∼ N • The model underfits and is under-confident: p ( f ∗ | x ∗ , y I ) ∼ N ( µ ∗ , σ 2 ∗ ) µ ∗ = K ∗ ,I ( K I + σ 2 I ) − 1 y , ∗ = K ∗ , ∗ − K ∗ ,I ( K I + σ 2 I ) − 1 K ⊤ σ 2 ∗ ,I • Training scales with m 3 , predicting with m and m 2 (mean and var) • Baseline approximation: we want higher accuracy and confidence

  6. Subset Of Regressors • Finite linear model with peculiar prior on the weigths: α I ∼ N (0 , K − 1 f ∗ = K ∗ ,I K − 1 ⇒ f I ∼ N (0 , K I ) f ∗ = K ∗ ,I α I , I ) I f I , • Posterior now benefits from all of y : I f I | y , σ 2 I ) · N ( f I | 0 , K I ) , q ( f I | y ) ∝ N ( K ⊤ I, · K − 1 I, · + σ 2 K I ] − 1 K I, · y , σ 2 K I [ K I, · K ⊤ I, · + σ 2 K I ] − 1 K I K I [ K I, · K ⊤ � � ∼ N • The conditional distribution of f ∗ is degenerate! � ⊤ K ∗ ,I K − 1 � p ( f ∗ | f I ) ∼ N I f I , 0 • The predictive distribution produces nonsense errorbars � − 1 K I, · y , I, · + σ 2 K I K I, · K ⊤ � µ ∗ = K ∗ ,I � − 1 K ⊤ ∗ = σ 2 K ∗ ,I I, · + σ 2 K I K I, · K ⊤ σ 2 � ∗ ,I • Under the prior, only functions with m degrees of freedom

  7. Projected Process (Seeger et al) • Basic principle: likelihood approximation I f I , σ 2 I ) p ( y | f I ) ∼ ( K ⊤ I, · K − 1 • Leads to exactly the same posterior as for Subset of Regressors • But the conditional distribution is now non-degenerate (process approximation) � ⊤ K ∗ ,I K − 1 I f I , K ∗ , ∗ − K ∗ ,I K − 1 � p ( f ∗ | f I ) ∼ N I K ∗ ,I • Predictive distribution with same mean as Subset of Regressors, but with way under-confident predictive variance! � − 1 K I, · y I, · + σ 2 K I K I, · K ⊤ � µ ∗ = K ∗ ,I � − 1 K ⊤ ∗ ,I + σ 2 K ∗ ,I I, · + σ 2 K I ∗ = K ∗ , ∗ − K ∗ ,I K − 1 I K ⊤ K I, · K ⊤ σ 2 � ∗ ,I

  8. Augmented Subset Of Regressors • For each x ∗ , augment f I with f ∗ ; new active set I ∗ �� �� � f I � • Augmented posterior: q � y � f ∗ • ... at a cost of O ( nm ) per test case: need to compute K ∗ , · K ⊤ I, · • aSoR: � − 1 Q + v ∗ v ⊤ � ∗ µ ∗ = K ∗ , · y c ∗ � − 1 Q + v ∗ v ⊤ � ∗ K ⊤ σ 2 ∗ = K ∗ , ∗ − K ∗ , · ∗ , · c ∗ with the ususal approximate covariance: I K I, · + σ 2 I Q = K ⊤ I, · K − 1 with the difference between actual and projected covariance of f ∗ and f : v ∗ = K ⊤ ∗ , · − K ⊤ I, · K − 1 I K I, ∗ with the difference between the prior variance of f ∗ and the projected: c ∗ = K ∗ , ∗ − K ⊤ I, ∗ K − 1 I K I, ∗

  9. Dude, where’s my prior?

  10. The Priors The equivalent prior on [ f , f ∗ ] ⊤ is N (0 , P ) with: Q = K ⊤ I, · K − 1 I K I, · Subset of Regressors: Projected Process K ⊤ I, · K − 1 K ⊤ I, · K − 1 � � � � Q I K I, ∗ Q I K I, ∗ P = P = K ⊤ I, ∗ K − 1 I K I, · K ⊤ I, ∗ K − 1 K ⊤ I, ∗ K − 1 I K I, ∗ I K I, · K ∗ , ∗ Nystr¨ om: (positive definiteness!) Ed and Zoubin’s funky thing K ⊤ I, · K − 1 K ⊤ � � � � Q + Λ I K I, ∗ Q ∗ , · P = P = K ⊤ I, ∗ K − 1 K ∗ , · K ∗ , ∗ I K I, · K ∗ , ∗ Λ = diag ( K · ) − diag ( Q ) Augmented Subset of Regressors: � � Q + v ∗ v ⊤ K ⊤ ∗ P = ∗ , · c ∗ K ∗ , · K ∗ , ∗ with: v ∗ = K ⊤ ∗ , · − K ⊤ I, · K − 1 c ∗ = K ∗ , ∗ − K ⊤ I, ∗ K − 1 I K I, ∗ , I K I, ∗

  11. More on Ed and Zoubin’s Method • Here’s a way of looking at it: the prior is a posterior process f ∗ | f I = N ( K ∗ ,I K − 1 I f I , K ∗ , ∗ − K ∗ ,I K − 1 I K ⊤ ∗ ,I ) , ... well, almost: E [ f + , f ∗ | f I ] = 0 • And then of course f I ∼ N (0 , K I ) • The corresponding prior is Q = K I, · K − 1 I K ⊤ p ( f ) = N (0 , K ∗ , ∗ I + Q − diag( Q )) , I, · • With a bit of algebra you recover the marginal likelihood and the predictive distribution • I finished this 30 minutes ago, which is why I won’t show figures on it! (well, I now may) • but ...

  12. Na¨ ıve Process Approximation 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  13. Subset of Regressors (degenerate) 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  14. Projected Process Approximation 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  15. Ed and Zoubin’s Projected Process Method 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  16. Augmented SoR (pred scales with nm ) 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  17. Comparing the Predictive Uncertainties 0.7 Naive SR 0.6 Seeger EdZoubin Augm 0.5 0.4 0.3 0.2 0.1 0 −15 −10 −5 0 5 10 15

  18. Smola and Bartett’s Greedy Selection neg log evidence −80 neg log evidence squared error −120 0.12 −160 0.08 test squared error: 0.04 min neg log ev. min neg log post. 0 0 0 1 1 2 2 10 10 10 10 10 10 size of support set, m, logarithmic scale 50 gap = 0.025 neg log posterior upper bound on neg log post. 0 lower bound on neg log post. −50 0 1 2 10 10 10 size of support set, m, logarithmic scale

  19. Wrap Up • Training: from O ( n 3 ) to O ( nm 2 ) • Predicting: from O ( n 2 ) to O ( m 2 ) (or O ( nm ) ) • Be sparse if you must, but only then • Beware of over-fitting prone greedy selection methods • Do worry about the prior implied by the approximation!

  20. Appendix: Healing the RVM by Augmentation (joint work with Carl Rasmussen)

  21. Finite Linear Model 2 1 0 −1 −2 0 5 10 15

  22. A Bad Probabilistic Model 2 1 0 −1 −2 0 5 10 15

  23. The Healing: Augmentation 2 1 0 −1 −2 0 5 10 15

  24. Augmentation? • Train once your m -dimensional model • At each new test point add a new basis function • Update the m + 1 -dimensional model (update posterior) • Testing is now more expensive

  25. Wait a minute ... I don’t care about probabilistic predictions!

  26. Another Symptom: Underfitting Abalone Squared error loss Absolute error loss - log test density loss RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.138 0.135 0.092 0.259 0.253 0.209 0.469 0.408 0.219 · · · RVM not sig. < 0 . 01 0.07 < 0 . 01 < 0 . 01 < 0 . 01 · · · RVM* 0.02 < 0 . 01 < 0 . 01 GP · · · Robot Arm Squared error loss Absolute error loss - log test density loss RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.0043 0.0040 0.0024 0.0482 0.0467 0.0334 -1.2162 -1.3295 -1.7446 · · · RVM < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 RVM* · < 0 . 01 · < 0 . 01 · < 0 . 01 · · · GP • GP (Gaussian Process): infinitely augmented linear model • Beats finite linear models in all datasets I’ve looked at

  27. Interlude None of this happens with non-localized basis functions

  28. Finite Linear Model 2 1 0 −1 −2 0 5 10 15

  29. A Bad Probabilistic Model 2 1 0 −1 −2 0 5 10 15

  30. The Healing: Augmentation 2 1 0 −1 −2 0 5 10 15

Recommend


More recommend