mismatched models
play

Mismatched Models & Can GP Regression Be Made Robust Against - PowerPoint PPT Presentation

Gaussian Process Regression with Mismatched Models & Can GP Regression Be Made Robust Against Model Mismatch? Peter Sollich NEURIPS 2002 & International Workshop on Deterministic and Statistical Methods in Machine Learning (2004)


  1. Gaussian Process Regression with Mismatched Models & Can GP Regression Be Made Robust Against Model Mismatch? Peter Sollich NEURIPS 2002 & International Workshop on Deterministic and Statistical Methods in Machine Learning (2004)

  2. Learning curve Ideal learning curve: β€’ Performance on true distribution β€’ Average over multiple training datasets

  3. What is GP regression? 𝑧 = 𝑔 𝑦 + πœ— , πœ—~ 𝑂 0,𝜏 2 want to estimate 𝑔 Put a GP-prior on 𝑔 β€’ 𝐷𝑝𝑀 𝑔 𝑦 𝑗 ,𝑔 𝑦 π‘˜ = 𝐿(𝑦 𝑗 ,𝑦 π‘˜ ) β€’ 𝐹 𝑔 𝑦 = 0 Why GP-regression? β€’ Posterior analytically (requires 𝑃(π‘œ 3 ) ) β€’ Error bars

  4. Mismatched model? Input to GP: kernel 𝐿 , noise level 𝜏 β€’ What if we use the wrong one? Setting: β€’ Assume p(x) known: uniform on line or β€’ Theory exact if 𝑒 = ∞ , otherwise all kinds hypercube β€’ Assume K x, x β€² = g( 𝑦 βˆ’ 𝑦 β€² ) of approximations

  5. Weird learning curves β€’ Plateaus or arbitrary # overfitting maxima Line 1D, noise level too low results in plateau Hypercube, d=10, noise level too small: 1e-4, 1e- 3, … true = 1

  6. Asymptotic problems β€’ No asymptotic decay πœ— = 𝑃( π‘œ ) such as for parametric models, much 1 β€’ If true kernel (OU, MB2) is less smooth than chosen kernel (RBF) slower (log. slow) β€’ Prior cannot be overwhelmed by data (is too strong)

  7. Fix? β€’ But maybe we just chose very bad hyperparameters? β€’ Maximize: 𝑄 𝐸 = ∫ 𝑄 𝐸 πœ„ 𝑄 πœ„ π‘’πœ„ w.r.t. hyperpar. β€’ A true Bayesian is too expensive… What if evidence maximization? β€’ Setting: assume wrong kernel, but we tune 𝜏, 𝑏, π‘š using evidence β€’ All kinds of approximations to make the analysis tractable…

  8. Hypercube analysis β€’ If we can tune par to get Bayes optimal performance, we will get it. β€’ If we cannot find those par (for example, π‘š β†’ ∞ ), convergence still very slow β€’ No maxima’s β€’ No experiments???

  9. 1D case β€’ True kernel = MB2 β€’ Used kernel = in plot β€’ No maxima, plateaus β€’ Optimal rate achieved

Recommend


More recommend