Gaussian Process Regression with Mismatched Models & Can GP Regression Be Made Robust Against Model Mismatch? Peter Sollich NEURIPS 2002 & International Workshop on Deterministic and Statistical Methods in Machine Learning (2004)
Learning curve Ideal learning curve: β’ Performance on true distribution β’ Average over multiple training datasets
What is GP regression? π§ = π π¦ + π , π~ π 0,π 2 want to estimate π Put a GP-prior on π β’ π·ππ€ π π¦ π ,π π¦ π = πΏ(π¦ π ,π¦ π ) β’ πΉ π π¦ = 0 Why GP-regression? β’ Posterior analytically (requires π(π 3 ) ) β’ Error bars
Mismatched model? Input to GP: kernel πΏ , noise level π β’ What if we use the wrong one? Setting: β’ Assume p(x) known: uniform on line or β’ Theory exact if π = β , otherwise all kinds hypercube β’ Assume K x, x β² = g( π¦ β π¦ β² ) of approximations
Weird learning curves β’ Plateaus or arbitrary # overfitting maxima Line 1D, noise level too low results in plateau Hypercube, d=10, noise level too small: 1e-4, 1e- 3, β¦ true = 1
Asymptotic problems β’ No asymptotic decay π = π( π ) such as for parametric models, much 1 β’ If true kernel (OU, MB2) is less smooth than chosen kernel (RBF) slower (log. slow) β’ Prior cannot be overwhelmed by data (is too strong)
Fix? β’ But maybe we just chose very bad hyperparameters? β’ Maximize: π πΈ = β« π πΈ π π π ππ w.r.t. hyperpar. β’ A true Bayesian is too expensiveβ¦ What if evidence maximization? β’ Setting: assume wrong kernel, but we tune π, π, π using evidence β’ All kinds of approximations to make the analysis tractableβ¦
Hypercube analysis β’ If we can tune par to get Bayes optimal performance, we will get it. β’ If we cannot find those par (for example, π β β ), convergence still very slow β’ No maximaβs β’ No experiments???
1D case β’ True kernel = MB2 β’ Used kernel = in plot β’ No maxima, plateaus β’ Optimal rate achieved
Recommend
More recommend