bayesian model comparison
play

Bayesian Model Comparison Roberto Trotta - www.robertotrotta.com - PowerPoint PPT Presentation

@R_Trotta Bayesian Model Comparison Roberto Trotta - www.robertotrotta.com Analytics, Computation and Inference in Cosmology Cargese, Sept 2018 Frequentist hypothesis testing Warning: frequentist hypothesis testing (e.g., likelihood ratio


  1. @R_Trotta Bayesian Model Comparison Roberto Trotta - www.robertotrotta.com Analytics, Computation and Inference in Cosmology Cargese, Sept 2018

  2. Frequentist hypothesis testing • Warning: frequentist hypothesis testing (e.g., likelihood ratio test) cannot be interpreted as a statement about the probability of the hypothesis! • Example: to test the null hypothesis H 0 : θ = 0, draw n normally distributed points (with known variance σ 2 ). The χ 2 is distributed as a chi-square distribution with (n-1) degrees of freedom (dof). Pick a significance level α (or p-value, e.g. α = 0.05). If P( χ 2 > χ 2obs ) < α reject the null hypothesis. • This is a statement about the likelihood of observing data as extreme or more extreme than have been measured assuming the null hypothesis is correct . • It is not a statement about the probability of the null hypothesis itself and cannot be interpreted as such! (or you’ll make gross mistakes) • The use of p-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred. 
 (Je ff reys, 1961) Roberto Trotta

  3. Exercice: Is the coin fair? Blue Team: N=12 is fixed, H the random variable Red Team: H=3 is fixed, N the random variable Question: What is the p-value for the null hypothesis? DATA: T T H T H T T T T T T H

  4. The significance of significance • Important: A 2-sigma result does not wrongly reject the null hypothesis 5% of the time: at least 29% of 2-sigma results are wrong! • Take an equal mixture of H 0 , H 1 • Simulate data, perform hypothesis testing for H 0 • Select results rejecting H 0 at (or within a small range from) 1- α CL 
 (this is the prescription by Fisher) • What fraction of those results did actually come from H 0 ("true nulls", should not have been rejected)? Recommended reading: 
 Sellke, Bayarri & Berger, The American Statistician , 55, 1 (2001) Roberto Trotta

  5. Bayesian model comparison

  6. The 3 levels of inference LEVEL 2 LEVEL 3 LEVEL 1 Actually, there are several None of the models I have selected a model M possible models: M 0 , M 1 ,... is clearly the best and prior P( θ |M) Model comparison Parameter inference Model averaging What is the relative What are the favourite What is the inference on plausibility of M 0 , M 1 ,... values of the the parameters in light of the data? parameters? 
 accounting for model (assumes M is true) uncertainty? odds = P(M 0 | d) P ( θ | d, M ) = P ( d | θ ,M ) P ( θ | M ) P ( θ | d ) = � i P ( M i | d ) P ( θ | d, M i ) P ( d | M ) P(M 1 | d) Roberto Trotta

  7. Examples of model comparison questions ASTROPARTICLE COSMOLOGY Gravitational waves detection Is the Universe flat? Do cosmic rays correlate with AGNs? Does dark energy evolve? Which SUSY model is ‘best’? Are there anomalies in the CMB? Is there evidence for DM modulation? Which inflationary model is ‘best’? Is there a DM signal in gamma ray/ Is there evidence for modified gravity? neutrino data? Are the initial conditions adiabatic? Many scientific questions are of the model comparison type ASTROPHYSICS Exoplanets detection Is there a line in this spectrum? Is there a source in this image? Roberto Trotta

  8. Level 2: model comparison P ( θ | d, M ) = P ( d | θ ,M ) P ( θ | M ) P ( d | M ) Bayesian evidence or model likelihood The evidence is the integral of the likelihood over the prior: � P ( d | M ) = Ω d θ P ( d | θ , M ) P ( θ | M ) Bayes’ Theorem delivers the model’s posterior: P ( M | d ) = P ( d | M ) P ( M ) P ( d ) When we are comparing two models: The Bayes factor: B 01 ≡ P ( d | M 0 ) P ( M 1 | d ) = P ( d | M 0 ) P ( M 0 | d ) P ( M 0 ) P ( d | M 1 ) P ( M 1 ) P ( d | M 1 ) Posterior odds = Bayes factor × prior odds Roberto Trotta

  9. Scale for the strength of evidence • A (slightly modified) Je ff reys’ scale to assess the strength of evidence favoured model’s |lnB| relative odds Interpretation probability not worth < 1.0 < 3:1 < 0.750 mentioning < 2.5 < 12:1 0.923 weak < 5.0 < 150:1 0.993 moderate > 5.0 > 150:1 > 0.993 strong Roberto Trotta

  10. Bayesian model comparison of 193 models Higgs inflation as reference model Martin,RT+14 disfavoured favoured

  11. An automatic Occam’s razor • Bayes factor balances quality of fit vs extra model complexity. • It rewards highly predictive models, penalizing “wasted” parameter space R P ( d | M ) = d θ L ( θ ) P ( θ | M ) ≈ P (ˆ θ ) δθ L (ˆ θ ) Likelihood ∆ θ L (ˆ θ )ˆ δθ ≈ δθ θ Occam’s factor Prior Δθ ˆ θ Roberto Trotta

  12. The evidence as predictive probability • The evidence can be understood as a function of d to give the predictive probability under the model M: P(d|M) Simpler model M 0 More complex model M 1 d Observed value d obs Roberto Trotta

  13. Simple example: nested models • This happens often in practice: Likelihood we have a more complex model, M 1 with prior P( θ |M 1 ), which reduces to a simpler model (M 0 ) for a certain value of δθ the parameter, 
 e.g. θ = θ * = 0 ( nested models ) Prior Δθ • Is the extra complexity of M 1 warranted by the data? ˆ θ θ * = 0

  14. Simple example: nested models Define: λ ≡ ˆ θ − θ � Likelihood δθ For “informative” data: δθ − λ 2 δθ ln B 01 ≈ ln ∆ θ 2 Prior Δθ mismatch of wasted parameter prediction with space observed data (favours simpler model) ˆ θ (favours more θ * = 0 complex model)

  15. The rough guide to model comparison wider prior (fixed data) Trotta (2008) larger sample (fixed prior and significance) Planck WMAP3 WMAP1 Δθ = Prior width 𝜀 θ = Likelihood width ∆ θ I 10 ≡ log 10 δθ Roberto Trotta

  16. “Prior-free” evidence bounds • What if we do not know how to set the prior? For nested models, we can still choose a prior that will maximise the support for the more complex model: wider prior (fixed data) larger sample (fixed prior and significance) maximum evidence for Model 1 Roberto Trotta

  17. 
 Maximum evidence for a detection • The absolute upper bound: put all prior mass for the alternative onto the observed maximum likelihood value. Then 
 B < exp( − χ 2 / 2) • More reasonable class of priors: symmetric and unimodal around Ψ =0, then 
 ( α = significance level) − 1 B < exp(1) α ln α If the upper bound is small, no other choice of prior will make the extra parameter significant. Sellke, Bayarri & Berger, The American Statistician , 55, 1 (2001) Roberto Trotta

  18. How to interpret the “number of sigma’s” “Reasonable” Absolute bound α sigma bound on lnB on lnB (B) (B) 2.0 
 0.9 0.05 2 (7:1) (3:1) weak undecided 4.5 3.0 0.003 3 (90:1) (21:1) moderate moderate 6.48 5.0 
 0.0003 3.6 (650:1) (150:1) 
 strong strong Roberto Trotta

  19. How to assess p-values Rule of thumb: interpret a n-sigma result as a (n-1)-sigma result Sellke, Bayarri & Berger, The American Statistician , 55, 1 (2001) Roberto Trotta

  20. Computing the model likelihood Model likelihood: � P ( d | M ) = Ω d θ P ( d | θ , M ) P ( θ | M ) B 01 ≡ P ( d | M 0 ) Bayes factor: P ( d | M 1 ) • Usually computational demanding: it’s a multi-dimensional integral, averaging the likelihood over the (possibly much wider) prior • I’ll present two methods used by cosmologists: • Savage-Dickey density ratio (Dickey 1971): Gives the Bayes factor between nested models (under mild conditions). Can be usually derived from posterior samples of the larger (higher D) model. • Nested sampling (Skilling 2004): Transforms the D-dim integral in 1D integration. Can be used generally (within limitations of the e ffi ciency of the sampling method adopted). Roberto Trotta

  21. The Savage-Dickey density ratio Dickey J. M., 1971, Ann. Math. Stat., 42, 204 • This method works for nested models and gives the Bayes factor analytically. • Assumptions: • Nested models: M 1 with parameters ( Ψ , 𝜕 ) reduces to M 0 for e.g. 𝜕 = 𝜕 ✶ • Separable priors: the prior π 1 ( Ψ , 𝜕 |M 1 ) is uncorrelated with π 0 ( Ψ |M 0 ) B 01 = p ( ω ? | d ) • Result: Marginal posterior π 1 ( ω ? ) under M 1 • The Bayes factor is the ratio of the normalised (1D) marginal posterior on the additional parameter in M 1 over its prior, Prior evaluated at the value of the parameter for which M 1 reduces to M 0 . 𝜕 = 𝜕 ✶ Roberto Trotta

  22. Derivation of the SDDR RT, Mon.Not.Roy.Astron.Soc. 378 (2007) 72-82 Z Z P ( d | M 0 ) = d Ψ π 0 ( Ψ ) p ( d | Ψ , ω ? ) P ( d | M 1 ) = d Ψ d ωπ 1 ( Ψ , ω ) p ( d | Ψ , ω ) p ( ω ? | d ) = p ( ω ? , Ψ | d ) Divide and multiply B 01 by: p ( Ψ | ω ? , d ) d Ψ π 0 ( Ψ ) p ( d | Ψ , ω ? ) p ( Ψ | ω ? , d ) Z B 01 = p ( ω ? | d ) P ( M 1 | d ) p ( ω ? , Ψ | d ) Since: d Ψ π 0 ( Ψ ) p ( Ψ | ω ? , d ) Z p ( ω ? , Ψ | d ) = p ( d | ω ? , Ψ ) π 1 ( ω ? , Ψ ) B 01 = p ( ω ? | d ) π 1 ( ω ? , Ψ ) P ( M 1 | d ) Assuming separable B 01 = p ( ω ? | d ) d Ψ p ( Ψ | ω ? , d ) = p ( ω ? | d ) Z priors: π 1 ( ω ? ) π 1 ( ω ? ) π 1 ( ω , Ψ ) = π 1 ( ω ) π 0 ( Ψ ) Roberto Trotta

Recommend


More recommend