testing for tensions between datasets
play

Testing for Tensions Between Datasets David Parkinson University - PowerPoint PPT Presentation

Testing for Tensions Between Datasets David Parkinson University of Queensland In collaboration with Shahab Joudaki (Oxford) Outline Introduction Statistical Inference Methods Linear models Example using WL and CMB data


  1. Testing for Tensions Between Datasets David Parkinson University of Queensland In collaboration with Shahab Joudaki (Oxford)

  2. Outline • Introduction • Statistical Inference • Methods • Linear models • Example using WL and CMB data • Conclusions

  3. What is Probability? • In 1812 Laplace published Analytic Theory of Probabilities • He suggested the computation of "the probability of causes and future events, derived from past events” • “Every event being determined by the general laws of the universe, there is only probability relative to us.” • “Probability is relative, in part to [our] ignorance, in part to our knowledge.” • So to Laplace, probability theory is Pierre-Simon Laplace applied to our level of knowledge

  4. Comparing datasets • As there is only one Universe KiDS-450 1.2 CFHTLenS (MID J16) (setting aside the Multiverse) , we WMAP9+ACT+SPT make observations of un- Planck15 1.0 repeatable ‘experiments’ σ 8 • Therefore we have to proceed by 0.8 inference • Furthermore we cannot check or 0.6 probe for biases by repeating the 0.16 0.24 0.32 0.40 Ω m experiment - we cannot ‘restart the Universe’ (however much we may assuming Planck Λ CDM cosmology 0 . 7 DR12 final consensus want to) Planck Λ CDM 0 . 6 • If there is a tension (i.e. if two data f σ 8 ( z ) 0 . 5 sets don’t agree), can’t take the data again. Need to instead make 0 . 4 inferences with the data we have GAMA 0 . 3 6dFGS WiggleZ SDSS MGS Vipers 0 . 2 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 z Alam et al 2016

  5. Rules of Probability • We define Probability to have numerical value • We define the lower bound, of logical absurdities, to be zero, P( ∅ )=0 B • We normalize it so the sum of the A probabilities over all options is unity, ∑ P(Ai) ≡ 1 P(A ∪ B)=P(A)+P(B)-P(A ∩ B) Sum Rule: P(A ∩ B)=P(A)P(B|A)=P(B)P(A|B) Product Rule:

  6. Bayes Theorem • Bayes theorem is easily derived from the product rule P(B|A)P(A) P(A|B) = P(B) • We have some model M, with some unknown parameters θ , and want to test it with some data D posterior likelihood prior P(D| θ ,M)P( θ |M) P( θ |D,M) = P(D|M) evidence • Here we apply probability to models and parameters, as well as data

  7. Model Selection • If we marginalize over the parameter uncertainties, we are left with the marginal likelihood, or evidence likelihood evidence prior E=P(D|M)= ⌠ ⌡ P(D| θ ,M)P( θ |M)d θ • If we compare the evidences of two different models, we find the Bayes factor evidence Model prior Model posterior P(M 1 |D) P(D|M 1 )P(M 1 ) P(M 2 |D) = P(D|M 2 )P(M 2 ) • Bayes theorem provides a consistent framework for choosing between different models

  8. Occam’s Razor Z E = d θ P ( D | θ , M ) P ( θ |M ) θ , M ) × δθ ≈ P ( D | ˆ ∆ θ Best fit likelihood Occam factor • Occam factor rewards the model with the least amount of wasted parameter space (“most predictive”)

  9. Bayesian Model Comparison • Jeffrey’s (1961) scale: Jeffrey Trotta Difference Odds (1961) (2006) No Δ ln(E)<1 No evidence 3:1 evidence 1< Δ ln(E)<2.5 substantial weak 12:1 2.5< Δ ln(E)<5 strong moderate 150:1 >150: Δ ln(E)>5 decisive strong 1 • If model priors are equal, evidence ratio and Bayes factor are the same

  10. Information Criteria • Instead of using the Evidence (which is difficult to calculate accurately) we can approximate it using an Information Criteria statistic • Ability to fit the data (chi-squared) penalised by (lack of) predictivity • Smaller the value of the IC, the better the model • Bayesian Information Criterion BIC = χ 2 (ˆ θ ) + k ln N • k is the number of free parameters and N is the number of data points • Deviance Information Criterion (Spielgelhalter et al. 2002) DIC = χ 2 (ˆ θ ) + 2 c • Here c is the complexity, which is equal to number of well measured parameters

  11. Complexity • The DIC penalises models based on the Bayesian complexity , the number of well-measured parameters • This can be computed through the information gain (KL divergence) between the prior and posterior, minus a point estimate ⇣ ⌘ D KL [ P ( θ | D, M ) P ( θ |M )] − d C b = − 2 D KL • For the simple gaussian likelihood, this is given by C b = χ 2 ( θ ) − χ 2 (¯ θ ) • Average is over posterior

  12. Tensions • Tensions occur when KiDS-450 two datasets have 1.2 CFHTLenS (MID J16) different preferred WMAP9+ACT+SPT values (posterior Planck15 1.0 distributions) for some σ 8 common parameters 0.8 • This can arise due to • random chance 0.6 • systematic errors 0.16 0.24 0.32 0.40 Ω m • undiscovered physics

  13. Diagnostic statistics • Need to diagnose not if the model is correct, but if the tension is significant • Simple test 𝜓 2 per degree of freedom • Equivalent to p-value test on data • Only a point estimate though • Raveri (2015): the evidence ratio P ( D 1 ∪ D 2 |M ) C ( D 1 , D 2 , M ) = P ( D 1 |M ) P ( D 2 |M ) • Joudaki et al (2016): change in DIC ∆ DIC = DIC( D 1 ∪ D 2 ) − DIC( D 1 ) − DIC( D 2 )

  14. Linear evidence 2 3 | F | − 1 / 2  � − 1 θ T F ¯ π Π θ π − ¯ 2( θ T L L θ L + θ T P ( D |M ) = L 0 | Π | − 1 / 2 exp θ ) 1 • Evidence in linear case dependent on 1.likelihood normalisation 2.Occam factor (compression of prior into posterior) 3.Displacement between prior and posterior • In linear case, final Fisher information matrix is sum of prior and likelihood (F=L+ Π ) • If prior is wide, Π is small (so displacement minimised), but Occam factor larger

  15. Simple linear model Image credit: Tamara Davis

  16. Diagnostics II: The Surprise • Seehars et al (2016): the ‘Surprise’ statistic, based on cross entropy of two distributions • Cross entropy given by KL divergence between original ( D 1 ) and updated dataset ( D 2 )  P ( θ | D 2 ) � Z D KL ( P ( θ | D 2 ) || P ( θ | D 1 )) = P ( θ | D 2 ) log P ( θ | D 1 ) • Surprise is difference of observed KL divergence relative to expected • where expected assumes consistency S ⌘ D KL ( P ( θ | D 2 ) || P ( θ | D 1 )) � h D i • One data set is assumed to be ‘ground-truth’, and information gain is considered in light on updating, or additional

  17. Linear tension P ( D 1 |M ) P ( D 2 |M ) = L 1+2 | F 1+2 | − 1 / 2 P ( D 1+2 |M ) 0 | F 1 | − 1 / 2 | F 2 | − 1 / 2 × displacement terms × L 1 0 L 2 0 • Displacement terms equivalent to `Surprise’ - relative entropy between two distributions • Occam factor independent of tensions • Tensions manifest in first and third terms - best fit likelihood and displacement

  18. Linear DIC • Δ DIC statistic has two components • Difference in mean parameter (best fit) likelihood ∆ χ 2 = χ 2 1+2 − χ 2 1 − χ 2 2 • Difference in penalty term (complexity) ∆ C b = C b 1+2 − C b 1 − C b 2 • In linear case, final Fisher matrix is the sum of individual matrices, so complexity doesn’t change • Tension statistic (in linear case) driven entirely by difference in best likelihood

  19. Linear Surprise • Surprise is difference between information gain (going from data set D 1 to D 2 ) and expected information gain • In the linear case, KL divergence can be D KL = − 1 h i χ 2 1+2 ( θ ) − χ 2 1 ( θ ) 2 • For the expectation of the information gain, need to average over possible outcomes for the combined data set • But in the linear case, this corresponds to the maximum likelihood, where the information gain is evaluated at the posterior maximum h D i = � 1 • 1+2 (¯ 1 (¯ χ 2 θ ) � χ 2 ⇥ ⇤ θ ) 2 • This is not the same as the complexity change, even though it looks similar, as the averaging process happens over the final posterior, not individual ones S = D KL � h D i = 1 h i 1+2 (¯ 1 (¯ χ 2 θ ) � χ 2 θ ) � ( χ 2 1+2 ( θ ) � χ 2 1 ( θ )) 2

  20. Pros and Cons Approach Like ratio Evidence DIC Surprise Average over No Yes Yes Yes parameters From MCMC Yes No Yes Yes chain Probabalistic Yes Yes Yes No Symmetric Yes Yes Yes No

  21. DIC • Simple 5th order polynomial model, with second data set offset from the first • Complexity of each individual data, and also combined data, is the same • Both measure the 5 free parameters well • DIC only changes due to 2 worsening of 𝜓 • The Δ DIC goes from negative (agreement) to positive (tension) as the offset increases • Odds ratio of agreement I ( D 1 , D 2 ) ≡ exp { − ∆ DIC( D 1 , D 2 ) / 2 }

  22. KiDS vs Planck • All tensions considered here are in light of a particular model • If the model is changed, the tension may be alleviated • This is not the same as model selection

  23. Application to lensing data • In Joudaki et al (2016) they T(S 8 ) Model Δ DIC compared the Λ CDM cosmological constraints from — fiducial systematics 2.1 σ 1.26 Small tension Planck CMB data — extended systematics 1.8 σ 1.4 Small tension with KiDS-450 weak lensing data — large scales 1.9 σ 1.24 Small tension • Including curvature Neutrino mass 2.4 σ 0.022 Marginal case worsened tension, Curvature 3.5 σ 3.4 Large tension but allowing for dynamical dark Dark Energy (constant w) 0.89 σ -1.98 Agreement energy improved Curvature + dark energy 2.1 σ -1.18 Agreement agreement

Recommend


More recommend