Peter Grünwald December 2016 Slate Sep 10 th : yet another classic finding in psychology — that you can smile your way to happiness — just blew up… Safe Testing Peter Grünwald Centrum Wiskunde & Informatica – Amsterdam Mathematisch Instituut Universiteit Leiden Reproducibility Crisis Partly based on joint work with Cover Story of Stéphanie van der Pas, Rianne de Heide, Economist (2013), Wall Street Journal, Science Wouter Koolen, Allard Hendriksen (2012) J. Berger (2003, IMS Medaillion Lecture ) Could Neyman, Fisher and Jeffreys have 80 years and still unresolved... agreed on testing? Jerzy Neyman : alternative exists, “inductive . . ... behaviour” • Standard method is still p-value-based Sir Ronald Fisher : test statistic rather than null hypothesis significance testing alternative, p- value indicates “unlikeliness” ...an amalgam of Neyman- Pearson’s and Fisher’s 1930s methods • Sir Harold Jeffreys : Bayesian , alternative exists, inductive behaviour; compression interpretation • everybody in psychology and medical sciences does it... • .... most statisticians agree it’s not o.k.... • ...but still can’t agree on what to do instead! P-value Problem #1: P-value Problem #2: Combining Independent Tests Combining Dependent Tests • • Suppose two different research groups Suppose reseach group A tests medication, gets ‘almost significant’ result. tested the same new medication. How to combine their test results? • ...whence group B tries again on new data. • You can’t multiply p -values! How to combine their test results? • Now Fisher’s and Stouffer’s method don’t work • This will (wildly) overestimate evidence anymore – need complicated methods! against the null hypothesis! • In our method, despite dependence, • Different valid p-value combination methods exist evidences can still be safely multiplied (Fisher’s; Stouffer’s) but give different results • We will present a method in which evidences can be safely multiplied! Safe Testing – talk at WADAPT 2016 1
Peter Grünwald December 2016 P-value Problem #2b: P-value Problem #2b: Extending Your Test Extending Your Test • • Suppose reseach group A tests medication, Suppose reseach group A tests medication, gets ‘almost significant’ result. gets ‘almost significant’ result. • Sometimes group A can’t resist to test a • Sometimes group A can’t resist to test a few more subjects themselves... few more subjects themselves... • • In a recent survey 55% of psychologists admit to have A recent survey revealed that 55% of psychologists have succumbed to this practice [L. John et al., Psychological succumbed to this practice Science , 23(5), 2012] • But isn’t this just cheating? • In our method, despite dependence, • Not clear: what if you submit a paper and the referee evidences can still be safely multiplied asks you to test a couple more subjects? Should you refuse because it invalidates your p-values!? Should we be Bayesian? Safe (i.e. adaptive) Testing • We aim for a ‘safe’ or adaptive method • These and several other problems with p-values attracted a lot of attention in the 1960s and... that better suits the real-life research • ...caused several people to become Bayesian world where obviously either you yourself • a nd right now there’s a Bayesian revolution in psychology... or another research group wants to, and • As we will see though, Bayesian methods don’t will, study more data given preliminary fully resolve the issues at hand test results that are promising but • We propose a new method that does: Safe Testing inconclusive! Should we be Bayesian? Earlier Work • • The simple 𝐼 0 case (and related developments) These and several other problems with p-values attracted a lot of attention in the 1960s and... was essentially covered in work by Volodya Vovk and collaborators (1993, 2001, 2011,...) • ...caused several people to become Bayesian • see esp. Shafer, Shen, Vereshchagin, Vovk: Test • a nd right now there’s a Bayesian revolution in psychology... Martingales, Bayes Factors and p-values, 2011 • As we will see though, Bayesian methods don’t • Also Jim Berger and collaborators have earlier fully resolve the issues at hand ideas in this direction (1994, 2001, ...) • We propose a new method: Safe Testing • Both Berger and Vovk inspired by the great • for simple 𝑰 𝟏 , all Bayes factor tests are also Jack Kiefer Safe Tests • The only thing that is really radically new here is • for composite 𝑰 𝟏 , Bayes factor tests are usually the treatment of composite 𝑰 𝟏 and its relation to not safe ( T-Test, independence testing ) reverse-information projection Safe Testing – talk at WADAPT 2016 2
Peter Grünwald December 2016 Menu Menu 1. Some of the problems with p-values 1. Some of the problems with p-values 2. Safe Testing 2. Safe Testing • • ...solves the adaptivity problem ...solves the adaptivity problem • • gambling interpretation gambling interpretation 3. Safe Testing, simple (singleton) 𝐼 0 3. Safe Testing, simple (singleton) 𝐼 0 • • relation to Bayes relation to Bayes • • relation to MDL (data compression) relation to MDL (data compression) 4. Safe Testing, Composite 𝐼 0 4. Safe Testing, Composite 𝐼 0 • • Magic: RIPr (Reverse Information Projection) Magic: RIPr (Reverse Information Projection) • • Examples: Safe t-Test, Safe Independence Test Examples: Safe t-Test, Safe Independence Test Null Hypothesis Testing Null Hypothesis Testing • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. • • under all 𝑄 ∈ 𝐼 0 . under all 𝑄 ∈ 𝐼 0 . • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • Example: testing whether a coin is fair • Example: testing whether a coin is fair Under 𝑄 𝜄 , data are i.i.d. Bernoulli 𝜄 Under 𝑄 𝜄 , data are i.i.d. Bernoulli 𝜄 1 1 1 1 Simple 𝐼 0 Θ 0 = 2 , Θ 1 = 0,1 ∖ Θ 0 = 2 , Θ 1 = 0,1 ∖ 2 2 Standard test would measure frequency of 1s Standard test would measure frequency of 1s Null Hypothesis Testing Null Hypothesis Testing • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. • For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. under all 𝑄 ∈ 𝐼 0 . under all 𝑄 ∈ 𝐼 0 . • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • • Example: t-test (most used test world-wide) Example: t-test (most used test world-wide) 𝐼 0 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 0, 𝜏 2 vs. 𝐼 0 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 0, 𝜏 2 vs. Composite 𝐼 0 𝐼 1 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 𝜈, 𝜏 2 for some 𝜈 ≠ 0 𝐼 1 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 𝜈, 𝜏 2 for some 𝜈 ≠ 0 𝜏 2 unknown (‘nuisance’) parameter 𝜏 2 unknown (‘nuisance’) parameter 𝐼 0 = 𝑄 𝜏 𝜏 ∈ 0,∞ } 𝐼 0 = 𝑄 𝜏 𝜏 ∈ 0,∞ } 𝐼 1 = 𝑄 𝜏,𝜈 𝜏 ∈ 0, ∞ ,𝜈 ∈ ℝ ∖ 0 } 𝐼 1 = 𝑄 𝜏,𝜈 𝜏 ∈ 0, ∞ ,𝜈 ∈ ℝ ∖ 0 } Safe Testing – talk at WADAPT 2016 3
Peter Grünwald December 2016 Safe Test: General Definition General Definition Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • • Assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. under all 𝑄 ∈ 𝐼 0 . • Let 𝑈 be a positive-integer valued random variable • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • A safe test for stopping time 𝑈 is a test such that for all 𝑄 0 ∈ 𝐼 0 , we have • A test is a function • A safe test for sample size 𝑜 is a test such that for all 𝑄 0 ∈ 𝐼 0 , we have First Interpretation: p-values First Interpretation: p-values • Proposition: Let 𝑁 be a safe test. Then • Proposition: Let 𝑁 be a safe test. Then 𝑁 −1 𝑌 𝑈 is a nonstrict p-value, i.e. a p-value 𝑁 −1 𝑌 𝑈 is a nonstrict p-value, i.e. a p-value with wiggle room : with wiggle room : • for all 𝑄 ∈ 𝐼 0 , all 0 ≤ 𝛽 ≤ 1 , • for all 𝑄 ∈ 𝐼 0 , all 0 ≤ 𝛽 ≤ 1 , • Proof: just Markov’s inequality! Safe Tests are Safe (‘Adaptive’) First Interpretation: p-values • Proposition: Let 𝑁 be a safe test. Then • Suppose we observe data (𝑌 1 ,𝑍 1 ), 𝑌 2 ,𝑍 2 ,… 𝑁 −1 𝑌 𝑈 is a nonstrict p-value, i.e. a p-value 𝑍 𝑗 : side information, independent of 𝑌 𝑗 ’s • • Let 𝑁 1 ,𝑁 2 ,… ,𝑁 𝑙 be an arbitrarily large collection of with wiggle room : (potentially identical) safe tests for sample sizes • for all 𝑄 ∈ 𝐼 0 , all 0 ≤ 𝛽 ≤ 1 , 𝑜 1 ,𝑜 2 ,… , 𝑜 𝑙 respectively. Suppose we first perform test 𝑁 1 . • • If outcome is in certain range (e.g. promising but not conclusive) and 𝑍 𝑜 1 has certain values (e.g. ‘boss has money to collect more data’) then we Hence if we reject 𝐼 0 iff 𝑁 −1 𝑌 𝑈 < 0.05 , • perform test 𝑁 2 ; otherwise we stop. then we have Type-I Error Bound of 0.05 Safe Testing – talk at WADAPT 2016 4
Recommend
More recommend