Statistical testing in the era of big data ( p < 0.05) Dimitri Van De Ville MIP:lab IBI-STI/CNP (EPFL) RADIO (UniGE) http://miplab.epfl.ch/ @dvdevill #CNP Retreat Feb 11-12, 2020
CNP Retreat 2020 — Stats Workshop Panic Dimitri Van De Ville 2
CNP Retreat 2020 — Stats Workshop Big Data Dimitri Van De Ville 3
4 Dimitri Van De Ville Paradox? p<0.05 Mount CNP Retreat 2020 — Stats Workshop Is big data Big destroying p-values? Data
Roadmap of the workshop 5 Dimitri Van De Ville ▪ Contradictory tendencies ▪ Many (emotive) reports about p-value crisis ▪ Reviewers even more picky on statistical significance ▪ Sufficient power, multiple comparisons, replication, … ▪ Adage: Never enough data ▪ Big data has arrived, and will become bigger ▪ Is classical hypothesis testing doomed? Should we all go into Bayesian statistics? Machine-learning approaches will be the only solution? CNP Retreat 2020 — Stats Workshop ▪ Here, revisit the basic statistical hypothesis testing ▪ to understand the core issue ▪ to solve it within the conventional framework
̂ ̂ ̂ ̂ ̂ One-sample t-test in a nutshell 6 Dimitri Van De Ville ▪ Consider samples modeled to reflect a true effect with a random N μ 𝒪 (0, σ 2 ) x n = μ + e n n = 1,…, N Gaussian* deviation : , ▪ Estimator of is average μ μ ▪ Estimator of uncertainty on is standard deviation μ σ μ ▪ We define t = N σ ▪ Question: is there evidence from the data that the underlying μ ≠ 0 CNP Retreat 2020 — Stats Workshop * Popularity of Gaussian hypothesis? Central limit theorem!
One-sample t-test in a nutshell 7 Dimitri Van De Ville ▪ Null hypothesis : no effect, μ = 0 ℋ 0 ▪ (Implicit) alternative hypothesis : ℋ 1 μ ≠ 0 ▪ Under the null, follows a known distribution t degrees of freedom) (Student t-distribution with N − 1 -value is probability to mistakenly reject : ℋ 0 p = P ( | t | > T | ℋ 0 ) ▪ p ▪ Result is considered significant if <0.05 p “If one in twenty does not seem high enough odds, we may, if we prefer CNP Retreat 2020 — Stats Workshop it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level.” — R.A. Fisher, “The arrangement of field experiments”. Journal of the Ministry of Agriculture of Great Britain. 33:503-513, 1926
One-sample t-test in a nutshell 8 Dimitri Van De Ville ▪ Thus, -value indicates probability of false positive (FP) p ▪ Typically, no explicit : ℋ 1 ▪ No control on false negatives; i.e., P ( | t | ≤ T | ℋ 1 not true ) ▪ One can only control specificity (1-FP rate), not sensitivity (1-FN rate) ▪ No proof of no effect because no point of comparison CNP Retreat 2020 — Stats Workshop
Fallacy of statistical testing 9 Dimitri Van De Ville ▪ Any true effect can become significant for sufficiently large μ 0 ≠ 0 N N > T 2 σ 2 μ 0 N > T μ 2 σ 0 ▪ “[ ] must be big enough that an effect of such magnitude as to be of N scientific significance will also be statistically significant. It is just as important, however, that the study not be too big, where an effect of little scientific importance is nevertheless statistically detectable” ▪ As increases, discriminability , as measured by classification N accuracy, of individual samples becomes very small CNP Retreat 2020 — Stats Workshop ▪ As increases, consistency , as measured by population prevalence, of N the effect becomes very small [Lenth, 2001]
Effect size 10 Dimitri Van De Ville ▪ Bottomline: -values are relevant if effect size is non-trivial! p N R 2 = μ 2 /( μ 2 + σ 2 ) ρ = R 2 ▪ Standardized effect size: Cohen’s ; ; d = t / Coefficient of Classification Population Effect size Cohen’s d Correlation determination R 2 accuracy prevalence ~1 ~1/2=0.50 ~0.71 ~70% ~50% Large ~1/2=0.50 ~1/5=0.20 ~0.45 ~60% ~20% Medium ~1/4=0.25 ~1/17=0.06 ~0.24 ~55% ~6% Small ~1/8=0.13 ~1/65=0.02 ~0.12 ~52.5% ~1% Trivial CNP Retreat 2020 — Stats Workshop 0 0 0 50% 0% None ▪ “ … one should be cautious that extremely large studies may be more likely to find a formally statistical significant difference for a trivial effect that is not really meaningfully different from the null.” (Ioannidis, 2005) [Friston, NeuroImage , 2012]
Sample size and sensitivity 11 Dimitri Van De Ville ▪ Consider now fixed specificity , then we have α = 0.05 α = ∫ ∞ T ( t ; N − 1) dt u ( α ) T | ℋ 0 CNP Retreat 2020 — Stats Workshop α [Friston, NeuroImage , 2012]
Sample size and sensitivity 12 Dimitri Van De Ville ▪ Consider now fixed specificity , then we have α = 0.05 α = ∫ ∞ T ( t ; N − 1) dt u ( α ) ▪ Under the assumption of a true effect size , we can compute sensitivity as d 1 − β ( d ) = ∫ ∞ T ( t ; N − 1, d N ) dt T | ℋ 0 T | ℋ 1 u ( α ) where is the non-central T ( t ; K , δ ) t-distribution with degrees of K 1 − β CNP Retreat 2020 — Stats Workshop freedom and non-centrality parameter δ ▪ Sensitivity depends on sample size ( ) and effect size ( ) N d α [Friston, NeuroImage , 2012]
Under-powered? 13 Dimitri Van De Ville ▪ Sensitivity depends on sample size ( ) and effect size ( ) N d ▪ Significant effect with small sample size is likely to be caused by large effect 1 − β ( d ) size! ▪ If you are criticized in this way: “The fact that we have demonstrated a significant result in a relatively under- 100 % powered study suggests that the effect size is large. This means, quantitatively, 50 % CNP Retreat 2020 — Stats Workshop our result is stronger than if we had used a larger sample-size.” 0 % = conflation of significance and power [Friston, NeuroImage , 2012]
Over-powered? 14 Dimitri Van De Ville ▪ Sensitivity depends on sample size ( ) and effect size ( ) N d ▪ Sensitivity to trivial effect sizes 1 increases with sample size! 0.9 ▪ Ultimately, with very large sample 0.8 sizes, sensitivity will reach 100% 0.7 for every non-null effect size 0.6 ▪ Explains a lot about the crisis! sensitivity 0.5 ▪ More is not better 0.4 CNP Retreat 2020 — Stats Workshop 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 sample size [Friston, NeuroImage , 2012]
Loss-function analysis 15 Dimitri Van De Ville ▪ Let us define a simple loss function : l ▪ Cost for detecting trivial effect size of [bad] +1 1/8 ▪ Cost for detecting large effect size of [good] − 1 1 0 0 ▪ Expected loss: -0.1 -0.1 l = (1 − β (1/8)) − (1 − β (1)) -0.2 -0.2 -0.3 -0.3 = β (1) − β (1/8) -0.4 -0.4 ▪ Optimal sample size at minimal loss loss loss -0.5 -0.5 CNP Retreat 2020 — Stats Workshop ▪ Does not increase dramatically even -0.6 -0.6 if significance needs to be (much) -0.7 -0.7 stronger (e.g., due to multiple -0.8 -0.8 comparisons) -0.9 -0.9 -1 -1 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 sample size sample size
Protected inference 16 Dimitri Van De Ville ▪ Inference is based on controlling FP rate under , which translates in a ℋ 0 flat sensitivity at for no effect: α 1 1 ▪ specificity = 0.9 0.9 sensitivity to null effects 0.8 0.8 ▪ So let us suppress sensitivity to 0.7 0.7 trivial effects instead! 1 − β ( d ) = ∫ ∞ 0.6 0.6 sensitivity sensitivity T ( t ; N − 1, d N ) dt 0.5 0.5 u ( α ) where this time we use 0.4 0.4 α ( d ′ � ) = ∫ ∞ CNP Retreat 2020 — Stats Workshop 0.3 0.3 T ( t ; N − 1, d ′ � N ) dt 0.2 0.2 u ( α ) with d ′ � = 1/8 0.1 0.1 specificity 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 sample size sample size [Friston, NeuroImage , 2012]
Protected inference 17 Dimitri Van De Ville ▪ Protection fixes and thus increasing becomes harmless β (1/8) = 0.05 N ▪ Concretely, threshold to be applied to -values is penalized t 0 10 -0.1 9 -0.2 8 -0.3 7 -0.4 T threshold 6 loss -0.5 5 -0.6 CNP Retreat 2020 — Stats Workshop 4 -0.7 protection 3 -0.8 2 no protection -0.9 -1 1 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 sample size sample size [Friston, NeuroImage , 2012]
̂ ̂ ̂ ̂ A note on non-parametric testing 18 Dimitri Van De Ville ▪ Consider samples modeled to reflect a true effect with a random N μ deviation of unknown, but symmetric distribution: , x n = μ + e n n = 1,…, N ▪ Estimator of is average (could also be median etc) μ μ ▪ Null hypothesis : no effect, ℋ 0 μ = 0 ▪ In that case, we can randomly flip or permute the signs of and x n μ (0) recompute our measure of interest under the null as , k = 1,…, K k μ (0) μ (0) ▪ If or , then is rejected with μ > max ̂ μ < min ̂ ℋ 0 p = 2/( K + 1) k k CNP Retreat 2020 — Stats Workshop ▪ Use randomizations to be able to assess 0.05 significance K = 39 ▪ Less assumptions about distribution, but essentially same problem that trivial effects will be picked up as increases N
Recommend
More recommend