Everything is Dangerous: A Controversy S. Stanley Young National Institute of Statistical Sciences June 2008 28-Jul-07 Stan Young, www.NISS.org 1 We examine statistical analysis strategies of epidemiologists and statisticians using an evaluation method taken from Thomas Kuhn. Kuhn says that it is relatively easy to understand the paradigm of a science by examining their papers, texts and journals. The epidemiology paradigm is to make no correction for multiple testing. The statistics paradigm is to protect against chance false discovery. Let me say at the beginning, I think medical observational studies are important and can be analyzed in a matter that claims are dependable. Epidemiologists are well- versed in statistics and are capable of defending their paradigm. 1
Abstract Some multiple testing mistakes are due to ignorance (how often are you asked to re-examine the data to see if something can be found?), but others are intentional, following a (faulty) scientific paradigm; over $1B of grant/tax money flows to institutions with reproducibility problems revolving around a multiple testing. Statisticians need to understand other scientists’ paradigms. It serves neither society nor our profession to ignore multiple testing controversies. At a minimum we need to protect the integrity of our profession. We present evidence of a false discovery rate over 80%. We present survey of journal editors on multiple testing that support the epidemiology paradigm of no correction for multiple testing and not sharing of data sets. 28-Jul-07 Stan Young, www.NISS.org 2 The basic thesis is quite simple. Epidemiologists have as their statistical analysis/scientific method paradigm not to correct for any multiple testing. Also, as part of their scientific paradigm they ask multiple, often hundreds to thousands, of questions of the same data set. Their position is that it is better to miss nothing real than to control the number of false claims they make. The Statisticians paradigm is to control the probability of making a false claim. We have a clash of paradigms. Empirical evidence is that 80-90% of the claims made by epidemiologists are false; these claims do not replicate when retested under rigorous conditions. 2
Epidemiology Recent Claims that do not Replicate ^ “T he reliability of results from observational studies has been called into question many times in the recent past, with several analyses showing that well over half of the reported findings are subsequently refuted.” JNCI, 2007 1. Calcium + VitD for bone breaking 2. Hormone replacement therapy for dementia, CHD, breast cancer, stroke 3. Vitamin E for CHD 4. Fluoride for vertebral fractures 5. Diuretic in diabetes patients for mortality 6. Low fat diet for colorectal cancer and CHD, breast cancer) 7. Beta Carotene for CHD 8. Growth hormone for mortality 9. Low dose aspirin for stroke, MI, and death 10.Knee surgery and pain 11.Statins for cancer and mortality 1/ 20, 5% !! 12.Wound dressing on healing speed 28-Jul-07 Stan Young, www.NISS.org 3 The NIH has funded a large number of randomized clinical trials testing the claims coming from observational studies. Of 20 claims coming from observational studies only one replicated when tested in RCT. The overall picture is one of crisis. 3
Beginnings What is the meaning of life? What is real? What is reproducible? Fooled by randomness? 28-Jul-07 Stan Young, www.NISS.org 4 We leave to the philosophers the meaning of life. Psychologists and physicists can ponder what is real. We and scientists focus on what phenomenon are reproducible. If I conduct an experiment and tell you how I did it, you should be able to get roughly similar results if you conduct a similar experiment. The effects of randomness are subtle. Humans have to be very vigilant and work very hard not to be fooled by randomness. See two books by Nassim Taleb, Fooled by Randomness and The Black Swan. Some other time, it would be interesting to go into how humans use randomness to fool other humans. 4
Escaping the Bonferroni iron claw in ecological studies “Lottery tickets should not be free. In such purely random and independent events as the lottery, the probability of having a winning number depends directly on the number of tickets you have purchased. When one evaluates the outcome of a scientific work, attention must be given not only to the potential interest of the ‘significant’ outcomes but also to the number of ‘lottery tickets’ the authors have ‘bought’. Those having many have a much higher chance of ‘winning a lottery prize’ than of getting a meaningful scientific result. It would be unfair not to distinguish between significant results of well-planned, powerful, sharply focused studies, and those from ‘fishing expeditions’ with a much higher probability of catching an old truck tyre than of a really big fish.” 28-Jul-07 Stan Young, www.NISS.org 5 Multiple testing is not just a problem of epidemiology. I use epidemiology as an example as they are not correcting for multiple testing as part of their scientific paradigm. They understand multiple testing. They are not doing what they are doing through ignorance. See for example, Vandenbroucke, PLoS Med (2008). Clinical trials and Genetics/Genomics are two sciences that take multiple testing seriously. 5
Non-randomized Studies Fail to Replicate Ioannidis , JAMA 2005 ~80% (5/6) efficacy findings based on non- randomized trials were already contradicted or found to be exaggerated by 2004. Even among highly-cited randomized trials, efficacy findings were already contradicted or found to be exaggerated in ~20% (9/39) interventions. (Keep in mind power.) See also Pocock, BMJ 2004. 28-Jul-07 Stan Young, www.NISS.org 6 Ioannidis in Journal of the American Medical Association examined highly cited medical trials, non-randomized and randomized, and found that claims coming from non-randomized trials failed to replicate or the claimed effect was dramatically smaller when the claim was tested a second time. Ioannidis noted that claims coming from randomized medical trials failed to replicate about 20% of the time. Stuart Pocock in BMJ catalogues the current problems with the reporting of epidemiology studies. There are so many problems that it is difficult to say that multiple testing is the largest problem. I think Pocock underestimates the multiple testing problem when he says the false discovery rate of epidmiology is on the order of 20%. On the other hand, it is relatively easy to fix the multiple testing problems: Copy the statistical strategies used in randomized clinical trials. 6
Outline 1. The question. 2. Two proofs 3. Two paradigms. 4. Crisis? 28-Jul-07 Stan Young, www.NISS.org 7 The motivation of this lecture is that effects found in non-randomized (epidemiology) medical studies are failing to replicate when tested in randomized clinical trials. The false discovery rate of epidemiolgy studies might be considered excessive, ~80- 90%. We present two paradigms for the analysis of non-randomized studies. The epidemiology paradigm is to test many questions and with no adjust for multiple testing. The statistics paradigm is to correct the analysis for the number of questions asked controlling the false positive rate at a fixed level, usually 5%. Is there a crisis? When an important science, epidemiology, has a false discovery rate of 80-90%, there appears to be a crisis. 7
Statistical Fun and Games Example: 54 p-values, smallest is 0.003. 54 x 0.003 = 0.162. Over 51,000 Google hits! 28-Jul-07 Stan Young, www.NISS.org 8 The claim coming from this paper is not significant when multiple testing is taken into account. The paper was wildly popular with the public press. It was even written up in the Economist. After much discussion, intervention by the editor, and the signing of rather restrictive legal document, it appears that we will get the data set from the authors. 8
Proof : Every study is positive 1.Bias 2.Multiple testing Either or both lead to all observational studies being positive! 28-Jul-07 Stan Young, www.NISS.org 9 Unless the statistical analysis of observational studies is carefully and conscientiously conducted, every study will have one or more statistically significant effects. We chose to focus on two statistical issues, bias and multiple testing. 9
First, Bias 28-Jul-07 Stan Young, www.NISS.org 10 Consider a linear model for a treated individual and a control individual. Let X1t indicate treatment and take the value 1 and X1c indicate no treatment. The remaining X’s are covariates. If we average all the treated and control individuals and subtract the two resulting equations, we get a delta for the difference between treated and control individuals. Now if we move all the known confounders to the left of the equation, we take out the effect of the known confounders. Unknown confounders are still confounded with the treatment difference and can confuse the interpretation of the data. 10
Randomized Clinical Trial C ~ = T C T 28-Jul-07 Stan Young, www.NISS.org 11 For RCT, through randomization the effects of bias are largely, but not completely, removed. If there is no treatment effect the two distributions are on top of one another. If treatment has an effect it will move the distribution of the treated patients, red, away from the control patients. If the effect is large enough and if the sample size is large enough, the treatment effect will be detected. 11
Recommend
More recommend