Comparing the Performance of Randomization Tests and Traditional - PDF document

Comparing the Performance of Randomization Tests and Traditional Tests: A Simulation Study W.B.M.R.D. Wijesuriya 1 , C.H.Magalla 2 , D. Kasturiratna 3 1.2 Department of Statistics, University of Colombo, Sri Lanka 3 Department of Mathematics and Statistics, Northern Kentucky University, USA 1 rush,wijesuriya@gmail.com, 2 champa@stat.cmb.ac.lk, 3 Kasturirad1@nku.edu Abstract Being non parametric in nature, the randomization tests (RTs) differ from the parametric statistical tests in many aspects and are often assumed to be more robust than parametric tests when their assumptions are violated. However, this ideology lacks sufficient evidence and the virtues of the RTs continue to be debated in the literature often with different conclusions. As a result researchers are often reluctant to employ RTs which are different from status quo and opt to use the traditional tests, regardless of the characteristics of their data. Hence this study compares the robustness, in terms of type I error rate and the power, of the most widely used classical parametric tests; pooled t test, unpooled t test, paired t test and one way ANOVA F test with their respective randomization counterpart using simulations under several trial conditions. While highlighting the seldom unrecognised potential of the RTs, the results concluded that, although the RTs are more robust in the presence of certain parametric assumption violations, this should not be a general rule and hence should only be used under the appropriate conditions for each test as demonstrated. Keywords: Randomization tests, Permutation tests, t test, ANOVA F test, Type I error, Power ass ume it’s the only method av ailable for hypothesis I. INTRODUCTION testing. RTs also permit to assess statistical significance of Parametric statistical tests such as the t test and F test, nearly any parameter. (Peres-neto & Olden 2001). In addition to that, the inferences of the RTs refer only to the assumes that the variable in question has a known underlying distribution that can be defined. In addition to actual experimental units involved in the that, the parametric tests also have other assumptions experiment/problem. As many of the restrictions that were placed upon about homogeneity of variances and independence of observations (Ludbrook & Dudley, 1998; Berry, Mielke, & randomization tests are being resolved, more literature on Mielke jr, 2001). These assumptions of the parametric tests randomization tests is timely so that students, researchers are indispensable for two reasons. First, they place and statisticians in general are well versed with it. Therefore the main objective of this research is to compare constraints on the interpretation of the results of the test (Snijders, 2001). Second, the characteristics of the the performance(type I error and power) of some of the population sampled, are used to draw inferences. Hence most commonly used parametric tests; pooled t test, the parametric assumptions are important in deriving the unpooled t test, paired t test and one way ANOVA F test with the randomization tests to discriminate the statistical optimal parametric test. Randomization tests (RTs) became known through R.A conditions that support the two different tests. Fisher’s (1935 ) demonstration that the assumption of II. THEORY AND SIMULATION PROCEDURE normality is not a must for analyzing data (David 2008). RTs make no reference to a population and hence do not Probability of type I error (α) of a test (size/ the nominal require random sampling (Potvin & Roff, 1993; Ludbrook power) is defined to be the probability of the test rejecting & Dudley, 1998). the null hypothesis (H 0 ) when H 0 is true. This probability is RTs only require random assignment of treatments to often fixed by the statistician. Hence this probability should experimental units. Few experiments in behavioural be at least in the proximity of the claimed value in order for sciences such as biology, education, psychology, medicine the test to be relevant. (Higgins, 2004). The value of α was or any other field use randomly selected subjects fixed at 5% in this study. Power of a statistical test (1- β) is ( Edgington & Onghena, 2007; Huo, Noortgate, Heyvaert, & defined as the probability of the test rejecting H 0 when H 0 Onghena, 2010;Huo & Onghena, 2012). According Hunter is false. (Mood, Graybill, & Boes 1950). and May (1993), in most research, the population model of RTs belong to a larger class of statistical tests called the inference enters statistical analysis not because the permutation tests. The procedure for RT involves experimenter wishes to generalize the results to a reshuffling/permuting the data and calculating the test population, but because the model is so common that many statistic for each permutation, to compile the sampling

distribution of the test statistic. Hence there is no under different trial conditions such as sample size ratios, requirement that the test statistic used, should conform to distribution shapes, effect sizes (Δ) an d variance ratios are a mathematically definable probability distribution (Berry, summarised in Tables I,II,III and IV. Four different sample Mielke, & Mielke jr, 2001). This means that the RTs can sizes (5, 10, 30 and 50) and four types of distributions; determine the p value directly from the data, without the Normal, Uniform (symmetric), Exponential and Gamma use of reference tables based on probability distributions (skewed) were considered. As the results for both skewed unlike t test or F test. distributions were similar, only the results generated for FIG.1 shows the basic steps followed in performing a RT. one type of skewed distribution are included. III. SIMULATION RESULTS TABLE I SIZE AND POWER VALUES (%) OF POOLED T TEST AND RT n 5:5 10:10 30:30 50:50 Δ Normal(σ 1 =σ 2 =10) t test RT t test RT t test RT t test RT 0 4.2 4.9 4.3 4.0 4.8 4.4 5.1 4.7 5 16.1 15.3 29.1 28.3 60.1 58.9 79.3 78.6 10 41.8 40.4 68.2 67.2 98.7 98.7 100.0 100.0 Uniform (σ 1 =σ 2 =10): similar type I error and power values as Normal case Exponential (σ 1 =σ 2 =10) t test RT t test RT t test RT t test RT 0 4.7 4.6 4.4 4.6 4.7 4.1 4.3 4.0 5 23.8 24.8 31.3 31.7 63.1 62.4 82.3 82.2 10 49.6 48.6 73.9 73.7 97.8 97.7 99.8 99.8 Δ Normal(σ 1 =10, σ 2 =20) t test RT t test RT t test RT t test RT 0 5.3 4.8 5.4 5.4 5.0 4.9 5.0 4.9 5 12.6 11.7 16.7 16.5 35.6 35.1 47.3 47.4 10 26.6 26.0 40.6 39.6 79.3 78.1 94.4 94.0 Uniform (σ 1 =10, σ 2 =20) : similar to the Normal case Exponential (σ 1 =10, σ 2 =20) t test RT t test RT t test RT t test RT 0 11.8 11.4 9.6 9.6 9.4 9.0 6.1 6.1 5 22.7 21.9 24.5 24.6 38.5 38.3 52.0 52.0 10 38.3 37.0 46.3 46.0 76.4 76.1 89.2 88.9 n 5:10 10:20 30:60 50:100 Δ Normal(σ 1 =σ 2 =10) FIG.1 Flow chart for the basic steps followed in the simulation t test RT t test RT t test RT t test RT Paste the graphic here procedure 0 5.6 5.6 5.3 5.3 4.5 4.7 4.4 4.8 5 21.7 21.4 33.0 32.3 72.7 71.9 89.1 88.6 In this study, the number of permutations and 10 55.1 54.4 80.0 79.5 99.8 99.7 100.0 100.0 simulations considered were both 1000. In order to obtain Uniform(σ 1 =σ 2 =10) : similar to the normal case the Type I error rates for the RTs, samples were generated Exponential (σ 1 =σ 2 =10) under H 0 (Δ=0) for each trial. The p value for each trial was t test RT t test RT t test RT t test RT calculated as the proportion of permutations that 0 2.9 4.4 3.6 4.7 4.9 5.6 5.2 5.1 generated a test statistic that were equal or bigger than the 5 28.3 32.7 42.6 45.6 73.0 73.7 89.4 89.9 test statistic obtained for the original sample. This p value 10 60.0 63.0 81.2 81.6 99.4 99.4 100.0 100.0 was used to reject or not reject H 0 in each trial. Type I error rates for each of the tests were determined by dividing the When the variances were homogenous, both the tests number of mistakenly rejected H 0 s in 1000 trials. To maintained the relevant type I error regardless of other calculate the power, samples were generated under H 1 trial conditions. Although the two tests had very similar (Δ>0), the alternate hypothesis (H 0 false). Then the power powers, pooled t test was slightly more powerful except in was determined as the number of correctly rejected H 0 s in small samples of asymmetric distributions. With unequal 1000 trials. The procedure was carried out using R. sample sizes (unbalanced designs) both the tests were The simulation results for the size and power estimates relevant for all other trial conditions, only if the data was for the unpooled t test, paired t test, F test and the RT normal or symmetric. Whereas when the data was

Comparing the Performance of Randomization Tests and Traditional - PDF document

Comparing the Performance of Randomization Tests and Traditional Tests: A Simulation Study W.B.M.R.D. Wijesuriya 1 , C.H.Magalla 2 , D. Kasturiratna 3 1.2 Department of Statistics, University of Colombo, Sri Lanka 3 Department of Mathematics and

What About Randomization Tests? Strengths Gail et al. (1996) reported nominal Type I and II

P -values, Randomization Tests, and Nonparametric Combinations of Tests Tonix Virtual Retreat

Comparing User-Provided Tests to Developer-Provided Tests Ren Just, Chris Parnin, Ian Drosos,

STAT 113 Comparing Multiple Means Colin Reimer Dawson Oberlin College December 5, 2017 1 / 34

Lecture 4: Permutation Methods Applied Statistics 2014 1 / 21 Randomization Model Population

Randomization Algorithm Theory WS 2012/13 Fabian Kuhn Randomization Randomized Algorithm: An

Introducing the research method of simulations Comparing the validity of cut-score methods in

Randomization and Restarts Remember the PLS? It has two very intriguing properties 1. A phase

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA. While comparing the

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62

GFC Missions GFC Missions Elaboration of Performance Tests Le Consulat 147 avenue Paul

ACMS 20340 Statistics for Life Sciences Chapter 20: Comparing Two Proportions Two sample tests

An Exploration into Comparing WBCs and their Performance May 1, 2015 Prepared by students of

Address Space Randomization A n E f f e c t i v e I m p l e m e n t a t i o n Michael Cloppert

Utilizing Performance Unit Tests To Increase Performance Awareness Vojtch Hork, Peter Libi,

Comparing the Performance of Abstract Syntax Notation One (ASN.1) vs. eXtensible Markup Language

Housing Crisis Response System Performance Comparing 2015, 2016, 2017, and 2018 (11/1) HUD

Comparing TCP performance of tunneled and non-tunneled traffic using OpenVPN Berry Hoekstra |

1 2 In stat. people may call these multistage trials (the randomization at each stage is

Task and finish group Allocation of laboratory tests to different models for performance

Test statistics and randomization distributions Applied Statistics and Experimental Design

Gov 2002: 3. Randomization Inference Matthew Blackwell September 10, 2015 Where are we? Where

Dublin Waste to Energy Project Bill Crellin Tony LoRe Report on Performance Demonstration Tests

Comparing Simulated Safety Performance to Observed Crash Occurrence Flvio Cunto, Ph.D