Chapter 6 Hypothesis Testing
What is Hypothesis Testing? • … the use of statistical procedures to answer research questions • Typical research question (generic): • For hypothesis testing, research questions are statements: • This is the null hypothesis (assumption of “no difference”) • Statistical procedures seek to reject or accept the null hypothesis (details to follow) 2
• Thus far: – You have generated a hypothesis (E.g. The mean of group A is different than the mean of group B) – You have collected some data (samples in group A, samples in group B) – Now you want to know if this data supports your hypothesis – Formally: – H0 (null hypothesis): there is no difference in the mean values of group A and group B – H1 (experimental hypothesis): there is a difference in the mean of group A and group B 3
A practitioner’s point of view • Test statistic – Inferential statistics tell us what is the likelihood that the experimental hypothesis is true à by computing a test statistic. – Typically, if the likelihood of obtaining a value of a test statistic is <0.05, then we can reject the null hypothesis – “…significant effect of …” • Non-significant results – Does not mean that the null hypothesis is true – Interpreted to mean that the results you are getting could be a chance finding • Significant result – Means that the null hypothesis is highly unlikely 4
A practitioner’s point of view • Errors: – Type 1 error (False positive) : we believe that there is an effect when there isn’t one – Type 2 error (False negative) : we believe that there isn’t an effect, when there is one – If p<0.05, then the probability of a Type 1 error is < 5% (alpha level) • Typically, we deal with two types of hypotheses – The mean of group A is different from the mean of group B (one-tailed test) – The mean of group A is larger than the mean of group B (two-tailed test) 5
Statistical Procedures • Two types: – Parametric • Data are assumed to come from a distribution, such as the normal distribution, t -distribution, etc. – Non-parametric • Data are not assumed to come from a distribution – Lots of debate on assumptions testing and what to do if assumptions are not met (avoided here, for the most part) – A reasonable basis for deciding on the most appropriate test is to match the type of test with the measurement scale of the data (next slide) 6
Measurement Scales vs. Statistical Tests Examples M=Male, F=Female Preference ranking Likert scale responses Task completion time • Parametric tests most appropriate for… – Ratio data, interval data • Non-parametric tests most appropriate for… – Ordinal data, nominal data (although limited use for ratio and interval data) 7
Tests Presented Here • Parametric – T-test – Analysis of variance (ANOVA) – Most common statistical procedures in HCI research 8
T-test • Goal: To ascertain if the difference in the means of two groups is significant • Assumptions – Data are normally distributed (you checked for this by looking at the histograms, reporting the mean/median/standard deviation, and by running Shapiro-Wilks) – If data come from different groups of people à Independent t-test (assumes scores are independent and variances in the populations are roughly equal … check your table of descriptive statistics) – If data come from same group of people à dependent t-test • Practioner’s point of view: When in doubt, consult a book! Let’s do an example in R 9
Tests Presented Here • Parametric – Analysis of variance (ANOVA) • Used for ratio data and interval data • Most common statistical procedure in HCI research • Non-parametric – Chi-square test • Used for nominal data – Mann-Whitney U, Wilcoxon Signed-Rank, Kruskal- Wallis, and Friedman tests • Used for ordinal data 10
Analysis of Variance • The analysis of variance (ANOVA) is the most widely used statistical test for hypothesis testing in factorial experiments • Goal à determine if an independent variable has a significant effect on a dependent variable • Remember, an independent variable has at least two levels (test conditions) • Goal (put another way) à determine if the test conditions yield different outcomes on the dependent variable (e.g., one of the test conditions is faster/slower than the other) 11
Why Analyse the Variance? • Seems odd that we analyse the variance, but the research question is concerned with the overall means: • Let’s explain through two simple examples (next slide) 12
Example #1 Example #2 “Significant” implies that in all “Not significant” implies that the likelihood the difference observed difference observed is likely due is due to the test conditions to chance. (Method A vs. Method B). File: 06-AnovaDemo.xlsx 13
Example #1 - Details Note: Within-subjects design Error bars show ±1 standard deviation Note: SD is the square root of the variance 14
Example #1 – ANOVA 1 Probability of obtaining the observed data if the null hypothesis is true Thresholds for “p” • .05 Reported as … • .01 • .005 F 1,9 = 9.80, p < .05 • .001 • .0005 • .0001 1 ANOVA table created by StatView (now marketed as JMP , a product of SAS; www.sas.com)
How to Report an F -statistic • Notice in the parentheses – Uppercase for F – Lowercase for p – Italics for F and p – Space both sides of equal sign – Space after comma – Space on both sides of less-than sign – Degrees of freedom are subscript, plain, smaller font – Three significant figures for F statistic – No zero before the decimal point in the p statistic (except in Europe)
Example #2 - Details Error bars show ±1 standard deviation
Example #2 – ANOVA Probability of obtaining the observed data if the null hypothesis is true Note: For non-significant Reported as … effects, use “ns” if F < 1.0, or “ p > .05” if F > 1.0. F 1,9 = 0.626, ns
Example #2 - Reporting 19
More Than Two Test Conditions 20
ANOVA • There was a significant effect of Test Condition on the dependent variable ( F 3,45 = 4.95, p < .005) • Degrees of freedom – If n is the number of test conditions and m is the number of participants, the degrees of freedom are… – Effect à ( n – 1) – Residual à ( n – 1)( m – 1) – Note: single-factor, within-subjects design 21
Post Hoc Comparisons Tests • A significant F -test means that at least one of the test conditions differed significantly from one other test condition • Does not indicate which test conditions differed significantly from one another • To determine which pairs differ significantly, a post hoc comparisons tests is used • Examples: – Fisher PLSD, Bonferroni/Dunn, Dunnett, Tukey/Kramer, Games/ Howell, Student-Newman-Keuls, orthogonal contrasts, Scheffé 22
Recommend
More recommend