15 388 688 practical data science hypothesis testing and
play

15-388/688 - Practical Data Science: Hypothesis testing and - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Motivation Background: sample statistics and central limit theorem Basic hypothesis testing


  1. 15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter Carnegie Mellon University Fall 2019 1

  2. Outline Motivation Background: sample statistics and central limit theorem Basic hypothesis testing Experimental design 2

  3. Outline Motivation Background: sample statistics and central limit theorem Basic hypothesis testing Experimental design 3

  4. Motivating setting For a data science course, there has been very little “science” thus far… “Science” as I’m using it roughly refers to “determining truth about the real world” 4

  5. Asking scientific questions Suppose you work for a company that is considering a redesign of their website; does their new design (design B) offer any statistical advantage to their current design (design A)? In linear regression, does a certain variable impact the response? (E.g. does energy consumption depend on whether or not a day is a weekday or weekend?) In both settings, we are concerned with making actual statements about the nature of the world 5

  6. Outline Motivation Background: sample statistics and central limit theorem Basic hypothesis testing Experimental design 6

  7. ̅ ̅ Sample statistics To be a bit more consistent with standard statistics notation, we’ll introduce the notion of a population and a sample Population Sample 푚 𝑦 = 1 𝑦 푖 𝑛 ∑ 𝜈 = 𝐅[𝑌] Mean 푖=1 푚 1 𝑡 2 = 𝑦 푖 − 𝜏 2 = 𝐅[ 𝑌 − 𝜈 2 ] 𝑦 2 𝑛 − 1 ∑ Variance 푖=1 7

  8. ̅ ̅ Sample mean as random variable The same mean is an empirical average over 𝑛 independent samples from the distribution; it can also be considered as a random variable This new random variable has the mean and variance 푚 푚 1 = 1 𝑦 푖 𝐅 𝑦 = 𝐅 𝑛 ∑ 𝑛 ∑ 𝐅 𝑌 = 𝐅 𝑌 = 𝜈 푖=1 푖=1 푚 푚 𝐖𝐛𝐬[𝑌] = 𝜏 2 1 = 1 𝑦 푖 𝐖𝐛𝐬 𝑦 = 𝐖𝐛𝐬 𝑛 ∑ 𝑛 2 ∑ 𝑛 푖=1 푖=1 where we used the fact that for independent random variables 𝑌 1 , 𝑌 2 𝐖𝐛𝐬 𝑌 1 + 𝑌 2 = 𝐖𝐛𝐬 𝑌 1 + 𝐖𝐛𝐬 𝑌 2 When estimating variance of sample, we use 𝑡 2 /𝑛 (the square root of this term is called the st standard error ) 8

  9. ̅ ̅ ̅ Central limit theorem Central limit theorem states further that ̅ 𝑦 (for “reasonably sized” samples, in practice 𝑛 ≥ 30 ) actually has a Gaussian distribution regardless of the distribution of 𝑌 𝜈, 𝜏 2 𝑦 − 𝜈 𝑦 → 𝒪 or equivalently 𝜏/𝑛 1/2 → 𝒪(0,1) 𝑛 In practice, for 𝑛 < 30 and for estimating 𝜏 2 using sample variance, we use a Student’s t-distribution with 𝑛 − 1 degrees of freedom −휈+1 1 + 𝑦 2 𝑦 − 𝜈 2 𝑡/𝑛 1/2 → 𝑈 푚−1 , 𝑞 𝑦; 𝜉 ∝ 𝜉 9

  10. ̅ ̅ ̅ ̅ ̅ Aside: why the 𝑛 − 1 scaling? We scale the sample variance by 𝑛 − 1 so that it is an unbiased estimate of the population variance 푚 푚 2 𝑦 푖 − 𝑦 푖 − 𝜈 − 𝑦 2 𝐅 ∑ = 𝐅 ∑ 𝑦 − 𝜈 푖=1 푖=1 푚 푚 𝑦 푖 − 𝜈 2 − 2 𝑦 푖 − 𝜈 + 𝑛 𝑦 − 𝜈 2 = 𝐅 ∑ 𝑦 − 𝜈 ∑ 푖=1 푖=1 푚 푚 𝑦 푖 − 𝜈 2 𝑦 − 𝜈 2 = 𝐅 ∑ − 𝑛𝐅 ∑ 푖=1 푖=1 = 𝑛𝐖𝐛𝐬 𝑌 − 𝑛𝐖𝐛𝐬 𝑌 = 𝑛 − 1 𝜏 2 𝑛 10

  11. Outline Motivation Background: sample statistics and central limit theorem Basic hypothesis testing Experimental design 11

  12. Hypothesis testing Using these basic statistical techniques, we can devise some tests to determine whether certain data gives evidence that some effect “really” occurs in the real world Fundamentally, this is evaluating whether things are (likely to be) true about the population (all the data) given a sample Lots of caveats about the precise meaning of these terms, to the point that many people debate the usefulness of hypothesis testing at all But, still incredibly common in practice, and important to understand 12

  13. Hypothesis testing basics Posit a null hypothesis 𝐼 0 and an alternative hypothesis 𝐼 1 (usually just that “ 𝐼 0 is not true” Given some data 𝑦 , we want to accept or reject the null hypothesis in favor of the alternative hypothesis 𝑰 ퟎ true 𝑰 ퟏ true Type II error Accept 𝑰 ퟎ Correct (false negative) Type I error Reject 𝑰 ퟎ Correct (false positive) 𝑞 reject 𝐼 0 𝐼 0 true = “significance of test” 𝑞 reject 𝐼 0 𝐼 1 true = “power of test” 13

  14. Basic approach to hypothesis testing Basic approach: compute the probability of observing the data under the null hypothesis (this is the p-value of the statistical test) 𝑞 = 𝑞 data 𝐼 0 is true) Reject the null hypothesis if the p-value is below the desired significance level (alternatively, just report the p-value itself, which is the lowest significance level we could use to reject hypothesis) Important: p-value is 𝑞 data 𝐼 0 is true) not 𝑞 𝐼 0 is true data) 14

  15. Poll: p-value hacking Suppose you adopt the following procedure. You test 100 patients to see if a drug has a statistically significant effect. If so, you stop the test and publish your current p-value. If not, you collect 100 additional patients, test the drug again, and publish that p-value (statistically significant or not). Is this a valid experimental design? 1. Yes 2. No 3. Depends on what p value you achieve 15

  16. ̅ ̅ Canonical example: t-test Given a sample 𝑦 1 , … , 𝑦 푚 ∈ ℝ 𝐼 0 : 𝜈 = 0 (for population) 𝐼 1 : 𝜈 ≠ 0 1 By central limit theorem, we know that 𝑦 − 𝜈 /(𝑡/𝑛 2 ) ∼ 𝑈 푚−1 (Student’s t- distribution with 𝑛 − 1 degrees of freedom) 1 So we just compute 𝑢 = 𝑦/ 𝑡/𝑛 (called test statistic ), then compute 2 𝑞 = 𝑞 𝑦 > 𝑢 + 𝑞 𝑦 < − 𝑢 = 𝐺 − 𝑢 + 1 − 𝐺 𝑢 = 2𝐺(− 𝑢 ) (where 𝐺 is cumulative distribution function of Student’s t-distribution) 16

  17. ̅ Visual example What we are doing fundamentally is modeling the distribution 𝑞 𝑦 𝐼 0 and then determining the probability of the observed ̅ 𝑦 or a more extreme value 𝑞 = Area 17

  18. Code in Python Compute 𝑢 statistic and 𝑞 value from data import numpy as np import scipy.stats as st x = np.random.randn(m) # compute t statistic and p value xbar = np.mean(x) s2 = np.sum((x - xbar)**2)/(m-1) std_err = np.sqrt(s2/m) t = xbar/std_err t_dist = st.t(m-1) p = 2*td.cdf(-np.abs(t)) # with scipy alone t,p = st.ttest_1samp(x, 0) 18

  19. Two-sided vs. one-sided tests The previous test considered deviation from the null hypothesis in both directions (two-sided test), also possible to consider a one-sided test 𝐼 0 : 𝜈 ≥ 0 (for population) 𝐼 1 : 𝜈 < 0 Same 𝑢 statistic as before, but we only compute the area under the left side of the curve 𝑞 = 𝑞 𝑦 < 𝑢 = 𝐺(𝑢) 19

  20. ̅ ̅ Confidence intervals We can also use the 𝑢 statistic to create confidence intervals for the mean 𝑦 has mean 𝜈 and variance 𝑡 2 /𝑛 , we know that 1 − 𝛽 of its probability Because ̅ mass must lie within the range 𝑡 1 − 𝛽 𝑛 1/2 ⋅ 𝐺 −1 𝑦 = 𝜈 ± ≡ 𝜈 + 𝐷𝐽 𝑡, 𝑛, 𝛽 2 ⟺ 𝜈 = 𝑦 ± 𝐷𝐽 𝑡, 𝑛, 𝛽 where 𝐺 −1 denotes the inverse CDF function of 𝑢 -distribution with 𝑛 − 1 degrees of freedom # simple confidence interval compuation CI = lambda s,m,a : s / np.sqrt(m) * st.t(m-1).ppf(1-a/2) 20

  21. Outline Motivation Background: sample statistics and central limit theorem Basic hypothesis testing Experimental design 21

  22. Experimental design: A/B testing Up until now, we have assumed that the null hypothesis is given by some known mean, but in reality, we may not know the mean that we want to compare to Example: we want to tell if some additional feature on our website makes user stay longer, so we need to estimate both how long users stay on the current site and how long they stay on redesigned site Standard approach is A/B testing: create a control group (mean 𝜈 1 ) and a treatment group (mean 𝜈 2 ) 𝐼 0 : 𝜈 1 = 𝜈 2 or e. g. 𝜈 1 ≥ 𝜈 2 𝐼 1 : 𝜈 1 ≠ 𝜈 2 or e. g. 𝜈 1 < 𝜈 2 22

  23. ̅ ̅ ̅ Independent 𝑢 -test (Welch’s 𝑢 -test) Collect samples (possibly different numbers) from both populations 1 , … , 𝑦 1 1 , … , 𝑦 2 푚 1 , 푚 2 𝑦 1 𝑦 2 2 for each group 2 , 𝑡 2 compute sample mean ̅ 𝑦 1 , 𝑦 2 and sample variance 𝑡 1 Compute test statistic 𝑦 1 − 𝑦 2 𝑢 = 2 /𝑛 1 + 𝑡 2 2 /𝑛 2 1/2 𝑡 1 And evaluate using a t distribution with degrees of freedom given by 2 /𝑛 1 + 𝑡 2 2 /𝑛 2 2 𝑡 1 2 /𝑛 1 2 /𝑛 2 2 2 𝑡 1 𝑛 1 − 1 + 𝑡 2 𝑛 2 − 1 23

  24. Starting seem a bit ad-hoc? There are a huge number of different tests for different situations You probably won’t need to remember these, and can just look up whatever test is most appropriate for your given situation But the basic idea in call cases is the same: you’re trying to find the distribution of your test statistic under the hull hypothesis, and then you are computing the probability of the observed test statistic or something more extreme All the different tests are really just about different distributions based upon your problem setup 24

Recommend


More recommend