Central limit theorem an example ogram of sample(1:6, 10000, replace :6, 10000, replace = TRUE) + sample( Frequency Frequency 1000 1000 0 0 1 2 3 4 5 6 0 2 4 6 8 10 14 sample(1:6, 10000, replace = TRUE) , 10000, replace = TRUE) + sample(1:6, 10000 TRUE) + sample(1:6, 10000, replace :6, 10000, replace = TRUE) + sample( Frequency Frequency 800 600 0 0 0 5 10 15 0 5 10 15 20 25 RUE) + sample(1:6, 10000, replace = TRUE) + , 10000, replace = TRUE) + sample(1:6, 10000
Continuous Distribution Normal (Gaussian ) Distribution distributional shape 0.4 dnorm(k, 0, 1) 0.3 Probability Density Distribution 0.2 0.1 0.0 -3 -2 -1 0 1 2 3 • Bell shape k • Symmetrical • Converges to zero and +-infinity
Normal Distribution cumulative density funtiction 1.0 0.8 • 68% in ± 1 σ pnorm(k, 0, 1) 0.6 • 97% in ± 1 σ 0.4 0.2 0.0 -3 -2 -1 0 1 2 3 k
Log normal Distribution 0.6 dlnorm(k, 0, 1) 0.4 0.2 0.0 0 2 4 6 8 10 k Skewed shape Mean and mode different Heavy tail
Exponential distribution and Gamma distribution • Exponential distribution – Time until an event occurs which is expected to occur at the same rate • Gamma distribution – Time until k events occur at the same rate which are expected to occur at the same rate
Gamma Distribution γ • Shape parameter k • Scale parameter θ K=1 , θ=1 K=2 , θ=1 Histogram of rgamma(1e+05, 1) Histogram of rgamma(1e+05, 2) Frequency Frequency 30000 15000 0 0 0 2 4 6 8 10 12 0 5 10 15 rgamma(1e+05, 1) rgamma(1e+05, 2)
Gamma Distribution • Mean = k ・ θ K=3 , θ=1 K=4 , θ=1 Histogram of rgamma(1e+05, 3) Histogram of rgamma(1e+05, 4) 25000 Frequency Frequency 10000 10000 0 0 0 5 10 15 0 5 10 15 rgamma(1e+05, 3) rgamma(1e+05, 4)
Exponential Distribution Histogram of rexp(1e+05, 3) Histogram of rexp(1e+05) Frequency 30000 Frequency 40000 0 0 0 2 4 6 8 10 12 14 0 1 2 3 rexp(1e+05) rexp(1e+05, 3) • Exp(- λx ) • Monotonically decreasing • Gamma distribution of Shape = 0
Generating random number Histogram of rnorm(1e+05, 0, 1) • R 15000 Frequency – rnorm(n, mean, s.d.) 0 – >hist(rnorm(1e5, 0, 1)) -4 -2 0 2 4 rnorm(1e+05, 0, 1) – >hist(rnorm(1e5, 0, 2)) Histogram of rnorm(1e+05, 0, 2) • Excel 2010 15000 Frequency – Norm.dist() 0 -5 0 5 rnorm(1e+05, 0, 2)
Central Limit theorem exercise • >hist (runif(1e5)) Histogram of r0 Frequency 3000 0 0.0 0.2 0.4 0.6 0.8 1.0 r0 • >hist (runif(1e5)+runif(1e5)) Histogram of runif(1e+05) + runif(1e+0 Frequency 6000 0 0.0 0.5 1.0 1.5 2.0 runif(1e+05) + runif(1e+05)
Central Limit theorem exercise gram of runif(1e+05) + runif(1e+05) + ru Frequency 10000 0 0.0 1.0 2.0 3.0 runif(1e+05) + runif(1e+05) + runif(1e+05) f runif(1e+05) + runif(1e+05) + runif(1e+0 Frequency 8000 0 0 1 2 3 4 runif(1e+05) + runif(1e+05) + runif(1e+05) + runif(1e
Chi-square distribution Histogram of apply(r0, 1, var Frequency 600 0 0 5 10 20 Sample size = 4 Sample size = 10 apply(r0, 1, var) * 3 Degree of freedom = 3 Degree of freedom = 9 • Distribution of square of normal random number
Exercise Plant Growth • >data(PlantGrowth) • >> PlantGrowth • weight group • 1 4.17 ctrl • 2 5.58 ctrl • 3 5.18 ctrl
Exercise Plotting notched Box Plot 6.0 5.5 5.0 4.5 4.0 3.5 ctrl trt1 trt2 boxplot(weight ~ group, data = PlantGrowth, main = "PlantGrowth data", ylab = "Dried weight of plants", col = "lightgray", notch = TRUE, varwidth = TRUE)
Level of Measurement • Ratio Data Quantitative • Interval Data Data • Ordinal Data – Can put rank Qualitative Data • Categorical Data – Binary Data
Descriptive Statistic Bivariate Data • Dependence index – Correlation: Pearson’ chi - square, Kendall’ τ, Spearman’ ρ • Cross-tabulation 4 – Binary and binary 2 c(rr1) – Binary and nominal 0 – Nominal and nominal -2 -4 • Scatterplots -4 -2 0 2 4 – Ordinal/Interval and ordinal/Interval c(rr0) • Quantile-Quantile plots – Ordinal and ordinal
Statistical inference • Drawing conclusions from data based on model/assumption • Data is independently identically distributed – Random sampling from population – Randomized experiment • Set Model or Assumption • Estimate – Parameter (mean, proportion, variance) • Interval – Confidential, Tolerance, Prediction • Test of Hypothesis
Types of statistical inference • Point Estimate – Obtain single estimate • Estimate Interval – Interval of possible values • Hypothesis testing – Making decision from data • Check model assumption
Point Estimation • Obtain best single value of a population parameter from a subset • Unbiasness • minimum variance • Parametric Distribution – Maximum Likelihood Estimator – Moment Estimator
Unbiasness • True parameter: θ 0 • Estimate:θ • E[θ]= θ 0 Histogram of aa Histogram of aa Frequency Frequency 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 aa aa • Estimator of standard deviation – Variance calculated from 5 normal samples – Left: mean=0.8, right: mean=1.0
Unbiased variance Deviance Deviance 2 Test result -0.00103 1 0.0394 1.06778E-06 0.000767 2 0.0412 5.87778E-07 0.001567 3 0.0420 2.45444E-06 -0.00063 4 0.0398 4.01111E-07 0.000267 5 0.0407 7.11111E-08 -0.00093 6 0.0395 8.71111E-07 Average 0.0404 Biased 9.08889E-07 Estimate Unbised =(sum of Estimate Deviance)/(6-1)
Minimum variance • Normal distribution mean • 5 samples • True mean=0 gram of apply(matrix(rnorm(5e+05), ncol = 5 ram of apply(matrix(rnorm(5e+05), ncol = 5), 15000 10000 Frequency 10000 Frequency 5000 5000 0 0 -2 -1 0 1 2 -2 -1 0 1 2 apply(matrix(rnorm(5e+05), ncol = 5), 1, mean) apply(matrix(rnorm(5e+05), ncol = 5), 1, median)
Goodness-of-fit Test • Graphical method – Quantile-Quantile plot
Exercise plotting Q-Q plot • Fit to normal distribution • > qqnorm(rnorm(1e2)) • > qqnorm(rlnorm(1e2)) Normal Q-Q Plot Normal Q-Q Plot Sample Quantiles Sample Quantiles 8 2 6 0 4 2 -2 0 -2 -1 0 1 2 -2 -1 0 1 2 Theoretical Quantiles Theoretical Quantiles
Pearson’s Chi -square Observed 10 2 7 9 Hypothesis 8 4 9 7 Diff 2 -2 2 2 • Yate’s correction
Other test of fit • Based on empirical distribution function – Kolmogrov-smirnov test – Anderson-Darling test – Lilliefors test – Cramer-von Mises test • For normality – Jarque-Bera • Based on skewness and kurtosis – Shapiro-wilk test • Statistic based on variance and covariance of rank
Interval Estimation Types of Interval • Confidential – True parameter with probability of alpha – Nominal and actual coverage probability • Prediction – Another sample falls within the prediction interval with the probability of alpha • Tolerance – N percent of data falls within the interval with confidence level of alpha
Confidence Intervals Example of 95%
Table of T-values d.f. t0.95 t0.975 t0.995 1 6.3 12.7 63.7 2 2.9 4.3 10 3 2.4 3.2 5.8 4 2.13 2.8 4.6 5 2.02 2.6 4.0 6 1.94 2.4 3.7 Z( ∞ ) 1.645 1.960 2.326
Statistical inference -Model, Assumption, Hypothesis- • Parametric – Data generation process is parametricized • Non-parametric – Data generation process is not parametericized • Asymptotical – commonly used – Critical value based on table • Exact – computer intensive – Critical values based on data
Statistical inference and error • Type I error – False Positive – α error – Rejecting a hypothesis that should have been accepted • Type Ⅱ error – False negative – β error – Accepting a hypothesis that should have been rejected
Statistical Test • Test for location • Test for dispersion • Test for outlier • P-value, error • Detection Power • Uniformly most powerful test
Ratio data • Quantitative data • Unlike interval data, it has natural zero • Can do multiplication or division • Age, Length, etc.
Interval Data • Quantitative data • Can add or subtract the data • Can not do multiplication or division • Ex. Temperature
Z-test • Critical value does not depend on sample size • Standard deviation : known • Exercise • Proficiency testing • Target : 700μg/g • Standard deviation: 25 μg/g • Test if one laboratory reports 640μg/g, they significantly differs from target
Test for normal interval data single set of samples • One sample t-test • What is tested – Whether population mean differs from 0 – Standard deviation: unknown • C.f. z-test • Threfore, s.d. is estimated from data • Error included • Mean of data set of (150, 120, 180, 130) significantly differs from 100.
Test for interval data 2 levels • T-test (paired or unpaired) • Variance of two gourps – Same Students test – Unsame Welch-Aspin test
T-distribution • Distribution of sample mean divided by sample variance • Normality assumed • Degree of freedom • probability • If standard deviation is known or the degree of freedom is infinity. It is z test.
T-distribution and normal distribution dnorm(seq(-10, 10, by = 0.01)) 0.4 0.3 0.2 0.1 0.0 -10 -5 0 5 10 seq(-10, 10, by = 0.01) • Green = degree of freedom (d.f.) 2 • Blue = degree of freedom (d.f.) 10 • Red = degree of freedom (d.f.) +infinity
T-test assumptions • Each of two data set follow a normal distribution – especially when sample size is small • Each of two data set are sampled independently • There are few cases where those assumptions are strictly met, care is needed for strict discussions.
Robustness of inference • How violation to assumptions affects the test • Outliers – Against outlier • Distribution – Mixture distribution of different s.d. • T-test is somehow robust to some violations • Some discuss to apply tests to check if those assumption holds in the data, but there are other discussions.
Case of unequal variance, equal sample size n=10, σ1/σ2 = 4 Histogram of p Histogram of p 5000 6000 4000 Frequency Frequency 3000 2000 1000 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p • Student • Aspin-Welch
T test exercise Sample Data • > ToothGrowth • len supp dose • 1 4.2 VC 0.5 • 2 11.5 VC 0.5 • 3 7.3 VC 0.5 • 4 5.8 VC 0.5 • 5 6.4 VC 0.5 • 6 10.0 VC 0.5 • 7 11.2 VC 0.5
T test exercise 2 Sample Data • d0 <- ToothGrowth$len[1:10] – VitaminC dose 0.5mg • d1<- ToothGrowth$len[11:20] – VitaminC dose 1.0mg
T-test using R -exercise 3- • t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) • >t.test(d0, d1) • Welch Two Sample t-test • data: d0 and d1 • t = -7.4634, df = 17.862, p-value = 6.811e-07 • alternative hypothesis: true difference in means is not equal to 0 • 95 percent confidence interval: • -11.265712 -6.314288 • sample estimates: • mean of x mean of y • 7.98 16.77
T-test exercise using random number -alpha error- • > t.test(rnorm(1e1),rnorm(1e1)) • t = -0.9106, df = 17.085, p-value = 0.3752 • t = 0.7685, df = 17.982, p-value = 0.4522 • t = 0.8858, df = 12.341, p-value = 0.3927 • t = -1.0532, df = 17.886, p-value = 0.3063 • t = 0.0694, df = 17.496, p-value = 0.9455 • t = -0.0784, df = 13.86, p-value = 0.9386 • t = -0.606, df = 17.528, p-value = 0.5523
Summary Actual Nominal P-value>=0.5 50% P-value<0.5 50%
Test for Proportions • > prop.test(10,20,p=0.5) • 1-sample proportions test without continuity correction • data: 10 out of 20, null probability 0.5 • X-squared = 0, df = 1, p-value = 1 • alternative hypothesis: true p is not equal to 0.5 • 95 percent confidence interval: • 0.299298 0.700702 • sample estimates: • p • 0.5
One-sided(tailed) and two- sided(tailed) test • One sided – Null hypothesis μ=μ0 – Alternative hypothesis μ>μ0 • Two sided – Null hypothesis μ=μ0 – Alternative hypothesis μ≠μ0 • One-sided p-value = ½ two-sided p-value
Balanced v.s. Unbalanced • Balanced – Equal sample or experiment assigned to each treatment • Unbalanced – Unequal sample or experiment assigned to each treatment
Ordinal Data • Several levels with order • Excellent – good – fair • Ranks in the race • Interval data is an ordinal data • But ordinal data is not always interval data
What can be done with Ordinal data • Wilcoxon rank sum test – Mann- Whitney’s U test – Unpaired t-test – interval data • Wilcoxon signed rank test – Paired t-test – interval data
Wilcoxon test -exercise- • > wilcox.test(d0,d1) • Wilcoxon rank sum test with continuity correction • data: d0 and d1 • W = 0, p-value = 0.0001796 • alternative hypothesis: true location shift is not equal to 0 • Warning message: • In wilcox.test.default(d0, d1) : cannot compute exact p- value with ties
Wilcoxon test -unequal variance- Histogram of p Histogram of p 6000 Frequency 6000 Frequency 2000 2000 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p Equal variance Unequal variance P(p<0.05)=0.044 P(p<0.05)=0.12
Comparison of t-test and wilcoxin test • Detection power – 98% • Extremely low sample size and high difference – t test
Detection power of t-test • S.d. =1 • μ1 - μ2 = 1, 0.5, 0.1
Detection power of wilcox test norm, diff=1 0.8 powvec 0.4 0.0 0 20 40 60 80 100 • S.d. =1 mvec • μ1 - μ2 = 1
T-test Detection power • > power.t.test(n=10,delta=1, sd=NULL, sig.level =0.05, power=.5) • Two-sample t test power calculation • n = 10 • delta = 1 • sd = 1.079782 • sig.level = 0.05 • power = 0.5 • alternative = two.sided • NOTE: n is number in *each* group
Calculating sample size - t-test • power.t.test(delta=1, sd=1, sig.level =0.05, power=.95) • Two-sample t test power calculation • n = 26.98922 • delta = 1 • sd = 1 • sig.level = 0.05 • power = 0.95 • alternative = two.sided • NOTE: n is number in *each* group
t.test detection power (β -error) -exercise- • >t.test(rnorm(1e1,sd=1.079782),rnorm(1e1,sd =1.079082)) • t = -1.0004, df = 10.752, p-value = 0.3391 • t = 0.1531, df = 17.229, p-value = 0.8801 P>0.05 P<=0.05
What is Nominal Data • Categorically discrete • Order of category is arbitrary • Red, blue, green • Origin(Region)
Analysis of Nominal Data • Contigency table • Binominal test • Chisquare-test • Fisher’s exact test • McNemer test
Recommend
More recommend