u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Faculty of Health Sciences Basic statistical concepts Susanne Rosthøj Section of Biostatistics Department of Public Health University of Copenhagen sr@biostat.ku.dk
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Statistical approaches Descriptive statistics : • Summarizing observations • Represented • graphically • in tables • as summary statistics (single values) Inferential statistics : • Procedures allowing us to conclude and generalize • Based on models , confidence intervals, hypotheses, tests • Need mathematical assumptions and results 2 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Male height from Sundby data Height distribution (males) 150 100 Frequency 50 0 150 160 170 180 190 200 Height (cm) Median 180, IQR 175-185. 3 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Decriptive illustration - box plot Height (males) 200 190 180 170 160 ● ● 4 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The normal distribution The normal distribution is the most important distribution for describing continuous variables. Examples: • Body temperature • Male height • Lung function indices It is widely used in statistical inference because • it has many mathematically convenient properties • the Central Limit Theorem : The average of a sufficiently number of independent variables with same distribution will be approximately normally distributed . 5 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The 95% reference interval Reference range for normally distributed data: µ ± 1 . 96 · SD 0.06 0.05 0.04 Density 0.03 0.02 0.01 0.00 150 150 160 170 180 190 200 Height (cm) Mean 179.9, SD=7.8. Reference range 164.6 to 195.2 cm. 6 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Mean and standard deviation of the sample mean Vi observerer n observationer X 1 , . . . , X n trukket fra en normalfordeling ( µ, σ 2 ) . For gennemsnittet gælder: mean( X ) = µ . σ SD ( X ) = √ n Denne SD kaldes også standard error of the mean (SE or SEM). Gennemsnittet har altså en fordeling . 7 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Fordelingen af gennemsnittet Ifølge CLT følger gennemsnittet ( X ) (approksimativt) en normalfordeling: Density 95% 2.5% 2.5% σ σ µ + 1.96 µ − 1.96 µ n n 8 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The 95% confidence interval Density 95% X ● X ● 2.5% 2.5% σ σ µ + 1.96 µ − 1.96 µ n n 9 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Understanding confidence intervals The population mean µ is a fixed unknown number. The confidence intervals vary between samples: Mean and 95% confidence interval 27 26 25 24 23 22 21 1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 17 18 19 20 Sample ¡ 10 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Interpretation of CI The 95% CI for mean male height ranges from 179 to 181 cm. Which of the following statements are true? A. There is a 95% probability that the population mean lies between 179 and 181 cm. B. 95% of males are between 179 and 181 cm tall. C. We are 95% confident that the interval from 179 to 181 cm contains the population mean. D. If we were to repeat the experiment over and over, then 95% of the time the population mean falls between 179 and 181 cm. 11 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Why do we need confidence intervals? We want to estimate a parameter , e.g. • the mean height for males • the mean difference in lung function for boys and girls Based on a sample we suggest a qualified guess (estimate) • we are uncertain about the guess and suggest an interval of plausible values • the interval has to be narrow • we want a large probability (95%) of guessing right. 12 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Small sample confidence intervals For small samples ( n ≤ 60 ) the CIs are better approximated by the t-distribution with df= n − 1 . The 95%-CI for µ is X ± z ′ · se with z ′ being the lower 2.5%-quantile of the t-distribution with df= n − 1 . Find a selection of quantiles in KS table A3 or calculate quantiles in R qt(x=0.025,df=n-1) . 13 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s How to make conclusions based on data? The purpose of most experiments is to prove or disprove a hypothesis . This is done by collecting data, analyzing it and drawing a conclusion. The original hypothesis is tested against the data to find out whether or not it is right. 14 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example of a hypothesis 636 children from Peru had their lung capacity examined. Response: FEV (Forced Expiratory Volume ( L /1s). Scientific question: Do boys and girls have different lung capacity? Hypothesis: H 0 : There is no difference in lung capacity for boys and girls. We observe: Girls : mean(FEV) = 1.54 Boys : mean(FEV) = 1.66. Observed difference = 0.12. What can we conclude? 15 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Formulation of a hypothesis We always formulate hypotheses as no difference or no association . Comparison of two populations (two groups): H 0 : The means are equal (i.e. µ 1 − µ 0 = 0 ) H A : The means are not equal. If sufficient evidence against the hypothesis, we reject H 0 . 16 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Test statistics We use test statistics to find evidence against the hypothesis. Often test statistics are given by estimate − hypothetical value SD ( estimate ) We expect the test statistic to be • small if the hypothesis is true • large if the hypothesis is false. 17 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example: Lung capacity Let X i denote FEV for child i , i = 1 , . . . , n = 636 . Assume X i normally distributed with mean µ 0 for girls, mean µ 1 for boys and variance σ 2 . Do boy and girls have different lung capacity? Hypothesis: H 0 : µ 0 = µ 1 . µ 1 − µ 0 is the parameter we investigate. 0 is the hypothetical value. 18 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Two sample t-test Can be used when data are normally distributed ∗ , arise from two groups , the variances in the two groups are equal and all observations are independent . Summary data: Girls: n 0 , X 0 , SD 0 Boys : n 1 , X 1 , SD 1 Test statistic: ( X 1 − X 0 ) − 0 T = SD ( X 1 − X 0 ) ∗ can be relaxed when n is large ( ≥ 40 (+/-)). 19 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example : Lung capacity n mean SD Girls 335 1.538 0.291 Boys 301 1.657 0.308 An estimate of the difference : X 1 − X 0 = 0 . 119 . The test statistic (formulas in KS Ch 7.4) 0 . 119 − 0 = = 5 . 01 . T � 1 1 0 . 299 × 335 + 301 Small or large??? 20 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s P values We use p values to assess the size of test statistics. If the hypothesis is true and we replicate the sampling many times: How often will we obtain a test statistic numerically larger than the observed test statistic? The p-value P (|test statistic| > |observed test statistic|) is calculated assuming the hypothesis being true. A small p-value corresponds to the observed test statistic being unlikely if the hypothesis is true. 21 / 22
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example : Lung capacity If H 0 is true, T follows a t-distribution with df= n 0 + n 1 − 2 . P -value: P ( | T | > 5 . 01 ) = P ( T < − 5 . 01 ) + P ( T > 5 . 01 ) 2 · 3 . 54 × 10 − 7 = 7 . 09 × 10 − 7 = If there is no difference in the mean lung function for boys and girls, the observed test statistic of 5.01 is unlikely . We reject H 0 and conclude that boys and girls have different lung function . 22 / 22
Recommend
More recommend