SAMPLING, THE CLT, AND THE STANDARD ERROR Business Statistics
CONTENTS Sampling The central limit theorem Point and interval estimates for ๐ Confidence intervals for ๐ Old exam question Further study
SAMPLING Suppose youโre a scissors manufacturer in the UK โช What proportion of your production should be left-handed? โช Three strategies โช look at Wikipedia (โ Studies suggest that 70 โ 90% of the world population is right-handed.[4][5] โ) โช ask all persons in the UK (~63 million) โช ask a sample of persons (100?) in the UK
SAMPLING Sampling is the process of collecting data about a sample (a subset of the population), with the aim of representing the entire population โช Arguments pro sampling โช too costly to probe entire population โช too time-consuming โช too dangerous โช too destructive โช etc. โช Arguments against sampling โช limited accuracy ๏ฎ confidence intervals (later in this course) โช not representative ๏ฎ design of experiments (not in this course)
SAMPLING A sample should be representative โช e.g., donโt ask people at Schiphol if theyโre afraid of flying A sample should be large enough โช cf. the โ ๐ โ law later on Choice in sampling โช with replacement or without replacement โช this has consequences for the probability model
SAMPLING Population Sample unknown known we would like to know irrelevant parameter statistic mostly Greek letters ( ๐ , ๐ ) mostly Roman letters ( ๐ , ๐ก ) some deviating notations ( ๐ ) some deviating notations ( าง ๐ฆ , ๐ )
THE CENTRAL LIMIT THEOREM โช Let ๐ 1 , ๐ 2 , โฆ , ๐ ๐ be a random sample from a population 2 ๐ with mean ๐ ๐ and variance ๐ ๐ โช e.g., body heights of ๐ persons Capital ๐ , because it is a โช waiting times of ๐ customers random variable! โช failure rates of ๐ cars, ... ๐ 1 +๐ 2 +โฏ+๐ ๐ โช Then, for ๐ sufficiently large, the mean เดค ๐ = ๐ 1. is normally distributed 2. with mean ๐ เดค ๐ = ๐ ๐ 2 2 = ๐ ๐ 3. and variance ๐ เดค Capital เดค ๐ , because this is ๐ ๐ also a random variable!
THE CENTRAL LIMIT THEOREM So for large ๐ : 2 2 = ๐ ๐ เดค ๐~๐ ๐ เดค ๐ = ๐ ๐ , ๐ เดค ๐ ๐ โช or for short 2 ๐~๐ ๐ ๐ , ๐ ๐ เดค ๐ โช This holds regardless of the distribution of ๐ ! โช so thatโs why the normal distribution is called โnormalโ โช this fact is called the central limit theorem (CLT) โช it is one of the most important results of statistics โช it holds for โsufficiently largeโ ๐
THE CENTRAL LIMIT THEOREM The CLT for a fair die Distribution of เดค ๐ for โช ๐ = 1 โช ๐ = 2 โช ๐ = 5 โช ๐ = 20
THE CENTRAL LIMIT THEOREM The CLT for a loaded (unfair) die Distribution of เดค ๐ for โช ๐ = 1 โช ๐ = 2 โช ๐ = 5 โช ๐ = 20
EXERCISE 1 We roll with a die 100 times. The outcomes are ๐ = ๐ 1 , ๐ 2 , โฆ , ๐ 100 . How is เดค ๐ distributed?
THE CENTRAL LIMIT THEOREM A โproofโ of the theorem (for normal populations) โช Recall the additive property of the normal distribution: 2 and ๐ 2 ~๐ ๐ ๐ , ๐ ๐ 2 , then ๐ 1 + โช if ๐ 1 ~๐ ๐ ๐ , ๐ ๐ 2 (provided ๐ 1 and ๐ 2 are independent) ๐ 2 ~๐ 2๐ ๐ , 2๐ ๐ 2 then ๐๐~๐ ๐๐ ๐ , ๐ 2 ๐ ๐ 2 โช Also recal that if ๐~๐ ๐ ๐ , ๐ ๐ 2 2 then ๐ 1 +๐ 2 ๐ ๐ โช So, if ๐ 1 + ๐ 2 ~๐ 2๐ ๐ , 2๐ ๐ ~๐ ๐ ๐ , 2 2 2 ๐ 1 +โฏ+๐ ๐ ๐ ๐ โช and more general: ~๐ ๐ ๐ , ๐ ๐ You donโt need to reproduce 2 ๐ ๐ โช or equivalently: เดค such proofs, but it may help ๐~๐ ๐ ๐ , ๐ โช This proof works for normal populations and all ๐ , but the CLT is valid for all populations and โlargeโ ๐
THE CENTRAL LIMIT THEOREM Some consequences of the CLT โช เดค ๐ is an estimator of ๐ ๐ โช and าง ๐ฆ is the best estimate of ๐ ๐ โช เดค ๐ will be a better estimator for large ๐ โช because ๐ เดค ๐ decreases with ๐ โช we can use the distribution of เดค ๐ to construct a confidence interval for ๐
THE CENTRAL LIMIT THEOREM The CLT holds for ๐ โsufficientlyโ large โช More specifically: โช if ๐ is normally distributed, the CLT holds for all sample sizes ๐ โช if the distribution of ๐ is fairly symmetric without extreme outliers, for sample sizes ๐ โฅ 15 the CLT gives a pretty good approximation of the distribution of เดค ๐ โช for any distribution of เดค ๐ and a sample size ๐ โฅ 30 , the CLT gives a pretty good approximation of the distribution of เดค ๐
THE CENTRAL LIMIT THEOREM The effect of asymmetry vs. sample size
POINT AND INTERVAL ESTIMATES FOR ๐ A statistic is a function of the (randomly sampled) data โช important example: the statistic เดค ๐ 1 โช defined by เดค ๐ ๐ ฯ ๐=1 ๐ = ๐ ๐ 1 ๐ โช in a concrete case, าง ๐ฆ ๐ is the best possible ๐ ฯ ๐=1 ๐ฆ = estimate of the parameter ๐ โช so the sample mean าง ๐ฆ is the best possible estimate of the population mean ๐ โช because it is just one value, it is a point estimate
าง POINT AND INTERVAL ESTIMATES FOR ๐ Due to sampling variation, าง ๐ฆ will be different in each sample โช and there will be a distribution of าง ๐ฆ -values, the distribution เดค ๐ โช the true value of ๐ may be different from the value of ๐ฆ obtained โช however, keep in mind that the value of าง ๐ฆ obtained cannot be โtooโ wrong 2 , so it follows that a specific โช we know that เดค ๐~๐ ๐ เดค ๐ , ๐ เดค ๐ value าง ๐ฆ must be within ๐ เดค ๐ โ 1.96๐ เดค ๐ , ๐ เดค ๐ + 1.96๐ เดค ๐ with 95% probability
าง าง าง าง าง าง POINT AND INTERVAL ESTIMATES FOR ๐ Conversely, the population value ๐ เดค ๐ must be within ๐ with 95% probability ๐ฆ โ 1.96๐ เดค ๐ , ๐ฆ + 1.96๐ เดค โช and because ๐ เดค ๐ = ๐ ๐ , the population value ๐ ๐ must be within ๐ with 95% probability ๐ฆ โ 1.96๐ เดค ๐ , ๐ฆ + 1.96๐ เดค โช this is an interval estimate for ๐ ๐ โช we say that ๐ is a 95% ๐ฆ โ 1.96๐ เดค ๐ , ๐ฆ + 1.96๐ เดค confidence interval for ๐ ๐
POINT AND INTERVAL ESTIMATES FOR ๐ So: โช we estimate ๐ ๐ by าง ๐ฆ โช and we know with 95% probability that าง ๐ฆ โ 1.96๐ เดค ๐ โค ๐ ๐ โค าง ๐ฆ + 1.96๐ เดค ๐ ๐ ๐ โช the quantity ๐ เดค ๐ is the standard error of the ๐ = distribution of the mean เดค ๐ โช it is so important that we give it a special name: the standard error of the mean โช sometimes (unfortunately!) abbreviated as the standard error
EXERCISE 2 We sample ( ๐ = 25 ) from a normal population ๐ with 2 = 4 . We find าง unknown ๐ ๐ and known ๐ ๐ ๐ฆ = 3 . a. Give a point estimate for ๐ ๐ . b. Find the standard error of the mean, ๐ก เดค ๐ . b. Give a 95% -confidence interval for ๐ ๐ .
าง CONCEPTS AND SYMBOLS โช Carefully distinguish: โช ๐ ๐ (a value, often unknown) ๐ฆ (a value from observations) โช ๐ (a distribution, not a value) เดค โช 2 (both are values, often โช and its two parameters ๐ เดค ๐ and ๐ เดค ๐ unknown) โช Later on, we will follow a similar logic, e.g. 2 โช ๐ ๐ 2 โช ๐ก ๐ and the CLT claims that 2 ๐ เดค ๐ = ๐ ๐ โช ๐ ๐ 2 2 = ๐ ๐ โช and its two parameters ๐ เดค ๐ ๐
OLD EXAM QUESTION 23 March 2015, Q1h
FURTHER STUDY Doane & Seward 5/E 8.1-8.3 Tutorial exercises week 2 sampling distribution central limit theorem standard error
Recommend
More recommend