Statistics, inference and ordinary least squares Frank Venmans
Statistics
Conditional probability • Consider 2 events: • A: die shows 1,3 or 5 => P(A)=3/6 • B: die shows 3 or 6 =>P(B)=2/6 2 1 5 3 6 4 • A ∩B : A and B occur: die shows 3 =>P(A&B)=1/6 • AUB : A or B occur: die shows 1,3, 5 or 6 =>P(AorB)=4/6 • Addition rule: P(AorB)=P(A)+P(B)-P(A&B) (~ venn diagram) 𝑄 𝐵 & 𝐶 • 𝑄 𝐵 𝐶 = (~ venn diagram) 𝑄 𝐶 • P(A|B): prob of event A given that B occurs=1/2 • P(B|A): prob of event B given that A occurs=1/3 Income>30,000 • Bayes’ Law: 𝑄 𝐵&𝐶 = 𝑄(𝐵|𝐶) P(B)=P(B|A)P(A) Education>12 • Event can be any set of outcomes. Example • A: Random draw from belgian population with income >30,000 • B: Random draw from Belgian population with education >12 years • P(A|B) ≠ P(A)
Independence • 2 events A and B: 𝑄 𝐵 𝐶 = 𝑄 𝐵 ⇔ 𝑄 𝐶 𝐵 = 𝑄 𝐶 ⇔ 𝐵 𝑏𝑜𝑒 𝐶 𝑏𝑠𝑓 𝑗𝑜𝑒𝑓𝑞𝑓𝑜𝑒𝑓𝑜𝑢 • Two variables X and Y 𝑔 𝑦|𝑧 = 𝑔 𝑦 ⇔ 𝑔 𝑧 𝑦 = 𝑔 𝑧 ⇔ x and y are independent • X and Y are independent if the conditional distribution of X given Y is the same as the unconditional distribution of X. • Independent variables do not necessarily have a zero correlation. • Example: height of my sun and Indian GDP are correlated (both affected by time) • Dependent variables may have a zero correlation in exceptional cases. • Example: selection bias may compensate a causal effect (see further)
Cumulative Distribution Function CDF Probability Density Function PDF • Notation: • Random variables X,Y: ex. Yearly earnings and level of eduction • Discrete if earnings are multiples of 100 € and eduction in years • ~Continuous if earnings are expressed un eurocent and education in seconds • Specific values of random variables: • a,b or x,y • Cumulative Distrubtion Function: • probability that X is smaller than or equal to a • 𝐺 𝑏 = 𝑄 𝑌 ≤ 𝑏 • Probability Density Function • For discrete variables: f(a)=P(X=a) • For continuous variables • 𝑔 𝑏 = 𝑒𝐺 𝑏 𝑏 ⇔ 𝐺 𝑏 = 𝑔 𝑌 𝑒𝑌 −∞ 𝑒𝑏 • Area under the pdf =1 because 𝐺 ∞ = 1
Joint Cumulative Distribution Function • Assume Y Yearly earnings and X level of education • 𝐺 𝑦, 𝑧 = 𝑄 𝑌 < 𝑦 &𝑍 < 𝑧
Density function • Joint Density Function • For discrete variables: 𝑔 𝑦, 𝑧 = 𝑄 𝑌 = 𝑦&𝑍 = 𝑧 • Continuous variables: 𝑔 𝑦, 𝑧 = 𝜖 2 𝐺 𝑦,𝑧 𝜖𝑦𝜖𝑧 • Marginal Denstity Function • Discrete variables 𝑔 𝑦 = 𝑄 𝑌 = 𝑦 disregarding y 𝑧=∞ • Continuous variables 𝑔 𝑦 = 𝑔 𝑦, 𝑧 𝑒𝑧 𝑧=−∞ • (red and blue line) • Conditional Density Function • Discrete variables 𝑔 𝑦|𝑧 = 𝑄 𝑌 = 𝑦 |𝑍 = 𝑧 • 𝑔 𝑦|𝑧 = 𝑔 𝑦,𝑧 𝑔 𝑧 • (intersections through the joint density function)
Regression as a conditional density function
Expected value • Unconditional expected value • For a discrete random variable : 𝐹 𝑌 = ∑𝑦 𝑗 𝑄 𝑦 𝑗 = 𝜈 ∞ • For a continuous random variable : 𝐹 𝑌 = 𝑦𝑔 𝑦 𝑒𝑦 = 𝜈 −∞ • Conditional expected value (in finance many expectations are conditional on the information set at time t) • 𝐹 𝑌 𝑍 = 𝐹 𝑍 [𝑌] = ∑𝑦 𝑗 𝑄 𝑦 𝑗 |𝑍 ∞ • 𝐹 𝑌|𝑍 = 𝑦𝑔 𝑦|𝑧 𝑒𝑦 −∞ • Variance= 𝜏 2 = 𝐹[ 𝑌 − 𝜈 2 ] • Covariance between X and Y= 𝜏 𝑌,𝑍 = 𝐹 𝑌 − 𝜈 𝑌 Y − 𝜈 𝑍 3 𝑌−𝜈 • Skewness= 𝐹 𝜏 4 𝑌−𝜈 • Kurtosis= 𝐹 𝜏
Normal distribution 2 1 1 𝑦−𝜈 • 𝑔 𝑦 = 𝜏 2𝜌 exp − 2 𝜏 • Notation 𝑌~𝑂(𝜈, 𝜏 2 ) • Skewness=0 • Kurtosis=3 • Jacques-Berra test for normality: tests if skewness and kurtosis are close to 0 and 3. • Any linear combination of normally distributed variables (correlated or not) is normally distributed • Central limit theorem: the probability distribution of a variable that is the sum of an infinite number of independent random variables with any distribution will be normally distributed.
Chi square distribution 𝑜 2 𝑥𝑗𝑢ℎ 𝑌 𝑗 ~𝑂 0,1 𝑏𝑜𝑒 𝑏𝑚𝑚 𝑌 𝑗 𝑗𝑜𝑒𝑓𝑞𝑓𝑜𝑒𝑓𝑜𝑢 • 𝑍 = ∑ 𝑌 𝑗 follows a 𝑗=1 𝜓 2 distribution with n degrees of freedom. 2 • 𝑍~𝜓 𝑜
Student t distribution 𝑌 2 𝑏𝑜𝑒 𝑌 𝑗𝑜𝑒𝑓𝑞𝑓𝑜𝑒𝑓𝑜𝑢 𝑔𝑠𝑝𝑛 𝑍 • 𝑎 = 𝑥𝑗𝑢ℎ 𝑌~𝑂 0,1 𝑏𝑜𝑒 𝑍~𝜓 𝑜 𝑍 𝑜 follows a student or t-distribution with n degrees of freedom • 𝑎~𝑢 𝑜 • Higher variance and kurtosis than the standardized normal distribution • Converges to the normal distribution for large n: 𝑢 ∞ = 𝑂 0,1
F distribution • Z= X/n 2 𝑏𝑜𝑒 𝑌 𝑗𝑜𝑒𝑓𝑞𝑓𝑜𝑒𝑓𝑜𝑢 𝑔𝑠𝑝𝑛 𝑍 follows 2 𝑏𝑜𝑒 𝑍~𝜓 𝑛 Y/m with X~ χ 𝑜 an F distribution with n and m degrees of freedom. • 𝑎~𝐺 𝑜,𝑛
Inference
Statistical inference • Try to say something about the real distribution of a random variable based on a sample. • The real distribution corresponds to an infinitely repeated event (ex dice), the entire population, entire set of possible ‘states of the world’ in a future period etc.
3 types of inference • Point estimator: • Ex: sample mean, sample variance, marginal effect in a linear regression (beta), correlation… => 𝜄 will follow a prob • Concept of repeated sampling: every sample gives another estimator 𝜄 distribution = 𝜄 • Unbiased: Expected value of estimator corresponds to the real parameter 𝐹 𝜄 • Consistent: The estimator can get arbitrarily close to the real parameter by increasing the sample size = 𝜄 plim 𝜄 𝑜→∞ 1 2 • Ex: sample variance estimator 𝑡 ² = 𝑜 ∑ 𝑧 𝑗 − 𝑧 is a biased but consistent estimator of the variance 𝑗 ) is small • Efficient estimator: 𝑤𝑏𝑠(𝜄 • Interval estimation: • Ex: given the observed sample, the real mean lays between 1 and 3 with 95% probability • Hypothesis testing: • Ex: if the null hypothesis is true (𝜈 = 2) , what is the probability of a random sample to have a more extreme (less likely) outcome than the observed sample mean of 4 and sample variance of 2.
Example: Sample mean • Income of Belgian households: a random variable following a distribution with mean 𝜈 and variance 𝜏² (distribution is skewed, not normal) • You have a sample of n individuals. You want to say something about 𝜈 and 𝜏² 𝑧 1 +𝑧 2 +𝑧 3 …𝑧 𝑜 • Estimator of 𝜈 : sample mean y = 𝑜 • Estimator will be different each time you draw a different sample=>sample mean will follow a distribution, which is different from the distribution of y. • Central limit theorem =>the sample mean converges to a normal distribution even if y does not follow a normal distribution.
Sample mean: variance known 𝑏𝑡𝑡𝑧𝑛𝑞𝑢𝑝𝑢𝑗𝑑 ~N 𝜈, 𝑏𝑡𝑡𝑧𝑛𝑞𝑢𝑝𝑢𝑗𝑑 ~N(0,1) 𝜏 2 𝑧 −𝜈 • 𝑧 ⇒ 𝑜 𝜏 𝑜 • This allows to determine a 95%confidence interval 𝑧 −𝜈 𝜏 𝜏 𝑄 −1,96 < 𝜏/ 𝑜 < 1,96 = 0,95 ⇔ 𝑄 𝑧 − 1,96 𝑜 < 𝜈 < y + 1,96 𝑜 =0,95 • When interval includes zero we say that the sample mean is not significantly different from zero at the 5% confidence level.
Sample mean: variance unknown and y normally distributed • Both mean and variance will need to be estimated. • Estimator for variance: 𝑡 2 = 1 2 𝑜−1 ∑ 𝑧 𝑗 − 𝑧 𝑜 2 • If Y follows a normal distribution ⇔ (𝑜−1)𝑡 2 𝑧 𝑗 −𝑧 2 = ∑ ~𝜓 𝑜−1 (no proof but intuitive) 𝑜 𝜏 2 𝜏 −𝜈 −𝜈 𝑧 𝑧 • 𝑧 −𝜈 ~ 𝑂 0,1 𝜏/ 𝑜 𝜏/ 𝑜 𝑡 / 𝑜 = = = 𝑢 𝑜−1 2 2 𝑡 𝑜−1 𝑡 𝜓𝑜−1 (𝑜−1)𝜏2 𝜏 𝑜−1 • This allows to determine a 95% confidence interval (ex. n=21) −𝜈 • 𝑄 −2,086 < 𝑧 − 2,086 𝑡 + 2,086 𝑡 < 2,086 = 0,95 ⇔ 𝑄 𝑧 𝑜 < 𝜈 < y 𝑜 =0,95 𝑡 𝑜 • For large n, the t distribution converges to the normal distribution
Hypothesis testing • Null hypothesis 𝐼 0 : 𝜄 = 𝜄 0 ex: 𝐼 0 : 𝜄 = 0 • One sided test 𝐼 𝐵 : 𝜄 > 𝜄 0 (𝑝𝑠 𝜄 < 𝜄 0 ) ex: 𝐼 𝐵 : 𝜄 > 0 • Two sided test 𝐼 𝐵 : 𝜄 ≠ 𝜄 0 ex: 𝐼 𝐵 : 𝜄 ≠ 0 • 2 regions: • If observed data (test statistic) falls in rejection region =>reject H 0 • If observed data (test statistic) falls in acceptence region =>accept H 0 • Imagine you have 10 months of data and you observe a mean monthly return of the stock of Apple of 0,8% and you want to test if this mean is different from a zero return. • Assume the standard error of the return is observed to be 1,58%, so the standard error of the mean is 1,58% 10 = 0,5%
Recommend
More recommend