SIMPLE REGRESSION ANALYSIS Business Statistics
CONTENTS Ordinary least squares (recap for some) Statistical formulation of the regression model Assessing the regression model Testing the regression coefficients The ANOVA table Old exam question Further study
ORDINARY LEAST SQUARES Idea of “curve fitting” in a scatterplot ▪ linear fit: 𝑧 = 𝑏 + 𝑐𝑦 ( 𝑦 =floor area of house, 𝑧 =price of house)
ORDINARY LEAST SQUARES You find the “best” line ▪ by minimizing the “misfit” ( 𝑓 𝑗 ) between observed value ( 𝑧 𝑗 ) and modelled/estimated value ( ෝ 𝑧 𝑗 = 𝑏𝑦 𝑗 + 𝑐 ) The hat (^) is our ▪ 𝑓 𝑗 = 𝑧 𝑗 − ෝ symbol for the 𝑧 𝑗 estimate ▪ in fact by minimizing the sum of squares of misfit 𝑜 ▪ σ 𝑗=1 𝑓 𝑗 2 ▪ OLS regression
STATISTICAL FORMULATION OF THE REGRESSION MODEL Rephrasing the model 𝑧 = 𝑏 + 𝑐𝑦 as a statistical model Assumptions and notation ▪ we assume a linear relation of the form of the population regression model 𝑍 𝑗 = 𝛾 0 + 𝛾 1 𝑌 𝑗 + 𝜁 𝑗 ▪ or 𝑍 = 𝛾 0 + 𝛾 1 𝑌 + 𝜁 We prefer to use 𝛾 0 instead of 𝑏 for the ▪ with constant, and 𝛾 1 instead of 𝑐 for the slope ▪ 𝛾 0 is the intercept or constant ▪ 𝛾 1 the slope or slope coefficient ▪ random variable 𝜁 𝑗 is the error or residual , the “unexplained part”
STATISTICAL FORMULATION OF THE REGRESSION MODEL Estimation of the model coefficients ▪ we assume that 𝜁 𝑗 ~𝑂 0, 𝜏 2 ▪ based on sample of 𝑜 paired data points 𝑦 𝑗 , 𝑧 𝑗 , 𝑗 = 1, … , 𝑜 ▪ use OLS to estimate the best line through the estimated regression model 𝑍 = 𝑐 0 + 𝑐 1 𝑌 or ෝ 𝑧 𝑗 = 𝑐 0 + 𝑐 1 𝑦 𝑗 ▪ the estimated coefficients ( 𝑐 0 for 𝛾 0 and 𝑐 1 for 𝛾 1 ) and the estimated error ( 𝑓 𝑗 for 𝜁 𝑗 ) corresespond to 𝑧 𝑗 = 𝑐 0 + 𝑐 1 𝑦 𝑗 + 𝑓 𝑗
STATISTICAL FORMULATION OF THE REGRESSION MODEL 𝑦 𝑗 , 𝑧 𝑗 𝑍 = 𝑐 0 + 𝑐 1 𝑌 𝑓 𝑗 { 𝑦 𝑗 , ෝ 𝑧 𝑗 𝑐 0 𝑐 1
STATISTICAL FORMULATION OF THE REGRESSION MODEL So ▪ 𝑐 0 is the estimated value of 𝛾 0 ▪ the intercept or constant of the regression line ▪ 𝑐 1 is the estimated value of 𝛾 1 ▪ the slope or slope coefficient of the regression line ▪ 𝑓 𝑗 is the estimated residual or error for observation 𝑗 ▪ the “misfit”
EXERCISE 1 Look back at the house prices ▪ where we have a line found 𝑧 = −264700 + 6152𝑦 a. Give the theoretical model b. Give the estimated model
ASSESSING THE REGRESSION MODEL OLS will always give an estimate for 𝛾 0 and 𝛾 1 ▪ the line of “best fit” But is “best” also “good enough” to make good predictions? ▪ can we do a statistical test on the quality of the model? We have minimized the sum of squares ( 𝑇𝑇 ) of the error 𝑜 𝑜 𝑓 𝑗 2 = 𝑧 𝑗 2 𝑇𝑇𝐹 = 𝑧 𝑗 − ෝ 𝑗=1 𝑗=1 We would like to compare this with: “R” stands for “regression” ▪ the “total” sum of squares 𝑇𝑇𝑈 ▪ the “explained” sum of squares 𝑇𝑇𝑆
ASSESSING THE REGRESSION MODEL So 𝑇𝑇𝑈 is the total variation Total sum of squares: around the mean ത 𝑧 𝑜 𝑧 2 𝑇𝑇𝑈 = 𝑧 𝑗 − ത 𝑗=1
ASSESSING THE REGRESSION MODEL So 𝑇𝑇𝑆 is the variation Regression sum of squares: around the mean ത 𝑧 that is explained by the model 𝑜 𝑧 2 𝑇𝑇𝑆 = 𝑧 𝑗 − ത ෝ 𝑗=1 ▪ So, ▪ the data has a total variability 𝑇𝑇𝑈 ▪ the regression model explains a variability 𝑇𝑇𝑆 ▪ and the residual variability is 𝑇𝑇𝐹 ▪ and 𝑇𝑇𝑈 = 𝑇𝑇𝑆 + 𝑇𝑇𝐹 Coefficient of determination (“ 𝑆 -square ”): 𝑆 2 = 𝑇𝑇𝑆 𝑇𝑇𝑈 = 1 − 𝑇𝑇𝐹 𝑇𝑇𝑈
ASSESSING THE REGRESSION MODEL 𝑆 2 is a measure of the usefulness of the model ▪ Properties ▪ 0 ≤ 𝑆 2 ≤ 1 ▪ 𝑆 2 = 0 means the model doesn’t explain anything ▪ 𝑆 2 = 1 means the model explains everything ▪ in between, the model explains 𝑆 2 × 100% of the variance of 𝑍 𝑆 2 = 1 − 𝑇𝑇𝐹 𝑇𝑇𝑈
ASSESSING THE REGRESSION MODEL If 𝑆 2 > 0 , the regression model explains “something” ▪ but in a random sample, 𝑆 2 may be non-zero due to chance ▪ when is 𝑆 2 is “significantly” different from 0 ? Finding a test statistic ▪ look at the variances associated with 𝑇𝑇𝑆 and 𝑇𝑇𝐹 ▪ so define the mean sums of squares ( 𝑁𝑇 ) (variances!) 𝑇𝑇𝑈 𝑇𝑇𝑆 𝑇𝑇𝐹 𝑜−1 ; 𝑁𝑇𝑆 = 1 ; 𝑁𝑇𝐹 = ▪ 𝑁𝑇𝑈 = 𝑜−2 𝑁𝑇𝑆 𝑇𝑇𝑆/1 ▪ use 𝑇𝑇𝐹/ 𝑜−2 as a ratio of two variances 𝑁𝑇𝐹 =
ASSESSING THE REGRESSION MODEL Statistical test: ▪ 𝐼 0 : the independent variable ( 𝑌 ) does not explain the variation in the dependent variable ( 𝑍 ) ▪ i.e., 𝐼 0 : 𝛾 1 = 0 versus 𝐼 1 : 𝛾 1 ≠ 0 𝑁𝑇𝑆 ▪ Sample statistic: 𝐺 = 𝑁𝑇𝐹 ; reject for large values ▪ Under 𝐼 0 : 𝐺~𝐺 1,𝑜−2 ; assumptions: see model 𝑁𝑇𝑆 ▪ Compare 𝐺 calc = 𝑁𝑇𝐹 with 𝐺 crit = 𝐺 1,𝑜−2;𝛽 ▪ or compute 𝑞 -value as the probability of obtaining 𝐺 calc or more extreme if 𝐼 0 is true
ASSESSING THE REGRESSION MODEL Using SPSS, three types of output Model summary ▪ 𝑆 2 Variance decomposition (ANOVA?) ▪ 𝑇𝑇𝑆 , 𝑇𝑇𝐹 , 𝑇𝑇𝑈 ▪ 𝑁𝑇𝑆 , 𝑁𝑇𝐹 ▪ 𝐺 𝑑𝑏𝑚𝑑 ▪ 𝑞 -value Regression coefficients ▪ 𝑐 0 and 𝑐 1
ASSESSING THE REGRESSION MODEL The model is 𝑍 = 𝛾 0 + 𝛾 1 𝑌 + 𝜁 ▪ OLS extracts estimates from the data: 𝑐 0 and 𝑐 1 ▪ But how accurate are these estimates? We can also find the distribution of 𝐶 0 and 𝐶 1 ▪ So, we can find confidence intervals and perform hypothesis tests 𝐶 0 and 𝐶 1 are 𝑢 -distributed: 𝐶 0 −𝛾 0 ▪ 𝑇 𝐶0 ~𝑢 𝑜−2 𝐶 1 −𝛾 1 ▪ 𝑇 𝐶1 ~𝑢 𝑜−2
ASSESSING THE REGRESSION MODEL Mind the notation, like before: ▪ mean ▪ population value 𝜈 𝑌 ▪ sample estimate ҧ 𝑦 When you’re careless ▪ sampling distribution of random variable ത 𝑌 with this, it all gets ▪ regression slope mixed up in one big abracadabra trickery! ▪ population value 𝛾 1 ▪ sample estimate 𝑐 1 ▪ sampling distribution of random variable 𝐶 1
EXERCISE 2 a. Is the model significant? b. Has the model practical relevance?
TESTING THE REGRESSION COEFFICIENTS ▪ Testing 𝛾 0 is usually not interesting ▪ but testing 𝛾 1 is! ▪ in particular, the hypothesis 𝛾 1 = 0 is often interesting ▪ i.e., the hypothesis that there is no relation between 𝑌 and 𝑍 ▪ or: that knowledge of 𝑌 doesn’t tell you anything about 𝑍 ▪ This test requires the standard deviation of 𝐶 1 ▪ it is calculated from the data; see computer output ▪ here 𝑡 𝐶 1 = 347.578
TESTING THE REGRESSION COEFFICIENTS 𝑐 1 −𝛾 1 6151.670−0 So: 𝑢 calc = = = 17.699 𝑡 𝐶1 347.578 ▪ which has to be compared to 𝑢 crit = ±𝑢 0.025;69 ▪ reject 𝐼 0 : 𝛾 1 = 0 , because 𝑢 calc > 𝑢 crit ▪ or with 𝑞 -value: 𝑞 = 0.000 ≪ 0.05 ▪ and conclude that the slope differs significantly from zero ▪ post-hoc conclusion: it is larger than zero
TESTING THE REGRESSION COEFFICIENTS Testing the regression model 𝑁𝑇𝑆 ▪ on the basis of 𝑁𝑇𝐹 ~𝐺 1,𝑜−2 Testing the regression coefficient 𝑐 1 𝐶 1 −0 ▪ on the basis of 𝑇 𝐶1 ~𝑢 𝑜−2 The two approaches are equivalent ▪ they have the same null hypothesis: 𝐼 0 : 𝛾 1 = 0 ▪ they lead to the same conclusion (rejection or no rejection) ▪ they lead to the same 𝑞 -value ▪ when we do multiple regression with several explanatory variables this is not the case! See later.
TESTING THE REGRESSION COEFFICIENTS We can also perform other tests than 𝐼 0 : 𝛾 1 = 0 ▪ Case 1: Different test values for 𝛾 1 ▪ for example 𝐼 0 : 𝛾 1 = 2 𝑐 1 −2 ▪ 𝑢 calc = 𝑡 𝐶1 ▪ not in SPSS, but easily calculated using s 𝐶 1 ▪ Case 2: One sided tests ▪ for example 𝐼 0 : 𝛾 1 ≥ 0 ▪ 𝑢 calc as before, but now tested with different 𝑢 crit ▪ not in SPSS, but also easily calculated using 2-sided 𝑞 -value ▪ Case 3: combination of case 1 and case 2 ▪ for example 𝐼 0 : 𝛾 1 ≥ 2 ▪ Try all! (see tutorials)
TESTING THE REGRESSION COEFFICIENTS Example of case 3: ▪ is there evidence that the price per square meter larger than 5500€? one-sided critical value, with 𝛽 , not 𝛽/2 ▪ 𝐼 0 : 𝛾 1 ≤ 5500 ; 𝐼 1 : 𝛾 1 > 5500 ; 𝛽 = 0.05 6151.670−5500 ▪ 𝑢 calc = = 1.875 > 𝑢 crit ≈ 1.7 347.578 ▪ reject 𝐼 0 ▪ conclude that price per m 2 is significantly larger than 5500€
TESTING THE REGRESSION COEFFICIENTS One may also test 𝛾 0 in exactly the same way ▪ however, this is hardly ever useful 𝐶 1 𝐶 0 Overall significance of 𝐺 -test only depends on 𝑇 𝐶1 , not on 𝑇 𝐶0 ▪ that is because the slope explains variation ▪ while the intercept is only a vertical shift
THE ANOVA TABLE One of the regression results is the ANOVA table ANOVA = analysis of variance ▪ Excel ▪ SPSS
Recommend
More recommend