MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES Business Statistics
CONTENTS Multiple regression Dummy regressors Assumptions of regression analysis Predicting with regression analysis Old exam question Further study
MULTIPLE REGRESSION The regression model so far is for one dependent variable ( ๐ ) and one independent (explanatory) variable ( ๐ ) โช There are many cases where several explanatory variables might play a role โช ... might โexplainโ the dependent variable ๐ โช Example: house prices depend on โช floor area โช ground area (first floor + garden) โช number of rooms โช age of the house โช etc.
MULTIPLE REGRESSION Generalize simple regression model Now, youโll understand why โช from ๐ = ๐พ 0 + ๐พ 1 ๐ 1 + ๐ we used a subscript 0 for the constant in ๐พ 0 ... โช to ๐ = ๐พ 0 + ๐พ 1 ๐ 1 + ๐พ 2 ๐ 2 + ๐ โช or even to ๐ = ๐พ 0 + ๐พ 1 ๐ 1 + ๐พ 2 ๐ 2 + โฏ + ๐พ ๐ ๐ ๐ + ๐ Multiple regression โช a quite obvious extension โช we can reuse much of the theory of simple regression โช still based on OLS, ๐ 2 , ๐บ -test, and ๐ข -test
MULTIPLE REGRESSION SPSS output Estimated model: เท ๐ = โ217603 + 5347๐ 1 + 225๐ 2
MULTIPLE REGRESSION โStep 0โ (statistical model): ๐ = ๐พ 0 + ๐พ 1 ๐ 1 + ๐พ 2 ๐ 2 + ๐ , with ๐~๐ 0, ๐ 2 Step 1: โช ๐ผ 0 : ๐พ 1 = ๐พ 2 = 0 ; ๐ผ 1 : at least one of these not 0 Step 2: mind that the null hypothesis does not include the constant (intercept) ๐พ 0 ๐๐๐ โช Sample statistic: ๐บ = ๐๐๐น ; reject for โtoo largeโ values Step 3: โช Under ๐ผ 0 : ๐บ~๐บ 2,๐โ3 ; assumption: see model (step 0) Step 4: with ๐ regressors: โช ๐บ calc = โฏ ; ๐บ crit = ๐บ 2,๐โ3;๐ฝ df 1 = ๐ df 2 = ๐ โ ๐ โ 1 Step 5: โช reject/not reject ๐ผ 0
MULTIPLE REGRESSION Rejecting the ๐บ -test in multiple regressions means: โช at least one of the slope coefficients differs from 0 โช โnot ๐พ 1 = ๐พ 2 = 0 โ โช which one differs (or differ) from 0 must be investigated by separate ๐ข -tests So, โช while in simple regression the overall ๐บ -test and the ๐ข -test for ๐พ 1 do exactly the same thing ... โช ... the two tests have a complimentary role in multiple regression โช first look at overall ๐บ , then go to the individual ๐ข s
MULTIPLE REGRESSION First, overall model test, using ๐บ -test Next, test each slope coefficient, using ๐ times a ๐ข -test not interesting
EXERCISE 1 What does it mean when in multiple regression a. the overall ๐บ -test yields a significant result? b. a ๐ข -test of an individual coefficient ๐พ 3 yields a significant result?
MULTIPLE REGRESSION Example: โช overall ๐บ -test: highly significant โช both regression slopes: highly significant โช coefficient of determination ( ๐ 2 ): very high ( 90% ) โช a very useful model โช in fact: better than the simple regression model with ๐ 2 = 82%
MULTIPLE REGRESSION Observe: โช including more explanatory variables will in general improve the model โช ๐ 2 will increase, even if we include โnon - senseโ variables (e.g., street number of the house) 2 (โR -square- adjustedโ) penalizes for including โtoo โช ๐ adj manyโ regressors ๐๐๐น/๐โ๐โ1 ๐๐๐/๐โ1 while ๐ 2 = 1 โ ๐๐๐น 2 โช ๐ adj = 1 โ ๐๐๐
DUMMY REGRESSORS House prices (numerical) depend on: โช numerical variables (floor area, ground area, etc.) โช binary categorical variables (with/without garage, etc.) โช other categorical variables (no/free/paid parking, etc.) However: โช regression for numerical ๐ and numerical ๐ โช ANOVA for categorical ๐ and numerical ๐ So, how to combine numerical ๐ 1 and categorical ๐ 2 ? Solution: dummy variables for categorical variable โช dummy regressors/dummy regression
DUMMY REGRESSORS We can include dummy variables in multiple regression โช Splitting binary in several binary Omitted variable: โช original variable: garage = no/yes no_garage (redundant): garage=0 โช garage: 0=no; 1=yes โช Splitting non-binary in several binary โช original variable: parking = no/free/paid Omitted variable: no_parking (redundant): โช free_parking: 0=no; 1=yes free=0, paid=0 โช paid_parking: 0=no; 1=yes โช Dummy variables only for independent ( ๐ ) variables โช never for dependent ( ๐ ) variable โช ๐ must be numerical (think about ๐~๐ )
DUMMY REGRESSORS Example โช House price ( ๐ ) as a function of โช floor area ( ๐ 1 ) โช dummy for garden ( ๐ 2 ; 0=No, 1=Yes) โช ๐๐ ๐๐๐ = โ261741 + 6040๐บ๐๐๐๐ ๐ต๐ ๐๐ + 21825๐ป๐๐ ๐๐๐ meaning 21825 โฌ extra when there is a garden (whatever the size)
DUMMY REGRESSORS โช Use dummy variables only for the independent (explanatory) variable โช not for the dependent variable.(logistic regression, not in this course!) โช It is quite common to indicate dummy explanatory variables with a ๐ธ instead of an ๐ โช for instance: ๐ = ๐พ 0 + ๐พ 1 ๐ 1 + ๐พ 2 ๐ธ 2 + ๐พ 3 ๐ธ 3 + ๐
EXERCISE 2 We want to explain car prices in terms of 1) engine power 2) number of seats 3) gas/diesel/electric. What is the theoretical model?
ASSUMPTIONS OF REGRESSION ANALYSIS The OLS equations always find coefficients ๐ 0 , ๐ 1 , โฆ that minimize the residual sum of squares ( ๐๐๐น ) โช so no assumptions required for that part But when testing the model (and when testing the coefficients ๐พ 1 , ๐พ 2 , โฆ ) โช we need to assume a statistical model with ๐~๐ 0, ๐ 2 : โช the residual terms should be normally distributed โช the residual terms should come from a distribution with constant variance โช the residual terms should be independent of each other โช there should be a linear relationship between the ๐ -variable(s) and ๐
ASSUMPTIONS OF REGRESSION ANALYSIS A final word on the residual ๐~๐ 0, ๐ 2 โช Theoretical regression model โช ๐ = ๐พ 0 + ๐พ 1 ๐ 1 + ๐พ 2 ๐ 2 + โฏ + ๐พ ๐ ๐ ๐ + ๐ โช Estimated regression model เท โช ๐ = ๐ 0 + ๐ 1 ๐ 1 + ๐ 2 ๐ 2 + โฏ + ๐พ ๐ ๐ ๐ โช Observations โช ๐ ๐ = ๐ 0 + ๐ 1 ๐ 1,๐ + ๐ 2 ๐ 2,๐ + โฏ + ๐พ ๐ ๐ ๐,๐ + ๐ ๐ โช And the standard deviation of the residual term ๐ = ๐ 2 ๐๐๐น โช is estimated by ๐ก = ๐โ๐โ1 = ๐๐๐น โช is known as the standard error of the regression or standard error of the estimate
PREDICTION WITH REGRESSION ANALYSIS Given a sample of data ๐ฆ 1๐ , ๐ฆ 2๐ , โฆ , ๐ง ๐ with ๐ = 1, โฆ , ๐ โช we can use OLS to estimate the regression model เท ๐ = ๐ 0 + ๐ 1 ๐ 1 + ๐ 2 ๐ 2 + โฏ โช subsequently, given the floor area, we can estimate the price of the house Now, a new โ incompleteโ observations arrives โช for instance, a new house with known floor area ( ๐ฆ ๐+1 ), but with unknown price (no ๐ง ๐+1 ) We can use the regression model to estimate the house price โช so to predict เท ๐ง ๐+1
PREDICTION WITH REGRESSION ANALYSIS Example: โช เท ๐ = โ264749 + 6152๐ โช a house with floor area ๐ฆ = 85 m2 has an estimated price ๐ง = โ264748 + 6152 ร 85 = 258142 (โฌ) เท
PREDICTION WITH REGRESSION ANALYSIS So, we can predict a value of เท ๐ง โช for a given ๐ฆ (or ๐ฆ 1 , ๐ฆ 2 , โฆ ) โช and given estimated regression coefficients ( ๐ 0 , ๐ 1 , โฆ ) The quality of this estimate depends obviously on the quality of the regression model โช try to find a confidence interval for the estimated เท ๐ง -value โช two types: โช the confidence interval for the average price of a house of 85 m2 โช the confidence interval for a particular house of 85 m2
PREDICTION WITH REGRESSION ANALYSIS Point prediction: 258142 Case 1: confidence interval (95%) for prediction of mean price โช 212866, 303419 Case 2: confidence interval (95%) for individual prediction Individual predictions are โช โ96372, 612658 always less accurate ๏ฎ wider confidence interval (this one even includes 0) Price ( ๐ ) unknown, area ( ๐ ) known
OLD EXAM QUESTION 26 March 2015, Q3a
FURTHER STUDY Doane & Seward 5/E 12.7, 13.1-13.5 Tutorial exercises week 4 multiple regression dummy regression prediction interval
Recommend
More recommend