2/22/2007 219323 Probability and Statistics for Software Statistics for Software and Knowledge Engineers Lecture 13: Simple Linear Regression Simple Linear Regression and Correlation Monchai Sopitkamon, Ph.D. Outline � The Simple Linear Regression Model (12.1) � Fitting the Regression Line (12.2) � The Analysis of Variance Table (12.6) � Residual Analysis (12.7) � Correlation Analysis (12.9) 1
2/22/2007 The Simple Linear Regression Model I (12.1) � Purpose of regression analysis: predict the value of a dependent or response variable value of a dependent or response variable from the values of at least one explanatory or independent variable (also called predictors or factors). � Purpose of correlation analysis: measure the strength of the correlation between two variables. variables The Simple Linear Regression Model II (12.1) y i = β 0 + β 1 x i Y i ∼ N ( β 0 + β 1 x i , σ 2 ) N ( β 0 β 1 x i , σ ) Y i Intercept parameter Slope parameter Sim ple linear regression m odel 2
2/22/2007 The Simple Linear Regression Model III (12.1) I nterpretation of the error variance σ 2 error variance σ The Simple Linear Regression Model IV (12.1) β 1 > 0 � positive relationship β 1 = 0 � No relationship 35 SLR model is not appropriate β 1 < 0 � negative relationship 30 for nonlinear relationship for nonlinear relationship 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 3
2/22/2007 The Simple Linear Regression Model V (12.1) � Ex.67 pg.536: Car Plant Electricity Usage 3.8 3.6 3.4 Electricity usage 3.2 3 2.8 2.6 2.4 2.2 2 3 3.5 4 4.5 5 5.5 6 6.5 Productiom Excel sheet Outline � The Simple Linear Regression Model (12.1) � Fitting the Regression Line (12.2) � The Analysis of Variance Table (12.6) � Residual Analysis (12.7) � Correlation Analysis (12.9) 4
2/22/2007 Fitting the Regression Line I (12.2) : Selecting the “best” line (errors) error estimated y The least squares fit Fitting the Regression Line II (12.2) = β + β ˆ y x i 0 1 i ˆ i y : predicted value of y for observation i . x : value of observation i . i β β are chosen to minimize: and 0 1 2 2 n n n [ ] ∑ ∑ ∑ = = − = − β + β 2 2 ˆ SSE e ( y y ) y ( x ) i i i i 0 1 i = = = i 1 i 1 i 1 n ∑ Subject to: = e 0 i = i 1 5
2/22/2007 Fitting the Regression Line III (12.2) � Method of Least Squares n ∑ − x y n x y i i β = = i 1 1 n ( ) ∑ − 2 2 x n x i = i 1 β = − β y x 0 1 SSE σ = n Variance of errors: ˆ 2 − 2 n -2 since two regression parameters need to be computed first Fitting the Regression Line IV (12.2) � Ex.67 pg.545: Car Plant Electricity Usage = = x x 4 885 4.885 n ∑ − = x y n x y y 2.846 i i β = = 12 i 1 ∑ = 2 x 291.231 1 n i ∑ ( ) − = 2 2 i 1 x n x 12 i ∑ = x y 169.253 = i 1 i i = i 1 β = − y b x − × × 169 . 253 12 4 . 885 2 . 846 0 1 β β = = − × 1 2 291 . 231 12 4 . 885 = 0 . 4988 β = − × 2 . 846 0 . 4998 4 . 885 0 = 0 . 409 ∴ = + y 0 . 409 0 . 499 x Excel sheet 6
2/22/2007 Fitting the Regression Line V (12.2) � Ex.67 pg.545: Car Plant Electricity Usage 3.8 3.6 y = 0.498x + 0.409 R² = 0.802 3.4 3.2 Electricity usage 3 2.8 2.6 2.4 2.2 2 3 3.5 4 4.5 5 5.5 6 6.5 Productiom Outline � The Simple Linear Regression Model (12.1) � Fitting the Regression Line (12.2) � The Analysis of Variance Table (12.6) � Residual Analysis (12.7) � Correlation Analysis (12.9) 7
2/22/2007 Outline � The Simple Linear Regression Model (12.1) � Fitting the Regression Line (12.2) � The Analysis of Variance Table (12.6) � Residual Analysis (12.7) � Correlation Analysis (12.9) Outline � The Simple Linear Regression Model (12.1) � Fitting the Regression Line (12.2) � The Analysis of Variance Table (12.6) � Residual Analysis (12.7) � Correlation Analysis (12.9) 8
2/22/2007 The Analysis of Variance Table: Sum of Squares Decomposition I (12.6.1) � Apply the similar ANOVA approach as the one-factor layout as in Chapter 11 one factor layout as in Chapter 11 � Consider the variability in the dependent variable y � Hypothesis test: H 0 : β 1 = 0 The Analysis of Variance Table: Sum of Squares Decomposition II (12.6.1) n = ∑ ∑ i − 2 SST ( ( y y y y ) ) i = i 1 n ( ) n ∑ ( ) = ∑ = − 2 − = − 2 ˆ SSE y y SSR y ˆ y SST SSE i i i = = 1 i i 1 9
2/22/2007 The Analysis of Variance Table: Sum of Squares Decomposition III (12.6.1) The sum of squares for a sim ple linear regression The Analysis of Variance Table: Sum of Squares Decomposition IV (12.6.1) The analysis of variance table for a sim ple linear regression analysis � Hypothesis test: H 0 : β 1 = 0 � The two-sided p -value is p -value = P ( X > F ) where X is RV that has an F 1, n -2 distribution 10
2/22/2007 The Analysis of Variance Table: Sum of Squares Decomposition V (12.6.1) � Coefficient of determination ( R 2 ): fraction of variation explained by the regression of variation explained by the regression − SSR SST SSE SSE 2 = = = − R 1 SST SST SST (0 ≤ R 2 ≤ 1) The closer R 2 is to one, the better is the regression model. The Analysis of Variance Table: Sum of Squares Decomposition VI (12.6.1) The coefficient of determination R 2 is larger in scenario I I than in scenario I 11
2/22/2007 The Analysis of Variance Table: Sum of Squares Decomposition VII (12.6.1) � Ex.67 pg.572: Car Plant Electricity Usage MSR 1 . 2124 = = = F 40 . 53 MSE 0 . 0299 SSR SSR 1 1 . 2124 2124 = SST = = 2 R 0 . 802 1 . 5115 The higher the value of R 2 the better the regression. Excel sheet Outline � The Simple Linear Regression Model (12.1) � Fitting the Regression Line (12.2) � The Analysis of Variance Table (12.6) � Residual Analysis (12.7) � Correlation Analysis (12.9) 12
2/22/2007 Residual Analysis Methods I (12.7.1) � Residuals: differences between the observed values of the dependent variable observed values of the dependent variable and the corresponding predicted (fitted) values = − ˆ e y y 1 ≤ i ≤ n i i i � Residual analysis can be used to – Identify outliers – Check if the fitted model is good – Check if the variance of error is constant – Check if the error terms are normally distributed Excel sheet Residual Analysis Methods II (12.7.1) � Plot the residuals e i against the values of the explanatory variable x i explanatory variable x i � Random scatter plot indicates no problem with the obtained regression model � If ( standardized residual ) is > 3, data σ ˆ e / i point i is an outlier � If there are outliers, they should be removed and the regression line should be fitted again Excel sheet 13
2/22/2007 Residual Analysis Methods III (12.7.1) Residual plot indicating points that m ay be outliers Residual Analysis Methods IV (12.7.1) � If residual plots show positive and negative residuals grouped together, a linear model is residuals grouped together, a linear model is not suitable A grouping of positive and negative residuals d i id l indicates that the linear m odel is inappropriate 14
2/22/2007 Residual Analysis Methods V (12.7.1) � If the residual plot shows a “funnel shape”, the variance of error ( σ 2 ) is not shape , the variance of error ( σ ) is not constant, conflicting w/ the assumption A funnel shape in the residual plot indicates residual plot indicates a non-constant error variance Residual Analysis Methods VI (12.7.1) � Normal probability plot (normal scores plot) of residuals can be used to check if the error of residuals can be used to check if the error terms ε i are normally distributed A norm al scores cores plot of a sim ulated Normal sc sam ple from a sam ple from a norm al distribution, w hich show s the points lying approxim ately on a straight line 15
2/22/2007 Residual Analysis Methods VII (12.7.1) � Exhibits non-normal distribution of res Normal scor residuals � Linear modeling approach may not be used Norm al scores plots Norm al scores plots of sim ulated sam ples Normal scores from non-norm al distributions, w hich show nonlinear patterns Outline � The Simple Linear Regression Model (12.1) � Fitting the Regression Line (12.2) � The Analysis of Variance Table (12.6) � Residual Analysis (12.7) � Correlation Analysis (12.9) 16
2/22/2007 The Sample Correlation Coefficient I (12.9.1) � From the correlation eq. in Section 2.5.4, Cov ( X , Y ) ρ ρ = = Corr Corr ( ( X X , , Y Y ) ) Var ( X ) Var ( Y ) which measures the strength of linear association between two jointly distributed RVs X and Y � The sample correlation coefficient r for a set of paired data observations ( x i , y i ) is n ∑ ∑ − − ( ( x x )( )( y y y y ) ) i i i i = = i 1 r n n ∑ ∑ − − 2 2 ( x x ) ( y y ) i i = = i 1 i 1 n ∑ − x y n x y i i = (-1 ≤ r ≤ 1) = i 1 n n ∑ ∑ − − 2 2 2 2 x n x y n y i i = = i 1 i 1 The Sample Correlation Coefficient II (12.9.1) r = 0 � no linear association r < 0 � negative linear association r > 0 � positive linear association 17
Recommend
More recommend