Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 27: Regression Evaluation Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 1 / 25
Univariate Regression Y = f ( X ) + ε = β + ω · X + ε where ω is the slope of the best fitting line and β is its intercept, and ε is the random error variable that follows a normal distribution with mean µ = 0 and variance σ 2 . The true parameters β , ω and σ 2 are all unknown, and have to be estimated from the training data D comprising n points x i and corresponding response values y i , for i = 1 , 2 , ··· , n . Let b and w denote the estimated bias and weight terms; we can then make predictions for any given value x i as follows: ˆ y i = b + w · x i The estimated bias b and weight w are obtained by minimizing the sum of squared errors (SSE), given as n n � � y i ) 2 = ( y i − b − w · x i ) 2 SSE = ( y i − ˆ i = 1 i = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 2 / 25
Univariate Regression According to our model, the variance in prediction is entirely due to the random error term ε . We can estimate this variance by considering the predicted value ˆ y i and its deviation from the true response y i , that is, by looking at the residual error ǫ i = y i − ˆ y i σ 2 is given as The estimated variance ˆ n n n 1 1 1 � � 2 = � � σ 2 = var( ǫ i ) = � ǫ 2 y i ) 2 ˆ n − 2 · ǫ i − E [ ǫ i ] n − 2 · i = n − 2 · ( y i − ˆ i = 1 i = 1 i = 1 Thus, the estimated variance is σ 2 = SSE ˆ (1) n − 2 We divide by n − 2 to get an unbiased estimate, since n − 2 is the number of degrees of freedom for estimating SSE. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 3 / 25
Univariate Regression The SSE value gives an indication of how much of the variation in Y cannot be explained by our linear model. We can compare this value with the total scatter , also called total sum of squares , for the dependent variable Y , defined as n � ( y i − µ Y ) 2 TSS = i = 1 Notice that in TSS, we compute the squared deviations of the true response from the true mean for Y , whereas, in SSE we compute the squared deviations of the true response from the predicted response. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 4 / 25
Univariate Regression The total scatter can be decomposed into two components by adding and subtracting ˆ y i as follows n n � � ( y i − µ Y ) 2 = y i − µ Y ) 2 TSS = ( y i − ˆ y i + ˆ i = 1 i = 1 n n n � � � y i ) 2 + y i − µ Y ) 2 + 2 = ( y i − ˆ (ˆ ( y i − ˆ y i ) · (ˆ y i − µ Y ) i = 1 i = 1 i = 1 n n � � y i ) 2 + y i − µ Y ) 2 = SSE + RSS = ( y i − ˆ (ˆ i = 1 i = 1 where we use the fact that � n i = 1 ( y i − ˆ y i ) · (ˆ y i − µ Y ) = 0, and n � y i − µ Y ) 2 RSS = (ˆ i = 1 is a new term called regression sum of squares that measures the squared deviation of the predictions from the true mean. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 5 / 25
Univariate Regression TSS can thus be decomposed into two parts: SSE, which is the amount of variation not explained by the model, and RSS, which is the amount of variance explained by the model. Therefore, the fraction of the variation left unexplained by the model is given by the ratio SSE TSS . Conversely, the fraction of the variation that is explained by the model, called the coefficient of determination or simply the R 2 statistic , is given as R 2 = TSS − SSE = 1 − SSE TSS = RSS (2) TSS TSS The higher the R 2 statistic the better the estimated model, with R 2 ∈ [ 0 , 1 ] . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 6 / 25
Variance and Goodness of Fit Consider the regression of petal length ( X ; the predictor variable) on petal width ( Y ; the response variable) for the Iris dataset. Figure shows the scatterplot between the two attributes. There are a total of n = 150 data points. The least squares estimates for the bias and regression coefficients are as follows w = 0 . 4164 b = − 0 . 3665 The SSE value is given as 150 150 � � y i ) 2 = 6 . 343 ǫ 2 SSE = i = ( y i − ˆ i = 1 i = 1 Thus, the estimated variance and standard error of regression are given as σ 2 = SSE n − 2 = 6 . 343 148 = 4 . 286 × 10 − 2 ˆ � SSE � 4 . 286 × 10 − 2 = 0 . 207 σ = ˆ n − 2 = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 7 / 25
Variance and Goodness of Fit For the bivariate Iris data, the values of TSS and RSS are given as TSS = 86 . 78 RSS = 80 . 436 We can observe that TSS = SSE + RSS . The fraction of variance explained by the model, that is, the R 2 value, is given as R 2 = RSS TSS = 80 . 436 86 . 78 = 0 . 927 This indicates a very good fit of the linear model. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 8 / 25
Inference about Regression Coefficient and Bias Term The estimated values of the bias and regression coefficient, b and w , are only point estimates for the true parameters β and ω . To obtain confidence intervals for these parameters, we treat each y i as a random variable for the response given the corresponding fixed value x i . These random variables are all independent and identically distributed as Y , with expected value β + ω · x i and variance σ 2 . On the other hand, the x i values are fixed a priori and therefore µ X and σ 2 X are also fixed values. We can now treat b and w as random variables, with b = µ Y − w · µ X � n n n i = 1 ( x i − µ X )( y i − µ Y ) = 1 � � w = ( x i − µ X ) · y i = c i · y i � n i = 1 ( x i − µ X ) 2 s X i = 1 i = 1 where c i is a constant (since x i is fixed), given as c i = x i − µ X (3) s X and s X = � n i = 1 ( x i − µ X ) 2 is the total scatter for X , defined as the sum of squared deviations of x i from its mean µ X . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 9 / 25
Mean and Variance of Regression Coefficient The expected value of w is given as � n � n n � � � E [ w ] = E c i y i = c i · E [ y i ] = c i ( β + ω · x i ) i = 1 i = 1 i = 1 n n n c i · x i = ω ( x i − µ X ) · x i = ω � � � = β c i + ω · · · s X = ω s X s X i = 1 i = 1 i = 1 which follows from the observation that � n i = 1 c i = 0, and further n � n n � � � � ( x i − µ X ) 2 = x 2 − n · µ 2 s X = X = ( x i − µ X ) · x i i i = 1 i = 1 i = 1 Thus, w is an unbiased estimator for the true parameter ω . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 10 / 2
Mean and Variance of Regression Coefficient Using the fact that the variables y i are independent and identically distributed as Y , we can compute the variance of w as follows � n � n n i = σ 2 � � � i · var( y i ) = σ 2 · c 2 c 2 var( w ) = var c i · y i = (4) s X i = 1 i = 1 i = 1 since c i is a constant, var( y i ) = σ 2 , and further n n i = 1 ( x i − µ X ) 2 = s X = 1 � � c 2 · s 2 s 2 s X X X i = 1 i = 1 The standard deviation of w , also called the standard error of w , is given as σ � se( w ) = var( w ) = √ s X (5) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 11 / 2
Mean and Variance of Bias Term The expected value of b is given as � � n 1 � E [ b ] = E [ µ Y − w · µ X ] = E y i − w · µ X n i = 1 � � � � n n 1 1 � � = E [ y i ] − µ X · E [ w ] = ( β + ω · x i ) − ω · µ X n · n i = 1 i = 1 = β + ω · µ X − ω · µ X = β Thus, b is an unbiased estimator for the true parameter β . Using the observation that all y i are independent, the variance of the bias term can be computed as follows Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 12 / 2
Mean and Variance of Bias Term var( b ) = var( µ Y − w · µ X ) � � n 1 � = var y i + var( µ X · w ) n i = 1 X · σ 2 = 1 X · var( w ) = 1 n 2 · n σ 2 + µ 2 n · σ 2 + µ 2 s X � 1 � n + µ 2 X · σ 2 = s X where we used the fact that for any two random variables A and B , we have var( A − B ) = var( A ) + var( B ) . That is, variances of A and B add, even though we are computing the variance of A − B . The standard deviation of b , also called the standard error of b , is given as � n + µ 2 1 � X se( b ) = var( b ) = σ · (6) s X Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 27: Regression Evaluation 13 / 2
Recommend
More recommend