8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 1 / 28
Regression Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we want to determine a function y = f ( x ) that is close to them. Scatter plot of data (x,y) 50 ● 40 ● ● 30 ● ● y ● ● 20 ● 10 ● ● −20 −10 0 10 20 30 x Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 2 / 28
Regression Based on knowledge of the underlying problem or on plotting the data, you have an idea of the general form of the function, such as: Polynomial y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 Line y = β 0 + β 1 x 500 ! ! 100 ! ! ! ! ! ! ! 0 ! ! ! ! ! ! ! ! ! ! ! ! ! 80 ! ! − 1000 ! ! ! ! 60 10 12 14 16 18 20 5 10 15 20 25 Exponential Decay y = Ae − Bx Logistic Curve y = A / ( 1 + B / C x ) 10 ! ! ! ! ! ! ! ! ! ! ! ! 5 ! 8 ! 4 ! ! 6 3 ! ! ! 4 2 ! ! ! ! ! 1 2 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 ! ! ! 0 ! 0 5 10 15 20 25 30 0 5 10 15 20 Goal: Compute the parameters ( β 0 , β 1 , . . . or A , B , C , . . .) that give a “best fit” to the data in some sense (least squares or MLEs). Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 3 / 28
Regression The methods we consider require the parameters to occur linearly. It is fine if ( x , y ) do not occur linearly. y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 E.g., plugging ( x , y ) = ( 2 , 3 ) into gives 3 = β 0 + 2 β 1 + 4 β 2 + 8 β 3 . For exponential decay, y = Ae − Bx , parameter B does not occur linearly. Transform the equation to: ln y = ln ( A ) − Bx = A ′ − Bx When we plug in ( x , y ) values, the parameters A ′ , B occur linearly. Transform the logistic curve y = A / ( 1 + B / C x ) to: � A � = ln ( B ) − x ln ( C ) = B ′ + C ′ x ln y − 1 x → ∞ y ( x ) . Now B ′ , C ′ occur linearly. where A is determined from A = lim Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 4 / 28
Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we will fit them to a line ˆ y = β 0 + β 1 x : Independent variable: x . We assume the x ’s are known exactly or have negligible measurement errors. Dependent variable: y . We assume the y ’s depend on the x ’s but fluctuate due to a random process. We do not have y = f ( x ) , but instead, y = f ( x ) + error. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 5 / 28
Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we will fit them to a line ˆ y = β 0 + β 1 x : Predicted y value (on the line): y i = β 0 + β 1 x i ˆ Actual data ( • ): y i = β 0 + β 1 x i + ǫ i Residual (actual y minus prediction): ǫ i = y i − ˆ y i = y i − ( β 0 + β 1 x i ) Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 6 / 28
Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x We will use the least squares method : pick parameters β 0 , β 1 that minimize the sum of squares of the residuals. n � ( y i − ( β 0 + β 1 x i )) 2 L = i = 1 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 7 / 28
Least squares fit to a line n � ( y i − ( β 0 + β 1 x i )) 2 L = i = 1 � � ∂β 0 , ∂ L ∂ L To find β 0 , β 1 that minimize this, solve ∇ L = = ( 0 , 0 ) : ∂β 1 � � n n n � � � ∂ L = − 2 ( y i − ( β 0 + β 1 x i )) = 0 n β 0 + β 1 = ⇒ x i y i ∂β 0 i = 1 i = 1 i = 1 � � � � n n n n ∂ L � � � � 2 = − 2 ( y i − ( β 0 + β 1 x i )) x i = 0 β 0 + β 1 = ⇒ x i x i x i y i ∂β 1 i = 1 i = 1 i = 1 i = 1 which has solution (all sums are i = 1 to n ) β 1 = n ( � i x i y i ) − ( � i x i ) ( � � i y i ) i ( x i − ¯ x )( y i − ¯ y ) = β 0 = ¯ y − β 1 ¯ x � n ( � i x i 2 ) − ( � i x i ) 2 x ) 2 i ( x i − ¯ Not shown: use 2nd derivatives to confirm it’s a minimum rather than a maximum or saddle point. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 8 / 28
Best fitting line x = α 0 + α 1 y + ε y = β 0 + β 1 x + ε 50 50 y = 24.9494 + 0.6180x x = −28.2067 + 1.1501y slope = 0.6180 slope = 0.8695 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x The best fit for y = β 0 + β 1 x + error or x = α 0 + α 1 y + error give different lines! y = β 0 + β 1 x + error assumes the x ’s are known exactly with no errors, while the y ’s have errors. x = α 0 + α 1 y + error is the other way around. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 9 / 28
Total Least Squares / Principal Components Analysis x = α 0 + α 1 y + ε y = β 0 + β 1 x + ε 50 50 y = 24.9494 + 0.6180x slope = 0.6180 x = −28.2067 + 1.1501y slope = 0.8695 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x First principal component All three of centered data 50 50 slope = 0.6934274 x = 1.685727 y = 25.99114 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 10 / 28
Least squares vs. PCA Errors in data: Least squares: y = β 0 + β 1 x + error assumes x ’s have no errors while y ’s have errors. PCA: assumes all coordinates have errors. For ( x i , y i ) data, we minimize the sum of . . . Least squares: squared vertical distances from points to the line. PCA: squared orthogonal distances from points to the line. Due to centering data, the lines all go through ( ¯ x , ¯ y ) . For multivariate data, lines are replaced by planes, etc. Different units/scaling on inputs ( x ) and outputs ( y ): Least squares gives equivalent solutions if you change units or scaling, while PCA is sensitive to changes in these. Example: (a) x in seconds, y in cm vs. (b) x in seconds, y in mm give equivalent results for least squares, inequivalent for PCA. For PCA, a workaround is to convert coordinates to Z -scores. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 11 / 28
Distribution of values at each x (a) Homoscedastic (b) Heteroscedastic 80 80 ! ! ! 60 60 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 40 ! ! 40 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 20 20 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 0 0 2 4 6 8 10 0 2 4 6 8 10 On repeated trials, at each x we get a distribution of values of y rather than a single value. In (a), the error term is a normal distribution with the same variance for every x . This is the case we will study. Assume the errors are independent of x and have a normal distribution with mean 0 , SD σ . In (b), the variance changes for different values of x . Use a generalization called Weighted Least Squares . Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 12 / 28
Maximum Likelihood Estimate for best fitting line The method of least squares uses a geometrical perspective. Now we’ll assume the data has certain statistical properties. Simple linear model: Y = β 0 + β 1 x + E Assume the x ’s are known (so lowercase) and E is Gaussian with mean 0 and standard deviation σ , making E , Y random variables. At each x , there is a distribution of possible y ’s, giving a conditional distribution : f Y | X = x ( y ) . Assume conditional distributions for different x ’s are independent. The means of these conditional distributions form a line y = E ( Y | X = x ) = β 0 + β 1 x . σ 2 to distinguish them from the Denote the MLE values by ˆ β 0 , ˆ β 1 , ˆ true (hidden) values. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 13 / 28
Maximum Likelihood Estimate for best fitting line Given data ( x 1 , y 1 ) , . . . , ( x n , y n ) , we have y i = β 0 + β 1 x i + ǫ i where ǫ i = y i − ( β 0 + β 1 x i ) has a normal distribution with mean 0 and standard deviation σ . The likelihood of the data is the product of the pdf of the normal distribution at ǫ i over all i : � � n ( y i − ( β 0 + β 1 x i )) 2 � 1 L = 2 πσ ) n exp − √ 2 σ 2 ( i = 1 Finding β 0 , β 1 that maximize L (or log L ) is equivalent to minimizing n � ( y i − ( β 0 + β 1 x i )) 2 i = 1 so we get the same answer as using least squares! Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 14 / 28
Recommend
More recommend