linear regression
play

Linear Regression Machine Learning Hamid Beigy Sharif University - PowerPoint PPT Presentation

Linear Regression Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 1 / 38 Introduction 1 Linear regression 2 Model selection 3 Sample size


  1. Linear Regression Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 1 / 38

  2. Introduction 1 Linear regression 2 Model selection 3 Sample size 4 Maximum likelihood and least squares 5 Over-fitting 6 Regularization 7 Maximum a posteriori and regularization 8 Geometric Interpretation 9 10 Sequential learning 11 Multiple outputs regression 12 Bias-Variance trade-off Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 2 / 38

  3. Introduction In regression, c ( x ) is a continuous function. Hence the training set is in the form of S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } , t k ∈ R . If there is no noise, the task is interpolation and our goal is to find a function f ( x ) that passes through these points such that we have t k = f ( x k ) ∀ k = 1 , 2 , . . . , N In polynomial interpolation, given N points, we find ( N − 1)st degree polynomial to predict the output for any x . If x is outside of the range of the training set, the task is called extrapolation. In regression, there is noise added to the output of the unknown function. t k = f ( x k ) + ǫ ∀ k = 1 , 2 , . . . , N f ( x k ) ∈ R is the unknown function and ǫ is the random noise. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 3 / 38

  4. Introduction(cont.) In regression, there is noise added to the output of the unknown function. t k = f ( x k ) + ǫ ∀ k = 1 , 2 , . . . , N The explanation for the noise is that there are extra hidden variables that we cannot observe. t k = f ∗ ( x k , z k ) + ǫ ∀ k = 1 , 2 , . . . , N z k denotes hidden variables Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 4 / 38

  5. Linear regression Our goal is to approximate the output by function g ( x ). The empirical error on the training set S is measured using loss/error/cost function. Squared Error from target E E ( g ( x i ) | S ) = ( t i − g ( x i )) 2 . Linear error from target E E ( g ( x i ) | S ) = | t i − g ( x i ) | . i =1 ( t i − g ( x i )) 2 . � N Mean square error from target E E ( g ( x ) | S ) = 1 N i =1 ( t i − g ( x i )) 2 . � N Sum of square error from target E E ( g ( x ) | S ) = 1 2 The aim is to find g ( . ) that minimizes the empirical error. We assume that a hypothesis class for g ( . ) with a small set of parameters. Assume that g ( x ) is linear g ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 5 / 38

  6. Linear regression(cont.) In the linear regression, when D = 1 g ( x ) = w 0 + w 1 x Parameters w 0 and w 1 should minimize the empirical error N E E ( w 0 , w 1 | S ) = E E ( g ( x ) | S ) = 1 [ t k − ( w 0 + w 1 x k )] 2 � 2 i =1 This error function is a quadratic function of W and its derivative is linear w.r.t W . Its minimization has a unique solution denoted by W ∗ . Taking derivative of error w.r.t w 0 and w 1 and setting equal to zero ¯ t − w 1 ¯ w 0 = x x ¯ � k t k x k − ¯ tN w 1 = k ( x k ) 2 − N (¯ � x ) 2 � N � N k =1 t k k =1 x k ¯ t = , ¯ x = N N Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 6 / 38

  7. Linear Regression(cont.) When the input variables form a D − dimensional vector, the linear regression model is g ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D . Parameters w 0 , w 1 , . . . , w D should minimize the empirical error N E E ( g ( x ) | S ) = 1 ( t i − g ( x i )) 2 . � 2 i =1 Taking derivative of error w.r.t w s and setting equal to zero, Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 7 / 38

  8. Linear Regression(cont.) Taking derivative of error w.r.t w s and setting equal to zero, N N N N � � � � t k = Nw 0 + w 1 x k 1 + w 2 x k 2 + . . . + w D x kD k =1 k =1 k =1 k =1 N N N N N ( x k 1 ) 2 + w 2 � � � � � x k 1 t k = w 0 x k 1 + w 1 x k 1 x k 2 + . . . + w D x k 1 x kD k =1 k =1 k =1 k =1 k =1 N N N N N ( x k 2 ) 2 + . . . + w D � � � � � = x k 2 + w 1 x k 1 x k 2 + w 2 x k 2 t k w 0 x k 2 x kD k =1 k =1 k =1 k =1 k =1 . . . N N N N N � � � � � ( x kD ) 2 x kD t k = w 0 x kD + w 1 x k 1 x kD + w 2 x k 2 x kD + . . . + w D k =1 k =1 k =1 k =1 k =1 Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 8 / 38

  9. Linear Regression(cont.) Define the following vectors and Matrix Data matrix  1 . . .  x 11 x 12 x 1 D 1 . . . x 21 x 22 x 2 D   X =  . .  . .   . .   1 x N 1 x N 2 . . . x DD The k th input vector X k = (1 , x k 1 , x k 2 , . . . , x kD ) T The weight vector W = ( w 0 , w 1 , w 2 , . . . , w D ) T The target vector t = ( t 1 , t 2 , t 3 , . . . , t N ) T Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 9 / 38

  10. Linear Regression(cont.) The empirical error is equal to N E E ( g ( x ) | S ) = 1 � 2 � � t k − W T X k . 2 k =1 The gradient of E E ( g ( x ) | S ) is N � � � t k − W T X k X T ∇ W E E ( g ( x ) | S ) = k k =1 N N � � t k X T k − W T X k X T = k = 0 k =1 k =1 Solving for W , we obtain � − 1 W ∗ = � X T X X T t If X T X is invertible, the problem has a unique solution. If X T X is not invertible, the pseudo inverse is used and the problem has several solution. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 10 / 38

  11. Regression(cont.) If the linear model is too simple, the model can be a polynomial (a more complex hypothesis set) g ( x ) = w 0 + w 1 x + w 2 x 2 + . . . + w M x M . M is the order of the polynomial. Choosing the right value of M is called model selection. For M = 1, we have a too general model For M = 9, we have a too specific model Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 11 / 38

  12. Regression(Model selection) The goal of model selection is to achieve good generalization by making accurate predictions for new data. The generalization ability of a model is measured by a separate test data generated using exactly the same process used for generating training data. The model is chosen using a validation data set. Two models sometimes are compared using root mean square (RMS) error. � 2 E E ( W ∗ | S ) / N E RMS = This allows comparison on different sizes of data sets. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 12 / 38

  13. Regression(Sample size) For a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 13 / 38

  14. Linear Regression(cont.) We can extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form M − 1 � g ( x ) = w 0 + w j φ j ( x ) j =1 . φ j ( x ) are known as basis functions. M is total number of parameters. w 0 is called bias parameter. usually a dummy basis function φ 0 ( x ) = 1 is used M � w j φ j ( x ) = W T Φ( x ) g ( x ) = j =0 . W = ( w 0 , w 1 , . . . , w M − 1 ) T and Φ = ( φ 0 , φ 1 , . . . , φ M − 1 ) T . Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 14 / 38

  15. Linear Regression(cont.) In pre-processing phase, the features can be expressed in terms of the basis functions { φ j ( x ) } . Examples of basis functions Polynomial basis function φ j ( x ) = x j . Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 15 / 38

  16. Linear Regression(cont.) Examples of basis functions � ( x − µ j ) 2 � Gaussian basis function φ j ( x ) = exp . 2 s 2 µ j is location of the basis function. s is the spatial scale of the basis function. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 16 / 38

  17. Linear Regression(cont.) Examples of basis functions � � x − µ j Logistic basis function φ j ( x ) = σ . s 1 σ ( a ) = 1+ exp ( − a ) . Fourier basis function Wavelets basis function Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 17 / 38

  18. Linear Regression(cont.) The empirical error is equal to N E E ( g ( x ) | S ) = 1 � 2 � � t k − W T Φ( X k ) . 2 k =1 The gradient of E E ( g ( x ) | S ) is N � � � t k − W T Φ( X k ) Φ( X k ) T ∇ W E E ( g ( x ) | S ) = k =1 N N t k Φ( X k ) T − W T Φ( X k )Φ( X k ) T = 0 � � = k =1 k =1 � − 1 Φ T t . Solving for W , we obtain W ∗ = � Φ T Φ   φ 0 ( x 1 ) φ 1 ( x 1 ) φ 2 ( x 1 ) . . . φ M − 1 ( x 1 ) φ 0 ( x 2 ) φ 1 ( x 2 ) φ 2 ( x 2 ) . . . φ M − 1 ( x 2 )   Φ =  . .  . .   . .   φ 0 ( x N ) φ 1 ( x N ) φ 2 ( x N ) . . . φ M − 1 ( x N ) Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 18 / 38

  19. Linear Regression(cont.) � − 1 Φ T is known as Moore-Penrose pseudo-inverse. The quantity Φ † = � Φ T Φ The bias value ( w 0 )   M − 1 N w 0 = 1   � �  t k − w j φ j ( X k )  . N k =1 j =1 and equals to the difference between target values and the weighted sum of the basis function values. Hamid Beigy (Sharif University of Technology) Linear Regression Fall 1393 19 / 38

Recommend


More recommend