linear models
play

Linear models Oliver Stegle and Karsten Borgwardt Machine Learning - PowerPoint PPT Presentation

Linear models Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tbingen Oliver Stegle and


  1. Linear models Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tübingen Oliver Stegle and Karsten Borgwardt: Computational Approaches for Analysing Complex Biological Systems, Page 1

  2. Motivation Curve fitting Tasks we are interested in: ? ◮ Making predictions Y ◮ Comparison of alternative models x* X O. Stegle & K. Borgwardt Linear models T¨ ubingen 1

  3. Motivation Curve fitting Tasks we are interested in: ? ◮ Making predictions Y ◮ Comparison of alternative models x* X O. Stegle & K. Borgwardt Linear models T¨ ubingen 1

  4. Motivation Further reading, useful material ◮ Christopher M. Bishop: Pattern Recognition and Machine learning. ◮ Good background, covers most of the course material and much more! ◮ This lecture is largely inspired by chapter 3 of the book. O. Stegle & K. Borgwardt Linear models T¨ ubingen 2

  5. Outline Outline O. Stegle & K. Borgwardt Linear models T¨ ubingen 3

  6. Linear Regression Outline Motivation Linear Regression Bayesian linear regression Model comparison and hypothesis testing Summary O. Stegle & K. Borgwardt Linear models T¨ ubingen 4

  7. Linear Regression Regression Noise model and likelihood ◮ Given a dataset D = { x n , y n } N n =1 , where x n = { x n, 1 , . . . , x n,D } is D dimensional, fit parameters θ of a regressor f with added Gaussian noise: � � � 0 , σ 2 � p ( ǫ | σ 2 ) = N y n = f ( x n ; θ ) + ǫ n where ǫ . ◮ Equivalent likelihood formulation: N � � � � f ( x n ) , σ 2 � p ( y | X ) = N y n n =1 O. Stegle & K. Borgwardt Linear models T¨ ubingen 5

  8. Linear Regression Regression Choosing a regressor ◮ Choose f to be linear: N � � � w T · x n + c, σ 2 � � p ( y | X ) = N y n n =1 ◮ Consider bias free case, c = 0 , otherwise inlcude an additional column of ones in each x n . O. Stegle & K. Borgwardt Linear models T¨ ubingen 6

  9. Linear Regression Regression Choosing a regressor ◮ Choose f to be linear: N � � � w T · x n + c, σ 2 � � p ( y | X ) = N y n n =1 ◮ Consider bias free case, c = 0 , otherwise inlcude an additional column of ones in each x n . Equivalent graphical model O. Stegle & K. Borgwardt Linear models T¨ ubingen 6

  10. Linear Regression Linear Regression Maximum likelihood ◮ Taking the logarithm, we obtain N � � � � w T x n , σ 2 � ln p ( y | w , X , σ 2 ) = ln N y n n =1 N 1 = − N � 2 ln 2 πσ 2 − ( y n − w T · x n ) 2 2 σ 2 n =1 � �� � Sum of squares ◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent. O. Stegle & K. Borgwardt Linear models T¨ ubingen 7

  11. Linear Regression Linear Regression Maximum likelihood ◮ Taking the logarithm, we obtain N � � � � w T x n , σ 2 � ln p ( y | w , X , σ 2 ) = ln N y n n =1 N 1 = − N � 2 ln 2 πσ 2 − ( y n − w T · x n ) 2 2 σ 2 n =1 � �� � Sum of squares ◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent. O. Stegle & K. Borgwardt Linear models T¨ ubingen 7

  12. Linear Regression Linear Regression Maximum likelihood ◮ Taking the logarithm, we obtain N � � � � w T x n , σ 2 � ln p ( y | w , X , σ 2 ) = ln N y n n =1 N 1 = − N � 2 ln 2 πσ 2 − ( y n − w T · x n ) 2 2 σ 2 n =1 � �� � Sum of squares ◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent. O. Stegle & K. Borgwardt Linear models T¨ ubingen 7

  13. Linear Regression Linear Regression and Least Squares y n y f ( x n , w ) x x n (C.M. Bishop, Pattern Recognition and Machine Learning) N E ( w ) = 1 � ( y n − w T x n ) 2 2 n =1 O. Stegle & K. Borgwardt Linear models T¨ ubingen 8

  14. Linear Regression Linear Regression and Least Squares ◮ Derivative w.r.t a single weight entry w i � � N − 1 d d � ln p ( y | w , σ 2 ) = ( y n − w · x n ) 2 2 σ 2 d w i d w i n =1 N = 1 � ( y n − w · x n ) x i σ 2 n =1 ◮ Set gradient w.r.t to w to zero N ∇ w ln p ( y | w , σ 2 ) = 1 � ( y n − w · x n ) x T n = 0 σ 2 n =1 ⇒ w ML = ( X T X ) − 1 X T = y � �� � Pseudo inverse   x 1 , 1 . . . x 1 , D ◮ Here, the matrix X is defined as X = . . . . . . . . .   x N, 1 . . . x N,D O. Stegle & K. Borgwardt Linear models T¨ ubingen 9

  15. Linear Regression Polynomial Curve Fitting ◮ Use the polynomials up to degree K to construct new features from x f ( x, w ) = w 0 + w 1 x + w 2 x 2 + · · · + w K x K = w T φ ( x ) , where we defined φ ( x ) = (1 , x, x 2 , . . . , x K ) . ◮ Similarly, φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of kernels (kernel trick). O. Stegle & K. Borgwardt Linear models T¨ ubingen 10

  16. Linear Regression Polynomial Curve Fitting ◮ Use the polynomials up to degree K to construct new features from x f ( x, w ) = w 0 + w 1 x + w 2 x 2 + · · · + w K x K = w T φ ( x ) , where we defined φ ( x ) = (1 , x, x 2 , . . . , x K ) . ◮ Similarly, φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of kernels (kernel trick). O. Stegle & K. Borgwardt Linear models T¨ ubingen 10

  17. Linear Regression Polynomial Curve Fitting Overfitting ◮ The degree of the polynomial is crucial to avoid under- and overfitting. M = 0 1 t 0 −1 0 1 x (C.M. Bishop, Pattern Recognition and Machine Learning) O. Stegle & K. Borgwardt Linear models T¨ ubingen 11

  18. Linear Regression Polynomial Curve Fitting Overfitting ◮ The degree of the polynomial is crucial to avoid under- and overfitting. M = 1 1 t 0 −1 0 1 x (C.M. Bishop, Pattern Recognition and Machine Learning) O. Stegle & K. Borgwardt Linear models T¨ ubingen 11

  19. Linear Regression Polynomial Curve Fitting Overfitting ◮ The degree of the polynomial is crucial to avoid under- and overfitting. M = 3 1 t 0 −1 0 1 x (C.M. Bishop, Pattern Recognition and Machine Learning) O. Stegle & K. Borgwardt Linear models T¨ ubingen 11

  20. Linear Regression Polynomial Curve Fitting Overfitting ◮ The degree of the polynomial is crucial to avoid under- and overfitting. M = 9 1 t 0 −1 0 1 x (C.M. Bishop, Pattern Recognition and Machine Learning) O. Stegle & K. Borgwardt Linear models T¨ ubingen 11

  21. Linear Regression Regularized Least Squares ◮ Solutions to avoid overfitting: ◮ Intelligently choose K ◮ Regularize the regression weights w ◮ Construct a smoothed error function N E ( w ) = 1 + λ � � � 2 y n − w T φ ( x n ) 2 w T w 2 n =1 � �� � � �� � Regularizer Squared error O. Stegle & K. Borgwardt Linear models T¨ ubingen 12

  22. Linear Regression Regularized Least Squares ◮ Solutions to avoid overfitting: ◮ Intelligently choose K ◮ Regularize the regression weights w ◮ Construct a smoothed error function N E ( w ) = 1 + λ � � � 2 y n − w T φ ( x n ) 2 w T w 2 n =1 � �� � � �� � Regularizer Squared error O. Stegle & K. Borgwardt Linear models T¨ ubingen 12

  23. Linear Regression Regularized Least Squares More general regularizers ◮ A more general regularization approach: N D E ( w ) = 1 + λ � � � � 2 y n − w T φ ( x n ) | w d | q 2 2 n =1 d =1 � �� � � �� � Squared error Regularizer O. Stegle & K. Borgwardt Linear models T¨ ubingen 13

  24. Linear Regression Regularized Least Squares More general regularizers ◮ A more general regularization approach: N D E ( w ) = 1 + λ � � � � 2 y n − w T φ ( x n ) | w d | q 2 2 n =1 d =1 � �� � � �� � Squared error Regularizer q =0 . 5 q =1 q =2 q =4 (C.M. Bishop, Pattern Recognition and Machine Learning) O. Stegle & K. Borgwardt Linear models T¨ ubingen 13

  25. Linear Regression Regularized Least Squares More general regularizers ◮ A more general regularization approach: N D E ( w ) = 1 + λ � � � � 2 y n − w T φ ( x n ) | w d | q 2 2 n =1 d =1 � �� � � �� � Squared error Regularizer sparse q =0 . 5 q =1 q =2 q =4 Lasso Quadratic (C.M. Bishop, Pattern Recognition and Machine Learning) O. Stegle & K. Borgwardt Linear models T¨ ubingen 13

  26. Linear Regression Loss functions and other methods ◮ Even more general: vary the loss function N D E ( w ) = 1 + λ � � L ( y n − w T φ ( x n )) | w d | q 2 2 n =1 d =1 � �� � � �� � Loss Regularizer ◮ Many state-of-the-art machine learning methods can be expressed within this framework. ◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer. ◮ Inference: minimize the cost function E ( w ) , yielding a point estimate for w . O. Stegle & K. Borgwardt Linear models T¨ ubingen 14

Recommend


More recommend