lecture 5
play

Lecture 5: Linear regression (contd.) Regularization ML - PowerPoint PPT Presentation

Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning theory Aykut Erdem October 2017 Hacettepe University About class projects This semester the theme is machine learning and the city. To be done


  1. Lecture 5: − Linear regression (cont’d.) − Regularization − ML Methodology − Learning theory Aykut Erdem October 2017 Hacettepe University

  2. About class projects • This semester the theme is machine learning and the city. • To be done in groups of 3 people. • Deliverables: Proposal, blog posts, progress report, project presentations (classroom + video (new) presentations), final report and code • For more details please check the project webpage: 
 http://web.cs.hacettepe.edu.tr/~aykut/classes/fall2017/bbm406/project.html. 3

  3. Recall from last time… Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) Gradient Descent Update Rule: ⇣ ⌘ N t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1 Closed Form Solution: � − 1 X T t X T X � w = 4

  4. 
 Today • Linear regression (cont’d.) • Regularization • Machine Learning Methodology - validation - cross-validation (k-fold, leave-one-out) - model selection 
 • Learning theory 
 5

  5. 
 
 Multi-dimensional Inputs • One method of extending the model is to consider other input dimensions 
 y ( x ) = w 0 + w 1 x 1 + w 2 x 2 • In the Boston housing example, we can look at the number of rooms slide by Sanja Fidler 6

  6. Linear Regression with 
 Multi-dimensional Inputs • Imagine now we want to predict the median house price from these multi-dimensional observations • Each house is a data point n , with observations indexed by j : ⇣ ⌘ x ( n ) 1 , . . . , x ( n ) , . . . , x ( n ) x ( n ) = j d • We can incorporate the bias w 0 into w , by using x 0 = 1 , then d X w j x j = w T x y ( x ) = w 0 + j =1 • We can then solve for w = ( w 0 , w 1 ,…, w d ) . How? slide by Sanja Fidler • We can use gradient descent to solve for each coe ffi cient, or compute w analytically (how does the solution change?) 7

  7. More Powerful Models? • What if our linear model is not good? How can we create a more complicated model? slide by Sanja Fidler 8

  8. 
 
 
 
 Fitting a Polynomial • What if our linear model is not good? How can we create a more complicated model? • We can create a more complicated model by defining input variables that are combinations of components of x • Example: an M -th order polynomial function of one dimensional feature x: 
 M X w j x j y ( x, w ) = w 0 + j =1 where x j is the j -th power of x • We can use the same approach to optimize for the weights w slide by Sanja Fidler • How do we do that? 9

  9. Some types of basis functions in 1-D Sigmoids Gaussians Polynomials � − ( x − µ j ) 2 � � x − µ j � φ j ( x ) = σ φ j ( x ) = exp slide by Erik Sudderth 2 s 2 s 1 σ ( a ) = 1 + exp( − a ) . 10

  10. Two types of linear model that are equivalent with respect to learning bias T y ( x, w ) w w x w x ... w x = + + + = 0 1 1 2 2 T y ( x, w ) w w ( x ) w ( x ) ... w ( x ) = + φ + φ + = Φ 0 1 1 2 2 • The first model has the same number of adaptive coe ffi cients as the dimensionality of the data +1. • The second model has the same number of adaptive coe ffi cients as the number of basis functions +1. slide by Erik Sudderth • Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) 11

  11. 
 
 
 
 General linear regression problem • Using our new notations for the basis function linear regression can be written as 
 • notations for the basis • n � � y � w j � j ( x ) � � j � 0 � where can be either x j for multivariate regression Where � j (x) can • � • or one of the nonlinear basis we defined non linear bas • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution. ฀ � • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution. ฀ � • Once again we can use “least squares” to find the optimal solution. slide by E. P . Xing 12

  12. LMS for the general linear regression problem regression problem n � y � w j � j ( x ) Our goal is to minimize the following loss function: j � 0 ( y i � � � 2 J (w) � w j � j ( x i ) ) w – vector of dimension k+1 � (x i ) – vector of dimension k+1 i j y i – a scaler Moving to vector notations we get: ฀ � ( y i � w T � ( x i )) 2 � J (w) � ฀ � i We take the derivative w.r.t w � ( y i � w T � ( x i )) 2 ( y i � w T � ( x i )) � � � 2 � ( x i ) T � w ฀ � i i ( y i � w T � ( x i )) � ( x i ) T � 0 � � 2 Equating to 0 we get slide by E. P i �� �� � ( x i ) T � w T � � ฀ � � ( x i ) � ( x i ) T y i �� �� . Xing �� �� 13 i i ฀ �

  13. LMS for the general linear regression problem ( y i � w T � ( x i )) 2 � J (w) � We take the derivative w.r.t w i � ( y i � w T � ( x i )) 2 ( y i � w T � ( x i )) � � � 2 � ( x i ) T � w i i ฀ � ( y i � w T � ( x i )) � ( x i ) T � 0 � � Equating to 0 we get 2 i �� �� � ( x i ) T � w T � � ฀ � � ( x i ) � ( x i ) T y i �� �� �� �� i i �� �� � 0 ( x 1 ) � 1 ( x 1 ) � m ( x 1 ) Define: �� �� � 0 ( x 2 ) � 1 ( x 2 ) � m ( x 2 ) �� �� � � �� �� ฀ � �� �� � 0 ( x n ) � 1 ( x n ) � m ( x n ) �� �� slide by E. P Then deriving w w � ( � T � ) � 1 � T y ฀ � . Xing we get: 14 ฀ �

  14. LMS for the general linear regression problem ( y i � w T � ( x i )) 2 � J (w) � i w � ( � T � ) � 1 � T y Deriving w we get: ฀ � n entries vector k+1 entries vector ฀ � n by k+1 matrix This solution is slide by E. P also known as ‘psuedo ¡inverse’ . Xing 15

  15. 0 th order polynomial slide by Erik Sudderth 16

  16. 1 st order polynomial slide by Erik Sudderth 17

  17. 3 rd order polynomial slide by Erik Sudderth 18

  18. 9 th order polynomial slide by Erik Sudderth 19

  19. Which Fit is Best? from Bishop slide by Sanja Fidler 20

  20. Root Mean Square (RMS) Error N � 1 M = 0 1 M = 1 E ( w ) = 1 { y ( x n , w ) − t n } 2 t t 2 0 0 n =1 − 1 − 1 � E RMS = 2 E ( w ⋆ ) /N 0 1 0 1 x x The division by N allows us to M = 3 M = 9 1 1 t t compare di ff erent sizes of data 0 0 sets on an equal footing, and 
 the square root ensures that − 1 − 1 E RMS is measured on the same 0 1 0 1 x x scale (and in the same units) as the target variable t slide by Erik Sudderth 21

  21. Root Mean Square (RMS) Error 1 Training inde- Test E RMS 0.5 0 0 3 6 9 M Root>Mean>Square'(RMS)'Error:' N slide by Erik Sudderth E ( w ) = 1 ( t n − φ ( x n ) T w ) 2 = 1 X 2 || t − Φ w || 2 2 n =1 22

  22. Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) 1 Training inde- Test E RMS 0.5 slide by Sanja Fidler 0 0 3 6 9 M 23

  23. Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Not a problem if we have lots of training examples slide by Sanja Fidler 24

  24. Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Let’s look at the estimated weights for various M in the case of fewer examples slide by Sanja Fidler 25

  25. Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Let’s look at the estimated weights for various M in the case of fewer examples • The weights are becoming huge to compensate for the noise • One way of dealing with this is to encourage the weights to be small (this way no input dimension will have too much influence on prediction). This is called regularization . slide by Sanja Fidler 26

  26. 1-D regression illustrates key concepts • Data fits – is linear model best ( model selection )? − Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data 
 (fit not only the signal but also the noise in the data), especially if not enough data to constrain model • One method of assessing fit: − test generalization = model’s ability to predict 
 the held out data • Optimization is essential: stochastic and batch 1 Training inde- iterative approaches; analytic when available Test slide by Richard Zemel E RMS 0.5 0 0 3 6 9 M 27

  27. Regularized Least Squares • A technique to control the overfitting phenomenon • Add a penalty term to the error function in order to discourage the coe ffi cients from reaching large values Ridge N � E ( w ) = 1 � { y ( x n , w ) − t n } 2 + λ � 2 ∥ w ∥ 2 � regression 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , slide by Erik Sudderth importance of the regularization term compared which is minimized by ' 28

Recommend


More recommend