Lecture 5: − Linear regression (cont’d.) − Regularization − ML Methodology − Learning theory Aykut Erdem October 2017 Hacettepe University
About class projects • This semester the theme is machine learning and the city. • To be done in groups of 3 people. • Deliverables: Proposal, blog posts, progress report, project presentations (classroom + video (new) presentations), final report and code • For more details please check the project webpage: http://web.cs.hacettepe.edu.tr/~aykut/classes/fall2017/bbm406/project.html. 3
Recall from last time… Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) Gradient Descent Update Rule: ⇣ ⌘ N t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1 Closed Form Solution: � − 1 X T t X T X � w = 4
Today • Linear regression (cont’d.) • Regularization • Machine Learning Methodology - validation - cross-validation (k-fold, leave-one-out) - model selection • Learning theory 5
Multi-dimensional Inputs • One method of extending the model is to consider other input dimensions y ( x ) = w 0 + w 1 x 1 + w 2 x 2 • In the Boston housing example, we can look at the number of rooms slide by Sanja Fidler 6
Linear Regression with Multi-dimensional Inputs • Imagine now we want to predict the median house price from these multi-dimensional observations • Each house is a data point n , with observations indexed by j : ⇣ ⌘ x ( n ) 1 , . . . , x ( n ) , . . . , x ( n ) x ( n ) = j d • We can incorporate the bias w 0 into w , by using x 0 = 1 , then d X w j x j = w T x y ( x ) = w 0 + j =1 • We can then solve for w = ( w 0 , w 1 ,…, w d ) . How? slide by Sanja Fidler • We can use gradient descent to solve for each coe ffi cient, or compute w analytically (how does the solution change?) 7
More Powerful Models? • What if our linear model is not good? How can we create a more complicated model? slide by Sanja Fidler 8
Fitting a Polynomial • What if our linear model is not good? How can we create a more complicated model? • We can create a more complicated model by defining input variables that are combinations of components of x • Example: an M -th order polynomial function of one dimensional feature x: M X w j x j y ( x, w ) = w 0 + j =1 where x j is the j -th power of x • We can use the same approach to optimize for the weights w slide by Sanja Fidler • How do we do that? 9
Some types of basis functions in 1-D Sigmoids Gaussians Polynomials � − ( x − µ j ) 2 � � x − µ j � φ j ( x ) = σ φ j ( x ) = exp slide by Erik Sudderth 2 s 2 s 1 σ ( a ) = 1 + exp( − a ) . 10
Two types of linear model that are equivalent with respect to learning bias T y ( x, w ) w w x w x ... w x = + + + = 0 1 1 2 2 T y ( x, w ) w w ( x ) w ( x ) ... w ( x ) = + φ + φ + = Φ 0 1 1 2 2 • The first model has the same number of adaptive coe ffi cients as the dimensionality of the data +1. • The second model has the same number of adaptive coe ffi cients as the number of basis functions +1. slide by Erik Sudderth • Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) 11
General linear regression problem • Using our new notations for the basis function linear regression can be written as • notations for the basis • n � � y � w j � j ( x ) � � j � 0 � where can be either x j for multivariate regression Where � j (x) can • � • or one of the nonlinear basis we defined non linear bas • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution. � • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution. � • Once again we can use “least squares” to find the optimal solution. slide by E. P . Xing 12
LMS for the general linear regression problem regression problem n � y � w j � j ( x ) Our goal is to minimize the following loss function: j � 0 ( y i � � � 2 J (w) � w j � j ( x i ) ) w – vector of dimension k+1 � (x i ) – vector of dimension k+1 i j y i – a scaler Moving to vector notations we get: � ( y i � w T � ( x i )) 2 � J (w) � � i We take the derivative w.r.t w � ( y i � w T � ( x i )) 2 ( y i � w T � ( x i )) � � � 2 � ( x i ) T � w � i i ( y i � w T � ( x i )) � ( x i ) T � 0 � � 2 Equating to 0 we get slide by E. P i �� �� � ( x i ) T � w T � � � � ( x i ) � ( x i ) T y i �� �� . Xing �� �� 13 i i �
LMS for the general linear regression problem ( y i � w T � ( x i )) 2 � J (w) � We take the derivative w.r.t w i � ( y i � w T � ( x i )) 2 ( y i � w T � ( x i )) � � � 2 � ( x i ) T � w i i � ( y i � w T � ( x i )) � ( x i ) T � 0 � � Equating to 0 we get 2 i �� �� � ( x i ) T � w T � � � � ( x i ) � ( x i ) T y i �� �� �� �� i i �� �� � 0 ( x 1 ) � 1 ( x 1 ) � m ( x 1 ) Define: �� �� � 0 ( x 2 ) � 1 ( x 2 ) � m ( x 2 ) �� �� � � �� �� � �� �� � 0 ( x n ) � 1 ( x n ) � m ( x n ) �� �� slide by E. P Then deriving w w � ( � T � ) � 1 � T y � . Xing we get: 14 �
LMS for the general linear regression problem ( y i � w T � ( x i )) 2 � J (w) � i w � ( � T � ) � 1 � T y Deriving w we get: � n entries vector k+1 entries vector � n by k+1 matrix This solution is slide by E. P also known as ‘psuedo ¡inverse’ . Xing 15
0 th order polynomial slide by Erik Sudderth 16
1 st order polynomial slide by Erik Sudderth 17
3 rd order polynomial slide by Erik Sudderth 18
9 th order polynomial slide by Erik Sudderth 19
Which Fit is Best? from Bishop slide by Sanja Fidler 20
Root Mean Square (RMS) Error N � 1 M = 0 1 M = 1 E ( w ) = 1 { y ( x n , w ) − t n } 2 t t 2 0 0 n =1 − 1 − 1 � E RMS = 2 E ( w ⋆ ) /N 0 1 0 1 x x The division by N allows us to M = 3 M = 9 1 1 t t compare di ff erent sizes of data 0 0 sets on an equal footing, and the square root ensures that − 1 − 1 E RMS is measured on the same 0 1 0 1 x x scale (and in the same units) as the target variable t slide by Erik Sudderth 21
Root Mean Square (RMS) Error 1 Training inde- Test E RMS 0.5 0 0 3 6 9 M Root>Mean>Square'(RMS)'Error:' N slide by Erik Sudderth E ( w ) = 1 ( t n − φ ( x n ) T w ) 2 = 1 X 2 || t − Φ w || 2 2 n =1 22
Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) 1 Training inde- Test E RMS 0.5 slide by Sanja Fidler 0 0 3 6 9 M 23
Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Not a problem if we have lots of training examples slide by Sanja Fidler 24
Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Let’s look at the estimated weights for various M in the case of fewer examples slide by Sanja Fidler 25
Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Let’s look at the estimated weights for various M in the case of fewer examples • The weights are becoming huge to compensate for the noise • One way of dealing with this is to encourage the weights to be small (this way no input dimension will have too much influence on prediction). This is called regularization . slide by Sanja Fidler 26
1-D regression illustrates key concepts • Data fits – is linear model best ( model selection )? − Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model • One method of assessing fit: − test generalization = model’s ability to predict the held out data • Optimization is essential: stochastic and batch 1 Training inde- iterative approaches; analytic when available Test slide by Richard Zemel E RMS 0.5 0 0 3 6 9 M 27
Regularized Least Squares • A technique to control the overfitting phenomenon • Add a penalty term to the error function in order to discourage the coe ffi cients from reaching large values Ridge N � E ( w ) = 1 � { y ( x n , w ) − t n } 2 + λ � 2 ∥ w ∥ 2 � regression 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , slide by Erik Sudderth importance of the regularization term compared which is minimized by ' 28
Recommend
More recommend