Lecture 4: − Linear Regression (cont’d.) − Optimization − Generalization − Model complexity − Regularization Aykut Erdem October 2017 Hacettepe University
Administrative • Assignment 1 is out! • It is due October 20 (i.e. in two weeks). • It includes − Pencil-and-paper derivations − Implementing kNN classifier − numpy/Python code 2
Classifying Bird Species Hooded Oriole (Icterus cucullatus) • Caltech-UCSD Birds 200 dataset (200 bird species) − 5033 train, 1000 test images • You may want to split the training set into train and validation (more on this next week) • Do not use test data for training or parameter tuning adapted from Sanja Fidler • Features: − Attributes, − Color histogram, − HOG features − Deep CNN features • Report performance on test data 3
𝑦 1 , 𝑧 1 , … , 𝑦 𝑜 , 𝑧 𝑜 • 𝑦 𝑗 ∈ 𝑌 – 𝑧 𝑗 ∈ ℜ – • 𝐿 ∶ 𝑌 × 𝑌 → ℜ – – • Recall from last time… Kernel Regression – x’ ′ 𝐿 𝑦 𝑗 , 𝑦 – Here, this is Here, this is the closest the closest Here, this is the closest Here, this is the closest y x 1-NN for Regression Weighted K-NN for Regression ! 1 /p n X | x i − y i | p D = w i = exp(-d(x i , query) 2 / σ 2 ) i =1 Distance metrics Kernel width 4
Recall from last time… Least-Squares Regression y ( x ) = function( x, w ) slide by Sanja Fidler 5
Recall from last time… Least-Squares Regression • Define a model Linear: y ( x ) = function( x, w ) • Standard loss/cost/objective function measures the squared error between y and the true value t slide by Sanja Fidler • For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 6
Recall from last time… Least-Squares Regression • Define a model Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t N i 2 h X t ( n ) − y ( x ( n ) ) ` ( w ) = slide by Sanja Fidler n =1 • For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 7
Recall from last time… Least-Squares Regression • Define a model Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 • The loss for the red hypothesis is the sum of the squared vertical errors (squared lengths of green vertical lines) 8
Recall from last time… Least-Squares Regression • Define a model Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 • How do we obtain weights ? w = ( w 0 , w 1 ) 9
Recall from last time… Least-Squares Regression • Define a model Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 • How do we obtain weights ? Find w that minimizes w = ( w 0 , w 1 ) loss ` ( w ) 10
Optimizing the Objective • One straightforward method: gradient descent − initialize w (e.g., randomly) − repeatedly update w based on the gradient w ← w − � @` @ w • λ is the learning rate • For a single training case, this gives the LMS update rule: ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ { error slide by Sanja Fidler • Note: As error approaches zero, so does the update ( w stops changing) 11
12 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
13 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
E ff ect of learning rate λ ` ( w ) ` ( w ) w 0 w 0 • Large λ => Fast convergence but larger residual error Also possible oscillations slide by Erik Sudderth • Small λ => Slow convergence but small residual error 14
Optimizing Across Training Set • Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients Algorithm 1 Stochastic gradient descent 1: Randomly shu ffl e examples in the training set 2: for i = 1 to N do Update: 3: slide by Sanja Fidler w ← w + 2 λ ( t ( i ) − y ( x ( i ) )) x ( i ) (update for a linear model) 4: end for 15
Optimizing Across Training Set • Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients • Underlying assumption: sample is independent and identically distributed (i.i.d.) slide by Sanja Fidler 16
Analytical Solution • For some objectives we can also find the optimal solution analytically • This is the case for linear least-squares regression • How? slide by Sanja Fidler 17
Vectorization • Consider our model: y ( x ) = w 0 + w 1 x • Let w 0 � x T = [1 x ] w = w 1 • Can write the model in vectorized form as y ( x ) = w T x 18
Vectorization • Consider our model with N instances: R N × 1 t (1) , t (2) , . . . , t ( N ) i T h t = T 1 , x (1) 2 3 R N × 2 1 , x (2) 6 7 X = 6 7 . . . 4 5 1 , x ( N ) R 2 × 1 w 0 � w = w 1 • Then: N w T x ( n ) − t ( n ) i 2 h X ` ( w ) = slide by Sanja Fidler n =1 = ( Xw − t ) T ( Xw − t ) { { R 1 × N R N × 1 19
Analytical Solution • Instead of using GD, solve for optimal w analytically @ − Notice the solution is when @ w ` ( w ) = 0 • Derivation: 1x1 ` ( w ) = ( Xw − t ) T ( Xw − t ) = w T X T Xw − t T Xw − w T X T t + t T t = w T X T Xw − 2 w T X T t + t T t − Take derivative and set equal to 0, then solve for If X T X is not inver-ble (i.e., ∂ w T X T Xw − 2 w T X T t + t T t � � = 0 singular), may need to: ∂ w • Use pseudo-inverse instead of X T X w − X T t = 0 � � the inverse − In Python, X T X w = X T t � � numpy.linalg.pinv(a) • Remove redundant (not linearly independent) features • Remove extra features to � − 1 X T t X T X Closed Form Solution: � w = ensure that d ≤ N 20
21
Multi-dimensional Inputs • One method of extending the model is to consider other input dimensions y ( x ) = w 0 + w 1 x 1 + w 2 x 2 • In the Boston housing example, we can look at the number of rooms slide by Sanja Fidler 22
Linear Regression with Multi-dimensional Inputs • Imagine now we want to predict the median house price from these multi-dimensional observations • Each house is a data point n , with observations indexed by j : ⇣ ⌘ x ( n ) 1 , . . . , x ( n ) , . . . , x ( n ) x ( n ) = j d • We can incorporate the bias w 0 into w , by using x 0 = 1 , then d X w j x j = w T x y ( x ) = w 0 + j =1 • We can then solve for w = ( w 0 , w 1 ,…, w d ) . How? slide by Sanja Fidler • We can use gradient descent to solve for each coe ffi cient, or compute w analytically (how does the solution change?) � − 1 X T t X T X recall: � w = 23
More Powerful Models? • What if our linear model is not good? How can we create a more complicated model? slide by Sanja Fidler 24
Fitting a Polynomial • What if our linear model is not good? How can we create a more complicated model? • We can create a more complicated model by defining input variables that are combinations of components of x • Example: an M -th order polynomial function of one dimensional feature x: M X w j x j y ( x, w ) = w 0 + j =1 where x j is the j -th power of x • We can use the same approach to optimize for the weights w slide by Sanja Fidler • How do we do that? 25
Some types of basis functions in 1-D Sigmoids Gaussians Polynomials � − ( x − µ j ) 2 � � x − µ j � φ j ( x ) = σ φ j ( x ) = exp slide by Erik Sudderth 2 s 2 s 1 σ ( a ) = 1 + exp( − a ) . 26
Two types of linear model that are equivalent with respect to learning bias T y ( x, w ) w w x w x ... w x = + + + = 0 1 1 2 2 T y ( x, w ) w w ( x ) w ( x ) ... w ( x ) = + φ + φ + = Φ 0 1 1 2 2 • The first model has the same number of adaptive coe ffi cients as the dimensionality of the data +1. • The second model has the same number of adaptive coe ffi cients as the number of basis functions +1. slide by Erik Sudderth • Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) 27
Recommend
More recommend