illustration: detail from xkcd strip #2048 BBM406 Fundamentals of β¨ Machine Learning Lecture 4: Linear Regression, Optimization, Generalization, Model complexity, Regularization Aykut Erdem // Hacettepe University // Fall 2019
π¦ 1 , π§ 1 , β¦ , π¦ π , π§ π β’ π¦ π β π β π§ π β β β β’ πΏ βΆ π Γ π β β β β β’ Recall from last timeβ¦ Kernel Regression β xβ β² πΏ π¦ π , π¦ β Here, this is Here, this is the closest the closest Here, this is the closest Here, this is the closest y x 1-NN for Regression Weighted K-NN for Regression ! 1 /p n X | x i β y i | p D = w i = exp(-d(x i , query) 2 / Ο 2 ) i =1 Distance metrics Kernel width 2
Linear Regression 3
β¨ β¨ Simple 1-D Regression β’ Circles are data points (i.e., training examples) that are given to us β’ The data points are uniform in x , but may be displaced in y β¨ t ( x ) = f ( x ) + Ξ΅ β¨ with Ξ΅ some noise slide by Sanja Fidler β’ In green is the βtrueβ curve that we donβt know β¨ β’ Goal: We want to fit a curve to these points 4
Simple 1-D Regression β’ Key Questions: β How do we parametrize the model (the curve)? β What loss (objective) function should we use to judge fit? β How do we optimize fit to unseen test data slide by Sanja Fidler ( generalization )? β¨ 5
Example: Boston House Prizes β’ Estimate median house price in a neighborhood based on neighborhood statistics β’ Look at first (of 13) attributes: per capita crime rate slide by Sanja Fidler β’ Use this to predict house prices in other neighborhoods Is this a good input (attribute) to predict house prices? β’ https://archive.ics.uci.edu/ml/datasets/Housing 6
β¨ Represent the data β’ Data described as pairs D = {( x (1) , t (1) ), ( x (2) , t (2) ),..., ( x ( N ) , t ( N ) )} β x is the input feature (per capita crime rate) β t is the target output (median house price) β ( i ) simply indicates the training examples (we have N in this case) β’ Here t is continuous, so this is a regression problem β’ Model outputs y, an estimate of t β¨ y ( x ) = w 0 + w 1 x β’ What type of model did we choose? β’ Divide the dataset into training and testing examples slide by Sanja Fidler β Use the training examples to construct hypothesis, or function approximator, that maps x to predicted y β Evaluate hypothesis on test set 7
Noise β’ A simple model typically does not exactly fit the data β lack of fit can be considered noise β¨ β’ Sources of noise: β Imprecision in data attributes (input noise, e.g. noise in per-capita crime) β Errors in data targets (mislabeling, e.g. noise in house prices) β Additional attributes not taken into account by data attributes, a ff ect target values (latent variables). In the example, what else could a ff ect house prices? β Model may be too simple to account for data targets slide by Sanja Fidler 8
Least-Squares Regression y ( x ) = function( x, w ) slide by Sanja Fidler 9
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = function( x, w ) β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ slide by Sanja Fidler β’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 10
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = w 0 + w 1 x β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ slide by Sanja Fidler β’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 11
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = w 0 + w 1 x β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ N i 2 h X t ( n ) β y ( x ( n ) ) ` ( w ) = slide by Sanja Fidler n =1 β’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 12
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = w 0 + w 1 x β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ N i 2 h X t ( n ) β ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 13
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = w 0 + w 1 x β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ N i 2 h X t ( n ) β ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 14
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = w 0 + w 1 x β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ N i 2 h X t ( n ) β ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β’ The loss for the red hypothesis is the sum of the squared vertical errors (squared lengths of green vertical lines) 15
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = w 0 + w 1 x β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ N i 2 h X t ( n ) β ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β’ How do we obtain weights ? β¨ w = ( w 0 , w 1 ) 16
β¨ β¨ Least-Squares Regression β’ Define a model β¨ Linear: y ( x ) = w 0 + w 1 x β’ Standard loss/cost/objective function measures the squared error between y and the true value t β¨ N i 2 h X t ( n ) β ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β’ How do we obtain weights ? Find w that minimizes β¨ w = ( w 0 , w 1 ) loss ` ( w ) 17
Optimizing the Objective β’ One straightforward method: gradient descent β initialize w (e.g., randomly) β repeatedly update w based on the gradient β¨ w β w β οΏ½ @` @ w β’ Ξ» is the learning rate β’ For a single training case, this gives the LMS update rule: β£ β t ( n ) β y ( x ( n ) ) x ( n ) w β w + 2 Ξ» { error slide by Sanja Fidler β’ Note: As error approaches zero, so does the update β¨ ( w stops changing) 18
19 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
20 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
E ff ect of learning rate Ξ» ` ( w ) ` ( w ) w 0 w 0 β’ Large Ξ» => Fast convergence but larger residual error β¨ Also possible oscillations slide by Erik Sudderth β’ Small Ξ» => Slow convergence but small residual error 21
Optimizing Across Training Set β’ Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values β£ β t ( n ) β y ( x ( n ) ) x ( n ) w β w + 2 Ξ» 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients Algorithm 1 Stochastic gradient descent 1: Randomly shu ffl e examples in the training set 2: for i = 1 to N do Update: 3: slide by Sanja Fidler w β w + 2 Ξ» ( t ( i ) β y ( x ( i ) )) x ( i ) (update for a linear model) 4: end for 22
Optimizing Across Training Set β’ Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values β£ β t ( n ) β y ( x ( n ) ) x ( n ) w β w + 2 Ξ» 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients β’ Underlying assumption: sample is independent and identically distributed (i.i.d.) slide by Sanja Fidler 23
Analytical Solution β’ For some objectives we can also find the optimal solution analytically β’ This is the case for linear least-squares regression β’ How? slide by Sanja Fidler 24
Vectorization β’ Consider our model: y ( x ) = w 0 + w 1 x β’ Let ο£Ώ w 0 οΏ½ x T = [1 x ] w = w 1 β’ Can write the model in vectorized form as y ( x ) = w T x 25
β¨ Vectorization β’ Consider our model with N instances: β¨ R N Γ 1 t (1) , t (2) , . . . , t ( N ) i T h t = T 1 , x (1) 2 3 R N Γ 2 1 , x (2) 6 7 X = 6 7 . . . 4 5 1 , x ( N ) R 2 Γ 1 ο£Ώ w 0 οΏ½ w = w 1 β’ Then: β¨ N w T x ( n ) β t ( n ) i 2 h X ` ( w ) = slide by Sanja Fidler n =1 = ( Xw β t ) T ( Xw β t ) { { R 1 Γ N R N Γ 1 26
Recommend
More recommend