CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee CS480/680 Winter 2020 Zahra Sheikhbahaee
First Assignment • Available 15 th of January • Deadline 23 th of January • Teacher assistants responsible for the first assignment • Gaurav Gupta g27gupta@uwaterloo.ca • Colin Michiel Vandenhof cm5vandenhof@uwaterloo.ca • Office hour for TA’s: 20 th of January at 1:00 pm in DC2584 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2
Outline • Linear regression • Ridge regression • Lasso • Bayesian linear regression University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3
Linear Models for Regression Definition : Given a training set comprising 𝑂 observations {𝑦 $ , 𝑦 & , . . , 𝑦 ( } with corresponding target real-valued 𝑧 $ , 𝑧 & , . . , 𝑧 ( , the goal is to predict the value of 𝑧 (+$ for a new value of 𝑦 (+$ . The form of a linear regression model is ( 𝑔(𝒚) = 𝛾 1 + ∑ 45$ 𝑦 4 𝛾 4 𝛾 4 : unknown coefficients University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4
Linear Models for Regression Definition : Given a training set comprising 𝑂 observations {𝑦 $ , 𝑦 & , . . , 𝑦 ( } with corresponding target real-valued 𝑧 $ , 𝑧 & , . . , 𝑧 ( , the goal is to predict the value of 𝑧 (+$ for a new value of 𝑦 (+$ . The form of a linear regression model is ( 𝑔(𝒚) = 𝛾 1 + ∑ 45$ 𝑦 4 𝛾 4 𝛾 4 : unknown coefficients ( (a polynomial regression) 7 , . . , 𝑦 ( = 𝑦 $ & , 𝑦 7 = 𝑦 $ • Basis expansions: 𝑦 & = 𝑦 $ • We can have interactions between variables, i.e. 𝑦 7 = 𝑦 $ . 𝑦 & ( • By using nonlinear basis functions like 𝑔(𝒚) = 𝛾 1 + ∑ 45$ 𝜚 4 (𝒚)𝛾 4 , we allow the function 𝑔 𝒚 to be a non-linear function of the input vector x but linear w.r.t 𝜸 . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5
Basis Function Choices • Polynomial 𝜚 4 𝑦 = 𝑦 4 • Gaussian (>?@ A ) B 𝜚 4 𝑦 = exp(− ) &C B • Sigmoidal >?@ A $ 𝜚 4 𝑦 = 𝜏( C ) with 𝜏 𝑦 = $+E FG University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6
Minimizing the Residual Sum of Squares ( RSS ) • Assumption: The output variable 𝑧 is given by a deterministic function 𝑔(𝒚, 𝜸) with additive Gaussian noise 𝒛 = 𝒈 𝒚, 𝜸 + 𝝑 𝝑 :a zero-mean Gaussian random variable 𝜗 ∼ 𝑂(0, 𝜏 & ) ( W?$ 2𝜌𝜏 & exp[− 1 1 𝑄 𝒛 𝒈 𝒚, 𝜸 , 𝜏 & 𝕁 = P 𝛾 4 𝜚 4 (𝒚 Q )) & ] 2𝜏 & (𝑧 Q −𝛾 1 − V Q5$ 45$ Assuming these data points are drawn independent from the distribution ( (𝑧 Q −𝜸 [ 𝚾(𝒚 Q )) & = ∥ 𝜗 ∥ & & RSS 𝛾 = ∑ Q5$ & :square of the ℓ & norm ∥∥ & University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7
Minimizing RSS • Compute the gradient of the log-likelihood function ( ∇ ` ln 𝑄 𝒛 𝜸, 𝜏 & 𝕁 = V {𝑧 Q − 𝜸 [ 𝜚(𝒚 Q )} 𝜚(𝒚 Q ) [ Q5$ • Setting the gradient to zero 𝜸 𝑵𝑴 = (𝚾 𝑼 𝚾) ?𝟐 𝚾 𝑼 𝒛 W?$ 𝛾 4 𝜚 4 , 𝜚 4 = $ ( 𝑧 − ∑ 45$ ( ∑ Q5$ 𝛾 1 = g 𝜚 4 (𝑦 Q ) 𝚾 : the design matrix 𝜚 1 𝒚 $ … 𝜚 W?$ (𝒚 $ ) 𝜚 W?$ (𝒚 & ) 𝜚 1 𝒚 & … 𝚾 = . . . . 𝜚 W?$ (𝒚 ( ) 𝜚 1 𝒚 ( … University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8
9 Minimizing RSS • Solving ln 𝑄(𝒛│𝜸, 𝜏 & 𝕁) w.r.t the noise parameter 𝜏 & Squared-Error for linear regression model ( (𝑧 Q −𝛾 Wm [ 𝜚(𝑦 Q )) & j klB = $ $ ( ∑ Q5$ Gray line: The true noise-free function University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee
Bayesian view of Linear Regression • Bayes Rule: The posterior is proportional to likelihood times prior & ) 𝑄 𝛾 n 𝑧, 𝑦 ∝ 𝑄 n 𝑧 𝑦, 𝛾 𝑄(𝛾|𝜈 ` , 𝜏 ` Let’s use Gaussian prior for weight 𝛾 ` B & = 𝑂(0, 𝜏 $ & ) = 𝑄 𝛾 𝜈 ` , 𝜏 B exp(− B ) ` ` &j s &rj s The posterior is 2𝜏 & (𝑧 − 𝜸 [ 𝚾(𝒚)) & )exp(− 𝛾 & 𝑧, 𝑦 ∝ exp(− 1 𝑄 𝛾 n & ) 2𝜏 ` University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10
Bayesian view of Linear Regression (Ridge Regression) The log posterior is 2𝜏 & (𝑧 − 𝜸 [ 𝚾(𝒚)) & − 𝛾 & 𝑧, 𝑦 ∝ − 1 log 𝑄 𝛾 n & 2𝜏 ` If we assume that 𝜏 & = 1 and 𝜇 = $ B then j s W?$ 𝛾 4 & ≤ 𝑑 & − w 𝑧, 𝑦 ∝ − $ & subject to ∑ 451 & ∥ 𝑧 − 𝜸 [ 𝚾(𝒚) ∥ & log 𝑄 𝛾 n & ∥ 𝛾 ∥ & 𝜸 𝑺𝒋𝒆𝒉𝒇 = (𝚾 𝑼 𝚾 + 𝝁𝕁) ?𝟐 𝚾 𝑼 𝒛 First term: the mean square error 𝜸 𝑺𝒋𝒆𝒉𝒇 :the posterior mode Second term: a complexity penalty 𝜇 ≥ 0 Goal: Regularization is the most common way to avoid overfitting and reduce the model complexity by shrinking coefficients close to zero. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11
12 Lasso Definition : The lasso cost function is defined if we replace square 𝜸 in the second term of ridge regression with | 𝜸| & − w 𝑧, 𝑦 ∝ − $ & ∥ 𝑧 − 𝜸 [ 𝚾(𝒚) ∥ & log 𝑄 𝛾 n & | 𝜸| W?$ 𝛾 4 < 𝑢 . Where for some 𝑢 > 0 , ∑ 451 For lasso the prior is a Laplace distribution 𝑄 𝑦 𝜈, 𝑐 = 1 2𝑐 exp(− |𝑦 − 𝜈| ) 𝑐 Goal: beside helping to reduce overfitting it can be used for variable selection and make it easier to eliminate some of input variable which are not contributing to the output . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee
The Bias-Variance Decomposition • Let’s assume that 𝑍 = 𝑔 𝑦 + 𝜗 where 𝔽 𝜗 = 0 and Var 𝜗 = 𝜏 & . • The expected prediction error of the regression model Š 𝑔(𝑦) at the input point 𝑌 = 𝑦 1 using expected squared-error loss: & & 𝑍 − Š 𝑍 − 𝑔 𝑦 1 + 𝑔(𝑦 1 ) − Š Err 𝑦 1 = 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = & 𝑌 = 𝑦 1 +2 𝔽 & (𝑔(𝑦 1 ) − Š 𝑔(𝑦 1 ) − Š 𝔽 𝑍 − 𝑔 𝑦 1 𝑍 − 𝑔 𝑦 1 𝑔 𝑦 1 ) 𝑌 = 𝑦 1 + 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = & Var 𝜗 +2 𝔽 𝜗 𝔽 𝑔(𝑦 1 ) − Š 𝑔(𝑦 1 ) − Š 𝑔 𝑦 1 𝑌 = 𝑦 1 + 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 The first term can not be avoided because it is the variance of the target around its true mean. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13
The Bias-Variance Decomposition • Compute the expectation w.r.t the probability of given data set & & 𝑔(𝑦 1 ) − Š 𝑔 𝑦 1 − 𝔽[ Š 𝑔 𝑦 1 ] + 𝔽[ Š 𝑔 𝑦 1 ] − Š 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = 𝔽 𝑔 𝑦 1 𝑌 = 𝑦 1 = & 𝑌 = 𝑦 1 +2 𝔽 𝔽 Š 𝑔 𝑦 1 − 𝔽 Š 𝑌 = 𝑦 1 𝔽[𝔽[ Š 𝔽 𝑔 𝑦 1 − 𝑔(𝑦 1 ) 𝑔 𝑦 1 𝑔 𝑦 1 ] − & 𝑌 = 𝑦 1 = Bias & Š Š 𝔽 Š − Š + Var[ Š 𝑔 𝑦 1 | 𝑌 = 𝑦 1 ] + 𝔽 𝑔 𝑦 1 𝑔 𝑦 1 𝑔 𝑦 1 𝑔 𝑦 1 ]. The first term is the amount by the average of our estimate differs from its true mean. The second term is the expected squared deviation of Š 𝑔 𝑦 1 around its mean. More complex model → bias ↓ and variance ↑ University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14
Bias-Variance Trade-Off • Flexible models: low bias, high variance • Rigid models: high bias, low variance 𝑀 = 100 data sets, 𝑂 = 25 data points in each set, ℎ 𝑦 = sin(2𝜌𝑦) , 𝑁 = 25 the total number of parameters, the corresponding average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green) CS480/680 Winter 2020 Zahra Sheikhbahaee CS480/680 Winter 2020 Zahra Sheikhbahaee
Bayesian Linear Regression • Bayesian linear regression: avoid the over-fitting problem of maximum likelihood, automatically determines model complexity using the training data alone. 𝑄 𝛾 = 𝑂(𝛾|𝜈 ` , Σ ` ) 𝛾 :model parameters 𝑄(𝑧|𝛾) : the likelihood function 𝑄 𝛾 𝑧 = 𝑂(𝛾|𝜈, Σ) ∝ 𝑄 𝛾 𝑄(𝑧|𝛾) ?$ 𝜈 ` + $ j B Φ [ 𝑧 𝜈 = Σ Σ ` Σ ?$ = Σ ` $ ?$ + j B Φ [ Φ If data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16
Bayesian Learning For a simple linear model: 𝑧 𝑦, 𝒙 = 𝑥 1 +𝑥 $ 𝑦 𝑄 𝒙 𝛽 = 𝑂 𝒙 0, 𝛽 ?$ 𝕁 True parameter: 𝑥 1 = −0.3 𝑥 $ = 0.5 𝛽 = 2.0 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17
Predictive Distribution We are interested to making prediction of 𝑧 ∗ for new value of 𝑦 ∗ 𝑄 𝑧 ∗ 𝑦 ∗ , 𝑧, 𝑦 = ž 𝑞 𝑧 ∗ 𝑦 ∗ , 𝛾 𝑞 𝛾 𝑦, 𝑧 d𝛾 ∝ ž exp(− 1 2 (𝛾 − 𝜈) [ Σ ?$ (𝛾 − 𝜈)) exp − 1 2𝜏 & 𝑧 ∗ − 𝛾 [ 𝜚(𝑦 ∗ ) & 𝑒𝛾 = 𝑂(𝑧 ∗ | 𝜏 ?& 𝜚(𝑦 ∗ ) [ Σ𝜚 𝑦 𝑧, 𝜏 & + 𝜚(𝑦 ∗ ) [ Σ 𝜚(𝑦 ∗ )) A model with nine Gaussian basis functions University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 18
Recommend
More recommend