Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 6 Jan-Willem van de Meent
Regression
Curve Fitting (according to XKCD) https://xkcd.com/2048/
Linear Regression Goal: Approximate points with a line or hyper-surface
Linear Regression Assume f is a linear combination of D features ε ∼ Norm ( 0, σ 2 ) For N points we write Learning : Estimate w Prediction : Estimate y’ given x’
Error Measure: Sum of Squares Mean Squared Error (MSE): N E ( w ) = 1 X ( w T x n � y n ) 2 N n =1 = 1 N k Xw � y k 2 where — x 1 T — 2 3 2 y 1 T 3 — x 2 T — y 2 T 6 7 6 7 X = y = 6 7 6 7 4 5 4 5 . . . . . . — x NT — y NT
Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X
Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X Matrix Cookbook (on course website)
Ordinary Least Squares Construct matrix X and the vector y from the dataset { ( x 1 , y 1 ) , x 2 , y 2 ) , . . . , ( x N , y N ) } (each x includes x 0 = 1) as follows: — x T y T 1 — 1 — x T y T 2 — 2 X = y = . . . . . . — x T y T N — N Compute X † = ( X T X ) − 1 X T Return w = X † y
Basis function regression Linear regression Basis function regression For N samples Polynomial regression
Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x
Polynomial Regression Underfit M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x
Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x Overfit M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x
Regularization L 2 regularization (ridge regression) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ k w k 2 where λ � 0 and k w k 2 = w T w � k k L 1 regularization (LASSO) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ | w | 1 D where λ � 0 and | w | 1 = P | ω i | i =1
Regularization
Regularization L 2: closed form solution w = ( X T X + λ I ) � 1 X T y L 1: No closed form solution. Use quadratic programming: minimize k Xw � y k 2 k w k 1 s s . t .
Maximum Likelihood
Regression: Probabilistic Interpretation ? What is the probability
Regression: Probabilistic Interpretation Least Squares Objective Likelihood
Maximum Likelihood Least Squares Objective Log-Likelihood Maximizing the likelihood minimizes the sum of squares
Maximum a Posteriori
Regression with Priors Can we maximize ? (i.e. can we perform MAP estimation?)
Regression with Priors From Bayes Rule
Maximum a Posteriori Maximum a Posteriori is Equivalent to Ridge Regression
Recommend
More recommend