11-755 Machine Learning for Signal Processing Regression and Prediction Class 15. 23 Oct 2012 Instructor: Bhiksha Raj 23 Oct 2012 11755/18797 1
Matrix Identities df dx 1 dx 1 x 1 df x dx f ( ) 2 df ( ) 2 dx ... 2 x ... D df dx D dx D The derivative of a scalar function w.r.t. a vector is a vector The derivative w.r.t. a matrix is a matrix 23 Oct 2012 11755/18797 2
Matrix Identities df df df dx dx dx 11 12 1 D dx dx dx x x .. x 11 12 1 D 11 12 1 D .. df df df x x .. x dx dx dx .. f ( ) df ( ) 21 22 2 D 21 22 2 D dx dx dx .. .. .. .. .. 21 22 2 D .. .. .. .. x x .. x df df df D 1 D 2 DD dx dx dx D 1 D 2 DD dx dx dx D 1 D 2 DD The derivative of a scalar function w.r.t. a vector is a vector The derivative w.r.t. a matrix is a matrix 23 Oct 2012 11755/18797 3
Matrix Identities dF dF dF 1 1 1 dx dx dx 1 2 D dx dx dx 1 2 D dF F x .. dF dF dF 1 1 1 2 2 2 dx dx dx dF .. F x 1 2 D 2 dx dx dx F ( x ) F x 2 2 .. ... 1 2 D ... ... .. .. .. .. dF F x N dF dF dF N D N N N dx dx dx 1 2 D dx dx dx 1 2 D The derivative of a vector function w.r.t. a vector is a matrix Note transposition of order 23 Oct 2012 11755/18797 4
Derivatives , , UxV Nx1 UxV NxUxV Nx1 UxVxN In general: Differentiating an MxN function by a UxV argument results in an MxNxUxV tensor derivative 23 Oct 2012 11755/18797 5
Matrix derivative identities X is a matrix, a is a vector. T T d ( Xa ) X d a d ( a X ) X d a Solution may also be X T d ( AX ) ( d A ) X ; d ( XA ) X ( d A ) A is a matrix a T T T d a Xa a X X d T T T T d trace A XA d trace XAA d trace AA X ( X X ) d A Some basic linear and quadratic identities 23 Oct 2012 11755/18797 6
A Common Problem Can you spot the glitches? 23 Oct 2012 11755/18797 7
How to fix this problem? “Glitches” in audio Must be detected How? Then what? Glitches must be “fixed” Delete the glitch Results in a “hole” Fill in the hole How? 23 Oct 2012 11755/18797 8
Interpolation.. “Extend” the curve on the left to “predict” the values in the “blank” region Forward prediction Extend the blue curve on the right leftwards to predict the blank region Backward prediction How? Regression analysis.. 23 Oct 2012 11755/18797 9
Detecting the Glitch OK NOT OK Regression-based reconstruction can be done anywhere Reconstructed value will not match actual value Large error of reconstruction identifies glitches 23 Oct 2012 11755/18797 10
What is a regression Analyzing relationship between variables Expressed in many forms Wikipedia Linear regression, Simple regression, Ordinary least squares, Polynomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, …. Generally a tool to predict variables 23 Oct 2012 11755/18797 11
Regressions for prediction y = f( x ; Q ) + e Different possibilities y is a scalar Y is real Y is categorical (classification) y is a vector x is a vector x is a set of real valued variables x is a set of categorical variables x is a combination of the two f( . ) is a linear or affine function f( . ) is a non-linear function f( . ) is a time-series model 23 Oct 2012 11755/18797 12
A linear regression Y X Assumption: relationship between variables is linear A linear trend may be found relating x and y y = dependent variable x = explanatory variable Given x , y can be predicted as an affine function of x 23 Oct 2012 11755/18797 13
An imaginary regression.. http://pages.cs.wisc.edu/~kovar/hall.html Check this shit out (Fig. 1). That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data possible. Now, let's look a bit more closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered. 23 Oct 2012 11755/18797 14
Linear Regressions y = Ax + b + e e = prediction error Given a “training” set of { x, y } values: estimate A and b y 1 = Ax 1 + b + e 1 y 2 = Ax 2 + b + e 2 y 3 = Ax 3 + b + e 3 … If A and b are well estimated, prediction error will be small 23 Oct 2012 11755/18797 15
Linear Regression to a scalar y 1 = a T x 1 + b + e 1 y 2 = a T x 2 + b + e 2 y 3 = a T x 3 + b + e 3 Define: y [ y y y ...] x x x a b A 1 2 3 X 1 2 3 ... 1 1 1 e [ e e e ...] 1 2 3 Rewrite T y A X e 23 Oct 2012 11755/18797 16
Learning the parameters T y A X e ˆ T y A X Assuming no error Given training data: several x , y ˆ Can define a “divergence”: D( y , ) y Measures how much yhat differs from y Ideally, if the model is accurate this should be small ˆ Estimate A , b to minimize D( y , ) y 23 Oct 2012 11755/18797 17
The prediction error as divergence y 1 = a T x 1 + b + e 1 y 2 = a T x 2 + b + e 2 y 3 = a T x 3 + b + e 3 T y A X e ˆ 2 2 2 D(y, y ) E e e e ... 1 2 3 T 2 T 2 T 2 ( y a x b ) ( y a x b ) ( y a x b ) ... 1 1 2 2 3 3 2 T T T T E y A X y A X y A X Define the divergence as the sum of the squared error in predicting y 23 Oct 2012 11755/18797 18
Prediction error as divergence y = a T x + e e = prediction error Find the “slope” a such that the total squared length of the error lines is minimized 23 Oct 2012 11755/18797 19
Solving a linear regression T y A X e Minimize squared error T 2 T T T E || y X A || ( y A X )( y A X ) T T T T yy A XX A - 2 yX A Differentiating w.r.t A and equating to 0 T T T d E 2 A XX - 2 yX d A 0 -1 -1 T T T T T pinv A XX Xy A yX XX y X 23 Oct 2012 11755/18797 20
An Aside What happens if we minimize the perpendicular instead? 23 Oct 2012 11755/18797 21
Regression in multiple dimensions y 1 = A T x 1 + b + e 1 y i is a vector y 2 = A T x 2 + b + e 2 y 3 = A T x 3 + b + e 3 y ij = j th component of vector y i a i = i th column of A Also called multiple regression b i = i th component of b Equivalent of saying: T x 1 + b 1 + e 11 y 11 = a 1 T x 2 + b 2 + e 12 y 12 = a 2 y 1 = A T x 1 + b + e 1 T x 3 + b 3 + e 13 y 13 = a 3 Fundamentally no different from N separate single regressions But we can use the relationship between y s to our benefit 23 Oct 2012 11755/18797 22
Multiple Regression x x x A Y b 1 2 3 [ y y y ...] X ... A 1 2 3 1 1 1 E [ e e e ...] 1 2 3 Dx1 vector of ones T Y A X E 2 T T T T DIV y A x b trace ( Y A X )( Y A X ) i i i Differentiating and equating to 0 T T T dDiv 2 A XX - 2 YX d A 0 -1 -1 T T T T T A YX XX Y pinv X A XX XY 23 Oct 2012 11755/18797 23
A Different Perspective = + y is a noisy reading of A T x T y A x e Error e is Gaussian 2 I e ~ N ( 0 , ) Y [ y y ... y ] X [ x x ... x ] Estimate A from 1 2 N 1 2 N 23 Oct 2012 11755/18797 24
Recommend
More recommend