regression and prediction
play

Regression and Prediction Class 15. 23 Oct 2012 Instructor: - PowerPoint PPT Presentation

11-755 Machine Learning for Signal Processing Regression and Prediction Class 15. 23 Oct 2012 Instructor: Bhiksha Raj 23 Oct 2012 11755/18797 1 Matrix Identities df dx 1 dx 1 x 1 df


  1. 11-755 Machine Learning for Signal Processing Regression and Prediction Class 15. 23 Oct 2012 Instructor: Bhiksha Raj 23 Oct 2012 11755/18797 1

  2. Matrix Identities   df dx   1 dx     1 x 1 df     x dx   f ( )    2  df ( ) 2 dx ... 2      x  ...   D df   dx D    dx  D  The derivative of a scalar function w.r.t. a vector is a vector  The derivative w.r.t. a matrix is a matrix 23 Oct 2012 11755/18797 2

  3. Matrix Identities   df df df dx dx dx   11 12 1 D   dx dx dx x x .. x   11 12 1 D 11 12 1 D .. df df df     x x .. x dx dx dx ..   f ( )   df ( ) 21 22 2 D  21 22 2 D  dx dx dx .. .. .. .. ..  21 22 2 D    .. .. .. ..    x x .. x  df df df D 1 D 2 DD   dx dx dx D 1 D 2 DD    dx dx dx  D 1 D 2 DD  The derivative of a scalar function w.r.t. a vector is a vector  The derivative w.r.t. a matrix is a matrix 23 Oct 2012 11755/18797 3

  4. Matrix Identities   dF dF dF 1 1 1 dx dx dx   1 2 D dx dx dx     1 2 D     dF F x .. dF dF dF 1   1 1       2 2 2 dx dx dx dF ..  F x   1 2 D    2  dx dx dx F ( x ) F x  2   2  .. ... 1 2 D     ... ... .. .. .. ..     dF      F   x  N dF dF dF   N D N N N dx dx dx  1 2 D   dx dx dx  1 2 D  The derivative of a vector function w.r.t. a vector is a matrix  Note transposition of order 23 Oct 2012 11755/18797 4

  5. Derivatives , , UxV Nx1 UxV NxUxV Nx1 UxVxN  In general: Differentiating an MxN function by a UxV argument results in an MxNxUxV tensor derivative 23 Oct 2012 11755/18797 5

  6. Matrix derivative identities X is a matrix, a is a vector.   T T d ( Xa ) X d a d ( a X ) X d a Solution may also be X T   d ( AX ) ( d A ) X ; d ( XA ) X ( d A ) A is a matrix     a   T T T d a Xa a X X d                 T T T T d trace A XA d trace XAA d trace AA X ( X X ) d A  Some basic linear and quadratic identities 23 Oct 2012 11755/18797 6

  7. A Common Problem  Can you spot the glitches? 23 Oct 2012 11755/18797 7

  8. How to fix this problem?  “Glitches” in audio  Must be detected  How?  Then what?  Glitches must be “fixed”  Delete the glitch  Results in a “hole”  Fill in the hole  How? 23 Oct 2012 11755/18797 8

  9. Interpolation..  “Extend” the curve on the left to “predict” the values in the “blank” region  Forward prediction  Extend the blue curve on the right leftwards to predict the blank region  Backward prediction  How?  Regression analysis.. 23 Oct 2012 11755/18797 9

  10. Detecting the Glitch OK NOT OK  Regression-based reconstruction can be done anywhere  Reconstructed value will not match actual value  Large error of reconstruction identifies glitches 23 Oct 2012 11755/18797 10

  11. What is a regression  Analyzing relationship between variables  Expressed in many forms  Wikipedia  Linear regression, Simple regression, Ordinary least squares, Polynomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, ….  Generally a tool to predict variables 23 Oct 2012 11755/18797 11

  12. Regressions for prediction  y = f( x ; Q ) + e  Different possibilities  y is a scalar Y is real  Y is categorical (classification)   y is a vector  x is a vector x is a set of real valued variables  x is a set of categorical variables  x is a combination of the two   f( . ) is a linear or affine function  f( . ) is a non-linear function  f( . ) is a time-series model 23 Oct 2012 11755/18797 12

  13. A linear regression Y X  Assumption: relationship between variables is linear  A linear trend may be found relating x and y  y = dependent variable  x = explanatory variable  Given x , y can be predicted as an affine function of x 23 Oct 2012 11755/18797 13

  14. An imaginary regression..  http://pages.cs.wisc.edu/~kovar/hall.html Check this shit out (Fig. 1).  That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data possible. Now, let's look a bit more closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered. 23 Oct 2012 11755/18797 14

  15. Linear Regressions  y = Ax + b + e  e = prediction error  Given a “training” set of { x, y } values: estimate A and b  y 1 = Ax 1 + b + e 1  y 2 = Ax 2 + b + e 2  y 3 = Ax 3 + b + e 3  …  If A and b are well estimated, prediction error will be small 23 Oct 2012 11755/18797 15

  16. Linear Regression to a scalar y 1 = a T x 1 + b + e 1 y 2 = a T x 2 + b + e 2 y 3 = a T x 3 + b + e 3  Define:      y [ y y y ...] x x x a  b  A 1 2 3 X 1 2 3 ...        1 1  1  e [ e e e ...] 1 2 3  Rewrite   T y A X e 23 Oct 2012 11755/18797 16

  17. Learning the parameters   T y A X e  ˆ T y A X Assuming no error  Given training data: several x , y ˆ  Can define a “divergence”: D( y , ) y  Measures how much yhat differs from y  Ideally, if the model is accurate this should be small ˆ  Estimate A , b to minimize D( y , ) y 23 Oct 2012 11755/18797 17

  18. The prediction error as divergence y 1 = a T x 1 + b + e 1 y 2 = a T x 2 + b + e 2 y 3 = a T x 3 + b + e 3   T y A X e      ˆ 2 2 2 D(y, y ) E e e e ... 1 2 3           T 2 T 2 T 2 ( y a x b ) ( y a x b ) ( y a x b ) ... 1 1 2 2 3 3    2 T      T T T E y A X y A X y A X  Define the divergence as the sum of the squared error in predicting y 23 Oct 2012 11755/18797 18

  19. Prediction error as divergence  y = a T x + e  e = prediction error  Find the “slope” a such that the total squared length of the error lines is minimized 23 Oct 2012 11755/18797 19

  20. Solving a linear regression   T y A X e  Minimize squared error      T 2 T T T E || y X A || ( y A X )( y A X )   T T T T yy A XX A - 2 yX A  Differentiating w.r.t A and equating to 0     T T T d E 2 A XX - 2 yX d A 0       -1 -1    T T T T T pinv A XX Xy A yX XX y X 23 Oct 2012 11755/18797 20

  21. An Aside  What happens if we minimize the perpendicular instead? 23 Oct 2012 11755/18797 21

  22. Regression in multiple dimensions y 1 = A T x 1 + b + e 1 y i is a vector y 2 = A T x 2 + b + e 2 y 3 = A T x 3 + b + e 3 y ij = j th component of vector y i a i = i th column of A  Also called multiple regression b i = i th component of b  Equivalent of saying: T x 1 + b 1 + e 11 y 11 = a 1 T x 2 + b 2 + e 12 y 12 = a 2 y 1 = A T x 1 + b + e 1 T x 3 + b 3 + e 13 y 13 = a 3  Fundamentally no different from N separate single regressions  But we can use the relationship between y s to our benefit 23 Oct 2012 11755/18797 22

  23. Multiple Regression     x x x A Y    b 1 2 3 [ y y y ...] X ... A       1 2 3   1 1 1 E  [ e e e ...] 1 2 3 Dx1 vector of ones   T Y A X E     2      T T T T DIV y A x b trace ( Y A X )( Y A X ) i i i  Differentiating and equating to 0     T T T dDiv 2 A XX - 2 YX d A 0       -1 -1    T T T T T A YX XX Y pinv X A XX XY 23 Oct 2012 11755/18797 23

  24. A Different Perspective = +  y is a noisy reading of A T x   T y A x e  Error e is Gaussian  2 I e ~ N ( 0 , )   Y [ y y ... y ] X [ x x ... x ]  Estimate A from 1 2 N 1 2 N 23 Oct 2012 11755/18797 24

Recommend


More recommend