mle regression
play

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD - PowerPoint PPT Presentation

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE 175A Winter 2012 Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an approximating


  1. MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD – ECE 175A – Winter 2012

  2. Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an ˆ approximating function f ( x )  y   ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2

  3. Supervised Learning & Regression • X can be anything, but the type of Y dictates the type of supervised learning problem – Y in {0,1} referred to as detection – Y in {0, ..., M-1} referred to as (M-ary) classification – Y continuous is referred to as Regression • We have been dealing mostly with classification, now we will emphasize regression • The regression problem provides a relatively easy setting to explain non-trivial MLE problems 3

  4. The Standard Regression Model • The regression problem is usually modeled as follow: – The are two random vectors. The independent (regressor) variable X and the dependent (regressed) variable Y . – An iid dataset of training examples D = {( x 1 ,y 1 ) , … , ( x n ,y n )} – An additive noise parametric model of the form    ( ; ) Y f X E where    R p is a deterministic parameter vector, and E is an iid additive random vector that accounts for noise and model error. • Two fundamental types of regression problems – Linear regression , where f(.) is linear in  – Nonlinear regression , otherwise – What matters is linearity in the parameter  , not in the data X ! 4

  5. Example Regression Models • Linear Regression: • Nonlinear Regression: – Line Fitting – Neural Networks      ( ; ) f x x 1 1 0   ( ; ) f x e      – Polynomial Fitting x 1 1 0   k – Sinusoidal Decompositions   i ( ; ) f x x i   k  i 0   ( ; ) cos( ) f x x – Truncated Fourier Series i  0 i   k   ( ; ) cos( ) f x ix – Etc . i  0 i • We often assume that E is additive white Gaussian noise ( AWGN ) • We always assume that E and X are independent 5

  6. Probabilistic Model of Y Conditioned on X • A realization is X = x , E = e, Y = y : y    e ( ; ) y f x – x is (almost) always known , the goal is to x predict y given x – Thus, for each x , f(x,  ) is treated like a constant – The realization Ε = e is added to f(x,  ) to form Y = y – Hence, Y is conditionally distributed as Ε but with a constant added – This only changes the mean of the distribution of E , P Ε ( ε ;  ), yielding        | ( | ; ) ( ; ); P y x P y f x E Y X – The conditional probability model for Y | X is determined from the distribution of the noise, P Ε ( ε ;  ) ! Also note that the noise pdf, P Ε ( ε ;  ) , might also depend on the unknown parameter vector  6

  7. The (Conditional) Likelihood Function • Consider a collection of iid training points D = {( x 1 ,y 1 ) , ... , ( x n ,y n )}. If we define X = { x 1 , ... , x n } = X  Y . Y = { y 1 ,... , y n }, we have D = X • Conditioned on X , the likelihood of  given D is n         ( | ; ) ( | ; ) | ; P P P y | | | D Y X i  i 1 n n             | ; ( ; ) ; P y x P y f x E | Y X i i i i   1 1 i i X -conditional likelihood of  given Y • This is also the X • Note: we have used the facts that y i is conditionally iid and depends only on x i (both facts being a consequence of our modeling assumptions). 7

  8. Maximum Likelihood Estimation • This suggests that – Given a collection of iid training points D = {( x 1 ,y 1 ) ,..., ( x n ,y n )}, the natural procedure to estimate the parameter  is ML estimation:    ˆ    argmax | ; P y x M L Y X | i i    i        argmax ( ; ); P y f x E i i    i Equivalently ,    ˆ    argmax log | ; P y x M L Y X | i i    i        argmax log ( ; ); P y f x E i i    i – Note that the noise pdf, P Ε ( ε ;  ), can also possibly depend on  8

  9. AWGN MLE • One frequently used model is the scalar AWGN case where the noise is zero-mean with variance s 2 e 2  1 e  s 2 ( ) 2 P e E s 2 2 • In this case the conditional pdf for Y | X is a Gaussian of mean f ( x ;  ) and variance s 2     2     ( ; ) y f x 1      ( | ; ) exp P y x s | Y X 2  s 2 2   2   • If the variance s 2 is unknown, it is included in  9

  10. AWGN MLE • Assume the variance s 2 is known. Then the MLE is:    ˆ     argmax log ( ; ) P y f x E ML i i    i     2 ( ; ) y f x 1    s i i 2 argmin log(2 ) s 2    2 2 i       2 argmin ( ; ) y f x i i    i – Since this minimizes the squared Euclidean distance of the estimation error ( or prediction error ), it is also known as least squares curve fitting 10

  11. MLE & Optimal Regression • The above development can be framed in our initial formulation of optimizing the loss of the learning system  ˆ x ( ) ˆ y f x ( , ) L y y ( ) · f • For a regression problem this still applies – the interpretation of f (.) as a predictor even becomes more intuitive y • Solving by ML is equivalent to picking a loss identical to the negative of the log of the noise probability density x 11

  12. Loss for Scalar Noise with Known PDF • Additive Error PDF: • Loss, ε = ( y – f (x;  )): – Gaussian (AWGN case) – L 2 Distance e 2      1 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e  s 2 ( ) 2 P e E s 2 2 – Laplacian – L 1 Distance e | | 1      L( ( ; ) , ) ( ; ) f x y y f x e  s ( ) P e E s 2 – Rayleigh Distance – Rayleigh     2 e 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e  e  s 2 ( ) 2 P e    E s log( ( ; )) y f x 2 12

  13. Maximum Likelihood Estimation • How do we find the optimal parameters? • Recall that to obtain the MLE we need to solve      * max argmax ; P D    • The unique local solutions are the parameter values such that  ˆ   ( ; ) 0 P   D ˆ        T ( , ) 0, 0 • Note that you always have to check the second-order Hessian condition! 13

  14. Maximum likelihood Estimation Recall some important results • FACT: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite : i) x T Ax > 0,  x  0 ii) All eigenvalues of A are real and satisfy l i >0 iii ) All upper-left submatrices A k have strictly positive determinant. ( strictly positive leading principal minors ). iv ) There exists a matrix R with independent rows such that A = RR T . Equivalently, there exists a matrix Q with independent columns such that A = Q T Q • Definition of upper left submatrices:   a a a   1,1 1,2 1,3   a a    1,1 1,2   A a A A a a a   1 1,1 2 3 2,1 2,2 2,3 a a     2,1 2,2 a a a   3,1 3,2 3,3 14

  15. Vector Derivatives • To compute the gradient and Hessian it is useful to rely on vector derivatives (defined as row operators) • Some important identities that we will use       T A        T T A A ( ) A A      2       ) T 2( b A b A A   • To find equivalent Cartesian gradient identities, merely transpose the above. • There are various lists of the most popular formulas. Just Google “vector derivatives” or “matrix derivatives”. 15

  16. MLE for Known Additive Noise PDF • For regression this becomes    ˆ    argmax log | ; P y x M L | Y X i i    i     argmin L , ; y x i i    i ˆ  • The unique locally optimal solutions, , are given by      ˆ   L( , ; )  0 y x   i i   i    2  ˆ  y x   >    T L( , ; ) 0, 0     2 i i   i 16

  17. MLE and Regression • Noting that the vector derivative and Hessian are linear operators (because derivatives are linear operators), these conditions can be written as    ˆ   y x  L( , ; ) 0     i i i and    2  ˆ  y x   >    T  2 L( , ; )  0, 0   i i   i 17

Recommend


More recommend