MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD – ECE 175A – Winter 2012
Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an ˆ approximating function f ( x ) y ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2
Supervised Learning & Regression • X can be anything, but the type of Y dictates the type of supervised learning problem – Y in {0,1} referred to as detection – Y in {0, ..., M-1} referred to as (M-ary) classification – Y continuous is referred to as Regression • We have been dealing mostly with classification, now we will emphasize regression • The regression problem provides a relatively easy setting to explain non-trivial MLE problems 3
The Standard Regression Model • The regression problem is usually modeled as follow: – The are two random vectors. The independent (regressor) variable X and the dependent (regressed) variable Y . – An iid dataset of training examples D = {( x 1 ,y 1 ) , … , ( x n ,y n )} – An additive noise parametric model of the form ( ; ) Y f X E where R p is a deterministic parameter vector, and E is an iid additive random vector that accounts for noise and model error. • Two fundamental types of regression problems – Linear regression , where f(.) is linear in – Nonlinear regression , otherwise – What matters is linearity in the parameter , not in the data X ! 4
Example Regression Models • Linear Regression: • Nonlinear Regression: – Line Fitting – Neural Networks ( ; ) f x x 1 1 0 ( ; ) f x e – Polynomial Fitting x 1 1 0 k – Sinusoidal Decompositions i ( ; ) f x x i k i 0 ( ; ) cos( ) f x x – Truncated Fourier Series i 0 i k ( ; ) cos( ) f x ix – Etc . i 0 i • We often assume that E is additive white Gaussian noise ( AWGN ) • We always assume that E and X are independent 5
Probabilistic Model of Y Conditioned on X • A realization is X = x , E = e, Y = y : y e ( ; ) y f x – x is (almost) always known , the goal is to x predict y given x – Thus, for each x , f(x, ) is treated like a constant – The realization Ε = e is added to f(x, ) to form Y = y – Hence, Y is conditionally distributed as Ε but with a constant added – This only changes the mean of the distribution of E , P Ε ( ε ; ), yielding | ( | ; ) ( ; ); P y x P y f x E Y X – The conditional probability model for Y | X is determined from the distribution of the noise, P Ε ( ε ; ) ! Also note that the noise pdf, P Ε ( ε ; ) , might also depend on the unknown parameter vector 6
The (Conditional) Likelihood Function • Consider a collection of iid training points D = {( x 1 ,y 1 ) , ... , ( x n ,y n )}. If we define X = { x 1 , ... , x n } = X Y . Y = { y 1 ,... , y n }, we have D = X • Conditioned on X , the likelihood of given D is n ( | ; ) ( | ; ) | ; P P P y | | | D Y X i i 1 n n | ; ( ; ) ; P y x P y f x E | Y X i i i i 1 1 i i X -conditional likelihood of given Y • This is also the X • Note: we have used the facts that y i is conditionally iid and depends only on x i (both facts being a consequence of our modeling assumptions). 7
Maximum Likelihood Estimation • This suggests that – Given a collection of iid training points D = {( x 1 ,y 1 ) ,..., ( x n ,y n )}, the natural procedure to estimate the parameter is ML estimation: ˆ argmax | ; P y x M L Y X | i i i argmax ( ; ); P y f x E i i i Equivalently , ˆ argmax log | ; P y x M L Y X | i i i argmax log ( ; ); P y f x E i i i – Note that the noise pdf, P Ε ( ε ; ), can also possibly depend on 8
AWGN MLE • One frequently used model is the scalar AWGN case where the noise is zero-mean with variance s 2 e 2 1 e s 2 ( ) 2 P e E s 2 2 • In this case the conditional pdf for Y | X is a Gaussian of mean f ( x ; ) and variance s 2 2 ( ; ) y f x 1 ( | ; ) exp P y x s | Y X 2 s 2 2 2 • If the variance s 2 is unknown, it is included in 9
AWGN MLE • Assume the variance s 2 is known. Then the MLE is: ˆ argmax log ( ; ) P y f x E ML i i i 2 ( ; ) y f x 1 s i i 2 argmin log(2 ) s 2 2 2 i 2 argmin ( ; ) y f x i i i – Since this minimizes the squared Euclidean distance of the estimation error ( or prediction error ), it is also known as least squares curve fitting 10
MLE & Optimal Regression • The above development can be framed in our initial formulation of optimizing the loss of the learning system ˆ x ( ) ˆ y f x ( , ) L y y ( ) · f • For a regression problem this still applies – the interpretation of f (.) as a predictor even becomes more intuitive y • Solving by ML is equivalent to picking a loss identical to the negative of the log of the noise probability density x 11
Loss for Scalar Noise with Known PDF • Additive Error PDF: • Loss, ε = ( y – f (x; )): – Gaussian (AWGN case) – L 2 Distance e 2 1 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e s 2 ( ) 2 P e E s 2 2 – Laplacian – L 1 Distance e | | 1 L( ( ; ) , ) ( ; ) f x y y f x e s ( ) P e E s 2 – Rayleigh Distance – Rayleigh 2 e 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e e s 2 ( ) 2 P e E s log( ( ; )) y f x 2 12
Maximum Likelihood Estimation • How do we find the optimal parameters? • Recall that to obtain the MLE we need to solve * max argmax ; P D • The unique local solutions are the parameter values such that ˆ ( ; ) 0 P D ˆ T ( , ) 0, 0 • Note that you always have to check the second-order Hessian condition! 13
Maximum likelihood Estimation Recall some important results • FACT: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite : i) x T Ax > 0, x 0 ii) All eigenvalues of A are real and satisfy l i >0 iii ) All upper-left submatrices A k have strictly positive determinant. ( strictly positive leading principal minors ). iv ) There exists a matrix R with independent rows such that A = RR T . Equivalently, there exists a matrix Q with independent columns such that A = Q T Q • Definition of upper left submatrices: a a a 1,1 1,2 1,3 a a 1,1 1,2 A a A A a a a 1 1,1 2 3 2,1 2,2 2,3 a a 2,1 2,2 a a a 3,1 3,2 3,3 14
Vector Derivatives • To compute the gradient and Hessian it is useful to rely on vector derivatives (defined as row operators) • Some important identities that we will use T A T T A A ( ) A A 2 ) T 2( b A b A A • To find equivalent Cartesian gradient identities, merely transpose the above. • There are various lists of the most popular formulas. Just Google “vector derivatives” or “matrix derivatives”. 15
MLE for Known Additive Noise PDF • For regression this becomes ˆ argmax log | ; P y x M L | Y X i i i argmin L , ; y x i i i ˆ • The unique locally optimal solutions, , are given by ˆ L( , ; ) 0 y x i i i 2 ˆ y x > T L( , ; ) 0, 0 2 i i i 16
MLE and Regression • Noting that the vector derivative and Hessian are linear operators (because derivatives are linear operators), these conditions can be written as ˆ y x L( , ; ) 0 i i i and 2 ˆ y x > T 2 L( , ; ) 0, 0 i i i 17
Recommend
More recommend