MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD - PowerPoint PPT Presentation

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD – ECE 175A – Winter 2012

Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an ˆ approximating function f ( x )  y   ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2

Supervised Learning & Regression • X can be anything, but the type of Y dictates the type of supervised learning problem – Y in {0,1} referred to as detection – Y in {0, ..., M-1} referred to as (M-ary) classification – Y continuous is referred to as Regression • We have been dealing mostly with classification, now we will emphasize regression • The regression problem provides a relatively easy setting to explain non-trivial MLE problems 3

The Standard Regression Model • The regression problem is usually modeled as follow: – The are two random vectors. The independent (regressor) variable X and the dependent (regressed) variable Y . – An iid dataset of training examples D = {( x 1 ,y 1 ) , … , ( x n ,y n )} – An additive noise parametric model of the form    ( ; ) Y f X E where    R p is a deterministic parameter vector, and E is an iid additive random vector that accounts for noise and model error. • Two fundamental types of regression problems – Linear regression , where f(.) is linear in  – Nonlinear regression , otherwise – What matters is linearity in the parameter  , not in the data X ! 4

Example Regression Models • Linear Regression: • Nonlinear Regression: – Line Fitting – Neural Networks      ( ; ) f x x 1 1 0   ( ; ) f x e      – Polynomial Fitting x 1 1 0   k – Sinusoidal Decompositions   i ( ; ) f x x i   k  i 0   ( ; ) cos( ) f x x – Truncated Fourier Series i  0 i   k   ( ; ) cos( ) f x ix – Etc . i  0 i • We often assume that E is additive white Gaussian noise ( AWGN ) • We always assume that E and X are independent 5

Probabilistic Model of Y Conditioned on X • A realization is X = x , E = e, Y = y : y    e ( ; ) y f x – x is (almost) always known , the goal is to x predict y given x – Thus, for each x , f(x,  ) is treated like a constant – The realization Ε = e is added to f(x,  ) to form Y = y – Hence, Y is conditionally distributed as Ε but with a constant added – This only changes the mean of the distribution of E , P Ε ( ε ;  ), yielding        | ( | ; ) ( ; ); P y x P y f x E Y X – The conditional probability model for Y | X is determined from the distribution of the noise, P Ε ( ε ;  ) ! Also note that the noise pdf, P Ε ( ε ;  ) , might also depend on the unknown parameter vector  6

The (Conditional) Likelihood Function • Consider a collection of iid training points D = {( x 1 ,y 1 ) , ... , ( x n ,y n )}. If we define X = { x 1 , ... , x n } = X  Y . Y = { y 1 ,... , y n }, we have D = X • Conditioned on X , the likelihood of  given D is n         ( | ; ) ( | ; ) | ; P P P y | | | D Y X i  i 1 n n             | ; ( ; ) ; P y x P y f x E | Y X i i i i   1 1 i i X -conditional likelihood of  given Y • This is also the X • Note: we have used the facts that y i is conditionally iid and depends only on x i (both facts being a consequence of our modeling assumptions). 7

Maximum Likelihood Estimation • This suggests that – Given a collection of iid training points D = {( x 1 ,y 1 ) ,..., ( x n ,y n )}, the natural procedure to estimate the parameter  is ML estimation:    ˆ    argmax | ; P y x M L Y X | i i    i        argmax ( ; ); P y f x E i i    i Equivalently ,    ˆ    argmax log | ; P y x M L Y X | i i    i        argmax log ( ; ); P y f x E i i    i – Note that the noise pdf, P Ε ( ε ;  ), can also possibly depend on  8

AWGN MLE • One frequently used model is the scalar AWGN case where the noise is zero-mean with variance s 2 e 2  1 e  s 2 ( ) 2 P e E s 2 2 • In this case the conditional pdf for Y | X is a Gaussian of mean f ( x ;  ) and variance s 2     2     ( ; ) y f x 1      ( | ; ) exp P y x s | Y X 2  s 2 2   2   • If the variance s 2 is unknown, it is included in  9

AWGN MLE • Assume the variance s 2 is known. Then the MLE is:    ˆ     argmax log ( ; ) P y f x E ML i i    i     2 ( ; ) y f x 1    s i i 2 argmin log(2 ) s 2    2 2 i       2 argmin ( ; ) y f x i i    i – Since this minimizes the squared Euclidean distance of the estimation error ( or prediction error ), it is also known as least squares curve fitting 10

MLE & Optimal Regression • The above development can be framed in our initial formulation of optimizing the loss of the learning system  ˆ x ( ) ˆ y f x ( , ) L y y ( ) · f • For a regression problem this still applies – the interpretation of f (.) as a predictor even becomes more intuitive y • Solving by ML is equivalent to picking a loss identical to the negative of the log of the noise probability density x 11

Loss for Scalar Noise with Known PDF • Additive Error PDF: • Loss, ε = ( y – f (x;  )): – Gaussian (AWGN case) – L 2 Distance e 2      1 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e  s 2 ( ) 2 P e E s 2 2 – Laplacian – L 1 Distance e | | 1      L( ( ; ) , ) ( ; ) f x y y f x e  s ( ) P e E s 2 – Rayleigh Distance – Rayleigh     2 e 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e  e  s 2 ( ) 2 P e    E s log( ( ; )) y f x 2 12

Maximum Likelihood Estimation • How do we find the optimal parameters? • Recall that to obtain the MLE we need to solve      * max argmax ; P D    • The unique local solutions are the parameter values such that  ˆ   ( ; ) 0 P   D ˆ        T ( , ) 0, 0 • Note that you always have to check the second-order Hessian condition! 13

Maximum likelihood Estimation Recall some important results • FACT: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite : i) x T Ax > 0,  x  0 ii) All eigenvalues of A are real and satisfy l i >0 iii ) All upper-left submatrices A k have strictly positive determinant. ( strictly positive leading principal minors ). iv ) There exists a matrix R with independent rows such that A = RR T . Equivalently, there exists a matrix Q with independent columns such that A = Q T Q • Definition of upper left submatrices:   a a a   1,1 1,2 1,3   a a    1,1 1,2   A a A A a a a   1 1,1 2 3 2,1 2,2 2,3 a a     2,1 2,2 a a a   3,1 3,2 3,3 14

Vector Derivatives • To compute the gradient and Hessian it is useful to rely on vector derivatives (defined as row operators) • Some important identities that we will use       T A        T T A A ( ) A A      2       ) T 2( b A b A A   • To find equivalent Cartesian gradient identities, merely transpose the above. • There are various lists of the most popular formulas. Just Google “vector derivatives” or “matrix derivatives”. 15

MLE for Known Additive Noise PDF • For regression this becomes    ˆ    argmax log | ; P y x M L | Y X i i    i     argmin L , ; y x i i    i ˆ  • The unique locally optimal solutions, , are given by      ˆ   L( , ; )  0 y x   i i   i    2  ˆ  y x   >    T L( , ; ) 0, 0     2 i i   i 16

MLE and Regression • Noting that the vector derivative and Hessian are linear operators (because derivatives are linear operators), these conditions can be written as    ˆ   y x  L( , ; ) 0     i i i and    2  ˆ  y x   >    T  2 L( , ; )  0, 0   i i   i 17

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD - PowerPoint PPT Presentation

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE 175A Winter 2012 Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an approximating

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Regression Methods 1. Linear Regression with only one parameter, and without offset; MLE and

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Lab 1: Replica-ng Oneal and Russe4 (2005) Linear probability

Log-Linear Models Noah A. Smith Department of Computer Science / Center for Language and

Models for individual demand Kathrin Gruber Assistant Professor of Econometrics Erasmus

I. Floorplanning with Fixed Modules Fixed modules only, no rotation allowed m 1 (4,5), m 2

Figure 1: World prices of coltan and gold Figure 2: Local prices of coltan and gold Figure 6:

When Regulations Backfire: The Case of the Community Reinvestment Act Konstantin Golyaev

w o o o o o o o o x o o o o o x o that represents how aligned the o x x x x

CSC321 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC321 Lecture 2: Linear