Introduction to General and Generalized Linear Models The Likelihood Principle - part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby October 2010 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 1 / 32
This lecture The maximum likelihood estimate (MLE) Distribution of the ML estimator Model selection Dealing with nuisance parameters Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 2 / 32
The maximum likelihood estimate (MLE) The Maximum Likelihood Estimate (MLE) Definition (Maximum Likelihood Estimate (MLE)) Given the observation y = ( y 1 , y 2 , . . . , y n ) the Maximum Likelihood Estimate (MLE) is a function � θ ( y ) such that L ( � θ ; y ) = sup L ( θ ; y ) θ ∈ Θ The function � θ ( Y ) over the sample space of observations Y is called an ML estimator . In practice it is convenient to work with the log-likelihood function l ( θ ; y ) . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 3 / 32
The maximum likelihood estimate (MLE) The Maximum Likelihood Estimate (MLE) The score function can be used to obtain the estimate, since the MLE can be found as the solution to l ′ θ ( θ ; y ) = 0 which are called the estimation equations for the ML-estimator , or, just the ML equations. It is common practice, especially when plotting, to normalize the likelihood function to have unit maximum and the log-likelihood to have zero maximum. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 4 / 32
The maximum likelihood estimate (MLE) Invariance property Theorem (Invariance property) Assume that � θ is a maximum likelihood estimator for θ , and let ψ = ψ ( θ ) denote a one-to-one mapping of Ω ⊂ R k onto Ψ ⊂ R k . Then the estimator ψ ( � θ ) is a maximum likelihood estimator for the parameter ψ ( θ ) . The principle is easily generalized to the case where the mapping is not one-to-one. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 5 / 32
Distribution of the ML estimator Distribution of the ML estimator Theorem (Distribution of the ML estimator) We assume that � θ is consistent. Then, under some regularity conditions, � θ − θ → N(0 , i ( θ ) − 1 ) where i ( θ ) is the expected information or the information matrix. The results can be used for inference under very general conditions. As the price for the generality, the results are only asymptotically valid. Asymptotically the variance of the estimator is seen to be equal to the Cramer-Rao lower bound for any unbiased estimator. The practical significance of this result is that the MLE makes efficient use of the available data for large data sets. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 6 / 32
Distribution of the ML estimator Distribution of the ML estimator In practice, we would use θ ∼ N( θ, j − 1 ( � � θ )) where j ( � θ ) is the observed (Fisher) information. This means that asymptotically i) E[ � θ ] = θ ii) D [ � θ ] = j − 1 ( � θ ) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 7 / 32
Distribution of the ML estimator Distribution of the ML estimator The standard error of � θ i is given by � Var ii [ � θ i = θ ] σ ˆ � where Var ii [ � θ ] is the i’th diagonal term of j − 1 ( � θ ) Hence we have that an estimate of the dispersion (variance-covariance matrix) of the estimator is D [ � θ ] = j − 1 ( � θ ) An estimate of the uncertainty of the individual parameter estimates is obtained by decomposing the dispersion matrix as follows: D [ � θ ] = � σ ˆ θ R � σ ˆ θ into � θ , which is a diagonal matrix of the standard deviations of the σ ˆ individual parameter estimates, and R , which is the corresponding correlation matrix. The value R ij is thus the estimated correlation between � θ i and � θ j . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 8 / 32
Distribution of the ML estimator The Wald Statistic A test of an individual parameter H 0 : θ i = θ i, 0 is given by the Wald statistic : � θ i − θ i, 0 Z i = � σ ˆ θ i which under H 0 is approximately N(0 , 1) -distributed. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 9 / 32
Distribution of the ML estimator Quadratic approximation of the log-likelihood A second-order Taylor expansion around � θ provides us with a quadratic approximation of the normalized log-likelihood around the MLE. A second-order Taylors expansion around � θ we get θ ) − 1 θ ) + l ′ ( � l ( θ ) ≈ l ( � θ )( θ − � 2 j ( � θ )( θ − � θ ) 2 and then log L ( θ ) ≈ − 1 2 j ( � θ )( θ − � θ ) 2 L ( � θ ) In the case of normality the approximation is exact which means that a quadratic approximation of the log-likelihood corresponds to normal approximation of the � θ ( Y ) estimator. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 10 / 32
Distribution of the ML estimator Example: Quadratic approximation of the log-likelihood Consider again the thumbtack example. The log-likelihood function is: l ( θ ) = y log θ + ( n − y ) log(1 − θ ) + const The score function is: θ − n − y l ′ ( θ ) = y 1 − θ , and the observed information: n − y j ( θ ) = y θ 2 + (1 − θ ) 2 . For n = 10 , y = 3 and � θ = 0 . 3 we obtain j ( � θ ) = 47 . 6 The quadratic approximation is poor in this case. By increasing the sample size to n = 100 , but still with � θ = 0 . 3 the approximation is much better. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 11 / 32
Distribution of the ML estimator Example: Quadratic approximation of the log-likelihood 0 0 True True Approx. Approx. Log-likelihood Log-likelihood -1 -1 -2 -2 -3 -3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 θ θ (a) n = 10 , y = 3 (b) n = 100 , y = 30 Figure: Quadratic approximation of the log-likelihood function. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 12 / 32
Model selection Likelihood ratio tests Methods for testing hypotheses using the likelihood function. The basic idea: determine the maximum likelihood estimates under both a null and alternative hypothesis. It is assumed that a sufficient model with θ ∈ Ω exists. Then consider some theory or assumption about the parameters H 0 : θ ∈ Ω 0 where Ω 0 ⊂ Ω ; dim(Ω 0 ) = r and dim(Ω) = k The purpose of the testing is to analyze whether the observations provide sufficient evidence to reject this theory or assumption. If not we accept the null hypothesis. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 13 / 32
Model selection Likelihood ratio tests The evidence against H 0 is measured by the p-value . The p-value is the probability under H 0 of observing a value of the test statistic equal to or more extreme as the actually observed test statistic. Hence, a small p-value (say ≤ 0 . 05 ) leads to a strong evidence against H 0 , and H 0 is then said to be rejected . Likewise, we retain H 0 unless there is a strong evidence against this hypothesis. Rejecting H 0 given H 0 is true is called a Type I error , while retaining H 0 when the truth is actually H 1 is called a Type II error . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 14 / 32
Model selection Likelihood ratio tests Definition (Likelihood ratio) Consider the hypothesis H 0 : θ ∈ Ω 0 against the alternative H 1 : θ ∈ Ω \ Ω 0 ( Ω 0 ⊆ Ω ), where dim(Ω 0 ) = r and dim(Ω) = k . For given observations y 1 , y 2 , ..., y n the likelihood ratio is defined as λ ( y ) = sup θ ∈ Ω 0 L ( θ ; y ) sup θ ∈ Ω L ( θ ; y ) If λ is small, then the data are seen to be more plausible under the alternative hypothesis than under the null hypothesis. Hence the hypothesis ( H 0 ) is rejected for small values of λ . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 15 / 32
Model selection Likelihood ratio tests It is sometimes possible to transform the likelihood ratio into a statistic, the exact distribution of which is known under H 0 . This is for instance the case for the General Linear Model for Gaussian data. In most cases, however, we must use following important result regarding the asymptotic behavior. Theorem (Wilk’s Likelihood Ratio test) For λ ( y ) , then under the null hypothesis H 0 , the random variable − 2 log λ ( Y ) converges in law to a χ 2 random variable with ( k − r ) degrees of freedom, i.e., − 2 log λ ( Y ) → χ 2 ( k − r ) under H 0 . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 16 / 32
Model selection Null model and full model The null model Ω null = R ( dim(Ω null ) = 1 ), is a model with only one parameter. The full model Ω full = R n ( dim(Ω full ) = n ), is a model where the dimension is equal to the number of observations and hence the model fits each observation perfectly. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 17 / 32
Recommend
More recommend