Maximum Likelihood Theory Max Turgeon STAT 4690–Applied Multivariate Analysis
Suffjcient Statistics i • We saw in the previous lecture that the multivariate normal distribution is completely determined by its mean 2 vector µ ∈ R p and its covariance matrix Σ . • Therefore, given a sample Y 1 , . . . , Y n ∼ N p ( µ, Σ) ( n > p ), we only need to estimate ( µ, Σ) . • Obvious candidates: sample mean ¯ Y and sample covariance S n .
Suffjcient Statistics ii • Write down the likelihood : 3 ) n 1 − 1 ( 2( y i − µ ) T Σ − 1 ( y i − µ ) ∏ L = exp √ (2 π ) p | Σ | i =1 n ( ) 1 − 1 ( y i − µ ) T Σ − 1 ( y i − µ ) ∑ = (2 π ) np/ 2 | Σ | n/ 2 exp 2 i =1 • If we take the (natural) logarithm of L and drop any term that does not depend on ( µ, Σ) , we get n ℓ = − n 2 log | Σ | − 1 ( y i − µ ) T Σ − 1 ( y i − µ ) . ∑ 2 i =1
Suffjcient Statistics iii • First, we have 4 • If we can re-express the second summand in terms of ¯ Y and S n , by the Fisher-Neyman factorization theorem, we will then know that ( ¯ Y , S n ) is jointly suffjcient for ( µ, Σ) .
Suffjcient Statistics iv 5 n n ( y i − µ )( y i − µ ) T = ∑ ∑ y − µ ) T ( y i − ¯ y +¯ y − µ )( y i − ¯ y +¯ i =1 i =1 n y ) T + ( y i − ¯ ( y − µ ) T ∑ = ( y i − ¯ y )( y i − ¯ y )(¯ i =1 y ) T + (¯ y − µ ) T ) +(¯ y − µ )( y i − ¯ y − µ )(¯ n y ) T + n (¯ y − µ ) T ∑ = ( y i − ¯ y )( y i − ¯ y − µ )(¯ i =1 y − µ ) T . = ( n − 1) S n + n (¯ y − µ )(¯
Suffjcient Statistics v 6 • Next, using the fact that tr( ABC ) = tr( BCA ) , we have
Suffjcient Statistics vi 7 ( n n ) ( y i − µ ) T Σ − 1 ( y i − µ ) = tr ( y i − µ ) T Σ − 1 ( y i − µ ) ∑ ∑ i =1 i =1 ( n ) Σ − 1 ( y i − µ )( y i − µ ) T ∑ = tr i =1 n ( ) Σ − 1 ( y i − µ )( y i − µ ) T ∑ = tr i =1 ( ) Σ − 1 S n = ( n − 1)tr ( y − µ ) T ) Σ − 1 (¯ + n tr y − µ )(¯ ( ) Σ − 1 S n = ( n − 1)tr y − µ ) T Σ − 1 (¯ + n (¯ y − µ ) .
Maximum Likelihood Estimators log-likelihood is maximised at • Going back to the log-likelihood, we get: • With extra efgort, it can be shown that 8 2 log | Σ |− ( n − 1) ℓ = − n − n ( ) Σ − 1 S n y − µ ) T Σ − 1 (¯ tr 2(¯ y − µ ) . 2 • Since Σ − 1 is positive defjnite, for Σ fjxed, the µ = ¯ ˆ y . − log | Σ | − ( n − 1) tr (Σ − 1 S n ) is maximised at n n Σ = ( n − 1) S n = 1 ˆ ∑ y ) T . ( y i − ¯ y )( y i − ¯ n n i =1 Y , ˆ • In other words : ( ¯ Σ) are the maximum likelihood estimators for ( µ, Σ) .
Maximum Likelihood Estimators • Since the multivariate normal density is “well-behaved”, we can deduce the usual properties: achieves the Cramér-Rao lower bound. 9 • Consistency : ( ¯ Y , ˆ Σ) converges in probability to ( µ, Σ) . • Effjciency : Asymptotically, the covariance of ( ¯ Y , ˆ Σ) • Invariance : For any transformation ( g ( µ ) , G (Σ)) of ( µ, Σ) , its MLE is ( g ( ¯ Y ) , G (ˆ Σ)) .
Visualizing the likelihood library (mvtnorm) set.seed (123) mu <- c (1, 2) Y <- rmvnorm (n, mean = mu, sigma = Sigma) 10 n <- 50; p <- 2 Sigma <- matrix ( c (1, 0.5, 0.5, 1), ncol = p)
Visualizing the likelihood loglik <- function (mu, sigma, data = Y) { # Compute quantities y_bar <- colMeans (Y) # Compute quadratic form quad_form <- drop ( t (y_bar - mu) %*% Sigma_inv %*% (y_bar - mu)) -0.5 * n *log ( det (sigma)) - 0.5 * (n - 1) *sum ( diag (Sigma_inv %*% Sn)) - 0.5 * n * quad_form } 11 Sn <- cov (Y) Sigma_inv <- solve (sigma)
grid_xy <- expand.grid ( seq (0.5, 1.5, length.out = 32), seq (1, 3, length.out = 32)) contours <- purrr ::map_df ( seq_len ( nrow (grid_xy)), function (i) { # Where we will evaluate loglik mu_obs <- as.numeric (grid_xy[i,]) # Evaluate at the pop covariance z <- loglik (mu_obs, sigma = Sigma) # Output data.frame data.frame (x = mu_obs[1], y = mu_obs[2], z = z) }) 12
Visualizing the likelihood i library (tidyverse) library (ggrepel) # Create df with pop and sample means data_means <- data.frame (x = c (mu[1], mean (Y[,1])), y = c (mu[2], mean (Y[,2])), label = c ("Pop.", "Sample")) 13
Visualizing the likelihood ii contours %>% ggplot ( aes (x, y)) + geom_contour ( aes (z = z)) + geom_point (data = data_means) + geom_label_repel (data = data_means, aes (label = label)) 14
Visualizing the likelihood iii 15 3.0 2.5 Sample 2.0 y Pop. 1.5 1.0 0.50 0.75 1.00 1.25 1.50 x
Visualizing the likelihood iv library (scatterplot3d) with (contours, scatterplot3d (x, y, z)) 16
Visualizing the likelihood v 17 −30 −40 −50 −60 z −70 −80 3.0 y 2.5 −90 2.0 1.5 −100 1.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 x
Sampling Distributions • ; • Recall the univariate case: • • In the multivariate case, we have similar results: 18 • ¯ ( µ, σ 2 /n X ∼ N ) ; ( n − 1) s 2 ∼ χ 2 ( n − 1) ; σ 2 ¯ X and s 2 are independent. ( ) • ¯ µ, 1 Y ∼ N p n Σ • ( n − 1) S n = n ˆ Σ follows a Wishart distribution with n − 1 degrees of freedom; • ¯ Y and S n are independent.
Wishart Distribution freedom. • From the previous slide: 19 distributed. Then we say that • Suppose Z 1 , . . . , Z n ∼ N p (0 , Σ) are independently n Z i Z T ∑ W = i i =1 follows a Wishart distribution W n (Σ) with n degrees of • Note that since E ( Z i Z T i ) = Σ , we have E ( W ) = n Σ . ∑ n i =1 ( Y i − ¯ Y )( Y i − ¯ Y ) T has the same distribution as ∑ n − 1 i =1 Z i Z T i for some choice of Z 1 , . . . , Z n − 1 ∼ N p (0 , Σ) .
Useful Properties then 20 • If W 1 ∼ W n 1 (Σ) and W 2 ∼ W n 2 (Σ) are independent, W 1 + W 2 ∼ W n 1 + n 2 (Σ) . • If W ∼ W n (Σ) and C is q × p , then CWC T ∼ W n ( C Σ C T ) .
Density function matrices. 21 • Let Σ be a fjxed p × p positive defjnite matrix. The density of the Wishart distribution with n degrees of freedom, with n ≥ p , is given by | A | ( n − p − 1) / 2 exp ( ) − 1 2 tr(Σ − 1 A ) w n ( A ; Σ) = ) , 2 np/ 2 π p ( p − 1) / 4 | Σ | n/ 2 ∏ p ( 1 2 ( n − i + 1) i =1 where A is ranging over all p × p positive defjnite
Eigenvalue density function • We will study this distribution in STAT 7200–Multivariate Analysis I 22 • For a random matrix A ∼ W n ( I p ) with n ≥ p , the joint distribution of its eigenvalues λ 1 ≥ · · · ≥ λ p has density p p ( ) − 1 λ ( n − p − 1) / 2 ∑ ∏ ∏ C n,p exp | λ i − λ j | , i 2 i =1 i =1 i<j for some constant C n,p .
Recommend
More recommend