Multivariate Linear Regression Max Turgeon STAT 4690–Applied Multivariate Analysis
Multivariate Linear Regression model • We will assume a linear relationship : regression coeffjcients . • We will also assume homoscedasticity : 2 • We are interested in the relationship between p outcomes Y 1 , . . . , Y p and q covariates X 1 , . . . , X q . • We will write Y = ( Y 1 , . . . , Y p ) and X = (1 , X 1 , . . . , X q ) . • E ( Y | X ) = B T X , where B is a ( q + 1) × p matrix of • Cov( Y | X ) = Σ , where Σ is positive-defjnite. • In other words, the (conditional) covariance of Y does not depend on X .
• However, we will use the correlation between outcomes Relationship with Univariate regression i for hypothesis testing • This follows from the assumption that each component 3 • Let σ 2 i be the i -th diagonal element of Σ . • Let β i be the i -th column of B . • From the model above, we get p univariate regressions: • E ( Y i | X ) = X T β i ; • Var( Y i | X ) = σ 2 i . Y i is linearly associated with the same covariates X .
Relationship with Univariate regression ii between the outcomes, we would get the Seemingly Unrelated Regressions (SUR) model. • This model is sometimes used by econometricians. 4 • If we assumed a difgerent set of covariates X i for each outcome Y i and still wanted to use the correlation
Least-Squares Estimation i • For Least-Squares Estimation, we will be looking for the • Note : This criterion is also known as the (squared) 5 • Let Y 1 . . . , Y n be a random sample of size n , and let X 1 , . . . , X n be the corresponding sample of covariates. • We will write Y and X for the matrices whose i -th row is Y i and X i , respectively. • We can then write E ( Y | X ) = X B . estimator ˆ B of B that minimises a least-squares criterion: ( Y − X B ) T ( Y − X B ) [ ] • LS ( B ) = tr Frobenius norm ; i.e. LS ( B ) = ∥ Y − X B ∥ 2 F .
Least-Squares Estimation ii • Note 2 : If you expand the matrix product and look at the diagonal, you can see that the Frobenius norm is equivalent to the sum of the squared entries. • Or, we can expand the matrix product along the diagonal and compute the trace. 6 • To minimise LS ( B ) , we could use matrix derivatives… • Let Y ( j ) be the j -th column of Y .
Least-Squares Estimation iii 7 • In other words, Y ( j ) = ( Y 1 j , . . . , Y nj ) contains the n values for the outcome Y j . We then have ( Y − X B ) T ( Y − X B ) [ ] LS ( B ) = tr p ( Y ( j ) − X β j ) T ( Y ( j ) − X β j ) ∑ = j =1 p n ∑ ∑ ( Y ij − β T j X i ) 2 . = j =1 i =1
Least-Squares Estimation iv linear regression. • Or put another way: 8 least-squares criterion for the corresponding univariate • For each j , the sum ∑ n i =1 ( Y ij − β T j X i ) 2 is simply the • ˆ β j = ( X T X ) − 1 X T Y ( j ) • But since LS ( B ) is a sum of p positive terms, each minimised at ˆ β j , the whole is sum is minimised at ˆ ( ˆ ˆ ) B = β 1 · · · β p . ˆ B = ( X T X ) − 1 X T Y .
Comments i • We still have not made any distributional assumptions on • We do not need to assume normality to derive the least-squares estimator. • The least-squares estimator is unbiased : 9 Y . E ( ˆ B | X ) = ( X T X ) − 1 X E ( Y | X ) = ( X T X ) − 1 X T X B = B.
Comments ii 10 estimation process. But note that: • We did not use the covariance matrix Σ anywhere in the Cov(ˆ β i , ˆ ( ) ( X T X ) − 1 X T Y ( i ) , ( X T X ) − 1 X T Y ( j ) β j ) = Cov ( X T X ) − 1 X T ) T ( ) ( = ( X T X ) − 1 X T Cov Y ( i ) , Y ( j ) = ( X T X ) − 1 X T ( σ ij I n ) X ( X T X ) − 1 = σ ij ( X T X ) − 1 , where σ ij is the ( i, j ) -th entry of Σ .
Example i # Let's revisit the plastic film data library (heplots) library (tidyverse) Y <- Plastic %>% select (tear, gloss, opacity) %>% as.matrix X <- model.matrix ( ~ rate, data = Plastic) head (X) 11
Example ii ## 4 (B_hat <- solve ( crossprod (X)) %*% t (X) %*% Y) 0 1 ## 6 0 1 ## 5 0 1 0 ## 1 ## 3 0 1 ## 2 0 1 ## 1 (Intercept) rateHigh 12
Example iii coef (fit) 0.29 0.59 -0.51 ## rateHigh 3.79 9.57 ## (Intercept) 6.49 tear gloss opacity ## data = Plastic) ## fit <- lm ( cbind (tear, gloss, opacity) ~ rate, # Compare with lm output 0.29 0.59 -0.51 ## rateHigh 3.79 9.57 ## (Intercept) 6.49 tear gloss opacity 13
Geometry of LS i residuals. 14 • Let P = X ( X T X ) − 1 X T . • P is symmetric and idempotent : P 2 = X ( X T X ) − 1 X T X ( X T X ) − 1 X T = X ( X T X ) − 1 X T = P. • Let ˆ Y = X ˆ B be the fjtted values, and ˆ E = Y − ˆ Y , the • We have ˆ Y = P Y . • We also have ˆ E = ( I − P ) Y .
Geometry of LS ii • Putting all this together, we get • In other words, the fjtted values and the residuals are orthogonal . 15 Y T ˆ ˆ E = ( P Y ) T ( I − P ) Y = Y T P ( I − P ) Y = Y T ( P − P 2 ) Y = 0 . • Similarly, we can see that X T ˆ E = 0 and P X = X .
Geometry of LS iii 16 • Interpretation : ˆ Y is the orthogonal projection of Y onto the column space of X .
Example (cont’d) i Y_hat <- fitted (fit) crossprod (Y_hat, residuals) ## tear gloss opacity ## tear -9.489298e-16 2.959810e-15 -4.720135e-15 ## gloss -1.424461e-15 1.109357e-15 -1.150262e-14 ## opacity -7.268852e-16 1.211209e-15 1.648459e-16 17 residuals <- residuals (fit)
Example (cont’d) ii crossprod (X, residuals) ## tear gloss opacity ## (Intercept) 0 5.828671e-16 -4.440892e-16 ## rateHigh 0 1.387779e-16 4.440892e-16 18
Example (cont’d) iii # Is this really zero? isZero <- function (mat) { all.equal (mat, matrix (0, ncol = ncol (mat), nrow = nrow (mat)), check.attributes = FALSE) } isZero ( crossprod (Y_hat, residuals)) ## [1] TRUE 19
Example (cont’d) iv isZero ( crossprod (X, residuals)) ## [1] TRUE 20
Bootstrapped Confjdence Intervals i • We still have not made any assumption about the covariance function. • Let’s see how much further we can go. • We will use bootstrap to derive confjdence intervals for our quantities of interest. • Bootstrap is a resampling technique for estimating the sampling distribution of an estimator of interest. • Particularly useful when we think the usual assumptions may not hold, or when the sampling distribution would be diffjcult to derive. 21 distribution of Y , beyond the conditional mean and
Bootstrapped Confjdence Intervals ii • Let’s say we want to estimate the sampling distribution of the correlation coeffjcient. • The idea is to resample with replacement from our sample to mimic the process of “repeating the experiment”. 22 • We have a sample of pairs ( U 1 , V 1 ) , . . . , ( U n , V n ) , from which we estimated the correlation ˆ ρ .
• We now have a whole sample of correlation coeffjcients Bootstrapped Confjdence Intervals iii • From its quantiles, we can derive a confjdence interval for 23 • For each bootstrap sample ( U ( b ) 1 , V ( b ) 1 ) , . . . , ( U ( b ) n , V ( b ) n ) , ρ ( b ) . we compute the sample correlation ˆ ρ (1) , . . . , ˆ ρ ( B ) . ˆ ρ . ˆ
Example i library (candisc) dataset <- HSB[, c ("math", "sci")] (corr_est <- cor (dataset)[1,2]) ## [1] 0.6495261 24
Example ii # Choose a number of bootstrap samples B <- 5000 corr_boot <- replicate (B, { replace = TRUE) dataset_boot <- dataset[samp_boot,] cor (dataset_boot)[1,2] }) quantile (corr_boot, probs = c (0.025, 0.975)) 25 samp_boot <- sample ( nrow (dataset),
Example iii ## 2.5% 97.5% ## 0.6037029 0.6924364 hist (corr_boot, breaks = 50) abline (v = corr_est, col = 'red', lty = 2) 26
Example iv 27 Histogram of corr_boot 400 300 Frequency 200 100 0 0.55 0.60 0.65 0.70 corr_boot
Bootstrapped Confjdence Intervals (cont’d) i • Going back to our multivariate linear regression setting, we can bootstrap our estimate of the matrix of regression coeffjcients! • It’s important to sample the same rows in both • For each bootstrap sample, we can compute the estimate • From these samples, we can compute confjdence intervals 28 • We will sample with replacement the rows of Y and X matrices. We want to keep the relationship between Y and X intact. ˆ B ( b ) . for each entry in B .
Bootstrapped Confjdence Intervals (cont’d) ii • We can also technically compute confjdence regions for • E.g. a whole column or a whole row • But multivariate quantiles are tricky… 29 multiple entries in B
Example (cont’d) i B_boot <- replicate (B, { replace = TRUE) X_boot <- X[samp_boot,] Y_boot <- Y[samp_boot,] solve ( crossprod (X_boot)) %*% t (X_boot) %*% Y_boot }) # The output is a 3-dim array dim (B_boot) 30 samp_boot <- sample ( nrow (Y),
Example (cont’d) ii ## (Intercept) 6.5545455 probs = c (0.025, 0.975)) quantile (B_boot["rateHigh", "tear",], # CI for effect of rate on tear 0.4787879 -0.1535354 0.7515152 ## rateHigh 9.5090909 3.7818182 opacity ## [1] gloss tear ## B_boot[,,1] 3 5000 2 31
Example (cont’d) iii ## 2.5% 97.5% ## 0.2738049 0.9125000 # CI for effect of rate on gloss quantile (B_boot["rateHigh", "gloss",], probs = c (0.025, 0.975)) ## 2.5% 97.5% ## -0.8967100 -0.1040152 32
Recommend
More recommend