MVN 1 / 28 Where do Multivariate Normal Samples Come from? Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas 2017
MVN 2 / 28 What I learned on my Winter Vacation These notes accompany the longer essay That essay uses some L YX/L A T EX tricks that were new to me, such as the bold upright symbol for matrices in math. If you want to know how that’s done, I’ll share the L YX document to you. The next thing I want to learn is how to make transpose symbols look nicer
MVN 3 / 28 Overview: The 5 step process for manufacturing MVN The paper says a 5 step algorithm receives the user’s requested mean vector µ and a variance-covariance matrix Σ . 1 Calculate the eigen decomposition of Σ . 2 Check that Σ is positive definite by inspecting the eigenvalues. 1 If an eigenvalue is intolerably negative, terminate with an error message. 2 Tolerably negative eigenvalues are reset to 0. 3 Create a scaling matrix, S . The two programs differ in this stage. R uses the eigen decomposition while Stata uses Cholesky roots. 4 Create a candidate n × p matrix of random vectors by drawing from N (0 , 1) . 5 Apply y = µ + Sx to rescale the candidate random draws.
MVN 4 / 28 MVN x ∼ MV N ( µ , Σ ) is a draw from the multivariate normal distribution, MV N ( µ , Σ ) σ 2 x 1 µ 1 σ 12 σ 1 p 1 σ 2 x 2 µ 2 σ 12 σ 2 p 2 x = ∼ MV N ( µ , Σ ) = MV N , . . . ... . . . . σ 2 x p µ p σ 1 p σ 2 p p (1) The output from one draw is a vector of values, not a single number. (1 . 1 , 2 . 4 , − 3 , 2 . 2 . . . ) T (2) A data set might be a collection of those draws, each one representing a person’s answers on a survey.
MVN 5 / 28 Draw a sample from N ( µ, σ 2 ) A univariate“standard”normal draw x i ∼ N (0 , 1) Most programs can draw that. I could teach you where that comes from, lets don’t worry about it now. If you want a draw from N (5 , 100) , you calculate y i = 5 + 10 × x i (3) In words: multiply by the square root of the variance and add the desired expected value. for N ( µ, σ 2 ) y = µ + σ · x. (4) Little Puzzle. Suppose you want draws that appear as if they are N (0 , 9) . Is it correct to write √ x i = 0 + 9 u i ? (5) hint: The square root of 9 is not unique
MVN 6 / 28 Generalize that to MV N ( µ , Σ ) The process is going to be the same, except it is scaled up to draw a vector of candidate values filled with N (0 , 1) draws. y = µ + Sx . (6) p candidate values N (0 , 1) values are in x , multiply them by something, add something y 1 µ 1 s 11 s 12 s 1 p x 1 y 2 µ 2 s 21 s 22 s 2 p x 2 = + . . . . ... . . . . . . y p µ p s p 1 s p 2 s pp x p The big puzzle is where do you get that matrix full of s ’s so that the resulting random draw has variance Σ . And what properties it should have.
MVN 7 / 28 Generalize that to MV N ( µ , Σ ) ... E [ y ] = E [ µ + Sx ] = µ + S E [ x ] . (7) And the variance matrix of y is V ar [ y ] = S V ar ( x ) S T . (8) Because V ar ( x ) = I and because S T S is symmetric, V ar [ y ] = SS T = S T S . (9) So if our goal is to generate y ∼ N ( µ, Σ ) , then we need Σ = SS T = S T S . In a sense, S is thus one kind of“square root”of a matrix.
MVN 8 / 28 If we are given a user’s requested , can we use that to create the matrix we need? Not always. There may be no matrix“square root” . User might supply Σ that’s not actually a covariance matrix. What do we know about Σ ? Symmetric, σ ij = σ ji Positive Definite (PD) or Positive Semidefinite (PSD). This is either esoteric and difficult to understand or simple and obvious. Here is my simple and obvious explanation. Here’s the formula for the MVN: 1 2 ( x − µ ) T Σ − 1 ( x − µ ) . − 1 f ( x ) = (2 π ) p/ 2 | Σ | 1 / 2 e (10)
MVN 9 / 28 If we are given a user’s requested , can we use that to create the matrix we need? ... The basic idea of the normal is that as a point moves away from the mean, as x − µ grows greater, we are less and less likely to observe that point. For this to be true, we need this to be positive. Otherwise, probability would grow as x − µ grows. ( x − µ ) T Σ − 1 ( x − µ ) > 0 . (11) This seems obvious to me. Suppose Σ = I , so this reduces to the (Pythagorean) distance between x and µ ( x − µ ) T ( x − µ ) = ( x 1 − µ 1 ) 2 + ( x 2 − µ 2 ) 2 + . . . + ( x p − µ p ) 2 (12) That’s a sum of squared things, it has to be 0 or greater. The Σ − 1 matrix puts weights on the deviations, but it should not be able to turn the distance between x and µ into a negative number. The closest two things can be is the same location. There’s a theorem that says if Σ is positive definite, then Σ − 1 is also PD
MVN 10 / 28 Check if a matrix is PD: Eigen Decomposition We insist that Σ is symmetric. We need a way to find out if it is PD. This turns out to be tricky because of numerical rounding error. In pencil and paper math, there is a theorem says that a matrix is PD if and only if all of its eigenvalues are positive. Recognizing that eigenvalues are approximated in computers, R and Stata use the same approach. A tolerance value is used as follows λ j < − tol | λ 1 | . (13) If λ j is so far negative that it is below − tol | λ 1 | , then we conclude the matrix is not PD.
MVN 11 / 28 Background info on the eigenvalues and eigenvectors The eigenvalues ( λ j ) and eigenvectors v j of Σ are paired in this weird way. = (14) Σv j λ j v j The vector v j that can be thought of as a rescaled thing coming out of Σ . Er, the scaling effect of Σ can be equaled by a simple multiplicative rescaling by λ j . Eigenvector matrix: collect the eigenvectors, side by side: v 11 v 1 p . . v 21 . v 2 p V = (15) v p 1 v pp
MVN 12 / 28 Background info on the eigenvalues and eigenvectors ... Interesting property. The Eigenvalues are“orthogonal”to one another. v T 1 · v 2 = 0 (16) We usually scale eigenvectors these so that v T i · v i = 1 , Which implies this interesting property 1 0 0 0 0 0 1 0 0 0 V T V = I = 0 0 1 0 0 (17) ... 0 0 0 0 0 0 0 0 1 And thus the inverse of the orthogonal matrix V is very easy to calculate: it is the transpose V T = V − 1 (18)
MVN 13 / 28 Background info on the eigenvalues and eigenvectors ... Anyway, given the vector of eigenvalues ( λ 1 , λ 2 , . . . , λ p ) , in pencil and paper math, λ j > 0 ∀ j . Numerically, we say if λ j is just a little bit negative, we will act as if that is roundoff error, and it is re-assigned as 0.
MVN 14 / 28 If all eigenvalues are greater than 0, Cholesky root Cholesky works for SQUARE, PD matrices only. 0 0 0 0 r 11 r 11 r 12 r 13 . . . r 1 p 0 0 0 r 12 r 22 r 22 r 23 r 2 p 0 0 r 13 r 23 r 33 r 33 Σ = R T × R = . ... ... 0 r 1 p r 1 p r 3 p r pp r pp (19) If you require the diagonal elements are positive, this is a UNIQUE triangular square root We usually see people using the lower triangular part of the Cholesky as the scaling matrix S = R T (20) Other square roots exist, as we see next Recall Cholesky is not guaranteed to exist if Σ is not strictly PD. SPD is not sufficient.
MVN 15 / 28 If an eigenvalue is equal to 0, it is not PD, must use Eigen decomposition = V diag ( λ ) ΣV ΣVV T V diag ( λ ) V T = V diag ( λ ) V T . Σ = (21) The function diag places a vector’s values along the diagonal: 0 0 0 λ 1 0 λ 2 0 0 diag ( λ ) = (22) . ... 0 0 0 0 0 0 λ p
MVN 16 / 28 If an eigenvalue is equal to 0, it is not PD, must use Eigen decomposition ... Since λ j ≥ 0 , a real-valued square root of each eigenvalue exists, and we can write this as a product using diag ( λ ) 1 / 2 : √ λ 1 √ λ 1 0 0 0 0 0 0 √ λ 2 √ λ 2 0 0 0 0 0 0 diag ( λ ) = = diag ... ... 0 0 0 0 0 0 � � 0 0 0 0 0 0 λ p λ p (23) This allows us to revise (21) into a format that helps us to see that we have square root of Σ : V diag ( λ ) 1 / 2 diag ( λ ) 1 / 2 V T . Σ = V diag ( λ ) 1 / 2 ( V diag ( λ ) 1 / 2 ) T (24)
MVN 17 / 28 If an eigenvalue is equal to 0, it is not PD, must use Eigen decomposition ... The scaling matrix will be S = V diag ( λ ) 1 / 2 (25) because SS T = S T S = Σ .
Recommend
More recommend