stat 5102 lecture slides deck 1 empirical distributions
play

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact - PowerPoint PPT Presentation

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions, Asymptotic Sampling Distributions Charles J. Geyer School of Statistics University of Minnesota 1 Empirical Distributions The empirical distribution


  1. Empirical Distribution Calculations in R (cont.) R also has a function quantile that calculates quantiles of the empirical distribution. As we mentioned there is no widely ac- cepted notion of the best way to calculate quantiles. The defi- nition we gave is simple and theoretically correct, but arguments can be given for other notions and the quantile function can calculate no less than nine different notions of “quantile” (the one we want is type 1). quantile(x, type = 1) calculates a bunch of quantiles. Other quantiles can be specified quantile(x, probs = 1 / 3, type = 1) 29

  2. Little x to Big X We now do something really tricky. So far we have just been reviewing finite probability spaces. The numbers x 1 , . . . , x n are just numbers. Now we want to make the numbers X 1 , . . . , X n that determine the empirical distribution IID random variables. In one sense the change is trivial: capitalize all the x ’s you see. In another sense the change is profound: now all the thingum- mies of interest — mean, variance, other moments, median, quantiles, and DF of the empirical distribution — are random variables. 30

  3. Little x to Big X (cont.) For example n � X n = 1 X i n i =1 (the mean of the empirical distribution) is a random variable. What is the distribution of this random variable? It is determined somehow by the distribution of the X i . When the distribution of X n is not a brand-name distribution but the distribution of n � nX n = X i i =1 is a brand name distribution, then we refer to that. 31

  4. Sampling Distribution of the Empirical Mean The distribution of nX n is given by what the brand-name distri- bution handout calls “addition rules”. If each X i is Ber( p ), then nX n is Bin( n, p ). If each X i is Geo( p ), then nX n is NegBin( n, p ). If each X i is Poi( µ ), then nX n is Poi( nµ ). If each X i is Exp( λ ), then nX n is Gam( n, λ ). If each X i is N ( µ, σ 2 ), then nX n is N ( nµ, nσ 2 ). 32

  5. Sampling Distribution of the Empirical Mean (cont.) In the latter two cases, we can apply the change-of-variable the- orem to the linear transformation y = x/n obtaining f Y ( y ) = nf X ( ny ) If each X i is Exp( λ ), then X n is Gam( n, nλ ). � � µ, σ 2 If each X i is N ( µ, σ 2 ), then X n is N . n 33

  6. Sampling Distribution of the Empirical Mean (cont.) For most distributions of of the X i we cannot calculate the exact sampling distribution of nX n or of X n . The central limit the- orem (CLT), however, gives an approximation of the sampling distribution when n is large. If each X i has mean µ and variance σ 2 , then X n is approximately � � µ, σ 2 N . n The CLT is not applicable if the X i do not have finite variance. 34

  7. Sampling Distributions The same game can be played with any of the other quantities, the empirical median, for example. Much more can be said about the empirical mean, because we have the addition rules to work with. The distribution of the empirical median is not brand-name unless the X i are Unif(0 , 1) and n is odd. There is a large n approximation, but the argument is long and complicated. We will do both, but not right away. 35

  8. Sampling Distributions (cont.) The important point to understand for now is that any random variable has a distribution (whether we can name it or otherwise describe it), hence these quantities related to the empirical dis- tribution have probability distributions — called their sampling distributions — and we can sometimes describe them exactly, sometimes give large n approximations, and sometimes not even that. But they always exist, whether we can describe them or not, and we can refer to them in theoretical arguments. 36

  9. Sampling Distributions (cont.) Why the “sample” in “sampling distribution”? Suppose X 1 , . . . , X n are a sample with replacement from a fi- nite population. Then we say the distribution of each X i is the population distribution , and we say X 1 , . . . , X n are a random sample from this population, and we say the distribution of X n is its sampling distribution because its randomness comes from X 1 , . . . , X n being a random sample. This is the story that introduces sampling distributions in most intro stats courses. It is also the language that statisticians use in talking to people who haven’t had a theory course like this one. 37

  10. Sampling Distributions (cont.) This language becomes only a vague metaphor when X 1 , . . . , X n are IID but their distribution does not have a finite sample space, so they cannot be considered — strictly speaking — a sample from a finite population. They can be considered a sample from an infinite population in a vague metaphorical way, but when we try to formalize this notion we cannot. Strictly speaking it is nonsense. And strictly speaking, the “sampling” in “sampling distribution” is redundant. The “sampling distribution” of X n is the distribu- tion of X n . Every random variable has a probability distribution. X n is a random variable so it has a probability distribution, which doesn’t need the adjective “sampling” attached to it any more than any other probability distribution does (i. e., not at all). 38

  11. Sampling Distributions (cont.) So why do statisticians, who are serious people, persist in using this rather silly language? The phrase “sampling distribution” alerts the listener that we are not talking about the “popula- tion distribution” and the distribution of X n or � X n (or whatever quantity related to the empirical distribution is under discussion) is not the same as the distribution of each X i . Of course, no one theoretically sophisticated (like all of you) would think for a second that the distribution of X n is the same as the distribution of the X i , but — probability being hard for less sophisticated audiences — the stress in “sampling distribution” — redundant though it may be — is perhaps useful. 39

  12. Chi-Square Distribution Recall that for any real number ν > 0 the chi-square distribution having ν degrees of freedom , abbreviated chi 2 ( ν ), is another � ν � 2 , 1 name for the Gam distribution. 2 40

  13. Student’s T Distribution Now we come to a new brand name distribution whose name is the single letter t (not very good terminology). It is sometimes called “Student’s t distribution” because it was invented by W. S. Gosset who published under the pseudonym “Student”. Suppose Z and Y are independent random variables Z ∼ N (0 , 1) Y ∼ chi 2 ( ν ) then Z T = � Y/ν is said to have Student’s t distribution with ν degrees of freedom , abbreviated t ( ν ). 41

  14. Student’s T Distribution (cont.) The PDF of the t ( ν ) distribution is √ νπ · Γ( ν +1 2 ) 1 1 · −∞ < x < + ∞ f ν ( x ) = � ( ν +1) / 2 , � Γ( ν 2 ) 1 + x 2 ν 2 ) = √ π (5101 Slide 158, Deck 3), because Γ( 1 √ νπ · Γ( ν +1 2 ) 1 = 1 1 √ ν · Γ( ν B ( ν 2 , 1 2 ) 2 ) where the beta function B ( ν 2 , 1 2 ) is the normalizing constant of the beta distribution defined in the brand name distributions handout. 42

  15. Student’s T Distribution (cont.) The joint distribution of Z and Y in the definition is � 1 � ν/ 2 1 2 πe − z 2 / 2 2 Γ( ν/ 2) y ν/ 2 − 1 e − y/ 2 √ f ( z, y ) = � Make the change of variables t = z/ y/ν and u = y , which has inverse transformation � z = t u/ν y = u and Jacobian � � � t/ 2 √ uν � � � u/ν � � � = u/ν � � � 0 1 43

  16. Student’s T Distribution (cont.) The joint distribution of T and U given by the multivariate change of variable formula (5101, Slides 121–122 and 128–136, Deck 3) is � 1 � ν/ 2 2 πe − ( t √ � 1 u/ν ) 2 / 2 Γ( ν/ 2) u ν/ 2 − 1 e − u/ 2 · 2 √ f ( t, u ) = u/ν � 1 � ν/ 2 � � � � 1 + t 2 1 1 u √ νu ν/ 2 − 1 / 2 exp 2 √ = − Γ( ν/ 2) ν 2 2 π Thought of as a function of u for fixed t , this is proportional to a gamma density with shape parameter ( ν + 1) / 2 and rate 2 (1 + t 2 parameter 1 ν ). 44

  17. Student’s T Distribution (cont.) The “recognize the unnormalized density trick” which is equiv- alent to using the “theorem” for the gamma distribution allows us to integrate out u getting the marginal of t � 1 � ν/ 2 Γ( ν +1 2 ) 1 Γ( ν/ 2) · 1 2 √ 2 π · √ ν · f ( t ) = 2 (1 + t 2 [ 1 ν )] ( ν +1) / 2 which, after changing t to x , simplifies to the form given on slide 42. 45

  18. Student’s T Distribution: Moments The t distribution is symmetric about zero, hence the mean is zero if the mean exists. Hence central moments are equal to ordinary moments. Hence every odd ordinary moment is zero if it exists. For the t ( ν ) distribution and k > 0, the ordinary moment E ( | X | k ) exists if and only if k < ν . 46

  19. Student’s T Distribution: Moments (cont.) The PDF is bounded, so the question of whether moments exist only involves behavior of the PDF at ±∞ . Since the t distribution is symmetric about zero, we only need to check the behavior at + ∞ . When does � ∞ x k f ν ( x ) dx 0 exist? Since x k f ν ( x ) lim → c x →∞ x α when α = k − ( ν + 1). The comparison theorem (5101 Slide 9, Deck 6) says the integral exists if and only if k − ( ν + 1) = α < − 1 which is equivalent to k < ν . 47

  20. Student’s T Distribution: Moments (cont.) If X has the t ( ν ) distribution and ν > 1, then E ( X ) = 0 Otherwise the mean does not exist. (Proof: symmetry.) If X has the t ( ν ) distribution and ν > 2, then ν var( X ) = ν − 2 Otherwise the variance does not exist. (Proof: homework.) 48

  21. Student’s T Distribution and Cauchy Distribution Plugging in ν = 1 into the formula for the PDF of the t ( ν ) distribution on slide 42 gives the PDF of the standard Cauchy distribution. In short t (1) = Cauchy(0 , 1). Hence if Z 1 and Z 2 are independent N (0 , 1) random variables, then T = Z 1 Z 2 has the Cauchy(0 , 1) distribution. 49

  22. Student’s T Distribution and Normal Distribution If Y ν is chi 2 ( ν ) = Gam( ν 2 , 1 2 ), then U ν = Y ν /ν is Gam( ν 2 , ν 2 ), and E ( U ν ) = 1 var( U ν ) = 2 ν Hence P U ν − → 1 , as ν → ∞ by Chebyshev’s inequality. Hence if Z is a standard normal ran- dom variable independent of Y ν Z D − → Z, as ν → ∞ � Y ν /ν by Slutsky’s theorem. In short, the t ( ν ) distribution converges to the N (0 , 1) distribution as ν → ∞ . 50

  23. Snedecor’s F Distribution If X and Y are independent random variables and X ∼ chi 2 ( ν 1 ) Y ∼ chi 2 ( ν 2 ) then W = X/ν 1 Y/ν 2 has the F distribution with ν 1 numerator degrees of freedom and ν 2 denominator degrees of freedom . 51

  24. Snedecor’s F Distribution (cont.) The “F” is for R. A. Fisher, who introduced a function of this random variable into statistical inference. This particular random variable was introduced by G. Snedecor. Hardly anyone knows this history or uses the eponyms. This is our second brand-name distribution whose name is a single roman letter (we also have two, beta and gamma, whose names are single greek letters). It is abbreviated F ( ν 1 , ν 2 ). 52

  25. Snedecor’s F Distribution (cont.) The theorem on slides 128–137, 5101 Deck 3 says that if X and Y are independent random variables and X ∼ Gam( α 1 , λ ) Y ∼ Gam( α 2 , λ ) then X V = X + Y has the Beta( α 1 , α 2 ) distribution. 53

  26. Snedecor’s F Distribution (cont.) Hence, if X and Y are independent random variables and X ∼ chi 2 ( ν 1 ) Y ∼ chi 2 ( ν 2 ) then X V = X + Y has the Beta( ν 1 2 , ν 2 2 ) distribution. 54

  27. Snedecor’s F Distribution (cont.) Since X V Y = 1 − V we have W = ν 2 V · 1 − V ν 1 and ν 1 W/ν 2 V = 1 + ν 1 W/ν 2 This gives the relationship between the F ( ν 1 , ν 2 ) distribution of W and the Beta( ν 1 2 , ν 2 2 ) distribution of V . 55

  28. Snedecor’s F Distribution (cont.) The PDF of the F distribution can be derived from the PDF of the beta distribution using the change-of-variable formula. It is given in the brand name distributions handout, but is not very useful. If one wants moments of the F distribution, for example, ν 2 E ( W ) = ν 2 − 2 when ν 2 > 2, write W as a function of V and calculate the moment that way. 56

  29. Snedecor’s F Distribution (cont.) The same argument used to show D t ( ν ) − → N (0 , 1) , as ν → ∞ shows P F ( ν 1 , ν 2 ) − → 1 , as ν 1 → ∞ and ν 2 → ∞ So an F random variable is close to 1 when both degrees of freedom are large. 57

  30. Sampling Distributions for Normal Populations Suppose X 1 , . . . , X n are IID N ( µ, σ 2 ) and n � X n = 1 X i n i =1 n � V n = 1 ( X i − X n ) 2 n i =1 are the mean and variance of the empirical distribution. Then X n and V n are independent random variables and � � µ, σ 2 X n ∼ N n nV n σ 2 ∼ chi 2 ( n − 1) 58

  31. Sampling Distributions for Normal Populations (cont.) It is traditional to name the distribution of n � nV n σ 2 = 1 ( X i − X n ) 2 σ 2 i =1 rather than of V n itself. But, of course, if nV n σ 2 ∼ chi 2 ( n − 1) then � n − 1 � n V n ∼ Gam , 2 σ 2 2 by the change-of-variable theorem. 59

  32. Sampling Distributions for Normal Populations (cont.) Strictly speaking, the “populations” in the heading should be in scare quotes, because infinite populations are vague metaphorical nonsense. Less pedantically, it is important to remember that the theorem on slide 58 has no analog for non-normal populations. In general, X n and V n are not independent. � � µ, σ 2 In general, the sampling distribution of X n is not exactly N , n although it is approximately so when n is large. In general, the sampling distribution of V n is not exactly gamma. 60

  33. Empirical Variance and Sample Variance Those who have been exposed to an introductory statistics course may be wondering why we keep saying “empirical mean” rather than “sample mean” which everyone else says. The answer is that the “empirical variance” V n is not what everyone else calls the “sample variance”. In general, we do not know the distribution of V n . It is not brand name and is hard or impossible to describe explicitly. However we always have E ( V n ) = n − 1 · σ 2 n 61

  34. Empirical Variance and Sample Variance (cont.) Define n � n = 1 V ∗ ( X i − µ ) 2 n i =1 where µ = E ( X i ). Then E ( V ∗ n ) = σ 2 , because E { ( X i − µ ) 2 } = σ 2 . The empirical analog of the mean square error formula (derived on slide 7) is E n { ( X − a ) 2 } = var n ( X ) + ( a − X n ) 2 and plugging in µ for a gives n = E n { ( X − µ ) 2 } = var n ( X ) + ( µ − X n ) 2 = V n + ( µ − X n ) 2 V ∗ 62

  35. Empirical Variance and Sample Variance (cont.) But since E ( X n ) = µ (5101, Slide 90, Deck 2) E { ( µ − X n ) 2 } = var( X n ) In summary, E ( V ∗ n ) = E ( V n ) + var( X n ) and we know var( X n ) = σ 2 /n (5101, Slide 90, Deck 2), so n ) − var( X n ) = σ 2 − σ 2 n = n − 1 E ( V n ) = E ( V ∗ · σ 2 n 63

  36. Empirical Variance and Sample Variance (cont.) The factor ( n − 1) /n is deemed to be unsightly, so n � n 1 S 2 ( X i − X n ) 2 n = n − 1 · V n = n − 1 i =1 which has the simpler property E ( S 2 n ) = σ 2 is usually called the sample variance , and S n is usually called the sample standard deviation . In cookbook applied statistics the fact that these are not the variance and standard deviation of the empirical distribution does no harm. But it does mess up the theory. So we do not take S 2 n as being the obvious quantity to study and look at V n too. 64

  37. Sampling Distributions for Normal Populations (cont.) We now prove the theorem stated on slide 58. The random vector ( X n , X 1 − X n , . . . , X n − X n ), being a linear function of a multivariate normal, is multivariate normal. We claim the first component X n is independent of the other components X i − X n , i = 1, . . . , n . Since uncorrelated implies independent for multivariate normal (5101, Deck 5, Slides 130– 135), it is enough to verify cov( X n , X i − X n ) = 0 65

  38. Sampling Distributions for Normal Populations (cont.) cov( X n , X i − X n ) = cov( X n , X i ) − var( X n ) = cov( X n , X i ) − σ 2 n   n  − σ 2 �  1 = cov X j , X i n n j =1 n cov( X j , X i ) − σ 2 = 1 � n n j =1 = 0 by linearity of expectation (5101 homework problem 4-1), by cov( X j , X i ) = 0 when i � = j , and by cov( X i , X i ) = var( X i ) = σ 2 . 66

  39. Sampling Distributions for Normal Populations (cont.) That finishes the proof that X n and V n are independent random variables, because V n is a function of X i − X n , i = 1, . . . , n . That � � µ, σ 2 X n ∼ N n we already knew. It comes from the addition rule for the normal distribution. Establishing the sampling distribution of V n is more complicated. 67

  40. Orthonormal Bases and Orthogonal Matrices A set of vectors U is orthonormal if each has length one u T u = 1 , u ∈ U and each pair is orthogonal u T v = 0 , u , v ∈ U and u � = v An orthonormal set of d vectors in d -dimensional space is called an orthonormal basis (plural orthonormal bases , pronounced like “base ease”). 68

  41. Orthonormal Bases and Orthogonal Matrices (cont.) A square matrix whose columns form an orthonormal basis is called orthogonal . If O is orthogonal, then the orthonormality property expressed in matrix notation is O T O = I where I is the identity matrix. This implies O T = O − 1 and OO T = I Hence the rows of O also form an orthonormal basis. Orthogonal matrices have appeared before in the spectral de- composition (5101 Deck 5, Slides 103–110). 69

  42. Orthonormal Bases and Orthogonal Matrices (cont.) It is a theorem of linear algebra, which we shall not prove, that any orthonormal set of vectors can be extended to an orthonor- mal basis (the Gram-Schmidt orthogonalization process can be used to do this). The unit vector 1 u = √ n (1 , 1 , . . . , 1) all of whose components are the same forms an orthonormal set { u } of size one. Hence there exists an orthogonal matrix O whose first column is u . 70

  43. Sampling Distributions for Normal Populations (cont.) Any orthogonal matrix O maps standard normal random vectors to standard normal random vectors. If Z is standard normal and Y = O T Z , then E ( Y ) = O T E ( Z ) = 0 var( Y ) = O T var( Z ) O = O T O = I 71

  44. Sampling Distributions for Normal Populations (cont.) Also n � Y 2 i = � Y � 2 i =1 = Y T Y = Z T OO T Z = Z T Z n � Z 2 = i i =1 72

  45. Sampling Distributions for Normal Populations (cont.) In the particular case where u is the first column of O n n � � Y 2 i = Y 2 Y 2 1 + i i =1 i =2 n � = nZ 2 Y 2 n + i i =2 because n � Z i = √ nZ n 1 u T Z = √ n i =1 73

  46. Sampling Distributions for Normal Populations (cont.) Hence n n � � i − nZ 2 Y 2 Y 2 i = n i =2 i =1 n � i − nZ 2 Z 2 = n i =1   n �  1 i − Z 2 Z 2  = n n n i =1 = n var n ( Z ) This establishes the theorem in the special case µ = 0 and σ 2 = 1 because the components of Y are IID standard normal, hence n times the empirical variance of Z 1 , . . . , Z n has the chi-square distribution with n − 1 degrees of freedom. 74

  47. Sampling Distributions for Normal Populations (cont.) To finish the proof of the theorem, notice that if X 1 , . . . , X n are IID N ( µ, σ 2 ), then Z i = X i − µ , i = 1 , . . . , n σ are IID standard normal. Hence n var n ( Z ) = n var n ( X ) = nV n σ 2 σ 2 has the chi-square distribution with n − 1 degrees of freedom. That finishes the proof of the theorem stated on slide 58. 75

  48. Sampling Distributions for Normal Populations (cont.) The theorem can be stated with S 2 n replacing V n . If X 1 , . . . , X n are IID N ( µ, σ 2 ) and n � X n = 1 X i n i =1 n � 1 S 2 ( X i − X n ) 2 n = n − 1 i =1 then X n and S 2 n are independent random variables and � � µ, σ 2 X n ∼ N n ( n − 1) S 2 n ∼ chi 2 ( n − 1) σ 2 76

  49. Sampling Distributions for Normal Populations (cont.) An important consequence uses the theorem as restated using S 2 n and the definition of a t ( n − 1) random variable. If X 1 , . . . , X n are IID N ( µ, σ 2 ), then X n − µ σ/ √ n ∼ N (0 , 1) ( n − 1) S 2 n ∼ chi 2 ( n − 1) σ 2 Hence ( X n − µ ) /σ/ √ n = X n − µ T = � S n / √ n [( n − 1) S 2 n /σ 2 ] / ( n − 1) has the t ( n − 1) distribution. 77

  50. Asymptotic Sampling Distributions When the data X 1 , . . . , X n are IID from a distribution that is not normal, we have no result like the theorem just discussed for the normal distribution. Even when the data are IID normal, we have no exact sampling distribution for moments other than the mean and variance. We have to make do with asymptotic, large n , approximate results. 78

  51. Asymptotic Sampling Distributions (cont.) The ordinary and central moments of the distribution of the data were defined on 5101 deck 3, slides 151–152. The ordinary moments, if they exist, are denoted α k = E ( X k i ) (they are the same for all i because the data are IID). The first ordinary moment is the mean µ = α 1 . The central moments, if they exist, are denoted µ k = E { ( X i − µ ) k } (they are the same for all i because the data are IID). The first central moment is always zero µ 1 = 0. The second central moment is the variance µ 2 = σ 2 . 79

  52. Asymptotic Sampling Distributions (cont.) The ordinary and central moments of the empirical distribution are defined in the same way. The ordinary moments are denoted n � A k,n = E n ( X k ) = 1 X k i n i =1 The first ordinary moment is the empirical mean X n = A 1 ,n . The central moments are denoted n � M k,n = E n { ( X − X n ) k } = 1 ( X i − X n ) k n i =1 The first central moment is always zero M 1 ,n = 0. The second central moment is the empirical variance M 2 ,n = V n . 80

  53. Asymptotic Sampling Distributions (cont.) The asymptotic joint distribution of the ordinary empirical mo- ments was done on 5101 deck 7, slides 93–95 although we hadn’t introduced the empirical distribution yet so didn’t describe it this way. 81

  54. Asymptotic Sampling Distributions (cont.) Define random vectors   X i   X 2    i  Y i = .  .  .   X k i Then   A 1 ,n n   � Y n = 1 A 2 ,n   Y i =   . . .   n i =1 A k,n 82

  55. Asymptotic Sampling Distributions (cont.)   α 1   α 2   E ( Y i ) =   . . .   α k   α 2 − α 2 α 3 − α 1 α 2 · · · α k +1 − α 1 α k 1   α 4 − α 2   α 3 − α 1 α 2 · · · α k +2 − α 2 α k   2 var( Y i ) = . . . ...   . . . . . .   α 2 k − α 2 α k +1 − α 1 α k α k +2 − α 2 α k · · · k (they are the same for all i because the data are IID). Details of the variance calculation are on 5101 deck 7, slide 94. 83

  56. Asymptotic Sampling Distributions (cont.) Write E ( Y i ) = µ ordinary var( Y i ) = M ordinary ( µ ordinary is a vector and M ordinary is a matrix). Then the mul- tivariate CLT (5101 deck 7, slides 90–91) says � � µ ordinary , M ordinary Y n ≈ N n Since the components of Y n are the empirical ordinary moments up to order k , this gives the asymptotic (large n , approximate) joint distribution of the empirical ordinary moments up to order k . Since M ordinary contains population moments up to order 2 k , we need to assume those exist. 84

  57. Asymptotic Sampling Distributions (cont.) All of this about empirical ordinary moments is simple — a straightforward application of the multivariate CLT — compared to the analogous theory for empirical central moments. The problem is that n � M k,n = 1 ( X i − X n ) k n i =1 is not an empirical mean of the form n � E n { g ( X ) } = 1 g ( X i ) n i =1 for any function g . 85

  58. Asymptotic Sampling Distributions (cont.) We would have a simple theory, analogous to the theory for empirical ordinary moments if we studied instead n � k,n = 1 M ∗ ( X i − µ ) k n i =1 which are empirical moments but are not functions of data only so not as interesting. It turns out that the asymptotic joint distribution of the M ∗ k,n is theoretically useful as a step on the way to the asymptotic joint distribution of the M k,n , so let’s do it. 86

  59. Asymptotic Sampling Distributions (cont.) Define random vectors   X i − µ   ( X i − µ ) 2   Z ∗   i = .  .  .   ( X i − µ ) k Then   M ∗ 1 ,n   M ∗   Z ∗ 2 ,n   n = .   . .   M ∗ k,n 87

  60. Asymptotic Sampling Distributions (cont.)   µ 1   µ 2   E ( Z ∗ i ) =   . . .   µ k   µ 2 − µ 2 µ 3 − µ 1 µ 2 · · · µ k +1 − µ 1 µ k 1   µ 4 − µ 2   µ 3 − µ 1 µ 2 · · · µ k +2 − µ 2 µ k var( Z ∗  2  i ) = . . . ...   . . . . . .   µ 2 k − µ 2 µ k +1 − µ 1 µ k µ k +2 − µ 2 µ k · · · k (they are the same for all i because the data are IID). The variance calculation follows from the one for ordinary moments because central moments of X i are ordinary moments of X i − µ . 88

  61. Asymptotic Sampling Distributions (cont.) Write E ( Z ∗ i ) = µ central var( Z ∗ i ) = M central ( µ central is a vector and M central is a matrix). Then the multi- variate CLT (5101 deck 7, slides 90–91) says � � µ central , M central Z ∗ n ≈ N n Since the components of Z ∗ n are the M ∗ i,n up to order k , this gives the asymptotic (large n , approximate) joint distribution of the M ∗ i,n up to order k . Since M central contains population moments up to order 2 k , we need to assume those exist. 89

  62. Asymptotic Sampling Distributions (cont.) These theorems imply the laws of large numbers (LLN) P − → α k A k,n P M ∗ − → µ k k,n for each k , but these LLN actually hold under the weaker condi- tions that the population moments on the right-hand side exist. The CLT for A k,n requires population moments up to order 2 k . The LLN for A k,n requires population moments up to order k . Similarly for M ∗ k,n . 90

  63. Asymptotic Sampling Distributions (cont.) By the binomial theorem n � M k,n = 1 ( X i − X n ) k n i =1 n k � k � � � = 1 ( − 1) j ( X n − µ ) j ( X i − µ ) k − j n j i =1 j =0 k n � k � � � ( − 1) j ( X n − µ ) j 1 ( X i − µ ) k − j = j n j =0 i =1 k � k � � ( − 1) j ( X n − µ ) j M ∗ = k − j,n j j =0 91

  64. Asymptotic Sampling Distributions (cont.) By the LLN P − → µ X n so by the continuous mapping theorem P ( X n − µ ) j − → 0 for any positive integer j . Hence by Slutsky’s theorem � k � P ( − 1) j ( X n − µ ) j M ∗ − → 0 k − j,n j for any positive integer j . Hence by another application of Slut- sky’s theorem P M k,n − → µ k 92

  65. Asymptotic Sampling Distributions (cont.) Define random vectors   X i − X n   ( X i − X n ) 2     Z i = .  .  .   ( X i − X n ) k Then   M 1 ,n   M 2 ,n   Z n =   . . .   M k,n 93

  66. Asymptotic Sampling Distributions (cont.) Since convergence in probability to a constant of random vec- tors is merely convergence in probability to a constant of each component (5101, deck 7, slides 73–78), we can write these univariate LLN as multivariate LLN Z ∗ P − → µ central n P − → µ central Z n 94

  67. Asymptotic Sampling Distributions (cont.) Up to now we used the “sloppy” version of the multivariate CLT and it did no harm because we went immediately to the conclusion. Now we want to apply Slutsky’s theorem, so we need the careful pedantically correct version. The sloppy version was � � µ central , M central Z ∗ n ≈ N n The careful version is � � √ n Z ∗ D n − µ central − → N (0 , M central ) The careful version has no n in the limit (right-hand side), as must be the case for any limit as n → ∞ . The sloppy version does have an n on the right-hand side, which consequently cannot be a mathematical limit. 95

  68. Asymptotic Sampling Distributions (cont.) √ n ( M k,n − µ k ) n � � � 1 ( X i − X n ) k − µ k = √ n i =1   n k � k � � � 1 ( − 1) j ( X n − µ ) j ( X i − µ ) k − j − µ k   = √ n j i =1 j =0 n k � k � = √ n ( M ∗ � � k,n − µ k ) + 1 ( − 1) j ( X n − µ ) j ( X i − µ ) k − j √ n j i =1 j =1 k � k � = √ n ( M ∗ ( − 1) j √ n ( X n − µ ) j M ∗ � k,n − µ k ) + k − j,n j j =1 96

  69. Asymptotic Sampling Distributions (cont.) By the CLT √ n ( X n − µ ) D − → U where U ∼ N (0 , σ 2 ). Hence by the continuous mapping theorem D n j/ 2 ( X n − µ ) j → U j − but by Slutsky’s theorem √ n ( X n − µ ) j D − → 0 , j = 2 , 3 . . . 97

  70. Asymptotic Sampling Distributions (cont.) Hence only the j = 0 and j = 1 terms on slide 96 do not converge in probability to zero, that is, √ n ( M k,n − µ k ) = √ n ( M ∗ k,n − µ k ) − k √ n ( X n − µ ) M ∗ k − 1 ,n + o p (1) where o p (1) means terms that converge in probability to zero. By Slutsky’s theorem this converges to W − kµ k − 1 U where the bivariate random vector ( U, W ) is multivariate normal with mean vector zero and variance matrix � � � � X i − µ µ 2 µ k +1 M = var = ( X i − µ ) k µ 2 k − µ 2 µ k +1 k 98

  71. Asymptotic Sampling Distributions (cont.) Apply the multivariate delta method, which in this case says that the distribution of W − kµ k − 1 U is univariate normal with mean zero and variance � � � � � � µ 2 µ k +1 − kµ k − 1 − kµ k − 1 1 µ 2 k − µ 2 1 µ k +1 k = µ 2 k − µ 2 k − 2 kµ k − 1 µ k +1 + k 2 µ 2 k − 1 µ 2 99

  72. Asymptotic Sampling Distributions (cont.) Summary: √ n ( M k,n − µ k ) D → N (0 , µ 2 k − µ 2 k − 2 kµ k − 1 µ k +1 + k 2 µ 2 − k − 1 µ 2 ) We could work out the asymptotic joint distribution of all these empirical central moments but spare you the details. The k = 2 case is particularly simple. Recall µ 1 = 0, µ 2 = σ 2 , and M 2 ,n = V n , so the k = 2 case is √ n ( V n − σ 2 ) D → N (0 , µ 4 − σ 4 ) − 100

Recommend


More recommend