bootstrapping spectral statistics in high dimensions
play

Bootstrapping Spectral Statistics in High Dimensions Miles Lopes UC - PowerPoint PPT Presentation

Bootstrapping Spectral Statistics in High Dimensions Miles Lopes UC Davis Random Matrices and Complex Data Analysis Workshop Shanghai 2019 1 / 48 Bootstrap for sample covariance matrices Let X 1 , . . . , X n R p be i.i.d. observations,


  1. Linear Spectral Statistics (LSS) A natural class of prototype statistics for investigating bootstrap consistency are linear spectral statstics , which have the form � p T = 1 j =1 f ( λ j ( � Σ)) , (3) p where f is a smooth function on [0 , ∞ ). Examples: The choice f ( x ) = log( x ) leads to log(det( � Σ)) . The choice f ( x ) = x k , leads to tr( � Σ k ) The normal log-likelihood ratio statistic for testing sphericity is p log(tr( � Σ)) − log(det( � Σ)) . Even some non-linear spectral statistics are “asymptotically equivalent” to transformations of LSS (cf. Dobriban 2017) 9 / 48

  2. Background ideas for developing a new bootstrap Beginning with fundamental works of Jonsson (1982) and Bai and Silverstein (2004), a substantial literature has developed increasingly general central limit theorems for LSS: (e.g. Pan and Zhou (2008), Lytova and Pastur (2009), Bai, Wang and Zhou (2010), Scherbina (2011), Zheng (2012), Wang and Yao (2013), Naijm and Yao (2016), Li, Li and Yao (2018), Hu, Li, Liu and Zhou (2019) among others) 10 / 48

  3. Background ideas for developing a new bootstrap Beginning with fundamental works of Jonsson (1982) and Bai and Silverstein (2004), a substantial literature has developed increasingly general central limit theorems for LSS: (e.g. Pan and Zhou (2008), Lytova and Pastur (2009), Bai, Wang and Zhou (2010), Scherbina (2011), Zheng (2012), Wang and Yao (2013), Naijm and Yao (2016), Li, Li and Yao (2018), Hu, Li, Liu and Zhou (2019) among others) In particular, if E [ Z 4 ij ] = 3, and p / n → c ∈ (0 , ∞ ), then under “standard assumptions” we have p ( T − E [ T ]) ⇒ N (0 , σ 2 ) , where σ 2 = − 1 f ( z 1 ) f ( z 2 ) ‹ d m ( z 1 ) d m ( z 2 ) dz 1 dz 2 2 π 2 ( m ( z 1 ) − m ( z 2 )) 2 dz 1 dz 2 10 / 48

  4. Background ideas for developing a new bootstrap p ( T − E [ T ]) ⇒ N (0 , σ 2 ) Important property: Under conditions more general than E [ Z 3 ij ] = 3, the variance σ 2 is essentially determined by the limiting spectral distribution of Σ. Roughly speaking, this means that under certain conditions, the limit laws of LSS are mainly governed by just ( λ 1 (Σ) , . . . , λ p (Σ)), rather than the entire matrix Σ. This is a major reduction in complexity. 11 / 48

  5. A “parametric bootstrap” approach More good news: The eigenvalues Λ = diag( λ 1 (Σ) , . . . , λ p (Σ)) can be estimated well in high dimensions. In particular, for consistent estimation of the population LSD, it is not necessary to use sparsity and/or low-rank conditions. (cf. El Karoui (2008), Mestre (2008), Li and Yao (2014), Ledoit and Wolf (2015), Kong and Valiant (2017), among others) 12 / 48

  6. A “parametric bootstrap” approach More good news: The eigenvalues Λ = diag( λ 1 (Σ) , . . . , λ p (Σ)) can be estimated well in high dimensions. In particular, for consistent estimation of the population LSD, it is not necessary to use sparsity and/or low-rank conditions. (cf. El Karoui (2008), Mestre (2008), Li and Yao (2014), Ledoit and Wolf (2015), Kong and Valiant (2017), among others) Intuitive procedure: Generate a “new dataset” X ∗ that nearly matches the observed data X with respect to Λ. Then, we compute the statistic T ∗ arising from the “new data” X ∗ . (This is akin to a parametric bootstrap.) 12 / 48

  7. A “parametric bootstrap” approach More good news: The eigenvalues Λ = diag( λ 1 (Σ) , . . . , λ p (Σ)) can be estimated well in high dimensions. In particular, for consistent estimation of the population LSD, it is not necessary to use sparsity and/or low-rank conditions. (cf. El Karoui (2008), Mestre (2008), Li and Yao (2014), Ledoit and Wolf (2015), Kong and Valiant (2017), among others) Intuitive procedure: Generate a “new dataset” X ∗ that nearly matches the observed data X with respect to Λ. Then, we compute the statistic T ∗ arising from the “new data” X ∗ . (This is akin to a parametric bootstrap.) One extra detail: The kurtosis κ = E [ Z 4 ij ] matters too, but it’s estimable. 12 / 48

  8. Proposed method: Spectral Bootstrap � p j =1 f ( λ j ( � Goal. Approximate the distribution of T = 1 Σ)). p 13 / 48

  9. Proposed method: Spectral Bootstrap � p j =1 f ( λ j ( � Goal. Approximate the distribution of T = 1 Σ)). p κ and � Before resampling, first compute estimates � Λ. 13 / 48

  10. Proposed method: Spectral Bootstrap � p j =1 f ( λ j ( � Goal. Approximate the distribution of T = 1 Σ)). p κ and � Before resampling, first compute estimates � Λ. Algorithm. (Spectral Bootstrap) For b = 1 , . . . , B : 13 / 48

  11. Proposed method: Spectral Bootstrap � p j =1 f ( λ j ( � Goal. Approximate the distribution of T = 1 Σ)). p κ and � Before resampling, first compute estimates � Λ. Algorithm. (Spectral Bootstrap) For b = 1 , . . . , B : Generate a random matrix Z ∗ ∈ R n × p whose entries Z ∗ ij are drawn (Recall X i = Σ 1 / 2 Z i ) i.i.d. from Pearson(0 , 1 , 0 , � κ ). 13 / 48

  12. Proposed method: Spectral Bootstrap � p j =1 f ( λ j ( � Goal. Approximate the distribution of T = 1 Σ)). p κ and � Before resampling, first compute estimates � Λ. Algorithm. (Spectral Bootstrap) For b = 1 , . . . , B : Generate a random matrix Z ∗ ∈ R n × p whose entries Z ∗ ij are drawn (Recall X i = Σ 1 / 2 Z i ) i.i.d. from Pearson(0 , 1 , 0 , � κ ). Σ ∗ = 1 Compute � n � Λ 1 / 2 ( Z ∗⊤ Z ∗ ) � (Note � Σ = 1 Λ 1 / 2 . n Σ 1 / 2 Z ⊤ Z Σ 1 / 2 ) 13 / 48

  13. Proposed method: Spectral Bootstrap � p j =1 f ( λ j ( � Goal. Approximate the distribution of T = 1 Σ)). p κ and � Before resampling, first compute estimates � Λ. Algorithm. (Spectral Bootstrap) For b = 1 , . . . , B : Generate a random matrix Z ∗ ∈ R n × p whose entries Z ∗ ij are drawn (Recall X i = Σ 1 / 2 Z i ) i.i.d. from Pearson(0 , 1 , 0 , � κ ). Σ ∗ = 1 Compute � n � Λ 1 / 2 ( Z ∗⊤ Z ∗ ) � (Note � Σ = 1 Λ 1 / 2 . n Σ 1 / 2 Z ⊤ Z Σ 1 / 2 ) Compute the eigenvalues of � Σ ∗ , and denote them by ( λ ∗ 1 , . . . , λ ∗ p ). 13 / 48

  14. Proposed method: Spectral Bootstrap � p j =1 f ( λ j ( � Goal. Approximate the distribution of T = 1 Σ)). p κ and � Before resampling, first compute estimates � Λ. Algorithm. (Spectral Bootstrap) For b = 1 , . . . , B : Generate a random matrix Z ∗ ∈ R n × p whose entries Z ∗ ij are drawn (Recall X i = Σ 1 / 2 Z i ) i.i.d. from Pearson(0 , 1 , 0 , � κ ). Σ ∗ = 1 Compute � n � Λ 1 / 2 ( Z ∗⊤ Z ∗ ) � (Note � Σ = 1 Λ 1 / 2 . n Σ 1 / 2 Z ⊤ Z Σ 1 / 2 ) Compute the eigenvalues of � Σ ∗ , and denote them by ( λ ∗ 1 , . . . , λ ∗ p ). � p Compute the statistic, T ∗ b = 1 j =1 f ( λ ∗ j ) p Return: the empirical distribution of the values T ∗ 1 , . . . , T ∗ B . 13 / 48

  15. Generalizing to other spectral statistics Let ψ : R p → R be a generic (non-linear) function, and consider the statistic T = ψ ( λ 1 ( � Σ) , . . . , λ p ( � Σ)) . Key point: To bootstrap T , we only need change the last step. (This is a distinct benefit of the bootstrap in relation to formulas.) 14 / 48

  16. Generalizing to other spectral statistics Let ψ : R p → R be a generic (non-linear) function, and consider the statistic T = ψ ( λ 1 ( � Σ) , . . . , λ p ( � Σ)) . Key point: To bootstrap T , we only need change the last step. (This is a distinct benefit of the bootstrap in relation to formulas.) For b = 1 , . . . , B : . . . Compute the eigenvalues of � Σ ∗ , and denote them by ( λ ∗ 1 , . . . , λ ∗ p ). Compute the statistic, T ∗ b = ψ ( λ ∗ 1 , . . . , λ ∗ p ) Return: the empirical distribution of the values T ∗ 1 , . . . , T ∗ B . 14 / 48

  17. Estimating kurtosis Recall κ = E [ Z 4 ij ], and all row vectors satisfy X i = Σ 1 / 2 Z i . Our estimate of κ is based on a general formula for the variance of a quadratic form κ = 3 + Var( � X 1 � 2 2 ) − 2 � Σ � 2 F . � p j =1 σ 4 j 15 / 48

  18. Estimating kurtosis Recall κ = E [ Z 4 ij ], and all row vectors satisfy X i = Σ 1 / 2 Z i . Our estimate of κ is based on a general formula for the variance of a quadratic form κ = 3 + Var( � X 1 � 2 2 ) − 2 � Σ � 2 F . � p j =1 σ 4 j All the quantities on the right side have ratio-consistent estimators when p ≍ n under standard conditions. 15 / 48

  19. Estimating kurtosis Recall κ = E [ Z 4 ij ], and all row vectors satisfy X i = Σ 1 / 2 Z i . Our estimate of κ is based on a general formula for the variance of a quadratic form κ = 3 + Var( � X 1 � 2 2 ) − 2 � Σ � 2 F . � p j =1 σ 4 j All the quantities on the right side have ratio-consistent estimators when p ≍ n under standard conditions. The estimation of � Σ � 2 F was handled previously in Bai and Saranadasa (1996), but it seems that a consistent estimate for κ has not been available in the high-dimensional setting. 15 / 48

  20. Estimating kurtosis Recall κ = E [ Z 4 ij ], and all row vectors satisfy X i = Σ 1 / 2 Z i . Our estimate of κ is based on a general formula for the variance of a quadratic form κ = 3 + Var( � X 1 � 2 2 ) − 2 � Σ � 2 F . � p j =1 σ 4 j All the quantities on the right side have ratio-consistent estimators when p ≍ n under standard conditions. The estimation of � Σ � 2 F was handled previously in Bai and Saranadasa (1996), but it seems that a consistent estimate for κ has not been available in the high-dimensional setting. An estimate of κ may also be of independent interest as a diagnostic tool for checking if data are approximately Gaussian. 15 / 48

  21. Estimating eigenvalues We use the Quest method (Ledoit and Wolf, 2015). For bootstrapping LSS, the essential issue is to use eigenvalue estimates � λ 1 , . . . , � λ p that lead to a consistent estimate of the population LSD. Let H p denote the spectral the distribution function associated with λ 1 (Σ) , . . . , λ p (Σ), � p H p ( t ) = 1 j =1 1 { λ j (Σ) ≤ t } . p Then, an estimate � H p may be formed by taking the estimates � λ 1 , . . . , � λ p as the quantiles. 16 / 48

  22. Estimating eigenvalues We use the Quest method (Ledoit and Wolf, 2015). For bootstrapping LSS, the essential issue is to use eigenvalue estimates � λ 1 , . . . , � λ p that lead to a consistent estimate of the population LSD. Let H p denote the spectral the distribution function associated with λ 1 (Σ) , . . . , λ p (Σ), � p H p ( t ) = 1 j =1 1 { λ j (Σ) ≤ t } . p Then, an estimate � H p may be formed by taking the estimates � λ 1 , . . . , � λ p as the quantiles. Consistency. If there is a population LSD H satisfying H p ⇒ H , then the Quest estimator � H p satisfies the following limit under standard assumptions, � H p ⇒ H almost surely . 16 / 48

  23. Estimating eigenvalues We use the Quest method (Ledoit and Wolf, 2015). For bootstrapping LSS, the essential issue is to use eigenvalue estimates � λ 1 , . . . , � λ p that lead to a consistent estimate of the population LSD. Let H p denote the spectral the distribution function associated with λ 1 (Σ) , . . . , λ p (Σ), � p H p ( t ) = 1 j =1 1 { λ j (Σ) ≤ t } . p Then, an estimate � H p may be formed by taking the estimates � λ 1 , . . . , � λ p as the quantiles. Consistency. If there is a population LSD H satisfying H p ⇒ H , then the Quest estimator � H p satisfies the following limit under standard assumptions, � H p ⇒ H almost surely . Note: Other spectrum estimation methods are also compatible with the proposed bootstrap — provided that the above limit holds. 16 / 48

  24. Main result: bootstrap consistency Assumptions in brief: p / n → c ∈ (0 , ∞ ) λ p (Σ) and λ 1 (Σ) bounded away from 0 and ∞ Finite 8th moment: E [ Z 8 11 ] < ∞ . H p ⇒ H . Asymptotic “regularity” of population eigenvectors (more later) 17 / 48

  25. Main result: bootstrap consistency Assumptions in brief: p / n → c ∈ (0 , ∞ ) λ p (Σ) and λ 1 (Σ) bounded away from 0 and ∞ Finite 8th moment: E [ Z 8 11 ] < ∞ . H p ⇒ H . Asymptotic “regularity” of population eigenvectors (more later) Theorem 1 (LBA 2019) Let d LP denote the L´ evy-Prohorov metric. Then, under the stated assumptions, the following limit holds as ( n , p ) → ∞ , � �� � � � p ( T ∗ − E [ T ∗ | X ]) | X d LP L , L p ( T − E [ T ]) → 0 in probability . 17 / 48

  26. High-level comments on the proof The proof draws substantially from the arguments in Najim and Yao (2016), based on the Helffer-Sj¨ ostrand formula. This formula allows for the following distributional approximation as ( n , p ) → ∞ , � � L p ( T − E [ T ]) ≈ L{ φ f ( G n ) } , where φ f is a linear functional, and G n = G n ( z ) is a centered Gaussian process � Σ − zI p ) − 1 � that arises from the empirical Stieltjes transform 1 ( � p tr . 18 / 48

  27. High-level comments on the proof The proof draws substantially from the arguments in Najim and Yao (2016), based on the Helffer-Sj¨ ostrand formula. This formula allows for the following distributional approximation as ( n , p ) → ∞ , � � L p ( T − E [ T ]) ≈ L{ φ f ( G n ) } , where φ f is a linear functional, and G n = G n ( z ) is a centered Gaussian process � Σ − zI p ) − 1 � that arises from the empirical Stieltjes transform 1 ( � p tr . In the “bootstrap world”, there is a corresponding conditional approximation, L{ p ( T ∗ − E [ T ∗ | X ]) | X } ≈ L{ φ f ( G ∗ n ) | X } . 18 / 48

  28. High-level comments on the proof The proof draws substantially from the arguments in Najim and Yao (2016), based on the Helffer-Sj¨ ostrand formula. This formula allows for the following distributional approximation as ( n , p ) → ∞ , � � L p ( T − E [ T ]) ≈ L{ φ f ( G n ) } , where φ f is a linear functional, and G n = G n ( z ) is a centered Gaussian process � Σ − zI p ) − 1 � that arises from the empirical Stieltjes transform 1 ( � p tr . In the “bootstrap world”, there is a corresponding conditional approximation, L{ p ( T ∗ − E [ T ∗ | X ]) | X } ≈ L{ φ f ( G ∗ n ) | X } . Finally, the consistency of � H and � κ are used to obtain the conditional approximation L{ φ f ( G ∗ n ) | X } ≈ L{ φ f ( G n ) } , by comparing the covariance functions of G ∗ n and G n . 18 / 48

  29. Regularity of eigenvectors Let U be the matrix of eigenvectors of Σ, and consider the non-random quantity p � � UD n ( z 1 ) U ⊤ � � UD n ( z 2 ) U ⊤ � K p ( z 1 , z 2 ) := 1 p jj jj j =1 where z 1 , z 2 ∈ C \ R , and D n ( · ) ∈ C p × p is a diagonal matrix that only depends on the spectrum of Σ. 19 / 48

  30. Regularity of eigenvectors Let U be the matrix of eigenvectors of Σ, and consider the non-random quantity p � � UD n ( z 1 ) U ⊤ � � UD n ( z 2 ) U ⊤ � K p ( z 1 , z 2 ) := 1 p jj jj j =1 where z 1 , z 2 ∈ C \ R , and D n ( · ) ∈ C p × p is a diagonal matrix that only depends on the spectrum of Σ. p � � � � � p ( z 1 , z 2 ) := 1 Also let K ′ D n ( z 1 ) D n ( z 2 ) jj . p jj j =1 19 / 48

  31. Regularity of eigenvectors Let U be the matrix of eigenvectors of Σ, and consider the non-random quantity p � � UD n ( z 1 ) U ⊤ � � UD n ( z 2 ) U ⊤ � K p ( z 1 , z 2 ) := 1 p jj jj j =1 where z 1 , z 2 ∈ C \ R , and D n ( · ) ∈ C p × p is a diagonal matrix that only depends on the spectrum of Σ. p � � � � � p ( z 1 , z 2 ) := 1 Also let K ′ D n ( z 1 ) D n ( z 2 ) jj . p jj j =1 Regularity of eigenvectors. We say that the eigenvectors of Σ are regular if for any z 1 , z 2 ∈ C \ R , as ( n , p ) → ∞ K p ( z 1 , z 2 ) = K ′ p ( z 1 , z 2 ) + o (1) . 19 / 48

  32. Regularity of eigenvectors Let U be the matrix of eigenvectors of Σ, and consider the non-random quantity p � � UD n ( z 1 ) U ⊤ � � UD n ( z 2 ) U ⊤ � K p ( z 1 , z 2 ) := 1 p jj jj j =1 where z 1 , z 2 ∈ C \ R , and D n ( · ) ∈ C p × p is a diagonal matrix that only depends on the spectrum of Σ. p � � � � � p ( z 1 , z 2 ) := 1 Also let K ′ D n ( z 1 ) D n ( z 2 ) jj . p jj j =1 Regularity of eigenvectors. We say that the eigenvectors of Σ are regular if for any z 1 , z 2 ∈ C \ R , as ( n , p ) → ∞ K p ( z 1 , z 2 ) = K ′ p ( z 1 , z 2 ) + o (1) . Remarks. The papers Pan and Zhou (2008) and Najim and Yao (2016) show that unless κ = 3, a limit law for standardized LSS may not exist unless U has some regularity. Nevertheless, empirical results suggest that such regularity may be “typical”. 19 / 48

  33. Regularity of eigenvectors (cont.) Example 1. (Rank k perturbations, k → ∞ ). Suppose λ 1 (Σ) is bounded away from ∞ , and let Λ be otherwise unrestricted. If U is of the form U = I p × p − 2Π , where Π is any orthogonal projection matrix of rank k , and k = o ( p ), then the eigenvectors are regular. This is a fairly substantial perturbation from the diagonal case. 20 / 48

  34. Regularity of eigenvectors (cont.) Example 2. (Spiked covariance models). Suppose Λ is of the form Λ = diag( λ 1 , . . . , λ k , 1 , . . . , 1) where k = o ( p ), and λ 1 = λ 1 (Σ) is bounded away from infinity. Then, any choice of U will be regular. 21 / 48

  35. Simulations for LSS ( κ > 3) Recall X i = Σ 1 / 2 Z i . Z i generated with standardized i.i.d. t-dist (df=20) kurtosis κ ≈ 3 . 4 decaying population spectrum is λ j = j − 1 / 2 population eigenvectors uniformly drawn from Haar measure We tabulate the std. dev., 95th percentile, and 99th percentile of p ( T − E [ T ]). 22 / 48

  36. Simulations for LSS ( κ > 3) Recall X i = Σ 1 / 2 Z i . Z i generated with standardized i.i.d. t-dist (df=20) kurtosis κ ≈ 3 . 4 decaying population spectrum is λ j = j − 1 / 2 population eigenvectors uniformly drawn from Haar measure We tabulate the std. dev., 95th percentile, and 99th percentile of p ( T − E [ T ]). f ( x ) = x f ( x ) = log( x ) (n,p) std. dev. 95th 99th std. dev. 95th 99th 0.16 0.27 0.36 1.07 1.82 2.41 (500,200) 0.17 ( 0.01 ) 0.28 ( 0.03 ) 0.39 ( 0.06 ) 1.08 ( 0.08 ) 1.76 ( 0.20 ) 2.51 ( 0.35 ) 0.18 0.29 0.41 4.41 7.03 9.77 (500,400) 0.18 ( 0.02 ) 0.30 ( 0.04 ) 0.42 ( 0.06 ) 4.27 ( 0.33 ) 6.72 ( 0.70 ) 9.29 ( 1.18 ) 0.17 0.29 0.40 (500,600) - - - 0.18 ( 0.02 ) 0.30 ( 0.04 ) 0.43 ( 0.07 ) 22 / 48

  37. Simulations for LSS ( κ < 3) Recall X i = Σ 1 / 2 Z i . Z i generated with standardized i.i.d. Beta(6,6) kurtosis κ = 2 . 6 decaying population spectrum is λ j = j − 1 / 2 population eigenvectors uniformly drawn from Haar measure f ( x ) = x f ( x ) = log( x ) (n,p) std. dev. 95th 99th std. dev. 95th 99th 0.14 0.23 0.33 0.93 1.51 1.92 (500,200) 0.14 ( 0.01 ) 0.23 ( 0.03 ) 0.32 ( 0.05 ) 0.93 ( 0.08 ) 1.52 ( 0.17 ) 2.15 ( 0.31 ) 0.15 0.25 0.34 1.65 2.64 3.64 (500,400) 0.14 ( 0.01 ) 0.24 ( 0.03 ) 0.34 ( 0.05 ) 1.70 ( 0.13 ) 2.81 ( 0.31 ) 3.97 ( 0.56 ) 0.16 0.26 0.34 (500,600) - - - 0.15 ( 0.01 ) 0.25 ( 0.03 ) 0.35 ( 0.05 ) 23 / 48

  38. What about other spectral statistics? In principle, the proposed method can be applied to any spectral statistic. Below, we present some simulation results for some non-linear statistics: T max = λ 1 ( � Σ). T gap = λ 1 ( � Σ) − λ 2 ( � Σ) 24 / 48

  39. Simulations for non-linear statistics ( κ < 3) Recall X i = Σ 1 / 2 Z i . Z i generated with standardized i.i.d. Beta(6,6) kurtosis κ = 2 . 6 decaying population spectrum is λ j = j − 1 / 2 population eigenvectors uniformly drawn from Haar measure 25 / 48

  40. Simulations for non-linear statistics ( κ < 3) Recall X i = Σ 1 / 2 Z i . Z i generated with standardized i.i.d. Beta(6,6) kurtosis κ = 2 . 6 decaying population spectrum is λ j = j − 1 / 2 population eigenvectors uniformly drawn from Haar measure T max − E [ T max ] T gap − E [ T gap ] (n,p) std. dev. 95th 99th std. dev. 95th 99th 0.06 0.11 0.15 0.08 0.13 0.17 (500,200) 0.06 ( 0.01 ) 0.09 ( 0.01 ) 0.13 ( 0.02 ) 0.07 ( 0.01 ) 0.11 ( 0.01 ) 0.16 ( 0.03 ) 0.06 0.10 0.15 0.08 0.13 0.18 (500,400) 0.06 ( 0.01 ) 0.09 ( 0.01 ) 0.13 ( 0.02 ) 0.07 ( 0.01 ) 0.11 ( 0.01 ) 0.16 ( 0.03 ) 0.06 0.11 0.14 0.07 0.13 0.17 (500,600) 0.06 ( 0.01 ) 0.09 ( 0.01 ) 0.13 ( 0.02 ) 0.07 ( 0.01 ) 0.11 ( 0.02 ) 0.16 ( 0.03 ) 25 / 48

  41. Simulations for non-linear statistics ( κ > 3) Recall X i = Σ 1 / 2 Z i . Z i generated with standardized i.i.d. t-dist (df=20) kurtosis κ ≈ 3 . 4 decaying population spectrum is λ j = j − 1 / 2 population eigenvectors uniformly drawn from Haar measure T max − E [ T max ] T gap − E [ T gap ] (n,p) std. dev. 95th 99th std. dev. 95th 99th 0.06 0.10 0.15 0.07 0.12 0.17 (500,200) 0.07 ( 0.01 ) 0.11 ( 0.02 ) 0.17 ( 0.03 ) 0.08 ( 0.01 ) 0.13 ( 0.02 ) 0.19 ( 0.03 ) 0.06 0.10 0.14 0.07 0.13 0.17 (500,400) 0.07 ( 0.01 ) 0.11 ( 0.02 ) 0.17 ( 0.03 ) 0.08 ( 0.01 ) 0.13 ( 0.02 ) 0.19 ( 0.03 ) 0.06 0.11 0.16 0.08 0.13 0.18 (500,600) 0.07 ( 0.01 ) 0.11 ( 0.02 ) 0.16 ( 0.03 ) 0.08 ( 0.01 ) 0.13 ( 0.02 ) 0.19 ( 0.03 ) 26 / 48

  42. Part I summary: Bootstrap for spectral statistics LSS are a general class of statistics for which bootstrapping can succeed in high dimensions. This offers general-purpose way to approximate the laws of LSS without relying on asymptotic formulas. The method is akin to the parametric bootstrap — using the fact that spectral statistics may depend on relatively few parameters of the full data-generating distribution. Numerical results are encouraging. The method appears to extend to some non-linear spectral statistics — for which asymptotic formulas are often unavailable. Further work on non-linear statistics is underway . . . 27 / 48

  43. 28 / 48

  44. Part II: Bootstrapping the operator norm in high dimensions: Error estimation for covariance matrices and sketching 29 / 48

  45. Motivations and background Let X 1 , . . . , X n are centered i.i.d. observations in R p with Σ = E [ X 1 X ⊤ 1 ], and let � Σ denote the sample covariance matrix. When p ≍ n or p ≫ n , there is a large literature on the problem of deriving high-probabilty non-asymptotic bounds on the operator norm error T = √ n � � Σ − Σ � op . 30 / 48

  46. Motivations and background Let X 1 , . . . , X n are centered i.i.d. observations in R p with Σ = E [ X 1 X ⊤ 1 ], and let � Σ denote the sample covariance matrix. When p ≍ n or p ≫ n , there is a large literature on the problem of deriving high-probabilty non-asymptotic bounds on the operator norm error T = √ n � � Σ − Σ � op . However, such bounds are typically only given up to unspecified constants. 30 / 48

  47. Motivations and background Let X 1 , . . . , X n are centered i.i.d. observations in R p with Σ = E [ X 1 X ⊤ 1 ], and let � Σ denote the sample covariance matrix. When p ≍ n or p ≫ n , there is a large literature on the problem of deriving high-probabilty non-asymptotic bounds on the operator norm error T = √ n � � Σ − Σ � op . However, such bounds are typically only given up to unspecified constants. In order to solve practical inference problems, such as constructing numerical error bounds for � Σ, or confidence regions for Σ, we need to approximate the distribution of T . 30 / 48

  48. Motivations and background Let X 1 , . . . , X n are centered i.i.d. observations in R p with Σ = E [ X 1 X ⊤ 1 ], and let � Σ denote the sample covariance matrix. When p ≍ n or p ≫ n , there is a large literature on the problem of deriving high-probabilty non-asymptotic bounds on the operator norm error T = √ n � � Σ − Σ � op . However, such bounds are typically only given up to unspecified constants. In order to solve practical inference problems, such as constructing numerical error bounds for � Σ, or confidence regions for Σ, we need to approximate the distribution of T . The recent work of Han, Xu, and Zhou (2018) has explored bootstrap � � u ⊤ �� � � √ n � when s ≪ n , but approximations for sup � u � 2 ≤ 1 , � u � 0 ≤ s Σ − Σ u beyond this, not much is known about bootstrapping T in high dimensions. 30 / 48

  49. Further motivations (RandNLA and sketching) Randomized numerical linear algebra (RandNLA) or “matrix sketching” uses randomization to accelerate numerical linear algebra on huge matrices. 31 / 48

  50. Further motivations (RandNLA and sketching) Randomized numerical linear algebra (RandNLA) or “matrix sketching” uses randomization to accelerate numerical linear algebra on huge matrices. For instance, if A is a very tall (deterministic) matrix, one may try to do computations with a shorter “sketch” ˜ A = SA , where S is a random short matrix satisfying E [ S ⊤ S ] = I . ( SA ) ⊤ SA A ⊤ ˜ ˜ A ⊤ A A 31 / 48

  51. Further motivations (RandNLA and sketching) Randomized numerical linear algebra (RandNLA) or “matrix sketching” uses randomization to accelerate numerical linear algebra on huge matrices. For instance, if A is a very tall (deterministic) matrix, one may try to do computations with a shorter “sketch” ˜ A = SA , where S is a random short matrix satisfying E [ S ⊤ S ] = I . ( SA ) ⊤ SA A ⊤ ˜ ˜ A ⊤ A A In practice, it is necessary to estimate the algorithmic error � A ⊤ S ⊤ SA − A ⊤ A � op . 31 / 48

  52. Further motivations (RandNLA and sketching) Randomized numerical linear algebra (RandNLA) or “matrix sketching” uses randomization to accelerate numerical linear algebra on huge matrices. For instance, if A is a very tall (deterministic) matrix, one may try to do computations with a shorter “sketch” ˜ A = SA , where S is a random short matrix satisfying E [ S ⊤ S ] = I . ( SA ) ⊤ SA A ⊤ ˜ ˜ A ⊤ A A In practice, it is necessary to estimate the algorithmic error � A ⊤ S ⊤ SA − A ⊤ A � op . However, most theoretical work has focused on bounds that hold up to constants, and only a handful of papers have addressed error estimation in this context: (e.g. Liberty et al., (2007), Woolfe et al., (2008), Halko, Martinsson and Tropp (2011), Lopes, Wang and Mahoney, (2017) (2018)). 31 / 48

  53. A model with spectrum decay Suppose X ∈ R n × p is a data matrix with rows generated as X i = Σ 1 / 2 Z i where the vectors Z 1 , . . . , Z n ∈ R p are i.i.d. and have i.i.d. entries with E [ Z 11 ] = 0, E [ Z 2 11 ] = 1, E [ Z 4 11 ] > 1, and � Z 11 � ψ 2 ≤ c 0 for some constant c 0 > 0 not depending on n . 32 / 48

  54. A model with spectrum decay Suppose X ∈ R n × p is a data matrix with rows generated as X i = Σ 1 / 2 Z i where the vectors Z 1 , . . . , Z n ∈ R p are i.i.d. and have i.i.d. entries with E [ Z 11 ] = 0, E [ Z 2 11 ] = 1, E [ Z 4 11 ] > 1, and � Z 11 � ψ 2 ≤ c 0 for some constant c 0 > 0 not depending on n . There are constants β > 1 / 2 and c 1 , c 2 > 0, not depending on n , such that for each j ∈ { 1 , . . . , p } , c 1 j − 2 β ≤ λ j (Σ) ≤ c 2 j − 2 β . 32 / 48

  55. Main result (rate of bootstrap approximation) Recall that we aim to approximate the law of T = √ n � � Σ − Σ � op . Let ( X ∗ 1 , . . . , X ∗ n ) be drawn with replacement from ( X 1 , . . . , X n ), and let � n Σ ∗ = 1 � X ∗ i ( X ∗ i ) ⊤ . n i =1 Also, define the bootstrapped statistic T ∗ = √ n � � Σ ∗ − � Σ � op . 33 / 48

  56. Main result (rate of bootstrap approximation) Recall that we aim to approximate the law of T = √ n � � Σ − Σ � op . Let ( X ∗ 1 , . . . , X ∗ n ) be drawn with replacement from ( X 1 , . . . , X n ), and let � n Σ ∗ = 1 � X ∗ i ( X ∗ i ) ⊤ . n i =1 Also, define the bootstrapped statistic T ∗ = √ n � � Σ ∗ − � Σ � op . Theorem 2 (LEM 2019) Let d K denote the Kolmogorov metric. Then, under the stated model, there is a constant c > 0 not depending on n such that the event � � ≤ c n − β − 1 / 2 L ( T ) , L ( T ∗ | X ) log( n ) c d K 6 β +4 occurs with probability at least 1 − c n . 33 / 48

  57. Connections to other works In recent years, there have been several influential works by Chernozhukov, Chetverikov, and Kato (CCK) (2013), (2014), (2017) on bootstrapping maxima of empirical processes sup G n ( f ) f ∈ F � n 1 where G n ( f ) = i =1 f ( Z i ) − E [ f ( Z i )]. √ n 34 / 48

  58. Connections to other works In recent years, there have been several influential works by Chernozhukov, Chetverikov, and Kato (CCK) (2013), (2014), (2017) on bootstrapping maxima of empirical processes sup G n ( f ) f ∈ F � n 1 where G n ( f ) = i =1 f ( Z i ) − E [ f ( Z i )]. √ n The statistic T = √ n � � Σ − Σ � op can be represented in this form by taking f ( Z i ) = ±� v , Z i � 2 , with v = Σ 1 / 2 u for some unit vector u (so F corresponds to a signed ellipsoid). 34 / 48

  59. Connections to other works In recent years, there have been several influential works by Chernozhukov, Chetverikov, and Kato (CCK) (2013), (2014), (2017) on bootstrapping maxima of empirical processes sup G n ( f ) f ∈ F � n 1 where G n ( f ) = i =1 f ( Z i ) − E [ f ( Z i )]. √ n The statistic T = √ n � � Σ − Σ � op can be represented in this form by taking f ( Z i ) = ±� v , Z i � 2 , with v = Σ 1 / 2 u for some unit vector u (so F corresponds to a signed ellipsoid). However, the results of CCK are not directly applicable to T , because such results typically involve a “minimum variance condition”, such as f ∈ F var( G n ( f )) ≥ c inf for some c > 0 not depending on n , which fails for the ellipsoidal index set. 34 / 48

  60. Why β − 1 / 2? Recall the rate of bootstrap approximation n − β − 1 / 2 log( n ) c . 6 β +4 The role of β − 1 / 2 can be understood in terms of the error that comes from discretizing F , | G n ( f ) − G n (˜ ∆ n ( ǫ ) := sup f ) | . dist( f , ˜ f ) ≤ ǫ 35 / 48

  61. Why β − 1 / 2? Recall the rate of bootstrap approximation n − β − 1 / 2 log( n ) c . 6 β +4 The role of β − 1 / 2 can be understood in terms of the error that comes from discretizing F , | G n ( f ) − G n (˜ ∆ n ( ǫ ) := sup f ) | . dist( f , ˜ f ) ≤ ǫ In order for discrete approximation to work, we should have E [∆ n ( ǫ )] → 0 as ǫ → 0. This imposes an implicit constraint on the complexity of F (i.e. the complexity Σ). 35 / 48

  62. Why β − 1 / 2? Recall the rate of bootstrap approximation n − β − 1 / 2 log( n ) c . 6 β +4 The role of β − 1 / 2 can be understood in terms of the error that comes from discretizing F , | G n ( f ) − G n (˜ ∆ n ( ǫ ) := sup f ) | . dist( f , ˜ f ) ≤ ǫ In order for discrete approximation to work, we should have E [∆ n ( ǫ )] → 0 as ǫ → 0. This imposes an implicit constraint on the complexity of F (i.e. the complexity Σ). If we consider a simpler situation where G n ( f ) is replaced by a linear Gaussian process indexed by the same F , say � n G ′ 1 n ( f ) = i =1 � v , ζ i � , √ n with ζ 1 , . . . , ζ n i.i.d. N (0 , I ), then it follows from classical results that the associated discretization error satisfies the lower bound E [∆ ′ n ( ǫ )] ≥ c ǫ ( β − 1 / 2) /β . Hence, the condition β − 1 / 2 > 0 is needed even in the linear Gaussian case. 35 / 48

  63. A new general-purpose error bound To analyze the bootstrap, it was necessary to use dimension-free high-probability upper bounds on � � Σ − Σ � op . 36 / 48

  64. A new general-purpose error bound To analyze the bootstrap, it was necessary to use dimension-free high-probability upper bounds on � � Σ − Σ � op . Existing dimension-free bounds generally require either � X i � 2 ≤ c almost surely, or �� u , X i �� ψ 2 ≍ �� u , X i �� L 2 for all � u � 2 = 1. (e.g. Rudelson and Vershynin, (2007), Oliveira, (2010), Hsu, Kakade and Zhang, (2012), Koltchinskii and Lounici, (2017), Minsker, (2017)) 36 / 48

  65. A new general-purpose error bound To analyze the bootstrap, it was necessary to use dimension-free high-probability upper bounds on � � Σ − Σ � op . Existing dimension-free bounds generally require either � X i � 2 ≤ c almost surely, or �� u , X i �� ψ 2 ≍ �� u , X i �� L 2 for all � u � 2 = 1. (e.g. Rudelson and Vershynin, (2007), Oliveira, (2010), Hsu, Kakade and Zhang, (2012), Koltchinskii and Lounici, (2017), Minsker, (2017)) However, the ℓ 2 -boundedness condition is often restrictive, while the ψ 2 - L 2 equivalence condition is not well-suited to the discrete distributions that arise from resampling. 36 / 48

  66. A new general-purpose error bound To analyze the bootstrap, it was necessary to use dimension-free high-probability upper bounds on � � Σ − Σ � op . Existing dimension-free bounds generally require either � X i � 2 ≤ c almost surely, or �� u , X i �� ψ 2 ≍ �� u , X i �� L 2 for all � u � 2 = 1. (e.g. Rudelson and Vershynin, (2007), Oliveira, (2010), Hsu, Kakade and Zhang, (2012), Koltchinskii and Lounici, (2017), Minsker, (2017)) However, the ℓ 2 -boundedness condition is often restrictive, while the ψ 2 - L 2 equivalence condition is not well-suited to the discrete distributions that arise from resampling. As a way to streamline our analysis of both ( X 1 , . . . , X n ) and the bootstrap samples ( X ∗ 1 , . . . , X ∗ n ), it is of interest to develop a dimension-free bound that can be applied in a more general-purpose way. 36 / 48

  67. A new general-purpose error bound Proposition 1 (LEM 2019) Let ξ 1 , . . . , ξ n ∈ R p be i.i.d. random vectors, and for any q ≥ 1 , define the quantity � �� 1 E [ � ξ 1 � 2 q q 2 � � r ( q ) = q . (4) � E [ ξ 1 ξ ⊤ � 1 ] op 37 / 48

  68. A new general-purpose error bound Proposition 1 (LEM 2019) Let ξ 1 , . . . , ξ n ∈ R p be i.i.d. random vectors, and for any q ≥ 1 , define the quantity � �� 1 E [ � ξ 1 � 2 q q 2 � � r ( q ) = q . (4) � E [ ξ 1 ξ ⊤ � 1 ] op Then, there is an absolute constant c > 0 , such that for any q ≥ 3 ∨ log( n ) , the event � � �� � � n � � � � � � � E [ ξ 1 ξ ⊤ � r ( q ) ∨ r ( q ) 1 ξ i ξ ⊤ i − E [ ξ i ξ ⊤ i ] ≤ c · 1 ] op · � � n n n op i =1 holds with probability at least 1 − 1 n . 37 / 48

  69. A new general-purpose error bound Proposition 1 (LEM 2019) Let ξ 1 , . . . , ξ n ∈ R p be i.i.d. random vectors, and for any q ≥ 1 , define the quantity � �� 1 E [ � ξ 1 � 2 q q 2 � � r ( q ) = q . (4) � E [ ξ 1 ξ ⊤ � 1 ] op Then, there is an absolute constant c > 0 , such that for any q ≥ 3 ∨ log( n ) , the event � � �� � � n � � � � � � � E [ ξ 1 ξ ⊤ � r ( q ) ∨ r ( q ) 1 ξ i ξ ⊤ i − E [ ξ i ξ ⊤ i ] ≤ c · 1 ] op · � � n n n op i =1 holds with probability at least 1 − 1 n . The proof extends an argument from Rudelson and Vershynin (2007) to the case of unbounded ξ 1 , . . . , ξ n . The essential step is based on the non-commutative Khinchine inequality of Lust-Piquard (1986). 37 / 48

  70. Coverage probabilities (error bounds or confidence regions) Simulation settings: n ∈ { 300 , 500 , 700 } and p = 1 , 000 Repeated leading eigenvalues: λ j (Σ) = j − 2 β λ 1 (Σ) = · · · = λ 5 (Σ) = 1 and for j ∈ { 6 , . . . , p } True eigenvectors were drawn from the Haar (uniform) distribution. Entries Z ij drawn from N (0 , 1) or standardized t 20 . 38 / 48

  71. Coverage probabilities (error bounds or confidence regions) Simulation settings: n ∈ { 300 , 500 , 700 } and p = 1 , 000 Repeated leading eigenvalues: λ j (Σ) = j − 2 β λ 1 (Σ) = · · · = λ 5 (Σ) = 1 and for j ∈ { 6 , . . . , p } True eigenvectors were drawn from the Haar (uniform) distribution. Entries Z ij drawn from N (0 , 1) or standardized t 20 . 38 / 48

  72. Simultaneous confidence intervals for true eigenvalues It is of interest to construct confidence intervals I 1 , . . . , I p that satisfy � �� p � � P λ j (Σ) ∈ I j ≥ 1 − α. (*) j =1 39 / 48

  73. Simultaneous confidence intervals for true eigenvalues It is of interest to construct confidence intervals I 1 , . . . , I p that satisfy � �� p � � P λ j (Σ) ∈ I j ≥ 1 − α. (*) j =1 To do this, it is helpful to consider the (deterministic) Weyl inequality, 1 ≤ j ≤ p | λ j ( � Σ) − λ j (Σ) | ≤ � � max Σ − Σ � op . 39 / 48

  74. Simultaneous confidence intervals for true eigenvalues It is of interest to construct confidence intervals I 1 , . . . , I p that satisfy � �� p � � P λ j (Σ) ∈ I j ≥ 1 − α. (*) j =1 To do this, it is helpful to consider the (deterministic) Weyl inequality, 1 ≤ j ≤ p | λ j ( � Σ) − λ j (Σ) | ≤ � � max Σ − Σ � op . If q 1 − α denotes the (1 − α )-quantile of T , then Weyl’s inequality implies that (*) Σ) ± q 1 − α / √ n ]. Hence, we may construct approximate must hold for I j := [ λ j ( � intervals by replacing q 1 − α with the bootstrap estimate � q 1 − α . 39 / 48

  75. Simultaneous confidence intervals for true eigenvalues It is of interest to construct confidence intervals I 1 , . . . , I p that satisfy � �� p � � P λ j (Σ) ∈ I j ≥ 1 − α. (*) j =1 To do this, it is helpful to consider the (deterministic) Weyl inequality, 1 ≤ j ≤ p | λ j ( � Σ) − λ j (Σ) | ≤ � � max Σ − Σ � op . If q 1 − α denotes the (1 − α )-quantile of T , then Weyl’s inequality implies that (*) Σ) ± q 1 − α / √ n ]. Hence, we may construct approximate must hold for I j := [ λ j ( � intervals by replacing q 1 − α with the bootstrap estimate � q 1 − α . 39 / 48

  76. Application to randomized numerical linear algebra Recall the schematic for randomized matrix multiplication: A ⊤ ˜ ˜ A ⊤ ( SA ) ⊤ SA A A where A ∈ R d × p is deterministic with d ≥ p , and the sketching matrix S ∈ R n × d is random with n ≪ d . 40 / 48

  77. Application to randomized numerical linear algebra Recall the schematic for randomized matrix multiplication: A ⊤ ˜ ˜ A ⊤ ( SA ) ⊤ SA A A where A ∈ R d × p is deterministic with d ≥ p , and the sketching matrix S ∈ R n × d is random with n ≪ d . Also, the rows of S are (usually) generated to be i.i.d. with E [ S ⊤ S ] = I . 40 / 48

  78. Application to randomized numerical linear algebra Recall the schematic for randomized matrix multiplication: A ⊤ ˜ ˜ A ⊤ ( SA ) ⊤ SA A A where A ∈ R d × p is deterministic with d ≥ p , and the sketching matrix S ∈ R n × d is random with n ≪ d . Also, the rows of S are (usually) generated to be i.i.d. with E [ S ⊤ S ] = I . A ⊤ ˜ To bootstrap the algorithmic error � ˜ A − A ⊤ A � op , we may regard ˜ A as a “data matrix” and sample its rows with replacement. Note that the user generates the matrix S , which leaves no question about model assumptions! 40 / 48

Recommend


More recommend