Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with Zhidong Bai , Dandan Jiang , Shurong Zheng :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions Overview Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marˇ cenko-Pastur distributions Bai and Silverstein’s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marˇ cenko-Pastur distributions Bai and Silverstein’s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions High-dimensional data and new challenge in statistics High dimensional data High dimensional data � = high dimensional models ◮ Nonparametric regression: a very high-dimensional model (i.e. infinite dimensional model) but with one-dimensional data : y i = f ( x i ) + ε i , f : R �→ R , i = 1 , . . . , n ◮ High-dimensional data : observation vectors y i ∈ R p , with p relatively high w.r.t. the sample size n :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions High-dimensional data and new challenge in statistics High dimensional data Some typical data dimensions : data ratio n / p data dimension p sample size n n / p portfolio ∼ 50 500 10 climate survey 320 600 1.9 a · 10 2 b · 10 2 speech analysis ∼ 1 ORL face data base 1440 320 1.2 micro-arrays 2000 200 0.1 ◮ Important: data ratio n / p not always large ; could be ≪ 1 ◮ Note: use of the Inverse data ratio: y = p / n :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem High-dimensional effect by an example The two-sample problem: ◮ two independent samples: x 1 , . . . , x n 1 ∼ ( µ 1 , Σ) , y 1 , . . . , y n 2 ∼ ( µ 2 , Σ) ◮ want to test H 0 : µ 1 = µ 2 against H 1 : µ 1 � = µ 2 . ◮ Classical approach: Hotelling’s T 2 test T 2 = n 1 n 2 ( x − y ) ′ S − 1 n ( x − y ) , n where n 1 n 2 X X = x i , y = y j , n = n 1 + n 2 , x i =1 j =1 " n 1 # X X n 2 1 ( x i − x i )( x i − x i ) ′ + ( y j − y j )( y j − y i ) ′ S n = . n − 2 i =1 j =1 S n : a sample covariance matrix :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem The two-sample problem: Hotelling’s T 2 test: nice properties ◮ invariance under linear transformations; ◮ finite-sample optimality if Gaussian; asymptotic optimality otherwise. Hotelling’s T 2 test: bad news ◮ low power even for moderate data dimensions; ◮ high instability in computing S − 1 even for p = 40; n ◮ very few is known for the non Gaussian case; ◮ fatal deficiency: when p > n − 2, S n is not invertible. :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Dempster’s non-exact test (NET) Dempster A.P., ’58, ’60 ◮ A reasonable test must be based on x − y even when p > n − 2; ◮ choose a new basis in R n , project the data such that 1. axis 1 � Ground mean: ( n 1 µ 1 + n 2 µ 2 ) / n 2. axis 2 � ( x − y ) . n × p = ( x 1 , . . . , x n 1 , y 1 , . . . , y n 2 ) ′ , and the (orthonormal) ◮ let the data matrix X base change H n : 0 1 0 1 0 1 h ′ z ′ n 2 √ nn 1 1 n 1 1 1 1 B C B C B C . . n × p = H n Z n × n X = . A X = . A , h 1 = √ n 1 n , h 2 = A . @ @ @ . . n 1 − √ nn 2 1 n 2 h ′ z ′ n n Under normality, we have: ◮ the z i ’s are n independent N p ( ∗ , Σ); 1 E z 2 = n 1 n 2 ◮ E z 1 = √ n ( n 1 µ 1 + n 2 µ 2 ) , √ n ( µ 1 − µ 2 ) , E z 3 = 0 , i = 3 , . . . , n . :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Dempster’s non-exact test (NET) Test statistic: � z 2 � 2 ◮ F D = ( n − 2) � z 3 � 2 + · · · + � z n � 2 ◮ Under H 0 , r X � z j � 2 ∼ Q := α k χ 2 1 ( k ) , k =1 where α 1 ≥ · · · α r > 0 are the non null eigenvalues of Σ. ◮ The distribution of F D is complicated ◮ Approximations - so the NET test : think as Σ = I p , 1. Q ≃ m χ 2 r ; 2. next estimate r by ˆ r ; ◮ Finally, under H 0 , F D ≃ F (ˆ r , ( n − 2)ˆ r ) . :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Dempster’s non-exact test (NET) Problems with the NET test: ◮ Difficult to construct the orthogonal transformation H n = { h j } for large n ; ◮ even under Gaussianity, the exact power function depend on H n . :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Bai and Saranadasa’s test (ANT) Bai & Saranadasa, ’96 n ◮ Consider directly the statistic M n = � x − y � 2 − n 1 n 2 tr S n ; ◮ generally under very mild conditions (here RMT comes!), n 2 M n n − 1 n − 2 tr Σ 2 . σ 2 = ⇒ N (0 , 1) , n := Var( M n ) = σ 2 n 2 1 n 2 n 2 ◮ A ratio consistent estimator: » – n = 2 n ( n − 1)( n − 2) 1 σ 2 tr S 2 n − 2( tr S n ) 2 σ 2 n /σ 2 P b n − , b − → 1 . n n 1 n 2 ( n − 3) ◮ Finally, under H 0 , Z n = M n = ⇒ N (0 , 1) σ 2 b n This is the Bai-Saranadasa’s asymptotic normal test (ANT). :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Comparison between T 2 , NET and ANT Power functions: ◮ Assuming p → ∞ , n → ∞ , p / n → y ∈ (0 , 1), n 1 / n → κ ; ◮ Hotelling’s T 2 , Dempster’s NET and Bai-Saranadasa’s ANT: s ! n (1 − y ) κ (1 − κ ) � Σ − 1 / 2 µ � 2 β H ( µ ) = Φ − ξ α + + o (1) , 2 y „ « n 2 tr Σ 2 κ (1 − κ ) � µ � 2 √ β D ( µ ) = Φ − ξ α + + o (1) = β BS ( µ ) . where α = test size, and ξ α = Φ − 1 (1 − α ) . µ = µ 1 − µ 2 , ◮ Important: because of the factor (1 − y ), T 2 losses power when y increases, i.e. p increases relatively to n . :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem Comparison between T 2 , NET and ANT Simulation results 1: Gaussian case Σ = (1 − ρ ) I p + ρ J p , J p = 1 p 1 ′ ◮ Choice of covariance: p ◮ noncentral parameter η = � µ 1 − µ 2 � 2 √ , ( n 1 , n 2 ) = (25 , 20), n = 45 tr Σ 2 :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions A two sample problem A summary of the introduction ◮ High-dimensional effect need to be taken into account ; ◮ Surprisingly, asymptotic methods with RMT perform well even for small p (as low as p = 4) ; ◮ many of classical multivariate analysis methods have to be examined with respect to high-dimensional effects. :
Introduction Sample covariance matrices Random Fisher matrices Testing covariance matrices I Testing covariance matrices II Multivariate regressions Conclusions Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marˇ cenko-Pastur distributions Bai and Silverstein’s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions :
Recommend
More recommend