MC for high-dimensional statistics 1 Random projections, reweighting and half-sampling for high-dimensional statistical inference Art B. Owen Stanford University based on joint works with: Dean Eckles Facebook Inc. Sarah Emerson Oregon State University MCQMC 2012, February 2012
MC for high-dimensional statistics 2 About these slides These are the slides I presented on February 15 at MCQMC 2012 in Sydney Australia. I have corrected some typos and extended the presentation of the challenging integral over the Stiefel manifold. A few of these slides were skipped over in order to allow time for questions. This talk covers two projects. The bootstrap work with Dean Eckles has now been accepted by the Annals of Applied statistics. The projection work with Sarah Emerson is still in progress. MCQMC 2012, February 2012
MC for high-dimensional statistics 3 Monte Carlo methods for statistics 1) Markov chain Monte Carlo 2) Bootstrap resampling The above are mainstays. We also use: 1) Random permutations 2) Random projections 3) Sample splitting Probability/Statistics and Monte Carlo are closely intertwined. MCQMC 2012, February 2012
MC for high-dimensional statistics 4 Statistics and Monte Carlo M. C. Escher (1948) MCQMC 2012, February 2012 This talk will show some uses of MC uses in statistics.
MC for high-dimensional statistics 5 Some statistical notions X ∼ F random vector X has distribution F iid ∼ F X i are statistically I ndependent and I dentically D istributed (IID) from F X i The Gaussian distribution with mean µ ∈ R d N d ( µ, Σ) and variance covariance matrix Σ ∈ R d × d X ∼ N ( µ, Σ) , means � Pr( X ∈ A ) = f ( x ) d x where A � � − 1 2 ( x − µ ) T Σ − 1 ( x − µ ) f ( x ) = exp (2 π ) d/ 2 det(Σ) 1 / 2 p -values Observe T = t and compute p = Pr( T � t ) . If p < 0 . 01 then the observed value t happens 1 % or less of the time. Evidence against the hypothesized distribution of T . MCQMC 2012, February 2012
MC for high-dimensional statistics 6 Problem one We have iid ∼ F in R d X 1 , . . . , X n x iid ∼ G in R d Y 1 , . . . , Y n y is F = G ? We might assume F = N ( µ 1 , Σ) and G = N ( µ 2 , Σ) . Then we test µ 1 = µ 2 . This is an old problem. Revived interest, d ≫ n x + n y DNA microarrays expression levels of d ≈ 30 , 000 genes on n x healthy and n y diseased individuals n x , n y tens or hundreds Genome wide association studies d ≈ 2 , 000 , 000 markers with n x , n y thousands or more MCQMC 2012, February 2012 Also: fMRI, finance
MC for high-dimensional statistics 7 Illustration 50 red and 50 black points in R 2 ● ● 2 ● ● Black points normally distributed ● ● ● ● ● ● ● Red points shifted Northwest wrt black ● ● ● ● 1 ● ● ● ● ● ● ● ● ● X 1 not significantly different, ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p = 0 . 47 ● ● ● 0 ● ● ● X 2 ● ● ● ● ● ● ● X 2 not significantly different, ● ● ● ● ● ● ● ● ● ● ● ● p = 0 . 09 −1 ● ● ● ● ● ● ● X 1 + X 2 not significantly different, ● ● ● ● ● p = 0 . 60 ● −2 ● X 1 − X 2 very significantly different, ● ● p = 1 . 7 × 10 − 4 −3 −2 −1 0 1 2 So: how to find the interesting projection? X 1 MCQMC 2012, February 2012
MC for high-dimensional statistics 8 Hotelling’s T 2 Find θ ∈ R d with θ T θ = 1 to maximize the apparent separation between � � X i = θ T X i ∈ R Y i = θ T Y i ∈ R and Answer depends on n x n x � � X = 1 ¯ ( X i − ¯ X )( X i − ¯ X ) T S x = X i n x i =1 i =1 n y n y � � Y = 1 ¯ ( Y i − ¯ Y )( Y i − ¯ Y ) T S y = Y i n y i =1 i =1 Algebraicly we get n x n y ( ¯ X − ¯ Y ) T S − 1 ( ¯ X − ¯ T 2 = Y ) where n x + n y S x + S y S = Hotelling (1931) n x + n y − 2 Get T 2 = 18 . 58 . Also Pr( T 2 � 18 . 58) = 2 . 6 × 10 − 4 (p value) MCQMC 2012, February 2012
MC for high-dimensional statistics 9 In high dimensions When d ≫ n x + n y the covariance matrix S is not invertible. Can’t use n x n y ( ¯ X − ¯ Y ) T S − 1 ( ¯ X − ¯ T 2 = Y ) n x + n y Geometrically Some projection θ ∈ R d has θ T X i = constant for i = 1 , . . . , n x and θ T Y i = different constant. IE: we will get perfect separation, even if F = G . A classic remedy by Dempster (1958) takes � d j =1 ( ¯ X j − ¯ Y j ) 2 � ¯ X − ¯ Y � 2 n x n y n x n y T 2 Dempster = = � d n x + n y tr ( S ) n x + n y j =1 S jj but this makes no use of correlations Recent improvements by: Bai, Saradanasa, Hall, Fan, Chen, Srivastava also don’t use correlations MCQMC 2012, February 2012
MC for high-dimensional statistics 10 Random projections Lopes, Jacob, Wainwright (2011) Choose random Θ ∈ R d × k with Θ T Θ = I k . Put � X i = Θ T X i and � Y i = Θ T Y i Then use n x n y � ( ¯ X − ¯ Y ) T Θ(Θ T S Θ) − 1 Θ T ( ¯ X − ¯ T 2 Θ = Y ) n x + n y Exists if k < n x + n y − 2 That is project data into a random k dimensional subspace test means of projected data this retains some of the correlations MCQMC 2012, February 2012
MC for high-dimensional statistics 11 Uniform random projections From R d to R : normalize a Gaussian vector Z Z ∼ N (0 , I k ) θ = � Z � , To project from R d to R k · · · Z 11 Z 12 Z 1 k · · · Z 21 Z 22 Z 2 k iid ∈ R d × k Z = Z ij ∼ N (0 , 1) . . . ... . . . . . . Z d 1 Z d 2 · · · Z dk Gram-Schmidt yields Z = QR Θ = Q ∈ R d × k deliver � X i = Θ T X i project Any QR decomposition with positive R ii will do. MCQMC 2012, February 2012
MC for high-dimensional statistics 12 Lopes et al. ctd. They make just one random projection of the data They find that k ≈ ( n x + n y − 2) / 2 performs well Why just one? If your one projection is ’unlucky’ then you might miss the pattern. But with just one projection the distribution of � T 2 is known. Multiple projections M � T 2 = 1 � ¯ T 2 i M i =1 T 2 based on Θ i ∈ R d × k i Get some kind of ’average’ luck. But distribution of ¯ T 2 not known. MCQMC 2012, February 2012
MC for high-dimensional statistics 13 Multiple projections average over M independent random Θ i ∈ R d × k Work with S. Emerson: M � T 2 = 1 � ¯ T 2 i , where M i =1 n x n y � ( ¯ X − ¯ i ( ¯ X − ¯ T 2 i S Θ i ) − 1 Θ T Y ) T Θ i (Θ T i = Y ) n x + n y Easily T 2 ) = E ( � 1) E ( ¯ T 2 i ) 2) Var( ¯ T 2 ) < Var( � T 2 i ) , unless both are infinite! (averaging reduces variance) Less easily 1) Finite variance requires k � n x + n y − 6 2) Finite mean requires k � n x + n y − 4 Unfortunately: the distribution of ¯ T 2 is not known. MCQMC 2012, February 2012
MC for high-dimensional statistics 14 Separation Simulate 2000 data sets iid iid δ ∈ R d ∼ N (0 , Σ) , ∼ N ( δ, Σ) , X i Y i Of these: � δ � = 0 1000 null cases � δ � > 0 1000 non-null cases Rank 2000 ¯ T 2 scores See if nulls get smaller ¯ T 2 values. The ROC ∗ curve Shown later, shows how well the test separates the two cases ∗ R eceiver O perating C haracteristic (don’t ask) MCQMC 2012, February 2012
MC for high-dimensional statistics 15 Simulated case X i , Y i ∈ R 200 n x = n y = 50 Pick � δ � = 3 uniform on 200 dimensional sphere √ Pick Σ = I d × 50 / d Why these Uniform δ means that the group separation is unrelated to the covariance structure. Debatable. We follow Lopes et al in making this assumption. WLOG, under uniformity Σ = diag ( λ 1 , . . . , λ d ) λ 1 � λ 2 � · · · � λ d Interesting cases are equal λ j and rapidly decreasing λ j MCQMC 2012, February 2012
MC for high-dimensional statistics 16 Multiple projections Simulated T squared M=1 M=32 Null Alt Null Alt n x = n y = 50 , d = 200 , k = 49 , Null: � δ � = 0 Alt: � δ � = 3 MCQMC 2012, February 2012
MC for high-dimensional statistics 17 The ROCs ROC curves: M=1,2,4,8,16,32 100 Larger M has greater area under the curve: 80 M AUC True positives 1 71.9 60 2 80.6 40 4 87.1 8 91.4 20 16 94.3 32 95.7 0 0 20 40 60 80 100 MCQMC 2012, February 2012 False positives
MC for high-dimensional statistics 18 Varying k Lopes et al. prefer k ≈ ( n x + n y − 2) / 2 That is not always optimal. But may be a good default. For the previous scenario: small k do relatively poorly. 32 � k � 56 all gave AUC ≈ 0 . 95 with M = 32 Other scenarios S. Emerson: advantage of averaging persists in other decay rates for eigenvalues of Σ MCQMC 2012, February 2012
MC for high-dimensional statistics 19 Using ¯ T 2 The usual p -value is Pr( ¯ T 2 � t 2 ) where t 2 is the observed value on our data. We have no good approximation for this. Even the moments of ¯ T 2 involve difficult integrals over Θ ∈ V d,k , the Stiefel manifold, e.g. � Θ(Θ T S Θ) − 1 Θ T d U (Θ) Θ ∈ V d,k � Z ∈ R d × k Z ( Z T SZ ) − 1 Z T e − tr( Z T Z ) / 2 d Z = (2 π ) − dk/ 2 Non-negative diagonal S ∈ R d × d with n x + n y − 2 positive entries U (Θ) is the uniform (Haar) measure. Above is the first moment. Closed forms for first and second moments could lead to useful test statistics. MCQMC 2012, February 2012
Recommend
More recommend