Edgeworth and confidence interval correction in spiked PCA Iain Johnstone & Jeha Yang Statistics & Biomedical Data Science, Stanford & Two Sigma Shanghai, December 10, 2019
Edgeworth and confidence interval correction in spiked PCA Iain Johnstone & Jeha Yang Statistics & Biomedical Data Science, Stanford & Two Sigma Shanghai, December 10, 2019
Viral protein mutations and spiked models Quadeer et. al. PLOS Comp. Bio. 2018
Viral protein mutations and spiked models Quadeer et. al. PLOS Comp. Bio. 2018
A suggestive simulation on correlation matrices [David Morales, Matt McKay] 2 nd eigenvalue ρ 1 = 0 . 2 ; ρ 2 = 0 . 1 2-st Leading Eigenvalue 2-st Leading Eigenvalue γ γ c = 0.2, N = 300, N1 = 10, simple spks = [2.8, 1.9], deg. spks = [0.8, 0.9] c = 0.2, N = 300, N1 = 30, simple spks = [6.8, 3.9], deg. spks = [0.8, 0.9] Histogram of the sample eigenvalue Histogram of the sample eigenvalue 9 3.5 mean = 2.305 [2.322] mean = 4.145 [4.169] std = 0.053 [0.054] (0.89 xPaul) std = 0.126 [0.125] (0.89 xPaul) 8 3 7 2.5 6 2 5 4 1.5 3 1 2 0.5 1 0 0 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Theoretical variance is pretty accurate, but there seems to be a shift in the mean (similar to what we’ve seen before in the eigenvector projections of sample covariance when spikes were close to each other) 6
Outline Background on spiked covariance model Edgeworth correction - single spike Edgeworth for multiple spikes Explaining the repulsion correction Confidence intervals after selection
High dimensional spiked PCA model ◮ Data : X = [ x 1 · · · x n ] ′ with i . i . d . x 1 , · · · , x n ∼ N p +1 (0 , Σ) ◮ Large dimensional asymptotic regime : as n → ∞ , γ n := p / n → γ ∈ (0 , ∞ ) ◮ Spiked eigenstructure of Σ : for a fixed r , ℓ 1 > · · · > ℓ r > 1 = ℓ r +1 = · · · = ℓ p +1 � �� � Spikes ◮ Statistics : eigenvalues of sample covariance matrix X ′ X / n ρ 1 ≥ · · · ≥ ˆ ˆ ρ p +1 → w.l.o.g. Σ is diagonal
Largest Eigenvalue ˆ ρ 1 : Numerical illustration p = 200 , n = 800 [i.e. γ n = p / n = 0 . 25] subcritical critical supercritical Spike h = ℓ − 1 : 0, 0.25, h + = 0 . 5, 0.75, 1. 15 10 5 0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
Finite rank model, K = 1: phase transition Σ = diag( ℓ 1 , 1 , . . . , 1) p / n → γ . Interior point transition at ℓ 1 = 1 + √ γ : [Baik–Ben Arous–Pech´ e,05] ¸ Tracy-Widom 1 {2 = 3 n fluctuation . ` 1 2 (1+ ° ) 1+ ° Critical point:
Finite rank model, K = 1: phase transition Σ = diag( ℓ 1 , 1 , . . . , 1) p / n → γ . Interior point transition at ℓ 1 = 1 + √ γ : [Baik–Ben Arous–Pech´ e,05] ¸ 1 Gaussian . {1 = 2 n fluctuation bias ` 1 2 (1+ ° ) 1+ ° ¸ ( ` ) Critical point: 1
Largest Eigenvalue ˆ ρ 1 : Numerical illustration p = 200 , n = 800 [i.e. γ n = p / n = 0 . 25] subcritical critical supercritical Spike h = 0, 0.25, h + = 0 . 5, 0.75, 1. 15 10 5 0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Edge: (1 + √ γ n ) 2 = 2 . 25
½ ( ` ) ° º 2 (1+ ° ) ` =1+ h º 2 1 1+ ° (1+ ° ) Largest eigenvalue: Phase transition Different rates, limit distributions: � ˆ � For h < √ γ : ρ 1 − µ ( γ n ) D n 2 / 3 ⇒ TW β , τ ( γ n ) � ˆ � For h > √ γ : ρ 1 − ρ ( h , γ n ) D n 1 / 2 ⇒ N (0 , 1) σ ( h , γ n )
Largest eigenvalue: Phase transition Different rates, limit distributions: � ˆ � For h < √ γ : ρ 1 − µ ( γ n ) D n 2 / 3 ⇒ TW β , τ ( γ n ) � ˆ � For h > √ γ : ρ 1 − ρ ( h , γ n ) D n 1 / 2 ⇒ N (0 , 1) σ ( h , γ n ) with � 1 + γ � σ 2 ( h , γ ) = 2(1 + h ) 2 � 1 − γ � ρ ( h , γ ) = (1 + h ) h 2 h ½ ( ` ) º ° Statistical physics lit, 94- bias Baik-Ben Arous-Peche(05) 2 (1+ ° ) , Paul (07) Baik-Silverstein (06), Bloemendal-Virag (11) Mo (11) , Wang (12) ` =1+ h Benaych-Georges-Guionnet- º 2 Maida (11) 1 1+ ° (1+ ° ) (bulk)
Normal approximation – multiple spikes ◮ Assume that all spikes are simple, supercritical : ℓ 1 > · · · > ℓ r > 1 + √ γ ◮ Asymptotic mutual independence: with ρ kn := ρ ( ℓ k , γ n ) , σ kn := σ ( ℓ k , γ n ), � � n 1 / 2 (ˆ ρ k − ρ kn ) (ˆ z kn ) k =1 , ··· , r := ⇒ N (0 , I r ) σ kn k =1 , ··· , r Shi (2013)
Edgeworth approximations
Inaccuracy of approximations : ˆ z kn associated with ℓ k = 2 . 7 (n, γ n ,l) = (400,1,(2.7)) (n, γ n ,l) = (400,1,(2.7,2.2)) 0.5 0.5 Normal Normal 0.4 0.4 0.3 0.3 Density Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 − 3 − 2 − 1 0 1 2 3 4 ^ 1n ^ 1n z z (n, γ n ,l) = (400,1,(3.2,2.7)) (n, γ n ,l) = (400,1,(2.7,2.4)) 0.5 0.5 Normal Normal 0.4 0.4 0.3 0.3 Density Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 − 3 − 2 − 1 0 1 2 3 4 ^ 2n ^ 1n z z
Traditional Edgeworth (Smooth function of) means model: Petrov, 1975, Hall, 1992 n 1 � indep, mean 0 , ∈ R d , S n = √ n κ 2 n X ni d fixed i =1 n κ jn = 1 � E X j moments ni n 1 First order expansion: P ( S n ≤ x ) = Φ( x ) + n − 1 / 2 p ( x ) φ ( x ) + o ( n − 1 / 2 ) p ( x ) = − κ 3 n H 2 ( x ) H 2 ( x ) = x 2 − 1 . , κ 3 / 2 6 2 n skewness correction
Single spike, first order expansion for ˆ ρ 1 z 1 n = n 1 / 2 (ˆ ˆ ρ 1 − ρ 1 n ) /σ 1 n Theorem In spiked model, h 1 = ℓ 1 − 1 > √ γ, γ n = p / n , z 1 n ≤ x ) = Φ( x ) + n − 1 / 2 p 1 n ( x ) φ ( x ) + o ( n − 1 / 2 ) , P (ˆ uniformly in x ∈ R , with p 1 n ( x ) = − α 2 n H 2 ( x ) − α 0 n √ h 3 2 1 + γ n α 2 n = α 2 ( h 1 , γ n ) = 1 − γ n ) 3 / 2 , ( h 2 3 α 0 n = α 0 ( h 1 , γ n ) = γ n h 1 + 1 √ ( h 2 1 − γ n ) 3 / 2 2
Coefficients of Edgeworth expansion for single-spike √ h 3 2 1 + γ n α 0 ( h 1 , γ n ) = γ n h 1 + 1 α 2 ( h 1 , γ n ) = 1 − γ n ) 3 / 2 , √ ( h 2 ( h 2 1 − γ n ) 3 / 2 3 2 ◮ Larger for “harder” cases i.e. larger γ and smaller h ( > √ γ ) √ ◮ Larger than the fixed p case i.e. γ = 0 , α 2 = 2 / 3 , α 0 = 0 Muirhead-Chikuse (1975) ◮ Empirically reasonable if α 2 n = ( h 3 1 + γ ) 2 9 2 1 − γ ) 3 ≤ 0 . 2 n ( h 2 2
Single Spike Simulation (n, γ , l−factor) = (50,0.1,0.3) (n, γ , l−factor) = (50,1,0.3) 1.0 Edgeworth Edgeworth 1.2 Normal Normal 0.8 Upper Edge Upper Edge Density 0.8 Density 0.6 0.4 0.4 0.2 0.0 0.0 1.0 1.5 2.0 2.5 3.0 3.5 4 5 6 7 ^ ^ l l (n, γ , l−factor) = (100,0.1,0.3) (n, γ , l−factor) = (100,1,0.3) 2.0 1.5 Edgeworth Edgeworth Normal Normal 1.5 Upper Edge Upper Edge 1.0 Density Density 1.0 0.5 0.5 0.0 0.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 ^ ^ l l
Edgeworth for multiple spikes
Eigenvalues are repulsive! (n, γ n ,l) = (400,1,(2.7,2.2)) (n, γ n ,l) = (400,1,(2.7,2.4)) (n, γ n ,l) = (400,1,(3.2,2.7)) 0.5 0.5 0.5 Normal Normal Normal 0.4 0.4 0.4 0.3 0.3 0.3 Density Density Density 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 ^ 1n ^ 1n ^ 2n z z z ◮ joint density of (ˆ ρ 1 , · · · , ˆ ρ n ∧ ( p +1) ) has a Jacobian factor � | ˆ ρ i − ˆ ρ j | i < j → pushes eigenvalues apart ◮ But, not visible at leading order (for supercritical spikes:) (ˆ z kn ) k =1 , ··· , r ⇒ N (0 , I r )
Multi spike, first order expansion for ˆ ρ k z kn = n 1 / 2 (ˆ ˆ ρ k − ρ kn ) /σ kn Theorem In spiked model, h k = ℓ k − 1 > √ γ, γ n = p / n , z kn ≤ x ) = Φ( x ) + n − 1 / 2 p kn ( x ) φ ( x ) + o ( n − 1 / 2 ) , P (ˆ uniformly in x ∈ R , with p kn ( x ) = − α 2 ( h k , γ n ) H 2 ( x ) − α 0 , k ( h ,γ n ) √ h 3 2 k + γ n α 2 ( h k , γ n ) = k − γ n ) 3 / 2 , ( h 2 3 1 h k + 1 � γ h j � � α 0 , k ( h , γ ) = √ k − γ + ( h 2 h 2 k − γ ) 1 / 2 h k − h j 2 j � = k
Interpretation Edgeworth corrected density φ + n − 1 / 2 ( α 2 H 3 + α 0 H 1 ) φ Relative to single spike case: α 2 unchanged, but 1 h k + 1 h j � √ ∆ α 0 = α 0 , k ( h , γ n ) − α 0 ( h k , γ n ) = ( h 2 k − γ n ) 1 / 2 h k − h j 2 j � = k ◮ ∆ α 0 > 0, e.g. smaller spikes h j < h k , push density to right, conversely for ∆ α 0 < 0 ◮ closer spikes ⇒ larger effect ◮ additive in ℓ j , j � = k
Repulsion example 1 : ˆ z kn associated with ℓ k = 2 . 7 (n, γ n ,l) = (400,1,(2.7)) (n, γ n ,l) = (400,1,(2.7,2.2)) 0.5 0.5 Normal Normal Edgeworth Edgeworth 0.4 0.4 0.3 Density 0.3 Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 ^ 1n z ^ 1n z (n, γ n ,l) = (400,1,(3.2,2.7)) (n, γ n ,l) = (400,1,(2.7,2.4)) 0.5 0.5 Normal Normal Edgeworth Edgeworth 0.4 0.4 0.3 0.3 Density Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 − 3 − 2 − 1 0 1 2 3 4 ^ 2n ^ 1n z z Figure: Density of ˆ z kn associated with ℓ k = 2 . 7
Recommend
More recommend