estimating sparse principal components and subspaces jing
play

Estimating Sparse Principal Components and Subspaces Jing Lei - PowerPoint PPT Presentation

Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU Joint work with V. Q. Vu (OSU), J. Cho, and K. Rohe (U. of Wisc.) July 1, 2013 Outline PCA in high dimensions. Sparsity of principal


  1. Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU Joint work with V. Q. Vu (OSU), J. Cho, and K. Rohe (U. of Wisc.) July 1, 2013

  2. Outline • PCA in high dimensions. • Sparsity of principal components. • Consistent estimation and minimax theory. • Feasible algorithms using convex relaxation.

  3. Principal Components Analysis • I have iid data points X 1 ,..., X n on p variables. • p may be large, so I want to use principal components analysis (PCA) for dimension reduction.

  4. Principal Components Analysis 4 2 y y 0 -2 -4 -4 -2 0 2 4 x

  5. Principal Components Analysis 4 2 y y 0 -2 -4 -4 -2 0 2 4 x

  6. Principal Components Analysis 4 2 pc1 y 0 -2 -4 -4 -2 0 2 4 x

  7. Principal Components Analysis • Σ = E ( XX T ) is the population covariance matrix (say E X = 0). • Eigen-decomposition Σ = VDV T = λ 1 v 1 v T 1 + λ 2 v 2 v T 2 + ... + λ p v p v T p D = diag ( λ 1 , λ 2 ,..., λ p ) , λ 1 ≥ λ 2 ≥ ... ≥ λ p ≥ 0 (eigenvalues) VV T = I p , V = ( v 1 , v 2 ,..., v p ) (eigenvectors) • “Optimal” d -dimensional projection: X → Π d X Π d = V d V T d ( d -dimensional projection matrix), V d = ( v 1 ,..., v d ) .

  8. Classical Estimator • Sample covariance matrix: ˆ Σ = n − 1 ( X 1 X T 1 + ... + X n X T n ) . • Estimate ( ˆ v j ) by eigen-decomposition of ˆ λ j , ˆ Σ . ˆ v d ) , ˆ Π d = ˆ V d ˆ V T V d = ( ˆ v 1 ,..., ˆ d . • Standard theory for p fixed and n → ∞ : ˆ Π d → Π d a.s. if λ j − λ j + 1 > 0.

  9. High-Dimensional PCA: Challenges • Estimation accuracy. Classical theory fails when p / n → c > 0: λ 1 → c ′ > 1, and ˆ ˆ v T 1 v 1 ≈ 0 under a simple model (Johnstone & Lu 2009). • Interpretability. ˆ Π d X may be hard to interpret when it involves linear combination of many variables. • Sparsity is a possible solution.

  10. Sparsity for Principal Subspaces [Vu & L 2012b] • Identifiability. If λ 1 = λ 2 = ... = λ d , then one cannot distinguish V d and V d Q from observed data for any orthogonal Q . • Intuition: a good notion of sparsity must be rotation invariant. • Matrix ( 2 , 0 ) norm: for any matrix V ∈ R p × d , � V � 2 , 0 = # of non-zero rows in V • Row sparsity: � V d � 2 , 0 ≤ R 0 ≪ p . V d = ( v 1 , v 2 ,..., v d ) . • Loss function: � ˆ Π d − Π d � 2 F ( �·� F : the Frobenius norm). Recall: ˆ d , ˆ Π d = ˆ V d ˆ Π d = V d V T V T d .

  11. Two Sparse PCA Models 1. Spiked model: Σ = ( λ 1 − λ d + 1 ) v 1 v T 1 + ... +( λ d − λ d + 1 ) v d v T d + λ d + 1 I p . 2. General model: d + λ d + 1 Σ ′ Σ = λ 1 v 1 v T 1 + ... + λ d v d v T where Σ ′ � 0 , � Σ ′ � = 1 , Σ ′ v j = 0 , ∀ 1 ≤ j ≤ d .

  12. Spiked Model is a Special Case of General Model Black cell: | Σ ( i , j ) | ≤ 0 . 01, White cell: | Σ ( i , j ) | > 0 . 01 In spiked model, all black cells outside the upper 20 × 20 are 0. 0 20 40 60 80 100 0 20 40 60 80 100 0 0 20 20 40 40 60 60 80 80 100 100 Covariance Pattern of Spiked Model Covariance Pattern of General Model

  13. How Does Sparsity Help? • Question: how does sparsity help with the estimation? 1. How well can we do if sparsity is assumed? 2. How to estimate under sparsity assumption? • Intuition: Estimation is easy if 1. n is large. 2. p is small. 3. λ d + 1 is close to 0. 4. λ d − λ d + 1 is away from 0. 5. R 0 is small. • Under the spiked model, [Johnstone & Lu 2009] gives a consistent estimator of v 1 when p / n → c > 0, and others fixed.

  14. A Minimax Framework Find f ( n , p , R 0 , λ 1 , λ 2 ) such that F � f ( n , p , R 0 ,� E � ˆ Π d − Π d � 2 λ ) , ∀ estimator ˆ Π d , sup Σ and a particular estimator ˆ Π d such that F � f ( n , p , R 0 ,� E � ˆ Π d − Π d � 2 λ ) , ∀ Σ . Σ is taken over all matrices in the sparse PCA model.

  15. Answer to the Minimax Question Theorem : Minimax Error Rate of Estimating V d (Vu and Lei 2012b) Under the general model, the minimax rate of estimating V d V T d is λ 1 λ d + 1 d + log p f d ( n , p , R 0 ,� λ ) ≍ R 0 , ( λ d − λ d + 1 ) 2 n and can be achieved by Tr ( V T d ˆ ˆ V d = arg Σ V d ) . max V T d V d = I d , � V d � 2 , 0 ≤ R 0

  16. About This Result • Good news • Exact minimax error rate in ( n , p , d , R 0 ,� λ ) for general models. • First consistency result for ℓ 1 constrained/penalized PCA (Jolliffe et al 2003, Zou et al 2006). • Price to pay • Finding the global maximizer is computationally demanding. • Extensions • Soft sparsity: ℓ q -ball with q ∈ [ 0 , 1 ] [Vu & L 2012a,b]. • Feasible algorithms [Vu, Cho, L, Rohe 2013].

  17. Related Work • When d = 1, [Birnbaum et al 2012, and Ma 2013] established the minimax rate under the spiked model, where the estimator is obtained by power method and thresholding. • For subspace estimation, the minimax rate is independently obtained by [Cai et al 2012] under a Gaussian spiked model.

  18. Feasible Algorithm Via Convex Relaxation • For d = 1, the optimal estimator (consider Z = v 1 v T 1 ) is ˆ Tr ( ˆ Z = argmax Σ Z ) − λ � Z � 0 , Z s . t . rank ( Z ) = 1 , Z � 0 , Tr ( Z ) = 1 . • [d’Aspremont et al 2004] proposed an SDP relaxation ˆ Z Tr ( ˆ Z = argmax Σ Z ) − λ � Z � 1 , s . t . Z � 0 , Tr ( Z ) = 1 , • ˆ Z gives consistent variable selection with optimal rate under a stringent spiked model, provided that ˆ Z is rank 1 [Amini & Wainwright 2009].

  19. Preliminary Results for SDP Relaxation Theorem : Error Bound for SDP Relaxation [VCLR 2013] When d = 1 under the general model, assume � v 1 � 0 ≤ R 0 and choose λ 1 � λ ≍ log p / n in the SDP relaxation. Then w.h.p the global λ 1 − λ 2 optimizer ˆ Z satisfies λ 2 log p Z − v 1 v T 1 � 2 2 � R 2 � ˆ 1 . 0 ( λ 1 − λ 2 ) 2 n

  20. SDP Reslaxation is *Near* Optimal • Recall the SDP rate and minimax rate ( d = 1, q = 0) λ 2 λ 1 λ 2 log p log p R 2 1 vs. R 0 0 ( λ 1 − λ 2 ) 2 ( λ 1 − λ 2 ) 2 n n • These are off by a factor of λ 1 . R 0 λ 2 • The R 0 factor is unavoidable for polynomial time algorithms in a hypothesis testing context [Berthet & Rigollet 2013]. • λ 1 / λ 2 factor may be removable using finer analysis.

  21. Summary • Sparsity helps improve both estimation accuracy and interpretability of PCA in high dimensions. • Sparsity can be defined for principal subspaces. • Minimax error rates are established for general covariance models. • Convex relaxation using SDP is near-optimal.

  22. Ongoing Work • Statistical properties for SDP relaxation under soft sparsity. • SDP relaxation for subspaces ( d > 1). • Other penalties than ℓ 1 , such as the group lasso penalty.

  23. Main References 1. V. Vu and J. Lei (2012) “Minimax rates of estimation for sparse PCA in high dimensions”, AISTATS’12 2. Vincent Vu and Jing Lei (2013) “Minimax Sparse Principal Subspace Estimation in High Dimensions”, revision submitted. 3. Vincent Q. Vu, Juhee Cho, Jing Lei, and Karl Rohe (2013), ongoing work.

Recommend


More recommend