Estimating Sparse Principal Components and Subspaces Jing Lei - PowerPoint PPT Presentation

Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU Joint work with V. Q. Vu (OSU), J. Cho, and K. Rohe (U. of Wisc.) July 1, 2013

Outline • PCA in high dimensions. • Sparsity of principal components. • Consistent estimation and minimax theory. • Feasible algorithms using convex relaxation.

Principal Components Analysis • I have iid data points X 1 ,..., X n on p variables. • p may be large, so I want to use principal components analysis (PCA) for dimension reduction.

Principal Components Analysis 4 2 y y 0 -2 -4 -4 -2 0 2 4 x

Principal Components Analysis 4 2 pc1 y 0 -2 -4 -4 -2 0 2 4 x

Principal Components Analysis • Σ = E ( XX T ) is the population covariance matrix (say E X = 0). • Eigen-decomposition Σ = VDV T = λ 1 v 1 v T 1 + λ 2 v 2 v T 2 + ... + λ p v p v T p D = diag ( λ 1 , λ 2 ,..., λ p ) , λ 1 ≥ λ 2 ≥ ... ≥ λ p ≥ 0 (eigenvalues) VV T = I p , V = ( v 1 , v 2 ,..., v p ) (eigenvectors) • “Optimal” d -dimensional projection: X → Π d X Π d = V d V T d ( d -dimensional projection matrix), V d = ( v 1 ,..., v d ) .

Classical Estimator • Sample covariance matrix: ˆ Σ = n − 1 ( X 1 X T 1 + ... + X n X T n ) . • Estimate ( ˆ v j ) by eigen-decomposition of ˆ λ j , ˆ Σ . ˆ v d ) , ˆ Π d = ˆ V d ˆ V T V d = ( ˆ v 1 ,..., ˆ d . • Standard theory for p fixed and n → ∞ : ˆ Π d → Π d a.s. if λ j − λ j + 1 > 0.

High-Dimensional PCA: Challenges • Estimation accuracy. Classical theory fails when p / n → c > 0: λ 1 → c ′ > 1, and ˆ ˆ v T 1 v 1 ≈ 0 under a simple model (Johnstone & Lu 2009). • Interpretability. ˆ Π d X may be hard to interpret when it involves linear combination of many variables. • Sparsity is a possible solution.

Sparsity for Principal Subspaces [Vu & L 2012b] • Identifiability. If λ 1 = λ 2 = ... = λ d , then one cannot distinguish V d and V d Q from observed data for any orthogonal Q . • Intuition: a good notion of sparsity must be rotation invariant. • Matrix ( 2 , 0 ) norm: for any matrix V ∈ R p × d , � V � 2 , 0 = # of non-zero rows in V • Row sparsity: � V d � 2 , 0 ≤ R 0 ≪ p . V d = ( v 1 , v 2 ,..., v d ) . • Loss function: � ˆ Π d − Π d � 2 F ( �·� F : the Frobenius norm). Recall: ˆ d , ˆ Π d = ˆ V d ˆ Π d = V d V T V T d .

Two Sparse PCA Models 1. Spiked model: Σ = ( λ 1 − λ d + 1 ) v 1 v T 1 + ... +( λ d − λ d + 1 ) v d v T d + λ d + 1 I p . 2. General model: d + λ d + 1 Σ ′ Σ = λ 1 v 1 v T 1 + ... + λ d v d v T where Σ ′ � 0 , � Σ ′ � = 1 , Σ ′ v j = 0 , ∀ 1 ≤ j ≤ d .

Spiked Model is a Special Case of General Model Black cell: | Σ ( i , j ) | ≤ 0 . 01, White cell: | Σ ( i , j ) | > 0 . 01 In spiked model, all black cells outside the upper 20 × 20 are 0. 0 20 40 60 80 100 0 20 40 60 80 100 0 0 20 20 40 40 60 60 80 80 100 100 Covariance Pattern of Spiked Model Covariance Pattern of General Model

How Does Sparsity Help? • Question: how does sparsity help with the estimation? 1. How well can we do if sparsity is assumed? 2. How to estimate under sparsity assumption? • Intuition: Estimation is easy if 1. n is large. 2. p is small. 3. λ d + 1 is close to 0. 4. λ d − λ d + 1 is away from 0. 5. R 0 is small. • Under the spiked model, [Johnstone & Lu 2009] gives a consistent estimator of v 1 when p / n → c > 0, and others fixed.

A Minimax Framework Find f ( n , p , R 0 , λ 1 , λ 2 ) such that F � f ( n , p , R 0 ,� E � ˆ Π d − Π d � 2 λ ) , ∀ estimator ˆ Π d , sup Σ and a particular estimator ˆ Π d such that F � f ( n , p , R 0 ,� E � ˆ Π d − Π d � 2 λ ) , ∀ Σ . Σ is taken over all matrices in the sparse PCA model.

Answer to the Minimax Question Theorem : Minimax Error Rate of Estimating V d (Vu and Lei 2012b) Under the general model, the minimax rate of estimating V d V T d is λ 1 λ d + 1 d + log p f d ( n , p , R 0 ,� λ ) ≍ R 0 , ( λ d − λ d + 1 ) 2 n and can be achieved by Tr ( V T d ˆ ˆ V d = arg Σ V d ) . max V T d V d = I d , � V d � 2 , 0 ≤ R 0

About This Result • Good news • Exact minimax error rate in ( n , p , d , R 0 ,� λ ) for general models. • First consistency result for ℓ 1 constrained/penalized PCA (Jolliffe et al 2003, Zou et al 2006). • Price to pay • Finding the global maximizer is computationally demanding. • Extensions • Soft sparsity: ℓ q -ball with q ∈ [ 0 , 1 ] [Vu & L 2012a,b]. • Feasible algorithms [Vu, Cho, L, Rohe 2013].

Related Work • When d = 1, [Birnbaum et al 2012, and Ma 2013] established the minimax rate under the spiked model, where the estimator is obtained by power method and thresholding. • For subspace estimation, the minimax rate is independently obtained by [Cai et al 2012] under a Gaussian spiked model.

Feasible Algorithm Via Convex Relaxation • For d = 1, the optimal estimator (consider Z = v 1 v T 1 ) is ˆ Tr ( ˆ Z = argmax Σ Z ) − λ � Z � 0 , Z s . t . rank ( Z ) = 1 , Z � 0 , Tr ( Z ) = 1 . • [d’Aspremont et al 2004] proposed an SDP relaxation ˆ Z Tr ( ˆ Z = argmax Σ Z ) − λ � Z � 1 , s . t . Z � 0 , Tr ( Z ) = 1 , • ˆ Z gives consistent variable selection with optimal rate under a stringent spiked model, provided that ˆ Z is rank 1 [Amini & Wainwright 2009].

Preliminary Results for SDP Relaxation Theorem : Error Bound for SDP Relaxation [VCLR 2013] When d = 1 under the general model, assume � v 1 � 0 ≤ R 0 and choose λ 1 � λ ≍ log p / n in the SDP relaxation. Then w.h.p the global λ 1 − λ 2 optimizer ˆ Z satisfies λ 2 log p Z − v 1 v T 1 � 2 2 � R 2 � ˆ 1 . 0 ( λ 1 − λ 2 ) 2 n

SDP Reslaxation is *Near* Optimal • Recall the SDP rate and minimax rate ( d = 1, q = 0) λ 2 λ 1 λ 2 log p log p R 2 1 vs. R 0 0 ( λ 1 − λ 2 ) 2 ( λ 1 − λ 2 ) 2 n n • These are off by a factor of λ 1 . R 0 λ 2 • The R 0 factor is unavoidable for polynomial time algorithms in a hypothesis testing context [Berthet & Rigollet 2013]. • λ 1 / λ 2 factor may be removable using finer analysis.

Summary • Sparsity helps improve both estimation accuracy and interpretability of PCA in high dimensions. • Sparsity can be defined for principal subspaces. • Minimax error rates are established for general covariance models. • Convex relaxation using SDP is near-optimal.

Ongoing Work • Statistical properties for SDP relaxation under soft sparsity. • SDP relaxation for subspaces ( d > 1). • Other penalties than ℓ 1 , such as the group lasso penalty.

Main References 1. V. Vu and J. Lei (2012) “Minimax rates of estimation for sparse PCA in high dimensions”, AISTATS’12 2. Vincent Vu and Jing Lei (2013) “Minimax Sparse Principal Subspace Estimation in High Dimensions”, revision submitted. 3. Vincent Q. Vu, Juhee Cho, Jing Lei, and Karl Rohe (2013), ongoing work.

Estimating Sparse Principal Components and Subspaces Jing Lei - PowerPoint PPT Presentation

Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU Joint work with V. Q. Vu (OSU), J. Cho, and K. Rohe (U. of Wisc.) July 1, 2013 Outline PCA in high dimensions. Sparsity of principal

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Subspaces and the Three Matrix Spaces Subspaces Defn. A subspace of a vector space V is a subset

Whats so great about Krylov subspaces? David S. Watkins Department of Mathematics Washington

Quiz Describe the two most important ways in which subspaces of F D arise. (These ways were

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

February ry 7 In Inheritance and polymorphism Jing Jing Jian Jiang The Rule of Three

Pulmonary Tuberculosis Jing ZHANG ( ), MD, PhD zhang.jing@zs-hospital.sh.cn Department of

Algorand: from Theory to Practice Jing Chen Algorand Inc., Boston, MA 02116 jing@algorand.com

Studies of Cosmic-Ray Proton from DAMPE Chuan Yue *, Jing-Jing Zang, Tie-Kuang Dong, Antonio Surdo,

Event SpatioTemporal Extent Stub Pascal Hitzler Data Semantics Laboratory (DaSe Lab) Data

The European Materials Modeling Council EMMC Interoperability: Objectives Scope: improve

Evaluating the Impact of Word Embeddings on Similarity Scoring for Practical Information

Search for Latent Variables ICA, Tensors, and NMF Pierre Comon, Christian Jutten GIPSA-lab

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all

Spectral sets and derivatives of the psd cone Mario Kummer TU Berlin August 28, 2020 Mario

PIPS Is not (just) Polyhedral Software Mehdi A MINI 1 , 2 Corinne A NCOURT 2 Fabien C OELHO 2

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley

Estimating Sparse Principal Components and Subspaces Jing Lei - PowerPoint PPT Presentation

Estimating Sparse Principal Components and Subspaces Jing Lei Department of Statistics, CMU Joint work with V. Q. Vu (OSU), J. Cho, and K. Rohe (U. of Wisc.) July 1, 2013 Outline PCA in high dimensions. Sparsity of principal

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Subspaces and the Three Matrix Spaces Subspaces Defn. A subspace of a vector space V is a subset

Whats so great about Krylov subspaces? David S. Watkins Department of Mathematics Washington

Quiz Describe the two most important ways in which subspaces of F D arise. (These ways were

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

February ry 7 In Inheritance and polymorphism Jing Jing Jian Jiang The Rule of Three

Pulmonary Tuberculosis Jing ZHANG ( ), MD, PhD zhang.jing@zs-hospital.sh.cn Department of

Algorand: from Theory to Practice Jing Chen Algorand Inc., Boston, MA 02116 jing@algorand.com

Studies of Cosmic-Ray Proton from DAMPE Chuan Yue *, Jing-Jing Zang, Tie-Kuang Dong, Antonio Surdo,

Event SpatioTemporal Extent Stub Pascal Hitzler Data Semantics Laboratory (DaSe Lab) Data

The European Materials Modeling Council EMMC Interoperability: Objectives Scope: improve

Evaluating the Impact of Word Embeddings on Similarity Scoring for Practical Information

Search for Latent Variables ICA, Tensors, and NMF Pierre Comon, Christian Jutten GIPSA-lab

IV.4 Topic-Specific &amp; Personalized PageRank PageRank produces one-size-fits-all

Spectral sets and derivatives of the psd cone Mario Kummer TU Berlin August 28, 2020 Mario

PIPS Is not (just) Polyhedral Software Mehdi A MINI 1 , 2 Corinne A NCOURT 2 Fabien C OELHO 2

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all