methods for robust high dimensional graphical model
play

Methods for Robust High Dimensional Graphical Model Selection Bala - PowerPoint PPT Presentation

Methods for Robust High Dimensional Graphical Model Selection Bala Rajaratnam Department of Statistics Stanford University Fields Institute Joint work with S. Oh (UC Berkeley) and K. Khare (U. Florida) Motivation Availability of


  1. Methods for Robust High Dimensional Graphical Model Selection Bala Rajaratnam Department of Statistics Stanford University Fields Institute Joint work with S. Oh (UC Berkeley) and K. Khare (U. Florida)

  2. Motivation • Availability of high-throughput data from various applications • Need for methodology/tools for analyzing high-dimensional data • Examples: ◮ Biology: gene expression data ◮ Environmental science: climate data on spatial grid ◮ Finance: returns on thousands of stocks ◮ Retail: consumer behavior • Common goals: ◮ Understand complex relationships & multivariate dependencies ◮ Formulate correct models & develop inferential procedures 1 / 44

  3. Modeling relationships • Correlation: basic measure of linear pairwise relationships • Covariance matrix Σ : collection of relationships • Estimates of Σ required in procedures such as PCA, CCA, MANOVA, etc. • Estimating (functions of) Σ and Ω = Σ − 1 are of statistical interest • Estimating Σ is difficult in high dimensions 2 / 44

  4. Sparse estimates • Matrix Σ or Ω of size p -by- p has O ( p 2 ) elements • Estimating O ( p 2 ) parameters with classical estimators is not viable, especially when n ≪ p • Reliably estimate small number of parameters in Σ • Model selection: zero/non-zero structure recovery • Gives rise to sparse estimates of Σ or Ω • Sparsity pattern can be represented by graphs/networks 3 / 44

  5. Gaussian Graphical Models (GGM) • Assume Y = ( Y 1 , . . . , Y p ) ′ has distribution N p ( 0, Σ ) • Denote V = { 1, 2, . . . , p } • Covariance matrix cov ( Y ) = Σ encodes marginal dependencies Y i ⊥ ⊥ Y j ⇐ ⇒ cov ( Y i , Y j ) = [ Σ ] ij = 0 • Inverse covariance matrix Ω = Σ − 1 encodes conditional dependencies given the rest � � Y i ⊥ ⊥ Y j | Y V \{ i , j } ⇐ ⇒ [ Ω ] ij = 0 � �� � � �� � matrix element conditional independence • Also known as Markov Random Fields (MRF) 4 / 44

  6. Gaussian Graphical Models (GGM) • Graph summarizes relationships with nodes V = { 1, . . . , p } and set E of edges [ Ω ] ij = 0 ⇐ ⇒ i � ∼ j ���� � �� � matrix element network/graph • Build a graph from sparse Ω A B C A � � 1 0.2 0.3 A Ω = 0.2 2 0 B 0.3 0 1.2 C B C 5 / 44

  7. GGM estimation algorithms • Regularized Gaussian likelihood methods ◮ Block coordinate descent (COVSEL) [Banerjee et al., 2008] ◮ Graphical lasso (GLASSO) [Friedman, Hastie, & Tibshirani, 2008] ◮ Large-scale GLASSO [Mazumder & Hastie, 2012] ◮ QUIC [Hsieh et al., 2011] ◮ G-ISTA [Guillot, Rajaratnam et al., 2012] ◮ Graphical Dual Proximal Gradient Methods [Dalal & Rajaratnam, 2013] ◮ Others • Bayesian methods ◮ Dawid & Lauritzen, 1993, Annals of Statistics ◮ Letac & Massam, 2007, Annals of Statistics ◮ Rajaratnam, Massam & Carvalho, 2008, Annals of Statistics ◮ Khare & Rajaratnam, 2011, Annals of Statistics ◮ Others • Testing-based methods ◮ Hero & Rajaratnam, 2011, JASA ◮ Hero & Rajaratnam, 2012, IEEE, Information Theory 6 / 44

  8. Regularized Gaussian likelihood graphical model selection • All ℓ 1 -regularized Gaussian-likelihood methods solve ˆ Ω = arg max Ω ≻ 0 { log det ( Ω ) − tr ( Ω S ) − λ � Ω � 1 } • S : sample covariance matrix • Graphical Lasso [Friedman, Hastie, & Tibshirani, 2008] • Ω can be computed by solving optimization problem • Adding ℓ 1 -regularization term λ � Ω � 1 introduces sparsity • Penalty parameter λ controls level of sparsity • Dependency on Gaussianity ◮ Parametric model ◮ Sensitivity to outliers ◮ Log-concave function 7 / 44

  9. Regularized pseudo-likelihood graphical model selection • Two main main approaches: 1. ℓ 1 -regularized likelihood methods 2. ℓ 1 -regularized regression-based/pseudo-likelihood methods • Series of linear regressions form a pseudo-likelihood function • Objective function is the ℓ 1 -penalized pseudo-likelihood • Pseudo-likelihood assumes less about distribution of the data • Applicable to wider range of data 8 / 44

  10. Partial covariance and correlation • Matrix Y ∈ R n × p denotes iid observations of random variable with mean zero, covariance Σ = Ω − 1 . • Goal : estimate partial correlation graph • Partial correlation in terms of Ω = [ ω ij ] : ω ij ρ ij = − √ ω ii ω jj • Called “partial” because correlation of residuals r k , where � � β ( k ) = arg r k = Y k − Y ˆ β ( k ) , where ˆ � Y k − Y β � 2 min . 2 β : β k = 0 Now, partial correlation is ρ ij = cor ( r i , r j ) � ω jj • It can be shown that ρ ij ω ii � �� � β ij • Zero/non-zero pattern of [ ρ ij ] is identical to that of Ω • Partial correlation graph is given by sparsity pattern of Ω 9 / 44

  11. Regularized regression-based graphical model selection • Neighborhood selection (NS) [Meinshausen and B¨ uhlmann, 2006] � � ω ( i ) = arg min � Y i − Y β � 2 ˆ 2 + λ � β � 1 β : β i = 0 • Neighborhood of i is defined as ne ( i ) = { k : ˆ ω ( i ) � � = 0 } k • MB does not take into account symmetry of Ω ne ( i ) ne ( j ) j ∈ � ⇒ i ∈ � � = • Current state-of-the-art methods address this issue ◮ SPACE [Peng et al., 2009] ◮ SYMLASSO [Friedman, Hastie, & Tibshirani, 2010] ◮ SPLICE [Rocha et al., 2008] 10 / 44

  12. Sparse PArtial Correlation Estimation [Peng et al., 2009] SPACE objective function : ( w i = 1 or w i = ω ii ) � ω jj p p Q spc ( Ω ) =: − 1 n log ω ii + 1 � � � � ρ ij Y j � 2 | ρ ij | w i � Y i − 2 + λ 2 2 ω ii i = 1 i = 1 j � = i 1 � i < j � p � �� � β ij 1. Update [ ρ ij ] coordinate-wise (using current estimates [ ˆ ω ii ] ): �   p   1 � � ω jj ˆ � [ ρ ij ] ← min ρ ij Y j � 2 | ρ ij | w i � Y i − 2 + λ 2 ˆ ω ii [ ρ ij ]   i = 1 j � = i 1 � i < j � p ρ ij ] and [ ˆ 2. Update [ ω ii ] (using current estimates [ ˆ ω ii ] ):   � − 1 ˆ � ω jj  � Y i − ρ ij Y j � 2  ω ii ← ˆ 2 ω ii ˆ j � = i 11 / 44

  13. Non-converging example: p = 3 case 1e+01 1e+03 abs. difference between successive updates 1e−01 1e+00 1e−03 1e−03 Convergence Threshold Convergence Threshold 0 1000 2000 3000 4000 0 1000 2000 3000 4000 iterations iterations estimator pvar.1 pvar.2 pvar.3 pcor.1 pcor.2 pcor.3 estimator pvar.1 pvar.2 pvar.3 pcor.1 pcor.2 pcor.3 (a) SPACE ( w i = ω ii ) (b) SPACE ( w i = 1) Figure: Y ( i ) ∼ N 3 ( 0, Ω − 1 ) , (left) n = 4, (right) n = 100 12 / 44

  14. Non-convergence of SPACE • Investigate the nature and extent of convergence issues: 1. Are such examples pathological? How widespread are they? 2. When do they occur ? • Consider a sparse 100 × 100 matrix Ω with edge density 4% and condition number of 100. • Generate 100 multivariate Gaussian datasets (with n = 100), µ = 0 and Σ = Ω − 1 . • Record the number of times (out of 100) for which SPACE1 (uniform weights) and SPACE2 (partial variance weights) do not converge within 1500 iterations. • Original implementation of SPACE by [Peng et al., 2009] claims only 3 iterations are sufficient. 13 / 44

  15. Non-convergence of SPACE SPACE1 ( w i = 1) SPACE2 ( w i = ω ii ) λ ∗ λ ∗ NZ NC NZ NC 0.026 60.9% 92 0.085 79.8% 100 0.099 19.7% 100 0.160 28.3% 0 0.163 7.6% 100 0.220 10.7% 0 0.228 2.9% 100 0.280 4.8% 0 0.614 0.4% 0 0.730 0.5% 97 Table: Number of simulations (out of 100) that do not converge within 1500 iterations (NC) for select values of penalty parameter ( λ ∗ = λ/ n ). Average percentage of non-zeros (NZ) in ˆ Ω are also shown. • SPACE exhibits extensive non-convergence behavior • Problem exacerbated when condition number is high • Typical of high dimensional sample starved settings 14 / 44

  16. Symmetric Lasso and SPLICE SYMLASSO [Friedman, Hastie, & Tibshirani, 2010]:   p Ω ) = 1  n log α ii + 1 � � � Q sym ( α , ˘ ω ij α ii Y j � 2  + λ � Y i + | ω ij | , 2 α ii i = 1 j � = i 1 � i < j � p where α ii = 1 /ω ii . SPLICE [Rocha et al., 2008]: p p Q spl ( B , D ) = n � ii ) + 1 � 1 � � β ij Y j � 2 + λ log ( d 2 � Y i − | β ij | , d 2 2 2 ii i = 1 i = 1 j � = i i < j where d 2 ii = ω ii . Also, alternating (off-diagonal vs diagonal) iterative algorithms No convergence guarantees 15 / 44

  17. Regularized regression-based graphical model selection • Advantage : Regression-based methods perform better model selection than likelihood-based methods in finite sample • Advantage : Regression-based methods are less restrictive than Gaussian likelihood-based methods • Disadvantage : ˆ Ω may not be positive definite (can be fixed) • Disadvantage : Solution may not be computable • Cause : Iterative algorithms SPACE, SYMLASSO and SPLICE are not guaranteed to converge 16 / 44

  18. Regression-based methods: summary METHOD SYMLASSO SPLICE SPACE NS Property Symmetry + + + Convergence guarantee N/A Asymptotic consistency ( n , p → ∞ ) + + How can we obtain all of the good properties simultaneously? 17 / 44

Recommend


More recommend