High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26, 2011 Joint work with Philipp R¨ utimann, Min Xu, and Peter B¨ uhlmann
Problem definition Want to estimate the covariance matrix for Gaussian Distributions: e.g., stock prices Take a random sample of vectors X ( 1 ) , . . . , X ( n ) i.i.d. ∼ N p ( 0 , Σ 0 ) , where p is understood to depend on n Let Θ 0 := Σ − 1 denote the concentration matrix 0 Sparsity: certain elements of Θ 0 are assumed to be zero Task: Use the sample to obtain a set of zeros, and then an estimator for Θ 0 ( Σ 0 ) upon a given pattern of zeros Show consistency in predictive risk and in estimating Θ 0 and Σ 0 when n , p → ∞
Gaussian graphical model: representation Let X be a p -dimensional Gaussian random vector X ∼ ( X 1 , . . . , X p ) ∼ N ( 0 , Σ 0 ) , Σ 0 = Θ − 1 where 0 In Gaussian graphical model G ( V , E 0 ) , where | V | = p : A pair ( i , j ) is NOT contained in E 0 ( θ 0 , ij = 0) iff X i ⊥ X j | { X k ; k ∈ V \ { i , j }} Define Predictive Risk with Σ ≻ 0 as R (Σ) = tr (Σ − 1 Σ 0 ) + log | Σ | ∝ − 2 E 0 ( log f Σ ( X )) where the Gaussian Log-likelihood function using Σ ≻ 0 is − p 2 log 2 π − 1 2 log | Σ | − 1 2 X T Σ − 1 X log f Σ ( X ) =
Penalized maximum likelihood estimators To estimate a sparse model (i.e., | Θ 0 | 0 is small), recent work has considered ℓ 1 -penalized maximum likelihood estimators: let | Θ | 1 = � vec Θ � 1 = � � j | θ ij | , i � � tr (Θ � S n ) − log | Θ | + λ n | Θ | 1 � Θ n = arg min , where Θ ≻ 0 S n = n − 1 � n � r = 1 X ( r ) ( X ( r ) ) T is the sample covariance The graph � G n is determined by the non-zeros of � Θ n References: Yuan-Lin 07, d’Aspremont-Banerjee-El Ghaoui 08, Friedman-Hastie-Tibshirani 08, Rothman et al 08, Z-Lafferty-Wasserman 08, and Ravikumar et. al. 08
Predictive risks Fix a point of interest with f 0 = N ( 0 , Σ 0 ) For a given L n , consider a constrained set of positive � � Σ − 1 � 1 ≤ L n } � definite matrices: Γ n = { Σ : Σ ≻ 0 , Define the oracle estimator as Σ ∗ = arg min Σ ∈ Γ n R (Σ) Recall R (Σ) = tr (Σ − 1 Σ 0 ) + log | Σ | Σ n as the minimizer of � R n (Σ) subject to Σ ∈ Γ n , Define � � � � tr Σ − 1 � S n + log | Σ | Σ n = arg min � �� � Σ ∈ Γ n R n (Σ) � R n (Σ) is the negative Gaussian log-likelihood function and � S n is the sample covariance �
Risk consistency Persistence Theorem: Let p < n ξ , for some ξ > 0. Given � � � Σ − 1 � n � 1 ≤ L n } , where L n = o ( log n ) , ∀ n . Γ n = { Σ : Σ ≻ 0 , n ) P Then R ( � Σ n ) − R (Σ ∗ → 0 , 1 n 2 L n = logn where R (Σ) = tr (Σ − 1 Σ 0 ) + log | Σ | n = arg min Σ ∈ Γ n R (Σ) and Σ ∗ n=200 400 800 o + o o + o + o o + Persistency answers the asymptotic question: How large may the set Γ n be, so that it is still possible to select empirically a predictor whose risk is close to that of the best predictor in the set (see Greenshtein-Ritov 04 )
Non-edges act as the constraints Suppose we obtain an edge set E such that E 0 ⊆ E : Define the estimator for the concentration matrix Θ 0 as: � � Θ n ( E ) = argmin Θ ∈M E � tr (Θ � S n ) − log | Θ | , where M E = { Θ ≻ 0 and θ ij = 0 ∀ ( i , j ) �∈ E , and i � = j } Theorem. Assume that 0 < ϕ min (Σ 0 ) < ϕ max (Σ 0 ) < ∞ . Suppose that E 0 ⊂ E and | E \ E 0 | = O ( S ) , where S = | E 0 | . �� � Θ n ( E ) − Θ 0 � F = O P ( p + S ) log max ( n , p ) / n Then, � � This is the same rate as Rothman et al 08 for the ℓ 1 -penalized likelihood estimate
Get rid of the dependency on p Theorem. Assume that 0 < ϕ min (Σ 0 ) < ϕ max (Σ 0 ) < ∞ . Assume that Σ 0 , ii = 1 , ∀ i . Suppose we obtain an edge set E such that E 0 ⊆ E and | E \ E 0 | = O ( S ) , where S := | E 0 | = � p i = 1 s i . Then, �� � � � Θ n ( E ) − Θ 0 � F = O P S log max ( n , p ) / n In the likelihood function, � S n will be replaced by the sample correlation matrix Γ n = diag ( � S n ) − 1 / 2 ( � S n ) diag ( � S n ) − 1 / 2 �
Main questions: How to select an edge set E so that we estimate Θ 0 well? What assumptions do we need to impose on Σ 0 or Θ 0 ? How does n scale with p , | E | , or the maximum node degree deg ( G ) ? What if some edges have very small weights? How to ensure that E \ E 0 is small? How does the edge-constrained maximum likelihood estimate behave with respect to E 0 \ E and E \ E 0 ?
Outline Introduction The regression model The method Theoretical results Conclusion
A Regression Model We assume a multivariate Gaussian model X = ( X 1 , . . . , X p ) ∼ N p ( 0 , Σ 0 ) , where Σ 0 , ii = 1 Consider a regression formulation of the model: For all i = 1 , . . . , p � β i β i X i = j X j + V i j = − θ 0 , ij /θ 0 , ii , and where j � = i V i ∼ N ( 0 , σ 2 V i ) is independent of { X j ; j � = i } for which we assume that there exists v 2 > 0 such that for all i , Var ( V i ) = 1 /θ 0 , ii ≥ v 2 Recall X i ⊥ X j | { X k ; k ∈ V \ { i , j }} ⇐ ⇒ θ 0 , ij = 0 ⇒ β j i = 0 and β i ⇐ j = 0
Want to recover the support of β i Take a random sample of size n , and use the sample to estimate β i , ∀ i ; that is, we have for each variable X i , X . \ i β i X i ǫ = + , n × ( p − 1 ) n n p − 1 where we assume p > n , that is, given high-dimensional data X Lasso (Tibshirani 96), a.k.a. Basis Pursuit (Chen, Donoho, and Saunders 98, and others): β i = arg min β � X i − X ·\ i β � 2 / 2 n + λ n � β � 1 �
Meinshausen and B¨ uhlmann 06 Perform p regressions using the Lasso to obtain p vectors β p where for each i , of regression coefficients � β 1 , . . . , � β i = { � β i � j ; j ∈ { 1 , . . . , p } \ i } Then estimate the edge set by the “OR” rule, estimate an edge between nodes i and j β j β i ⇒ � j � = 0 or � ⇐ i � = 0 Under sparsity and “Neighborhood Stability” conditions, they show P ( � E n = E 0 ) → 1 as n → ∞
Sparsity At row i , define s i 0 , n as the smallest integer such that: p � s i min { θ 2 0 , ij , λ 2 θ 0 , ii } 0 , n λ 2 θ 0 , ii ≤ j = 1 , j � = i The essential sparsity s i 0 , n at row i counts all ( i , j ) such that � ⇒ | β i | θ 0 , ij | � λ θ 0 , ii ⇐ j | � λσ V i Define S 0 , n = � p i = 1 s i 0 , n as the essential sparsity of the graph, which counts all ( i , j ) such that � � j | � λσ V i or | β j ⇒ | β i | θ 0 , ij | � λ min ( θ 0 , ii , θ 0 , jj ) ⇐ i | � λσ V j Aim to keep ≍ 2 S 0 edges in E
Defining 2 s 0 Let 0 ≤ s 0 ≤ s be the smallest integer such that � � p − 1 i , λ 2 σ 2 ) ≤ s 0 λ 2 σ 2 , where λ = 2 log p / n i = 1 min ( β 2 If we order the β j ’s in decreasing order of magnitude | β 1 | ≥ | β 2 | ... ≥ | β p − 1 | , then | β j | < λσ ∀ j > s 0 0.8 s 0 2s 0 s p = 512 n = 500 s = 96 σ = 1 0.6 λ n = logp n Value 0.4 σ 2logp n 0.2 σ logp n σ n 0.0 0 20 40 60 80 100 120 This notion of sparsity has been used in linear regression (Cand` es-Tao 07, Z09,10)
Selection: individual neighborhood We use the Lasso in combination with thresholding (Z09, Z10) � 2 log p / n for inferring the graph: Let λ = For each of the nodewise regressions, obtain an estimator β i init using the Lasso with penalty parameter λ n ≍ λ , n � � � ( X ( r ) j X ( r ) β i β i | β i ) 2 + λ n j | ∀ i , init = argmin β i − i j r = 1 j � = i j � = i Threshold β i init with τ ≍ λ to get the “Zero” set: Let � � � � D i = { j : j � = i , � β i � < τ } j , init
Selection: joining the neighborhoods Define the total “zeros” as: { ( i , j ) : i � = j : ( i , j ) ∈ D i ∩ D j } D = Select edge set E := { ( i , j ) : i , j = 1 , . . . , p , i � = j , ( i , j ) �∈ D} That is, edge set is the joint neighborhoods across all nodes in the graph This reflects the idea that the essential sparsity S 0 , n of the graph counts all ( i , j ) such that � � | θ 0 , ij | ≥ λ min ( θ 0 , ii , θ 0 , jj )
Example: a star graph Construct Σ 0 from a model used in Ravikumar et. al. 08: 1 ρ ρ ρ . . . 0 ρ 2 ρ 2 ρ 1 . . . 0 ρ 2 ρ 2 ρ 1 . . . 0 Σ 0 = ρ 2 ρ 2 ρ 1 . . . 0 . . . . . . . . . . . . . . . . . . 0 . . . . . . . . . . . . 1 p × p
Example: original graph p = 128 , n = 96 , s = 8 , ρ = 0 . 5 � � 2 log p / n , τ = 0 . 2 2 log p / n λ n = 2
Example: estimated graph with n = 96 � 2 log p / n λ n = 2
Example: estimated graph � 2 log p / n λ n = 2
Example: estimated graph � 2 log p / n λ n = 2
Example: estimated graph � 2 log p / n λ n = 2
Example: estimated graph � 2 log p / n λ n = 2
Recommend
More recommend