Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Efficiency of Bayesian procedures in some high dimensional problems Natesh S. Pillai Dept. of Statistics, Harvard University pillai@fas.harvard.edu May 16, 2013 DIMACS Workshop
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Joint Work: Collaborators Anirban Bhattacharya, Debdeep Pati and David Dunson (Duke University and Florida State) Christian Robert, Jean-Michel Marin, Judith Rousseau (Paris 9) Jun Yin (University of Wisconsin)
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Outline Goal: Understand Bayesian methods in high dimensions. Example 1: Covariance matrix estimation Example 2: Bayesian model choice via ABC Implications, Frequentist-Bayes connection in high dimensions.
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Conversation with Peter E. Huybers Motivation: Time variability in covariance patterns: stationarity? Instrumental measurements, only for the past n = 150 years. Measurements on p = 2000 latitude-longitude points. Estimate O ( p 2 ) parameters. Need judicious modeling.
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Covariance Matrix Estimation: Why Shrinkage? We observe i . i . d y 1 , . . . y n ∼ N p n ( 0 , Σ 0 n ) and set y ( n ) = ( y 1 , . . . , y n ) For p n = p , fixed, the sample covariance estimator n � Σ sample = 1 y i y T i n i = 1 is consistent for population eigenvalues. ˆ λ i are consistent for population eigenvalues: √ n (ˆ λ i − λ i ) ⇒ N ( 0 , V ( λ i ))
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Covariance Matrix in high dimensions Simplest Case: Σ 0 n = I Take p = p n = c n , c ∈ ( 0 , 1 ) . � λ 1 , � λ p n largest and smallest (non-zero) eigenvalues of n � Σ sample = 1 y i y T i n i = 1 Then as n → ∞ (and thus p n also grows), (Marcenko-Pastur, 1967) almost surely! √ � c ) 2 lim λ 1 = ( 1 + n →∞ √ � c ) 2 lim λ p n = ( 1 − n →∞ MLE is not consistent!
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Covariance Matrix in high dimensions λ 1 = ( 1 + √ c ) 2 = λ + . lim n →∞ � Confidence Interval: n 2 / 3 ( � λ 1 − λ + ) ⇒ TW 1 where TW 1 is the Tracy-Widom law (Johnstone 2000). Universality phenomenon: Results go beyond the case of Gaussian (Tao and Vu, 2009; P . and Yin, 2011)
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Correlation Matrix Johnstone (2001): Correlation Matrices for PCA. Theorem (P . and Yin, 2012, AoS) Largest eigenvalue of sample correlation matrices still inconsistent. All of the problems from covariance matrices persist.
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Understanding Asymptotics 20 century n → ∞ . Now: both p , n → ∞ . Why should we bother? Because the above asymptotics is remarkably accurate for ‘small’ n , ‘small’ p !
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Sample covariance matrix plot, n = 100, p = 25 n=100, p= 25 300 250 200 Frequency 150 100 50 0 1.8 2.0 2.2 2.4 2.6 Max Eigenvalue of Sample Covariance Matrix
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Sample covariance matrix plot, n = 500, p = 125 n=500, p= 125 400 300 Frequency 200 100 0 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 Max Eigenvalue of Sample Covariance Matrix
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Factor Models: Motivation Interest in estimating dependence in high-dim obs. + prediction and classification from high-dim correlated markers such as gene expression, SNPs. Center prior on a “sparse” structure, while allowing uncertainty and flexibility. Latent factor methods (West, 2003; Lucas et al., 2006; Carvalho et al., 2008). Huge applications (economics, finance, signal processing..)
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Gaussian factor models Explain dependence through shared dependence on fewer latent factors y i ∼ N ( 0 , Σ p × p ) , 1 ≤ i ≤ n . Focus on the case p = p n ≫ n . Factor models assume the “decomposition" Σ = ΛΛ T + σ 2 I p Λ is a p × k matrix, k ≪ n .
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Gaussian factor models Explain dependence through shared dependence on fewer latent factors y i = µ + Λ η i + ǫ i , ǫ i ∼ N p ( 0 , Σ) , i = 1 , . . . , n µ ∈ R p , a vector of means, with µ = 0. η i ∈ R k , latent factors, Λ a p × k matrix of factor loadings with k ≪ p . ǫ i are i.i.d with N ( 0 , σ 2 ) .
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Factor models for covariance estimation Unstructured Σ has O ( p 2 ) free elements Factor models Σ = ΛΛ T + σ 2 I p . Still O ( p ) elements to estimate!
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 High-dimensional covariance estimation ‘Frequentist’ solution– MLE doesn’t work. Start with sample covariance matrix: n � Σ sample = 1 y i y T . i n i = 1 Great interest in regularized estimation (Bickel & Levina, 2008a, b; Wu and Pourahmadi, 2010, Cai and Liu, 2011 ...) Estimator which achieves the ‘minimax’ rate: Σ ij = Σ sample ˆ 1 | Σ sample | > t n . ij ij Unstable; Confidence intervals..
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Sparse factor modeling A natural bayesian alternative: sparse factor modeling (West, 2003); also (Lucas et al., 2006; Carvalho et al., 2008) and many others Allow zeros in loadings through point mass mixture priors: Λ ij given point mass priors or shrinkage priors. Prior assigns Λ ij = 0 with non-zero probability. Why care about this prior? Bayesian analogue of thresholding. Assume k to be known (but easy to relax this).
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Important questions Can Bayes methods produce estimators which are comparable to frequentist estimators? Can one do computation in reasonable time? How to address Statistical efficiency-Computational efficiency trade off?
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Our objective Bayesian counterpart lacks a theoretical framework in terms of posterior convergence rates. A prior Π(Λ ⊗ σ 2 ) induces a prior distribution Π(Ω) How does the posterior behave assuming data sampled from fixed truth? Huge literature on frequentist properties of the posterior distribution
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Questions need to be addressed Does the posterior measure concentrate around the truth increasingly with sample size? What role does the prior play? How does the dimensionality affect the rate of contraction?
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Preliminaries We consider the operator norm ( � · � 2 ) � A � 2 = sup x ∈S r − 1 � Ax � 2 = s ( 1 ) Largest Eigenvalue of A , for symmetric A .
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Setup We observe i . i . d y 1 , . . . y n ∼ N p n ( 0 , Σ 0 n ) and set y ( n ) = ( y 1 , . . . , y n ) , Σ 0 n = Λ 0 Λ t 0 + σ 2 I p n × p n Want to find a minimum sequence ǫ n → 0 such that � � Σ − Σ 0 n � 2 > ǫ n | y ( n ) � lim = 0 n →∞ P Can we find such ǫ n even if p n ≫ n ? What is the role of the prior?
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Assumptions on truth “Realistic Assumption:" (A1) Sparsity: Each column of Λ 0 n has at most s n non-zero entries, with s n = O ( log p n ) .
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Prior choice & a key result Prior (PL) Let Λ ij ∼ ( 1 − π ) δ 0 + π g ( · ) , π ∼ Beta ( 1 , p n + 1 ) . g ( · ) has Laplace like or heavier tails Theorem (Pati, Bhattacharya, P . and Dunson, 2012) � log 7 ( p n ) / n , For the high-dimensional factor model r n = n →∞ P ( � Σ − Σ 0 � 2 > r n | y ( n ) ) = 0 . lim
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Prior choice & a key result Prior (PL) Let Λ ij ∼ ( 1 − π ) δ 0 + π g ( · ) , π ∼ Beta ( 1 , p n + 1 ) . g ( · ) has Laplace like or heavier tails Theorem (Pati, Bhattacharya, P . and Dunson, 2012) � log 7 ( p n ) / n , For the high-dimensional factor model r n = n →∞ P ( � Σ − Σ 0 � 2 > r n | y ( n ) ) = 0 . lim
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Implication of the result � log 2 ( p n ) / n . Rate ǫ n = We will get consistency if log 7 p n lim = 0 . n n →∞ Ultra-High dimensions, p n = e n 1 / 7 .
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Important Implication for Asymptotics This rate we get is similar to the minimax rate for similar problems Cai and Zhou (2011), but not the same! � r n = minimax rate × log p n The above phenomenon is similar to what happens in mixture modeling! Ghosal (2001): Bayesian nonparametric modeling doesn’t match frequentist rates. If true: Serious implications.
Outline Example 1 key issues Dirichlet-Laplace prior Example 2 A couple of Implications Minimax theory will tell only half the story. Heuristics based on bayes. BIC? Frequentist-Bayes agreement/disagreement?
Recommend
More recommend