Efficiency of Bayesian procedures in some high dimensional problems - PowerPoint PPT Presentation

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Efficiency of Bayesian procedures in some high dimensional problems Natesh S. Pillai Dept. of Statistics, Harvard University pillai@fas.harvard.edu May 16, 2013 DIMACS Workshop

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Joint Work: Collaborators Anirban Bhattacharya, Debdeep Pati and David Dunson (Duke University and Florida State) Christian Robert, Jean-Michel Marin, Judith Rousseau (Paris 9) Jun Yin (University of Wisconsin)

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Outline Goal: Understand Bayesian methods in high dimensions. Example 1: Covariance matrix estimation Example 2: Bayesian model choice via ABC Implications, Frequentist-Bayes connection in high dimensions.

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Conversation with Peter E. Huybers Motivation: Time variability in covariance patterns: stationarity? Instrumental measurements, only for the past n = 150 years. Measurements on p = 2000 latitude-longitude points. Estimate O ( p 2 ) parameters. Need judicious modeling.

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Covariance Matrix Estimation: Why Shrinkage? We observe i . i . d y 1 , . . . y n ∼ N p n ( 0 , Σ 0 n ) and set y ( n ) = ( y 1 , . . . , y n ) For p n = p , fixed, the sample covariance estimator n � Σ sample = 1 y i y T i n i = 1 is consistent for population eigenvalues. ˆ λ i are consistent for population eigenvalues: √ n (ˆ λ i − λ i ) ⇒ N ( 0 , V ( λ i ))

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Covariance Matrix in high dimensions Simplest Case: Σ 0 n = I Take p = p n = c n , c ∈ ( 0 , 1 ) . � λ 1 , � λ p n largest and smallest (non-zero) eigenvalues of n � Σ sample = 1 y i y T i n i = 1 Then as n → ∞ (and thus p n also grows), (Marcenko-Pastur, 1967) almost surely! √ � c ) 2 lim λ 1 = ( 1 + n →∞ √ � c ) 2 lim λ p n = ( 1 − n →∞ MLE is not consistent!

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Covariance Matrix in high dimensions λ 1 = ( 1 + √ c ) 2 = λ + . lim n →∞ � Confidence Interval: n 2 / 3 ( � λ 1 − λ + ) ⇒ TW 1 where TW 1 is the Tracy-Widom law (Johnstone 2000). Universality phenomenon: Results go beyond the case of Gaussian (Tao and Vu, 2009; P . and Yin, 2011)

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Correlation Matrix Johnstone (2001): Correlation Matrices for PCA. Theorem (P . and Yin, 2012, AoS) Largest eigenvalue of sample correlation matrices still inconsistent. All of the problems from covariance matrices persist.

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Understanding Asymptotics 20 century n → ∞ . Now: both p , n → ∞ . Why should we bother? Because the above asymptotics is remarkably accurate for ‘small’ n , ‘small’ p !

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Sample covariance matrix plot, n = 100, p = 25 n=100, p= 25 300 250 200 Frequency 150 100 50 0 1.8 2.0 2.2 2.4 2.6 Max Eigenvalue of Sample Covariance Matrix

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Sample covariance matrix plot, n = 500, p = 125 n=500, p= 125 400 300 Frequency 200 100 0 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 Max Eigenvalue of Sample Covariance Matrix

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Factor Models: Motivation Interest in estimating dependence in high-dim obs. + prediction and classification from high-dim correlated markers such as gene expression, SNPs. Center prior on a “sparse” structure, while allowing uncertainty and flexibility. Latent factor methods (West, 2003; Lucas et al., 2006; Carvalho et al., 2008). Huge applications (economics, finance, signal processing..)

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Gaussian factor models Explain dependence through shared dependence on fewer latent factors y i ∼ N ( 0 , Σ p × p ) , 1 ≤ i ≤ n . Focus on the case p = p n ≫ n . Factor models assume the “decomposition" Σ = ΛΛ T + σ 2 I p Λ is a p × k matrix, k ≪ n .

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Gaussian factor models Explain dependence through shared dependence on fewer latent factors y i = µ + Λ η i + ǫ i , ǫ i ∼ N p ( 0 , Σ) , i = 1 , . . . , n µ ∈ R p , a vector of means, with µ = 0. η i ∈ R k , latent factors, Λ a p × k matrix of factor loadings with k ≪ p . ǫ i are i.i.d with N ( 0 , σ 2 ) .

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Factor models for covariance estimation Unstructured Σ has O ( p 2 ) free elements Factor models Σ = ΛΛ T + σ 2 I p . Still O ( p ) elements to estimate!

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 High-dimensional covariance estimation ‘Frequentist’ solution– MLE doesn’t work. Start with sample covariance matrix: n � Σ sample = 1 y i y T . i n i = 1 Great interest in regularized estimation (Bickel & Levina, 2008a, b; Wu and Pourahmadi, 2010, Cai and Liu, 2011 ...) Estimator which achieves the ‘minimax’ rate: Σ ij = Σ sample ˆ 1 | Σ sample | > t n . ij ij Unstable; Confidence intervals..

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Sparse factor modeling A natural bayesian alternative: sparse factor modeling (West, 2003); also (Lucas et al., 2006; Carvalho et al., 2008) and many others Allow zeros in loadings through point mass mixture priors: Λ ij given point mass priors or shrinkage priors. Prior assigns Λ ij = 0 with non-zero probability. Why care about this prior? Bayesian analogue of thresholding. Assume k to be known (but easy to relax this).

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Important questions Can Bayes methods produce estimators which are comparable to frequentist estimators? Can one do computation in reasonable time? How to address Statistical efficiency-Computational efficiency trade off?

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Our objective Bayesian counterpart lacks a theoretical framework in terms of posterior convergence rates. A prior Π(Λ ⊗ σ 2 ) induces a prior distribution Π(Ω) How does the posterior behave assuming data sampled from fixed truth? Huge literature on frequentist properties of the posterior distribution

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Questions need to be addressed Does the posterior measure concentrate around the truth increasingly with sample size? What role does the prior play? How does the dimensionality affect the rate of contraction?

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Preliminaries We consider the operator norm ( � · � 2 ) � A � 2 = sup x ∈S r − 1 � Ax � 2 = s ( 1 ) Largest Eigenvalue of A , for symmetric A .

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Setup We observe i . i . d y 1 , . . . y n ∼ N p n ( 0 , Σ 0 n ) and set y ( n ) = ( y 1 , . . . , y n ) , Σ 0 n = Λ 0 Λ t 0 + σ 2 I p n × p n Want to find a minimum sequence ǫ n → 0 such that � � Σ − Σ 0 n � 2 > ǫ n | y ( n ) � lim = 0 n →∞ P Can we find such ǫ n even if p n ≫ n ? What is the role of the prior?

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Assumptions on truth “Realistic Assumption:" (A1) Sparsity: Each column of Λ 0 n has at most s n non-zero entries, with s n = O ( log p n ) .

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Prior choice & a key result Prior (PL) Let Λ ij ∼ ( 1 − π ) δ 0 + π g ( · ) , π ∼ Beta ( 1 , p n + 1 ) . g ( · ) has Laplace like or heavier tails Theorem (Pati, Bhattacharya, P . and Dunson, 2012) � log 7 ( p n ) / n , For the high-dimensional factor model r n = n →∞ P ( � Σ − Σ 0 � 2 > r n | y ( n ) ) = 0 . lim

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Implication of the result � log 2 ( p n ) / n . Rate ǫ n = We will get consistency if log 7 p n lim = 0 . n n →∞ Ultra-High dimensions, p n = e n 1 / 7 .

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Important Implication for Asymptotics This rate we get is similar to the minimax rate for similar problems Cai and Zhou (2011), but not the same! � r n = minimax rate × log p n The above phenomenon is similar to what happens in mixture modeling! Ghosal (2001): Bayesian nonparametric modeling doesn’t match frequentist rates. If true: Serious implications.

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 A couple of Implications Minimax theory will tell only half the story. Heuristics based on bayes. BIC? Frequentist-Bayes agreement/disagreement?

Efficiency of Bayesian procedures in some high dimensional problems - PowerPoint PPT Presentation

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Efficiency of Bayesian procedures in some high dimensional problems Natesh S. Pillai Dept. of Statistics, Harvard University pillai@fas.harvard.edu May 16, 2013 DIMACS

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Defining and Using Procedures Defining and Using Procedures Creating Procedures

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Module 9: Implementing Stored Procedures Overview Introduction to Stored Procedures

Procedures in Assembly Procedures Syntax CS Basics Save Registers 7) Procedures

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Lively Networks! Lively Networks R. Braun From Graph Theory To Biological Systems Motivation

Making the Worlds Scientific Information (More) Organized, Accessible, and Usable Ted Briscoe

CS171 Visualization Alexander Lex alex@seas.harvard.edu [xkcd] vi su al i za tion

5)&6# )5 %6&# " %6 5 %&'5%76#$% * -" ** ' 1'6%8' * )"( *'

DECIPHERING CANCER MECHANISMS BY INTEGRATIVE NETWORK ANALYSIS Research Seminar Duke-NUS Medical

A S TRUCTURED A PPROACH . . . T UTORIAL , P ART II F ROM P ETRI N ETS TO D IFFERENTIAL E QUATIONS

Trim Conditions Trim Conditions Trim Conditions Trim Conditions 1 1 VPC Trim Screen VPC Trim

Efficiency of Bayesian procedures in some high dimensional problems - PowerPoint PPT Presentation

Outline Example 1 key issues Dirichlet-Laplace prior Example 2 Efficiency of Bayesian procedures in some high dimensional problems Natesh S. Pillai Dept. of Statistics, Harvard University pillai@fas.harvard.edu May 16, 2013 DIMACS

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Defining and Using Procedures Defining and Using Procedures Creating Procedures

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Module 9: Implementing Stored Procedures Overview Introduction to Stored Procedures

Procedures in Assembly Procedures Syntax CS Basics Save Registers 7) Procedures

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Lively Networks! Lively Networks R. Braun From Graph Theory To Biological Systems Motivation

Making the Worlds Scientific Information (More) Organized, Accessible, and Usable Ted Briscoe

CS171 Visualization Alexander Lex alex@seas.harvard.edu [xkcd] vi su al i za tion

5)&amp;6*# )*5 %6&amp;# *&quot; %6 *5 %&amp;'5%76#$% * -&quot; ** ' 1'6%8' * )&quot;( *'

DECIPHERING CANCER MECHANISMS BY INTEGRATIVE NETWORK ANALYSIS Research Seminar Duke-NUS Medical

A S TRUCTURED A PPROACH . . . T UTORIAL , P ART II F ROM P ETRI N ETS TO D IFFERENTIAL E QUATIONS

Trim Conditions Trim Conditions Trim Conditions Trim Conditions 1 1 VPC Trim Screen VPC Trim

5)&6# )5 %6&# " %6 5 %&'5%76#$% * -" ** ' 1'6%8' * )"( *'