Introduction Model Computation Results Conclusion Scalable MCMC for Bayes Shrinkage Priors Paulo Orenstein July 2, 2018 Stanford University Joint work with James Johndrow and Anirban Bhattacharya Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 1 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Consider the high-dimensional setting: predict a vector y ∈ R n from a set of features X ∈ R n × p , with p ≫ n . Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Consider the high-dimensional setting: predict a vector y ∈ R n from a set of features X ∈ R n × p , with p ≫ n . ◮ Assume a sparse Gaussian linear model ε ∼ N ( 0 , σ 2 I n ) , y = X β + ε, with β j = 0 for many j . Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Consider the high-dimensional setting: predict a vector y ∈ R n from a set of features X ∈ R n × p , with p ≫ n . ◮ Assume a sparse Gaussian linear model ε ∼ N ( 0 , σ 2 I n ) , y = X β + ε, with β j = 0 for many j . ◮ How can we perform prediction and inference? Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Consider the high-dimensional setting: predict a vector y ∈ R n from a set of features X ∈ R n × p , with p ≫ n . ◮ Assume a sparse Gaussian linear model ε ∼ N ( 0 , σ 2 I n ) , y = X β + ε, with β j = 0 for many j . ◮ How can we perform prediction and inference? Lasso Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Consider the high-dimensional setting: predict a vector y ∈ R n from a set of features X ∈ R n × p , with p ≫ n . ◮ Assume a sparse Gaussian linear model ε ∼ N ( 0 , σ 2 I n ) , y = X β + ε, with β j = 0 for many j . ◮ How can we perform prediction and inference? Lasso Point mass mixture prior Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Consider the high-dimensional setting: predict a vector y ∈ R n from a set of features X ∈ R n × p , with p ≫ n . ◮ Assume a sparse Gaussian linear model ε ∼ N ( 0 , σ 2 I n ) , y = X β + ε, with β j = 0 for many j . ◮ How can we perform prediction and inference? Lasso, but : convex relaxation; one parameter for sparsity and shrinkage Point mass mixture prior Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Consider the high-dimensional setting: predict a vector y ∈ R n from a set of features X ∈ R n × p , with p ≫ n . ◮ Assume a sparse Gaussian linear model ε ∼ N ( 0 , σ 2 I n ) , y = X β + ε, with β j = 0 for many j . ◮ How can we perform prediction and inference? Lasso, but : convex relaxation; one parameter for sparsity and shrinkage Point mass mixture prior, but : computation is prohibitive Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Can we find a continuous prior that behaves like the point mass mixture prior? Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Can we find a continuous prior that behaves like the point mass mixture prior? ◮ Desiderata: adaptive to sparsity easy to compute good predictive performance good frequentist properties decent compromise between statistical and computational goals Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Can we find a continuous prior that behaves like the point mass mixture prior? ◮ Desiderata: adaptive to sparsity easy to compute good predictive performance good frequentist properties decent compromise between statistical and computational goals ◮ Global-local priors can achieve this (with some qualifications). Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16
Introduction Model Computation Results Conclusion Introduction ◮ Can we find a continuous prior that behaves like the point mass mixture prior? ◮ Desiderata: adaptive to sparsity easy to compute good predictive performance good frequentist properties decent compromise between statistical and computational goals ◮ Global-local priors can achieve this (with some qualifications). ◮ But... they are still slow. Lasso: n ≈ 1 , 000, p ≈ 1 , 000 , 000; Global-local: n ≈ 1 , 000, p ≈ 1 , 000. Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16
Introduction Model Computation Results Conclusion Model ◮ The Horseshoe model * : y i | β j , λ j , τ, σ 2 ind ∼ N ( x i β, σ 2 ) ind ∼ N ( 0 , τ 2 λ 2 β j j ) ind λ j ∼ Cauchy + ( 0 , 1 ) τ ∼ Cauchy + ( 0 , 1 ) σ 2 ∼ InvGamma ( a 0 / 2 , b 0 / 2 ) * [Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16
Introduction Model Computation Results Conclusion Model ◮ The Horseshoe model * : y i | β j , λ j , τ, σ 2 ind ∼ N ( x i β, σ 2 ) ind ∼ N ( 0 , τ 2 λ 2 β j j ) ind λ j ∼ Cauchy + ( 0 , 1 ) τ ∼ Cauchy + ( 0 , 1 ) σ 2 ∼ InvGamma ( a 0 / 2 , b 0 / 2 ) * [Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16
Introduction Model Computation Results Conclusion Model ◮ The Horseshoe model * : y i | β j , λ j , τ, σ 2 ind ∼ N ( x i β, σ 2 ) ind ∼ N ( 0 , τ 2 λ 2 β j j ) ind λ j ∼ Cauchy + ( 0 , 1 ) τ ∼ Cauchy + ( 0 , 1 ) σ 2 ∼ InvGamma ( a 0 / 2 , b 0 / 2 ) * [Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16
Introduction Model Computation Results Conclusion Model ◮ The Horseshoe model * : y i | β j , λ j , τ, σ 2 ind ∼ N ( x i β, σ 2 ) ind ∼ N ( 0 , τ 2 λ 2 β j j ) ind λ j ∼ Cauchy + ( 0 , 1 ) τ ∼ Cauchy + ( 0 , 1 ) σ 2 ∼ InvGamma ( a 0 / 2 , b 0 / 2 ) * [Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16
Introduction Model Computation Results Conclusion Model ◮ The Horseshoe model * : y i | β j , λ j , τ, σ 2 ind ∼ N ( x i β, σ 2 ) ind ∼ N ( 0 , τ 2 λ 2 β j j ) ind λ j ∼ Cauchy + ( 0 , 1 ) τ ∼ Cauchy + ( 0 , 1 ) σ 2 ∼ InvGamma ( a 0 / 2 , b 0 / 2 ) * [Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16
Introduction Model Computation Results Conclusion Model ◮ Horseshoe has other good frequentist properties. Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 5 / 16
Introduction Model Computation Results Conclusion Model ◮ Horseshoe has other good frequentist properties. ◮ It achieves the minimax-adaptive risk for squared error loss up to a constant. Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 5 / 16
Introduction Model Computation Results Conclusion Model ◮ Horseshoe has other good frequentist properties. ◮ It achieves the minimax-adaptive risk for squared error loss up to a constant. ◮ Suppose X = I , � β � 0 = s n , then [van der Pas et al., 2014], ≤ 4 σ 2 s n log n � � � ˆ β HS − β � 2 sup s n · ( 1 + o ( 1 )) , E β 2 β : � β � 0 ≤ s n while, for any estimator ˆ β , [Donoho et al., 1992] shows ≥ 2 σ 2 s n log n � � � ˆ β − β � 2 sup s n · ( 1 + o ( 1 )) . E β 2 β : � β � 0 ≤ s n Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 5 / 16
Introduction Model Computation Results Conclusion Computation ◮ State-of-the-art: (i) τ | β, σ 2 , λ , (ii) β, σ 2 � � | τ, λ , (iii) slice sampling for λ . Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16
Introduction Model Computation Results Conclusion Computation ◮ State-of-the-art: (i) τ | β, σ 2 , λ , (ii) β, σ 2 � � | τ, λ , (iii) slice sampling for λ . But... Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16
Introduction Model Computation Results Conclusion Computation ◮ State-of-the-art: (i) τ | β, σ 2 , λ , (ii) β, σ 2 � � | τ, λ , (iii) slice sampling for λ . But... ◮ We scale the model with two ideas. Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16
Recommend
More recommend