Scalable MCMC for Bayes Shrinkage Priors Paulo Orenstein July 2, - - PowerPoint PPT Presentation

scalable mcmc for bayes shrinkage priors
SMART_READER_LITE
LIVE PREVIEW

Scalable MCMC for Bayes Shrinkage Priors Paulo Orenstein July 2, - - PowerPoint PPT Presentation

Introduction Model Computation Results Conclusion Scalable MCMC for Bayes Shrinkage Priors Paulo Orenstein July 2, 2018 Stanford University Joint work with James Johndrow and Anirban Bhattacharya Paulo Orenstein Scalable MCMC for Bayes


slide-1
SLIDE 1

Introduction Model Computation Results Conclusion

Scalable MCMC for Bayes Shrinkage Priors

Paulo Orenstein

July 2, 2018

Stanford University

Joint work with James Johndrow and Anirban Bhattacharya

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 1 / 16

slide-2
SLIDE 2

Introduction Model Computation Results Conclusion

Introduction

◮ Consider the high-dimensional setting: predict a vector y ∈ Rn from a set of features X ∈ Rn×p, with p ≫ n.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16

slide-3
SLIDE 3

Introduction Model Computation Results Conclusion

Introduction

◮ Consider the high-dimensional setting: predict a vector y ∈ Rn from a set of features X ∈ Rn×p, with p ≫ n. ◮ Assume a sparse Gaussian linear model y = Xβ + ε, ε ∼ N(0, σ2In), with βj = 0 for many j.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16

slide-4
SLIDE 4

Introduction Model Computation Results Conclusion

Introduction

◮ Consider the high-dimensional setting: predict a vector y ∈ Rn from a set of features X ∈ Rn×p, with p ≫ n. ◮ Assume a sparse Gaussian linear model y = Xβ + ε, ε ∼ N(0, σ2In), with βj = 0 for many j. ◮ How can we perform prediction and inference?

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16

slide-5
SLIDE 5

Introduction Model Computation Results Conclusion

Introduction

◮ Consider the high-dimensional setting: predict a vector y ∈ Rn from a set of features X ∈ Rn×p, with p ≫ n. ◮ Assume a sparse Gaussian linear model y = Xβ + ε, ε ∼ N(0, σ2In), with βj = 0 for many j. ◮ How can we perform prediction and inference?

Lasso

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16

slide-6
SLIDE 6

Introduction Model Computation Results Conclusion

Introduction

◮ Consider the high-dimensional setting: predict a vector y ∈ Rn from a set of features X ∈ Rn×p, with p ≫ n. ◮ Assume a sparse Gaussian linear model y = Xβ + ε, ε ∼ N(0, σ2In), with βj = 0 for many j. ◮ How can we perform prediction and inference?

Lasso Point mass mixture prior

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16

slide-7
SLIDE 7

Introduction Model Computation Results Conclusion

Introduction

◮ Consider the high-dimensional setting: predict a vector y ∈ Rn from a set of features X ∈ Rn×p, with p ≫ n. ◮ Assume a sparse Gaussian linear model y = Xβ + ε, ε ∼ N(0, σ2In), with βj = 0 for many j. ◮ How can we perform prediction and inference?

Lasso, but: convex relaxation; one parameter for sparsity and shrinkage Point mass mixture prior

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16

slide-8
SLIDE 8

Introduction Model Computation Results Conclusion

Introduction

◮ Consider the high-dimensional setting: predict a vector y ∈ Rn from a set of features X ∈ Rn×p, with p ≫ n. ◮ Assume a sparse Gaussian linear model y = Xβ + ε, ε ∼ N(0, σ2In), with βj = 0 for many j. ◮ How can we perform prediction and inference?

Lasso, but: convex relaxation; one parameter for sparsity and shrinkage Point mass mixture prior, but: computation is prohibitive

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 2 / 16

slide-9
SLIDE 9

Introduction Model Computation Results Conclusion

Introduction

◮ Can we find a continuous prior that behaves like the point mass mixture prior?

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16

slide-10
SLIDE 10

Introduction Model Computation Results Conclusion

Introduction

◮ Can we find a continuous prior that behaves like the point mass mixture prior? ◮ Desiderata:

adaptive to sparsity easy to compute good predictive performance good frequentist properties decent compromise between statistical and computational goals

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16

slide-11
SLIDE 11

Introduction Model Computation Results Conclusion

Introduction

◮ Can we find a continuous prior that behaves like the point mass mixture prior? ◮ Desiderata:

adaptive to sparsity easy to compute good predictive performance good frequentist properties decent compromise between statistical and computational goals

◮ Global-local priors can achieve this (with some qualifications).

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16

slide-12
SLIDE 12

Introduction Model Computation Results Conclusion

Introduction

◮ Can we find a continuous prior that behaves like the point mass mixture prior? ◮ Desiderata:

adaptive to sparsity easy to compute good predictive performance good frequentist properties decent compromise between statistical and computational goals

◮ Global-local priors can achieve this (with some qualifications). ◮ But... they are still slow.

Lasso: n ≈ 1, 000, p ≈ 1, 000, 000; Global-local: n ≈ 1, 000, p ≈ 1, 000.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 3 / 16

slide-13
SLIDE 13

Introduction Model Computation Results Conclusion

Model

◮ The Horseshoe model*: yi | βj, λj, τ, σ2 ind ∼ N(xiβ, σ2) βj

ind

∼ N(0, τ 2λ2

j )

λj

ind

∼ Cauchy+(0, 1) τ ∼ Cauchy+(0, 1) σ2 ∼ InvGamma(a0/2, b0/2)

*[Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16

slide-14
SLIDE 14

Introduction Model Computation Results Conclusion

Model

◮ The Horseshoe model*: yi | βj, λj, τ, σ2 ind ∼ N(xiβ, σ2) βj

ind

∼ N(0, τ 2λ2

j )

λj

ind

∼ Cauchy+(0, 1) τ ∼ Cauchy+(0, 1) σ2 ∼ InvGamma(a0/2, b0/2)

*[Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16

slide-15
SLIDE 15

Introduction Model Computation Results Conclusion

Model

◮ The Horseshoe model*: yi | βj, λj, τ, σ2 ind ∼ N(xiβ, σ2) βj

ind

∼ N(0, τ 2λ2

j )

λj

ind

∼ Cauchy+(0, 1) τ ∼ Cauchy+(0, 1) σ2 ∼ InvGamma(a0/2, b0/2)

*[Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16

slide-16
SLIDE 16

Introduction Model Computation Results Conclusion

Model

◮ The Horseshoe model*: yi | βj, λj, τ, σ2 ind ∼ N(xiβ, σ2) βj

ind

∼ N(0, τ 2λ2

j )

λj

ind

∼ Cauchy+(0, 1) τ ∼ Cauchy+(0, 1) σ2 ∼ InvGamma(a0/2, b0/2)

*[Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16

slide-17
SLIDE 17

Introduction Model Computation Results Conclusion

Model

◮ The Horseshoe model*: yi | βj, λj, τ, σ2 ind ∼ N(xiβ, σ2) βj

ind

∼ N(0, τ 2λ2

j )

λj

ind

∼ Cauchy+(0, 1) τ ∼ Cauchy+(0, 1) σ2 ∼ InvGamma(a0/2, b0/2)

*[Carvalho et. al, 2010] Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 4 / 16

slide-18
SLIDE 18

Introduction Model Computation Results Conclusion

Model

◮ Horseshoe has other good frequentist properties.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 5 / 16

slide-19
SLIDE 19

Introduction Model Computation Results Conclusion

Model

◮ Horseshoe has other good frequentist properties. ◮ It achieves the minimax-adaptive risk for squared error loss up to a constant.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 5 / 16

slide-20
SLIDE 20

Introduction Model Computation Results Conclusion

Model

◮ Horseshoe has other good frequentist properties. ◮ It achieves the minimax-adaptive risk for squared error loss up to a constant. ◮ Suppose X = I, β0 = sn, then [van der Pas et al., 2014], sup

β:β0≤sn

  • ˆ

βHS − β2

2

  • ≤ 4σ2sn log n

sn · (1 + o(1)), while, for any estimator ˆ β, [Donoho et al., 1992] shows sup

β:β0≤sn

  • ˆ

β − β2

2

  • ≥ 2σ2sn log n

sn · (1 + o(1)).

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 5 / 16

slide-21
SLIDE 21

Introduction Model Computation Results Conclusion

Computation

◮ State-of-the-art: (i) τ | β, σ2, λ, (ii)

  • β, σ2

| τ, λ, (iii) slice sampling for λ.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-22
SLIDE 22

Introduction Model Computation Results Conclusion

Computation

◮ State-of-the-art: (i) τ | β, σ2, λ, (ii)

  • β, σ2

| τ, λ, (iii) slice sampling for λ. But...

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-23
SLIDE 23

Introduction Model Computation Results Conclusion

Computation

◮ State-of-the-art: (i) τ | β, σ2, λ, (ii)

  • β, σ2

| τ, λ, (iii) slice sampling for λ. But... ◮ We scale the model with two ideas.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-24
SLIDE 24

Introduction Model Computation Results Conclusion

Computation

◮ State-of-the-art: (i) τ | β, σ2, λ, (ii)

  • β, σ2

| τ, λ, (iii) slice sampling for λ. But... ◮ We scale the model with two ideas. ◮ First idea: block (β, σ2, τ) to improve mixing;

  • 1. sample (β, σ2, τ) | λ by block sampling: τ | λ, then σ2 | τ, λ, and finally β | σ2, τ, λ;
  • 2. sample λ | β, σ2 using slice sampling.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-25
SLIDE 25

Introduction Model Computation Results Conclusion

Computation

◮ State-of-the-art: (i) τ | β, σ2, λ, (ii)

  • β, σ2

| τ, λ, (iii) slice sampling for λ. But... ◮ We scale the model with two ideas. ◮ First idea: block (β, σ2, τ) to improve mixing;

  • 1. sample (β, σ2, τ) | λ by block sampling: τ | λ, then σ2 | τ, λ, and finally β | σ2, τ, λ;
  • 2. sample λ | β, σ2 using slice sampling.

◮ Second idea: truncate some of the matrices involved to improve the computational cost per step.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-26
SLIDE 26

Introduction Model Computation Results Conclusion

Gibbs sampling

Let M = X(diag(ξη))−1X T + I, ξ = τ −2, ηj = λ−2

j , and block update:

◮ p(τ | λ, y) ∝

1 √ξ(1+ξ)|M|−1/2

y TM−1y + b0 − n+a0

2

◮ p(σ2 | τ, λ, y) ∼ InvGamma n+a0

2 , 1 2

  • y TM−1y + b0
  • ◮ p(β | σ2, τ, λ, y) ∼ N
  • (X TX + diag(ξη))−1X Ty, σ2

X TX + diag(ξη) −1 Then perform slice sampling: ◮ p(λ | β, σ2, τ, y): (i) U | ηj ∼ Unif

  • 0,

1 1+ηj

  • ; (ii) ηj | u ∼ e−

ξβ2 j 2σ2 ηj I[ 1−u u

>ηj ].

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-27
SLIDE 27

Introduction Model Computation Results Conclusion

Gibbs sampling

Let M = X(diag(ξη))−1X T + I, ξ = τ −2, ηj = λ−2

j , and block update:

◮ p(τ | λ, y) ∝

1 √ξ(1+ξ)|M|−1/2

y TM−1y + b0 − n+a0

2

◮ p(σ2 | τ, λ, y) ∼ InvGamma n+a0

2 , 1 2

  • y TM−1y + b0
  • ◮ p(β | σ2, τ, λ, y) ∼ N
  • (X TX + diag(ξη))−1X Ty, σ2

X TX + diag(ξη) −1 Then perform slice sampling: ◮ p(λ | β, σ2, τ, y): (i) U | ηj ∼ Unif

  • 0,

1 1+ηj

  • ; (ii) ηj | u ∼ e−

ξβ2 j 2σ2 ηj I[ 1−u u

>ηj ].

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-28
SLIDE 28

Introduction Model Computation Results Conclusion

Gibbs sampling

Let M = X(diag(ξη))−1X T + I, ξ = τ −2, ηj = λ−2

j , and block update:

◮ p(τ | λ, y) ∝

1 √ξ(1+ξ)|M|−1/2

y TM−1y + b0 − n+a0

2

◮ p(σ2 | τ, λ, y) ∼ InvGamma n+a0

2 , 1 2

  • y TM−1y + b0
  • ◮ p(β | σ2, τ, λ, y) ∼ N
  • (X TX + diag(ξη))−1X Ty, σ2

X TX + diag(ξη) −1 Then perform slice sampling: ◮ p(λ | β, σ2, τ, y): (i) U | ηj ∼ Unif

  • 0,

1 1+ηj

  • ; (ii) ηj | u ∼ e−

ξβ2 j 2σ2 ηj I[ 1−u u

>ηj ].

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-29
SLIDE 29

Introduction Model Computation Results Conclusion

Gibbs sampling

Let M = X(diag(ξη))−1X T + I, ξ = τ −2, ηj = λ−2

j , and block update:

◮ p(τ | λ, y) ∝

1 √ξ(1+ξ)|M|−1/2

y TM−1y + b0 − n+a0

2

◮ p(σ2 | τ, λ, y) ∼ InvGamma n+a0

2 , 1 2

  • y TM−1y + b0
  • ◮ p(β | σ2, τ, λ, y) ∼ N
  • (X TX + diag(ξη))−1X Ty, σ2

X TX + diag(ξη) −1 Then perform slice sampling: ◮ p(λ | β, σ2, τ, y): (i) U | ηj ∼ Unif

  • 0,

1 1+ηj

  • ; (ii) ηj | u ∼ e−

ξβ2 j 2σ2 ηj I[ 1−u u

>ηj ].

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-30
SLIDE 30

Introduction Model Computation Results Conclusion

Gibbs sampling

Let M = X(diag(ξη))−1X T + I, ξ = τ −2, ηj = λ−2

j , and block update:

◮ p(τ | λ, y) ∝

1 √ξ(1+ξ)|M|−1/2

y TM−1y + b0 − n+a0

2

◮ p(σ2 | τ, λ, y) ∼ InvGamma n+a0

2 , 1 2

  • y TM−1y + b0
  • ◮ p(β | σ2, τ, λ, y) ∼ N
  • (X TX + diag(ξη))−1X Ty, σ2

X TX + diag(ξη) −1 Then perform slice sampling: ◮ p(λ | β, σ2, τ, y): (i) U | ηj ∼ Unif

  • 0,

1 1+ηj

  • ; (ii) ηj | u ∼ e−

ξβ2 j 2σ2 ηj I[ 1−u u

>ηj ].

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-31
SLIDE 31

Introduction Model Computation Results Conclusion

Gibbs sampling

Let M = X(diag(ξη))−1X T + I, ξ = τ −2, ηj = λ−2

j , and block update:

◮ p(τ | λ, y) ∝

1 √ξ(1+ξ)|M|−1/2

y TM−1y + b0 − n+a0

2

◮ p(σ2 | τ, λ, y) ∼ InvGamma n+a0

2 , 1 2

  • y TM−1y + b0
  • ◮ p(β | σ2, τ, λ, y) ∼ N
  • (X TX + diag(ξη))−1X Ty, σ2

X TX + diag(ξη) −1 Then perform slice sampling: ◮ p(λ | β, σ2, τ, y): (i) U | ηj ∼ Unif

  • 0,

1 1+ηj

  • ; (ii) ηj | u ∼ e−

ξβ2 j 2σ2 ηj I[ 1−u u

>ηj ].

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-32
SLIDE 32

Introduction Model Computation Results Conclusion

Gibbs sampling

Let M = X(diag(ξη))−1X T + I, ξ = τ −2, ηj = λ−2

j , and block update:

◮ p(τ | λ, y) ∝

1 √ξ(1+ξ)|M|−1/2

y TM−1y + b0 − n+a0

2

◮ p(σ2 | τ, λ, y) ∼ InvGamma n+a0

2 , 1 2

  • y TM−1y + b0
  • ◮ p(β | σ2, τ, λ, y) ∼ N
  • (X TX + diag(ξη))−1X Ty, σ2

X TX + diag(ξη) −1 Then perform slice sampling: ◮ p(λ | β, σ2, τ, y): (i) U | ηj ∼ Unif

  • 0,

1 1+ηj

  • ; (ii) ηj | u ∼ e−

ξβ2 j 2σ2 ηj I[ 1−u u

>ηj ].

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 6 / 16

slide-33
SLIDE 33

Introduction Model Computation Results Conclusion

Markov approximation

◮ We approximate M = Xdiag((ξηj)−1)X T + I with Mδ = XDδX T + I, Dδ = diag((ξηj)−1I[(ξmaxηj )−1>δ]) for δ ≪ 1, and ξmax the maximum of the current and proposed ξ.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 7 / 16

slide-34
SLIDE 34

Introduction Model Computation Results Conclusion

Markov approximation

◮ We approximate M = Xdiag((ξηj)−1)X T + I with Mδ = XDδX T + I, Dδ = diag((ξηj)−1I[(ξmaxηj )−1>δ]) for δ ≪ 1, and ξmax the maximum of the current and proposed ξ.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 7 / 16

slide-35
SLIDE 35

Introduction Model Computation Results Conclusion

Markov approximation

◮ We approximate M = Xdiag((ξηj)−1)X T + I with Mδ = XDδX T + I, Dδ = diag((ξηj)−1I[(ξmaxηj )−1>δ]) for δ ≪ 1, and ξmax the maximum of the current and proposed ξ. ◮ This makes computation much faster.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 7 / 16

slide-36
SLIDE 36

Introduction Model Computation Results Conclusion

Markov approximation

◮ We approximate M = Xdiag((ξηj)−1)X T + I with Mδ = XDδX T + I, Dδ = diag((ξηj)−1I[(ξmaxηj )−1>δ]) for δ ≪ 1, and ξmax the maximum of the current and proposed ξ. ◮ This makes computation much faster.

Approximating Kernels

Let Pδ(x, ·) and P(x, ·) denote the Markov operators for the approximate and exact algorithms, with x = (β, σ2, τ, λ) the entire state vector. Then sup

x Pδ(x, ·) − P(x, ·)TV ≤

√ δX

  • a + n + a0

b0 + n 2 y2 b0 + O(δ), for sufficiently small δ > 0.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 7 / 16

slide-37
SLIDE 37

Introduction Model Computation Results Conclusion

Simulation

◮ We simulate data as follows: xi

iid

∼ Np(0, Σ) yi ∼ N(xiβ, 4) βj =

  • 2−(j/4−9/4)

if j < 24, if j ≥ 24.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 8 / 16

slide-38
SLIDE 38

Introduction Model Computation Results Conclusion

Simulation

◮ We simulate data as follows: xi

iid

∼ Np(0, Σ) yi ∼ N(xiβ, 4) βj =

  • 2−(j/4−9/4)

if j < 24, if j ≥ 24. ◮ There are nulls, clear non-nulls, and some subtle non-nulls.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 8 / 16

slide-39
SLIDE 39

Introduction Model Computation Results Conclusion

Simulation

◮ We simulate data as follows: xi

iid

∼ Np(0, Σ) yi ∼ N(xiβ, 4) βj =

  • 2−(j/4−9/4)

if j < 24, if j ≥ 24. ◮ There are nulls, clear non-nulls, and some subtle non-nulls. ◮ We consider both Σ = I (independent design) and Σij = 0.9|i−j| (correlated design).

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 8 / 16

slide-40
SLIDE 40

Introduction Model Computation Results Conclusion

Autocorrelation

Autocorrelation for log(ξ) = −2 log τ

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 9 / 16

slide-41
SLIDE 41

Introduction Model Computation Results Conclusion

Effective samples per second

◮ Approximate algorithm is 50× more efficient with n = 2, 000 and p = 20, 000.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 10 / 16

slide-42
SLIDE 42

Introduction Model Computation Results Conclusion

Accuracy

◮ Existing algorithms failed to converge, due to numerical underflow.

Trace plots for −2 log(σ) and log(ξ) = −2 log(τ); truth in red

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 11 / 16

slide-43
SLIDE 43

Introduction Model Computation Results Conclusion

Accuracy

◮ In terms of MSE, the approximation costs us little.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 12 / 16

slide-44
SLIDE 44

Introduction Model Computation Results Conclusion

Dependence on p and n

◮ Effective sample sizes seem independent of n and p.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 12 / 16

slide-45
SLIDE 45

Introduction Model Computation Results Conclusion

Dependence on p and n

◮ Effective sample sizes seem independent of n and p.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 12 / 16

slide-46
SLIDE 46

Introduction Model Computation Results Conclusion

Real application: GWAS

◮ n = 2267 observations, p = 98385 SNPs in the genome of maize.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 13 / 16

slide-47
SLIDE 47

Introduction Model Computation Results Conclusion

Real application: GWAS

◮ n = 2267 observations, p = 98385 SNPs in the genome of maize. ◮ X: maize seeds; y: growing degree days to silking (‘growth cycle’)

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 13 / 16

slide-48
SLIDE 48

Introduction Model Computation Results Conclusion

Real application: GWAS

◮ n = 2267 observations, p = 98385 SNPs in the genome of maize. ◮ X: maize seeds; y: growing degree days to silking (‘growth cycle’)

Bimodal posterior distribution for β | y; Lasso (red) shrinks more than Horseshoe (blue)

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 13 / 16

slide-49
SLIDE 49

Introduction Model Computation Results Conclusion

Real application: GWAS

◮ n = 2267 observations, p = 98385 SNPs in the genome of maize. ◮ X: maize seeds; y: growing degree days to silking (‘growth cycle’)

Bimodal posterior distribution for β | y; Lasso (red) shrinks more than Horseshoe (blue)

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 13 / 16

slide-50
SLIDE 50

Introduction Model Computation Results Conclusion

Real application: GWAS

◮ n = 2267 observations, p = 98385 SNPs in the genome of maize. ◮ X: maize seeds; y: growing degree days to silking (‘growth cycle’)

Bimodal posterior distribution for β | y; Lasso (red) shrinks more than Horseshoe (blue)

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 13 / 16

slide-51
SLIDE 51

Introduction Model Computation Results Conclusion

Variable selection with Horseshoe

Number of variables for which ˆ βHS,j = E[βj | y] > t or ˆ βLasso,j > t vs threshold t; both methods largely agree on the identities of the signals

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 14 / 16

slide-52
SLIDE 52

Introduction Model Computation Results Conclusion

Conclusion

◮ There is no point in having a great model, like the Horseshoe, if it can’t be com- puted.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 15 / 16

slide-53
SLIDE 53

Introduction Model Computation Results Conclusion

Conclusion

◮ There is no point in having a great model, like the Horseshoe, if it can’t be com- puted. ◮ There is a need to scale more Bayesian models to the level of Frequentists.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 15 / 16

slide-54
SLIDE 54

Introduction Model Computation Results Conclusion

Conclusion

◮ There is no point in having a great model, like the Horseshoe, if it can’t be com- puted. ◮ There is a need to scale more Bayesian models to the level of Frequentists. ◮ We manage to do that for the Horseshoe prior with two ideas: blocking and trun- cation.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 15 / 16

slide-55
SLIDE 55

Introduction Model Computation Results Conclusion

Conclusion

◮ There is no point in having a great model, like the Horseshoe, if it can’t be com- puted. ◮ There is a need to scale more Bayesian models to the level of Frequentists. ◮ We manage to do that for the Horseshoe prior with two ideas: blocking and trun- cation. ◮ We observed interesting and novel statistical phenomena, e.g., bimodality of β.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 15 / 16

slide-56
SLIDE 56

Introduction Model Computation Results Conclusion

Conclusion

◮ There is no point in having a great model, like the Horseshoe, if it can’t be com- puted. ◮ There is a need to scale more Bayesian models to the level of Frequentists. ◮ We manage to do that for the Horseshoe prior with two ideas: blocking and trun- cation. ◮ We observed interesting and novel statistical phenomena, e.g., bimodality of β. ◮ There is likely more room for improvement.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 15 / 16

slide-57
SLIDE 57

Introduction Model Computation Results Conclusion

References

◮ Bhattacharya, Anirban, et al. “Bayesian shrinkage”. arXiv preprint, arXiv: 1212.6088 (2012). ◮ Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. “The horseshoe estimator for sparse signals”. Biometrika 97.2 (2010): 465-480. ◮ Johndrow, James E., and Jonathan C. Mattingly. “Error bounds for Approximations

  • f Markov chains.” arXiv preprint, arXiv:1711.05382 (2017).

◮ Johndrow, James E., P. O., Bhattacharya, Anirban “Scalable MCMC for Bayes Shrinkage Priors”. arXiv preprint, arXiv: 1705.00841 (2018). ◮ Rudolf, Daniel, and Nikolaus Schweizer. “Perturbation theory for Markov chains via Wasserstein distance.” Bernoulli 24.4A (2018): 2610-2639. ◮ Van Der Pas, S. L., B. J. K. Kleijn, and A. W. Van Der Vaart. “The horseshoe estimator: Posterior concentration around nearly black vectors.” Electronic Journal

  • f Statistics 8.2 (2014): 2585-2618.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 15 / 16

slide-58
SLIDE 58

Introduction Model Computation Results Conclusion

Extra slides

◮ More simulation results ◮ Why “Horseshoe”?

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 15 / 16

slide-59
SLIDE 59

Introduction Model Computation Results Conclusion

More simulations

◮ We let n = 1000 and p = 20, 000.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-60
SLIDE 60

Introduction Model Computation Results Conclusion

More simulations

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-61
SLIDE 61

Introduction Model Computation Results Conclusion

More simulations

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-62
SLIDE 62

Introduction Model Computation Results Conclusion

More simulations

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-63
SLIDE 63

Introduction Model Computation Results Conclusion

More simulations

◮ The new algorithm lead to significant improvement in the autocorrelation:

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-64
SLIDE 64

Introduction Model Computation Results Conclusion

More simulations

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-65
SLIDE 65

Introduction Model Computation Results Conclusion

Why "Horseshoe"?

◮ In the orthogonal case with n ≥ p and σ2 = τ = 1, and defining a shrinkage profile κj = 1/(1 + nλ2

j ), we can write E[βj|y] = (1 − E[κj|y]) ˆ

βj.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-66
SLIDE 66

Introduction Model Computation Results Conclusion

Why "Horseshoe"?

◮ In the orthogonal case with n ≥ p and σ2 = τ = 1, and defining a shrinkage profile κj = 1/(1 + nλ2

j ), we can write E[βj|y] = (1 − E[κj|y]) ˆ

βj.

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16

slide-67
SLIDE 67

Introduction Model Computation Results Conclusion

Why "Horseshoe"?

◮ In the orthogonal case with n ≥ p and σ2 = τ = 1, and defining a shrinkage profile κj = 1/(1 + nλ2

j ), we can write E[βj|y] = (1 − E[κj|y]) ˆ

βj. ◮ Prior density for κj:

Paulo Orenstein Scalable MCMC for Bayes Shrinkage Priors Stanford University 16 / 16