Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of Bayesian Inference – p. 1
Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data). Basics of Bayesian Inference – p. 1
Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data). A Bayesian writes down a prior distribution for θ , and combines it with the likelihood for the observed data Y to obtain the posterior distribution of θ . All statistical inferences then follow from summarizing the posterior. Basics of Bayesian Inference – p. 1
Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data). A Bayesian writes down a prior distribution for θ , and combines it with the likelihood for the observed data Y to obtain the posterior distribution of θ . All statistical inferences then follow from summarizing the posterior. This approach expands the class of candidate models, and facilitates hierarchical modeling, where it is important to properly account for various sources of uncertainty (e.g. spatial vs. nonspatial heterogeneity) Basics of Bayesian Inference – p. 1
Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed A Bayesian thinks of parameters as random, and thus coming from distributions (just like the data). A Bayesian writes down a prior distribution for θ , and combines it with the likelihood for the observed data Y to obtain the posterior distribution of θ . All statistical inferences then follow from summarizing the posterior. This approach expands the class of candidate models, and facilitates hierarchical modeling, where it is important to properly account for various sources of uncertainty (e.g. spatial vs. nonspatial heterogeneity) The classical (frequentist) approach to inference will cause awkward interpretation and will struggle with uncertainty. Basics of Bayesian Inference – p. 1
Basics of Bayesian inference In the simplest form, we start with a model/distribution for the data given unknowns (parameters), f ( y | θ ) Since the data is observed, hence known, while θ is not, we equivalently view this as a function of θ given y and call it the likelihood, L ( θ ; y ) . We write the prior distribution for θ as π ( θ ) Then the joint model for the data and parameters becomes f ( y | θ ) π ( θ ) Conditioning in the opposite direction, we have π ( θ | y ) m ( y ) The first term is called the posterior distribution for θ , the second is the marginal distribution of the data We see that π ( θ | y ) ∝ f ( y | θ ) π ( θ ) . m ( y ) is the normalizing constant. Basics of Bayesian Inference – p. 2
Basics of Bayesian Inference More generally, we would have a prior distribution π ( θ | λ ) , where λ is a vector of hyperparameters. In fact, we can think of θ even more generally as the “process” of interest with some parts known and some parts unknown Then, we can write f ( y | process , θ )( f ( process , θ | λ ) π ( θ | λ ) π ( λ ) A hierarchical specification If λ known, the posterior distribution for θ is given by p ( y , θ | λ ) p ( y , θ | λ ) p ( θ | y , λ ) = = � p ( y | λ ) p ( y , θ | λ ) d θ f ( y | θ ) π ( θ | λ ) d θ = f ( y | θ ) π ( θ | λ ) f ( y | θ ) π ( θ | λ ) = . � m ( y | λ ) Basics of Bayesian Inference – p. 3
Basics of Bayesian Inference Since λ will not be known, a second stage (hyperprior) distribution h ( λ ) will be required, so that � f ( y | θ ) π ( θ | λ ) h ( λ ) d λ p ( θ | y ) = p ( y , θ ) = f ( y | θ ) π ( θ | λ ) h ( λ ) d θ d λ . � p ( y ) Alternatively, we might replace λ in p ( θ | y , λ ) by an estimate ˆ λ ; this is called empirical Bayes analysis So, p ( θ | y ) � = π ( θ ) This is referred to as Bayesian learning (the change in the posterior distribution compared with the prior). Basics of Bayesian Inference – p. 4
Illustration of Bayes’ Theorem Suppose f ( y | θ ) = N ( y | θ, σ 2 ) , θ ∈ ℜ and σ > 0 known Basics of Bayesian Inference – p. 5
Illustration of Bayes’ Theorem Suppose f ( y | θ ) = N ( y | θ, σ 2 ) , θ ∈ ℜ and σ > 0 known If we take π ( θ | λ ) = N ( θ | µ, τ 2 ) where λ = ( µ, τ ) ′ is fixed and known, then it is easy to show that σ 2 τ 2 σ 2 τ 2 � � p ( θ | y ) = N θ σ 2 + τ 2 µ + σ 2 + τ 2 y , . σ 2 + τ 2 Basics of Bayesian Inference – p. 5
Illustration of Bayes’ Theorem Suppose f ( y | θ ) = N ( y | θ, σ 2 ) , θ ∈ ℜ and σ > 0 known If we take π ( θ | λ ) = N ( θ | µ, τ 2 ) where λ = ( µ, τ ) ′ is fixed and known, then it is easy to show that σ 2 τ 2 σ 2 τ 2 � � p ( θ | y ) = N θ σ 2 + τ 2 µ + σ 2 + τ 2 y , . σ 2 + τ 2 Note that The posterior mean E ( θ | y ) is a weighted average of the prior mean µ and the data value y , with weights depending on our relative uncertainty the posterior precision (reciprocal of the variance) is equal to 1 /σ 2 + 1 /τ 2 , which is the sum of the likelihood and prior precisions. Basics of Bayesian Inference – p. 5
Illustration (continued) As a concrete example, let µ = 2 , τ = 1 , ¯ y = 6 , and σ = 1 : prior 1.2 posterior with n = 1 posterior with n = 10 1.0 0.8 density 0.6 0.4 0.2 0.0 -2 0 2 4 6 8 θ When n = 1 , prior and likelihood receive equal weight Basics of Bayesian Inference – p. 6
Illustration (continued) As a concrete example, let µ = 2 , τ = 1 , ¯ y = 6 , and σ = 1 : prior 1.2 posterior with n = 1 posterior with n = 10 1.0 0.8 density 0.6 0.4 0.2 0.0 -2 0 2 4 6 8 θ When n = 1 , prior and likelihood receive equal weight When n = 10 , the data begin to dominate the prior Basics of Bayesian Inference – p. 6
Illustration (continued) As a concrete example, let µ = 2 , τ = 1 , ¯ y = 6 , and σ = 1 : prior 1.2 posterior with n = 1 posterior with n = 10 1.0 0.8 density 0.6 0.4 0.2 0.0 -2 0 2 4 6 8 θ When n = 1 , prior and likelihood receive equal weight When n = 10 , the data begin to dominate the prior The posterior variance goes to zero as n → ∞ Basics of Bayesian Inference – p. 6
Notes on priors The prior here is conjugate: it leads to a posterior distribution for θ that is a member of the same distributional family as the prior. Note that setting τ 2 = ∞ corresponds to an arbitrarily vague (or noninformative) prior. The posterior is then θ | y, σ 2 /n � � p ( θ | y ) = N , the same as the likelihood! The limit of the conjugate (normal) prior here is a uniform (or “flat”) prior, and thus the posterior is the normalized likelihood. The flat prior is improper since � p ( θ ) dθ = + ∞ . However, as long as the posterior is integrable, i.e., � Θ f ( y | θ ) π ( θ ) d θ < ∞ an improper prior can be used! Basics of Bayesian Inference – p. 7
A linear model example Let Y be an n × 1 data vector, X an n × p matrix of covariates, and adopt the likelihood and prior structure, Y | β ∼ N n ( X β , Σ) and β ∼ N p ( A α , V ) Then the posterior distribution of β | Y is β | Y ∼ N ( D d , D ) , where D − 1 = X T Σ − 1 X + V − 1 and d = X T Σ − 1 Y + V − 1 A α . V − 1 = 0 delivers a “flat” prior; if Σ = σ 2 I p , we get � β , σ 2 ( X ′ X ) − 1 � ˆ β | Y ∼ N , where ˆ β = ( X ′ X ) − 1 X ′ y ⇐ ⇒ usual likelihood analysis! Basics of Bayesian Inference – p. 8
More on priors How do we choose priors? Prior robustness, sensitivity to prior Informative vs noninformative . Dangers with improper priors; appealing but... Always some prior information Prior elicitation Priors based upon previous experiments (previous posteriors can be current priors) Hyperpriors? Basics of Bayesian Inference – p. 9
More on priors Back to conjugacy: Y | µ ∼ N ( µ, σ 2 ) , µ ∼ N ( µ 0 , τ 2 ) then marginally, Y ∼ Normal and conditionally, µ | y ∼ Normal For vectors, Y | µ ∼ N ( µ , Σ) , µ ∼ N ( µ 0 , V ) then marginally, Y ∼ Normal and conditionally, µ | y ∼ Normal For variances, with Y | µ ∼ N ( µ, σ 2 ) , if σ 2 ∼ IG ( a, b ) , then σ 2 | y ∼ IG Never use IG ( ǫ, ǫ ) for small ǫ . Almost improper and, with variance components, leads to almost improper posteriors. Similar result for Σ but with inverse Wishart distributions Other conjugacies: Poisson with Gamma; Binomial with Beta Basics of Bayesian Inference – p. 10
Bayesian updating Often referred to as “crossing bridges as you come to them” Simplifies sequential data collection Simplest version: Y 1 , Y 2 indep given θ . So joint model is p ( y 2 | θ ) p ( y 1 | θ ) π ( θ ) ∝ p ( y 2 | θ ) π ( θ | y 1 ) , i.e., Y 1 updates π ( θ ) to π ( θ | y 1 ) before Y 2 arrives Works for more than two updates, for updating in blocks, for dependent as well as independent data Basics of Bayesian Inference – p. 11
CIHM Conditionally independent hierarchical model Model: Π i p ( y i | θ i )Π i p ( θ i | η ) π ( η ) η known - not interesting; separate model for each i . So, η unknown Lots of learning about η ; not much about θ i Model implies θ i are exchangeable ; learning about θ i takes the form of shrinkage Basics of Bayesian Inference – p. 12
Recommend
More recommend