The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu , http://uqgroup.mit.edu 7 July 2015 Marzouk (MIT) ICERM IdeaLab 7 July 2015 1 / 29
Statistical inference Why is a statistical perspective useful in inverse problems? To characterize uncertainty in the inverse solution To understand how this uncertainty depends on the number and quality of observations, features of the forward model, prior information, etc. To make probabilistic predictions To choose “good” observations or experiments To address questions of model error, model validity, and model selection Marzouk (MIT) ICERM IdeaLab 7 July 2015 2 / 29
Bayesian inference Bayes’ rule p ( θ | y ) = p ( y | θ ) p ( θ ) p ( y ) Key idea: model parameters θ are treated as random variables (For simplicity, we let our random variables have densities) Notation θ are model parameters; y are the data; assume both to be finite-dimensional unless otherwise indicated p ( θ ) is the prior probability density L ( θ ) ≡ p ( y | θ ) is the likelihood function p ( θ | y ) is the posterior probability density p ( y ) is the evidence , or equivalently, the marginal likelihood Marzouk (MIT) ICERM IdeaLab 7 July 2015 3 / 29
Bayesian inference Summaries of the posterior distribution What information to extract? Posterior mean of θ ; maximum a posteriori (MAP) estimate of θ Posterior covariance or higher moments of θ Quantiles Credibile intervals: C ( y ) such that P [ θ ∈ C ( y ) | y ] = 1 − α . Credibile intervals are not uniquely defined above; thus consider, for example, the HPD (highest posterior density) region. Posterior realizations: for direct assessment, or to estimate posterior predictions or other posterior expectations Marzouk (MIT) ICERM IdeaLab 7 July 2015 4 / 29
Bayesian and frequentist statistics Understanding both perspectives is useful and important. . . Key differences between these two statistical paradigms Frequentists do not assign probabilities to unknown parameters θ . One can write likelihoods p θ ( y ) ≡ p ( y | θ ) but not priors p ( θ ) or posteriors. θ is not a random variable. In the frequentist viewpoint, there is no single preferred methodology for inverting the relationship between parameters and data. Instead, consider various estimators ˆ θ ( y ) of θ . The estimator ˆ θ is a random variable. Why? Frequentist paradigm considers y to result from a random and repeatable experiment. Marzouk (MIT) ICERM IdeaLab 7 July 2015 5 / 29
Bayesian and frequentist statistics Key differences (continued) Evaluate quality of ˆ θ through various criteria: bias, variance, mean-square error, consistency, efficiency, . . . One common estimator is maximum likelihood: ˆ θ ML = argmax θ p ( y | θ ) . p ( y | θ ) defines a family of distributions indexed by θ . Link to Bayesian approach: MAP estimate maximizes a “penalized likelihood.” What about Bayesian versus frequentist prediction of y new ⊥ y | θ ? Frequentist: “plug-in” or other estimators of y new Bayesian: posterior prediction via integration Marzouk (MIT) ICERM IdeaLab 7 July 2015 6 / 29
Bayesian inference Likelihood functions In general, p ( y | θ ) is a probabilistic model for the data In the inverse problem or parameter estimation context, the likelihood function is where the forward model appears, along with a noise model and (if applicable) an expression for model discrepancy Contrasting example (but not really!): parametric density estimation, where the likelihood function results from the probability density itself. Selected examples of likelihood functions Bayesian linear regression 1 Nonlinear forward model g ( θ ) with additive Gaussian noise 2 Nonlinear forward model with noise + model discrepancy 3 Marzouk (MIT) ICERM IdeaLab 7 July 2015 7 / 29
Bayesian inference Prior distributions In ill-posed parameter estimation problems, e.g., inverse problems, prior information plays a key role Intuitive idea: assign lower probability to values of θ that you don’t expect to see, higher probability to values of θ that you do expect to see Examples Gaussian processes with specified covariance kernel 1 Gaussian Markov random fields 2 Gaussian priors derived from differential operators 3 Hierarchical priors 4 Besov space priors 5 Higher-level representations (objects, marked point processes) 6 Marzouk (MIT) ICERM IdeaLab 7 July 2015 8 / 29
Gaussian process priors Key idea: any finite-dimensional distribution of the stochastic process θ ( x , ω ) : D × Ω → R is multivariate normal. In other words: θ ( x , ω ) is a collection of jointly Gaussian random variables, indexed by x Specify via mean function and covariance function E [ θ ( x )] = µ ( x ) E [( θ ( x ) − µ ) ( θ ( x ′ ) − µ )] = C ( x , x ′ ) Smoothness of process is controlled by behavior of covariance function as x ′ → x Restrictions: stationarity, isotropy, . . . Marzouk (MIT) ICERM IdeaLab 7 July 2015 9 / 29
Example: stationary Gaussian random fields (exponential covariance kernel) (Gaussian covariance kernel) Both are θ ( x , ω ) : D × Ω → R , with D = [ 0 , 1 ] 2 . Marzouk (MIT) ICERM IdeaLab 7 July 2015 10 / 29
Gaussian Markov random fields Key idea: discretize space and specify a sparse inverse covariance (“precision”) matrix W � � − 1 2 γ θ T W θ p ( θ ) ∝ exp where γ controls scale Full conditionals p ( θ i | θ ∼ i ) are available analytically and may simplify dramatically. Represent as an undirected graphical model Example: E [ θ i | θ ∼ i ] is just an average of site i ’s nearest neighbors Quite flexible; even used to simulate textures Marzouk (MIT) ICERM IdeaLab 7 July 2015 11 / 29
Priors through differential operators Key idea: return to infinite-dimensional setting; again penalize roughness in θ ( x ) Stuart 2010: define the prior using fractional negative powers of the Laplacian A = − ∆ : � θ 0 , β A − α � θ ∼ N Sufficiently large α ( α > d / 2), along with conditions on the likelihood, ensures that posterior measure is well defined Marzouk (MIT) ICERM IdeaLab 7 July 2015 12 / 29
GPs, GMRFs, and SPDEs In fact, all three “types” of Gaussian priors just described are closely connected. Linear fractional SPDE: � � β/ 2 θ ( x ) = W ( x ) , κ 2 − ∆ x ∈ R d , β = ν + d / 2 , κ > 0 , ν > 0 Then θ ( x ) is a Gaussian field with Mat´ ern covariance: σ 2 C ( x , x ′ ) = 2 ν − 1 Γ( ν )( κ � x − x ′ � ) ν K ν ( κ � x − x ′ � ) Covariance kernel is Green’s function of differential operator � � β C ( x , x ′ ) = δ ( x − x ′ ) κ 2 − ∆ ν = 1 / 2 equivalent to exponential covariance; ν → ∞ equivalent to squared exponential covariance Can construct a discrete GMRF that approximates the solution of SPDE (See Lindgren, Rue, Lindstr¨ om JRSSB 2011.) Marzouk (MIT) ICERM IdeaLab 7 July 2015 13 / 29
Hierarchical Gaussian priors 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 − 0.1 − 0.1 − 0.2 − 0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Figure 1. Three realization drawn from the prior (6) with constant variance θ j = θ 0 (left) and from the corresponding prior where the variance is 100–fold at two points indicated by arrows (right). Calvetti & Somersalo, Inverse Problems 24 (2008) 034013. Marzouk (MIT) ICERM IdeaLab 7 July 2015 14 / 29
Hierarchical Gaussian priors Iteration 1 Iteration 3 Iteration 7 Iteration 1 Iteration 3 Iteration 7 Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottom row) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to compute the updated of the image at each iteration step Calvetti & Somersalo, Inverse Problems 24 (2008) 034013. Marzouk (MIT) ICERM IdeaLab 7 July 2015 15 / 29
Non-Gaussian priors Besov space B s pq ( T ) : 2 j − 1 � ∞ � θ ( x ) = c 0 + w j , h ψ j , h ( x ) j = 0 h = 0 and q / p 1 / q 2 j − 1 � ∞ � | c 0 | q + 2 jq ( s + 1 2 − 1 p ) | w j , h | p � θ � B s < ∞ . pq ( T ) := j = 0 h = 0 Consider p = q = s = 1: 2 j − 1 � ∞ � 2 j / 2 | w j , h | . � θ � B 1 11 ( T ) = | c 0 | + j = 0 h = 0 Then the distribution of θ is a Besov prior if α c 0 and α 2 j / 2 w j , h are independent and Laplace(1). � � Loosely, π ( θ ) = exp − α � θ � B 1 . 11 ( T ) Marzouk (MIT) ICERM IdeaLab 7 July 2015 16 / 29
Higher-level representations Marked point processes, and more: Rue & Hurn, Biometrika 86 (1999) 649–660. Marzouk (MIT) ICERM IdeaLab 7 July 2015 17 / 29
Bayesian inference Hierarchical modeling One of the key flexibilities of the Bayesian construction! Hierarchical modeling has important implications for the design of efficient MCMC samplers (later in the lecture) Examples: Unknown noise variance 1 Unknown variance of a Gaussian process prior (cf. choosing the 2 regularization parameter) Many more, as dictated by the physical models at hand 3 Marzouk (MIT) ICERM IdeaLab 7 July 2015 18 / 29
Recommend
More recommend