Multilevel methods for fast Bayesian optimal experimental design Ra´ ul Tempone Alexander von Humboldt Professor, RWTH-Aachen, KAUST Joint work with J. Beck (KAUST), B.M. Dia (KFUPM) and L. Espath (RWTH) DCSE Fall School on ROM and UQ, November 4–8, 2019 1 / 38
Optimal Experimental Design (OED) ◮ Find the design that maximizes the success of an experiment for some desired experimental goal ◮ Predict the outcome without performing the real experiment ◮ Model-based approach: Important when resources are limited 2 / 38
Motivating problem: Electrical Impedance Tomography Quantities of interest: Material properties Measurements: Electrical potentials at electrodes Goal: Find the electrode placement ( design ) that maximizes the informativeness of the measurements 3 / 38
Data model setting Data model assumption : Y ( ξ ) = g ( θ , ξ ) + ǫ , ◮ Y = ( y 1 , y 2 , . . . , y q ) ∈ R q , measurements ◮ ǫ ∈ R q , measurement error ◮ g ∈ R q , model predictions (usually, the solution of a math. model) ◮ θ ∈ R d , quantities of interest ◮ ξ = ( ξ 1 , ξ 2 , . . . , ξ s ), design parameters Measurement error assumption: The measurement errors, ǫ , are zero-mean Gaussian errors with covariance matrix Σ ǫ 4 / 38
Data model with repetitive experiments Data model with repetitive experiments : Y i ( ξ ) = g ( θ , ξ ) + ǫ i , i = 1 , . . . , N e where N e is the number of repetitive experiments with design ξ . The full set of data : Y = { Y i } N e i =1 . 5 / 38
Simple OED example: a linear regression model Linear regression model: Y = X θ + ǫ where ◮ Y , vector of observation values ◮ X is the design matrix ◮ ǫ is the error (zero-mean) vector with known covariance matrix Σ ǫ ◮ θ is an unknown parameter vector of interest. Inverse problem : Estimate θ Classical approach : ◮ Least square estimate (arg min θ ǫ T ǫ ): ˆ θ = ( X T X ) − 1 X T Y ◮ Cov(ˆ θ ) = ( X T X ) − 1 X T Σ ǫ X ( X T X ) − 1 OED goal : Find design X that minimize Cov(ˆ θ ) 6 / 38
Alphabetic optimalities What does it mean to minimize Cov(ˆ def θ ) = Λ ? ◮ A -optimality: minimize the trace of the covariance matrix, tr ( Λ ) ◮ C -optimality: minimize the variance of a predefined linear combination of quantities of interest: c T Λ c ◮ D -optimality: minimize the determinant of the covariance matrix, det ( Λ ) ◮ E -optimality: minimize the maximum eigenvalue of the covariance matrix Λ For nonlinear models the problem is much more challenging! Entropy based expected information gain in a Bayesian setting: Lindley, D. V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4), 986-1005. 7 / 38
Entropy-based OED criterion 1948, Information Entropy Definition : Claude Shannon introduced the concept of information entropy: p i log 1 � Entropy = , p i i for probabilities p i . The differential entropy for a continuous random variable with probability density function (pdf) p ( x ) with support X , � 1 Differential Entropy = log p ( x ) p ( x ) d x . X 1956, Shannon Entropy-based OED : Dennis V. Lindley proposed an entropy-based measure of the information provided in an experiment: the relative entropy of the posterior pdf π ( θ | Y ) with respect to the prior pdf π ( θ ): � � 1 1 π ( θ ) π ( θ | y ) d θ − π ( θ | Y ) π ( θ | Y ) d θ Relative Entropy = log log , � �� � � �� � = Cross Entropy = Differential Entropy in expectation. 8 / 38
Information gain The information gain (Lindley, 1956) is here defined by the Kullback-Leibler (KL) divergence between prior and posterior pdfs � π ( θ | Y , ξ ) � � def D KL ( π ( θ | Y , ξ ) � π ( θ )) = log π ( θ | Y , ξ ) d θ , π ( θ ) Θ where ◮ π ( θ ), the prior pdf for θ ◮ π ( θ | Y , ξ ), the posterior pdf for θ given Y for design ξ How to obtain the posterior pdf? Use Bayes’ rule : π ( θ | Y , ξ ) = p ( Y | θ , ξ ) p ( Y | ξ ) π ( θ ) where ◮ p ( Y | θ , ξ ), the likelihood function for Y � ◮ p ( Y | ξ ), the evidence, i.e., p ( Y | ξ ) = Θ p ( Y | θ , ξ ) π ( θ ) d θ 9 / 38
Expected Information Gain (EIG) criterion Information gain : � p ( Y | θ , ξ ) π ( θ ) � p ( Y | θ , ξ ) � D KL ( π ( θ | Y , ξ ) � π ( θ )) = log d θ . p ( Y | ξ ) p ( Y | ξ ) Θ The information about Y is given by the data model. The expected information gain (EIG) is given by � def I ( ξ ) = D KL ( π ( θ | Y , ξ ) � π ( θ )) p ( Y ) d Y � � � � p ( Y | θ , ξ ) = log p ( Y | θ , ξ ) d Y π ( θ ) d θ , � Θ p ( Y | θ , ξ ) π ( θ ) d θ � � � ��� p ( Y | θ , ξ ) = E θ E Y | θ log , E θ [ p ( Y | θ , ξ )] where, because of the assumed data model, the likelihood follows a multivariate Normal distribution � � N e − 1 � p ( Y | θ , ξ ) = det (2 π Σ ǫ ) − Ne 2 exp � y i − g ( θ , ξ ) � 2 . Σ − 1 2 ǫ i =1 10 / 38
EIG surface 11 / 38
Estimation EIG The expected information can be computed separately for each ξ . EIG criterion: � � � ��� p ( Y | θ ) I = E θ E Y | θ log , E θ [ p ( Y | θ )] where the likelihood is given by the data model : � � N e − 1 � p ( Y | θ ) = det (2 π Σ ǫ ) − Ne � y i − g ( θ ) � 2 2 exp . Σ − 1 2 ǫ i =1 Conventional choice: Double-loop Monte Carlo sampling 12 / 38
Double-loop Monte Carlo (DLMC) for estimating EIG The DLMC estimator for EIG : N p ( Y n | θ n ) = 1 � def I dl � M N 1 m =1 p ( Y n | θ n , m ) n =1 M i.i.d. i.i.d. where θ n , θ n , m ∼ π ( θ ) and Y n ∼ p ( Y | θ n ). Average work model: W = N × M likelihood evaluations! Bias and variance can be modeled (Ryan, 2003): Bias: | E [ I dl ] − I | ≈ C 1 M , for some constant C 1 > 0 (Not unbiased!) Variance: V [ I dl ] ≈ C 2 N + C 3 NM , for some constants C 2 , C 3 > 0. 13 / 38
How to control the estimator error? Control the accuracy: Choose N and M to control the estimator’s accuracy Goal: For a specified error tolerance, TOL > 0, at a confidence level given by 0 < α ≪ 1, we select the values of N and M minimizing the computational work, while satisfying P ( |I dl ( N , M ) − I | ≤ TOL) ≥ 1 − α. 14 / 38
Total Error = Bias + Statistical Error Split the total error of a random estimator I into a bias component and a statistical error component | I − I| ≤ | I − E [ I ] | + | E [ I ] − I| . Optimization problem: minimize the estimator’s work, W , subject to | I − E [ I ] | ≤ (1 − κ )TOL | E [ I ] − I| ≤ κ TOL for κ ∈ (0 , 1), and the latter should hold with probability 1 − α . By using the CLT theorem, the statistical error constraint is replaced by � V [ I ] ≤ κ TOL C α for C α = Φ − 1 (1 − α 2 ) where Φ is the cumulative probability distribution (CDF) of a standard, normal random variable. 15 / 38
Optimal choice of DLMC parameters Optimization problem for DLMC: C dl , 3 ≤ (1 − κ )TOL M ( N ∗ , M ∗ , κ ∗ ) = arg min N × M subject to NM ≤ ( κ/ C α ) 2 TOL 2 ( N , M ,κ ) C dl , 1 + C dl , 2 N Optimal choice of number of samples (Beck et al., 2018): � C 2 1 − κ ∗ TOL − 2 � C dl , 1 N ∗ α = 2 κ ∗ � 0 . 5 C dl , 2 1 − κ ∗ TOL − 1 � M ∗ = Optimal average work: For a specified TOL with the specified probability of success, Optimal average work = N ∗ M ∗ ∝ TOL − 3 , as TOL → 0. 16 / 38
Improving the constant: Laplace-based importance sampling Assumption: The posterior pdf of θ can be well approximated by a Gaussian. Laplace approximation: Approximate the posterior pdf, π ( θ | Y ), by a � � θ ( Y ) , ˆ ˆ Σ ( ˆ multivariate normal pdf, ˜ π ( θ | Y ), following N θ ( Y )) , i.e., � � − 1 π ( θ | Y ) = (2 π ) − d 2 det( ˆ Σ ) − 1 2 � θ − ˆ 2 exp θ � 2 ˜ , ˆ Σ − 1 with maximum a posteriori (MAP) estimate, � � N e − 1 � � Y i − g ( θ ) � 2 ˆ def θ ( Y ) = arg max ǫ + log( π ( θ )) , Σ − 1 2 θ ∈ Θ i =1 and � � 1 Σ − 1 ( θ ) ˆ N e J ( θ ) T Σ − 1 ǫ J ( θ ) − ∇ θ ∇ θ log( π ( θ )) + O P √ N e = . This optimization problem can typically be solved sufficiently with ∼ 20-40 extra solves per outer sample (indexed by n ). 17 / 38
DLMC with Laplace-based Importance Sampling (DLMCIS) Laplace-based importance sampling for the inner expectation : � p ( Y | θ ) π ( θ ) � p ( Y ) = p ( Y | θ ) π ( θ ) d θ = π ( θ | Y ) d θ ˜ π ( θ | Y ) ˜ DLMCIS estimator (Ryan et al., 2015; Beck et al., 2018) : N = 1 p ( Y | θ n ) � def I dlis log , p ( Y n | ˜ θ n , m ) π (˜ N θ n , m ) � M 1 n =1 m =1 π (˜ M ˜ θ n , m | Y n ) where ˜ i.i.d. θ n , m ∼ ˜ π ( θ | Y n ). ( no additional bias! ) Optimal average work : ∝ TOL − 3 as TOL → 0. The proportionality constant is substantially smaller for DLMCIS than for DLMC ! ( variance reduction ) 18 / 38
Numerical demonstration Nonlinear scalar model: y i ( ξ ) = θ 3 ξ 2 + θ exp( −| 0 . 2 − ξ | ) + ǫ i , for i = 1 , 2 , . . . , N e , with ◮ θ ∼ U (0 , 1) ◮ ǫ i i.i.d. ∼ N (0 , 10 − 3 ) ◮ N e = 1 and N e = 10 19 / 38
Expected information gain Figure: Expected information gain 20 / 38
Computational work for a single design Figure: Average running time vs. TOL 21 / 38
Recommend
More recommend