multilevel methods for fast bayesian optimal experimental
play

Multilevel methods for fast Bayesian optimal experimental design Ra - PowerPoint PPT Presentation

Multilevel methods for fast Bayesian optimal experimental design Ra ul Tempone Alexander von Humboldt Professor, RWTH-Aachen, KAUST Joint work with J. Beck (KAUST), B.M. Dia (KFUPM) and L. Espath (RWTH) DCSE Fall School on ROM and UQ,


  1. Multilevel methods for fast Bayesian optimal experimental design Ra´ ul Tempone Alexander von Humboldt Professor, RWTH-Aachen, KAUST Joint work with J. Beck (KAUST), B.M. Dia (KFUPM) and L. Espath (RWTH) DCSE Fall School on ROM and UQ, November 4–8, 2019 1 / 38

  2. Optimal Experimental Design (OED) ◮ Find the design that maximizes the success of an experiment for some desired experimental goal ◮ Predict the outcome without performing the real experiment ◮ Model-based approach: Important when resources are limited 2 / 38

  3. Motivating problem: Electrical Impedance Tomography Quantities of interest: Material properties Measurements: Electrical potentials at electrodes Goal: Find the electrode placement ( design ) that maximizes the informativeness of the measurements 3 / 38

  4. Data model setting Data model assumption : Y ( ξ ) = g ( θ , ξ ) + ǫ , ◮ Y = ( y 1 , y 2 , . . . , y q ) ∈ R q , measurements ◮ ǫ ∈ R q , measurement error ◮ g ∈ R q , model predictions (usually, the solution of a math. model) ◮ θ ∈ R d , quantities of interest ◮ ξ = ( ξ 1 , ξ 2 , . . . , ξ s ), design parameters Measurement error assumption: The measurement errors, ǫ , are zero-mean Gaussian errors with covariance matrix Σ ǫ 4 / 38

  5. Data model with repetitive experiments Data model with repetitive experiments : Y i ( ξ ) = g ( θ , ξ ) + ǫ i , i = 1 , . . . , N e where N e is the number of repetitive experiments with design ξ . The full set of data : Y = { Y i } N e i =1 . 5 / 38

  6. Simple OED example: a linear regression model Linear regression model: Y = X θ + ǫ where ◮ Y , vector of observation values ◮ X is the design matrix ◮ ǫ is the error (zero-mean) vector with known covariance matrix Σ ǫ ◮ θ is an unknown parameter vector of interest. Inverse problem : Estimate θ Classical approach : ◮ Least square estimate (arg min θ ǫ T ǫ ): ˆ θ = ( X T X ) − 1 X T Y ◮ Cov(ˆ θ ) = ( X T X ) − 1 X T Σ ǫ X ( X T X ) − 1 OED goal : Find design X that minimize Cov(ˆ θ ) 6 / 38

  7. Alphabetic optimalities What does it mean to minimize Cov(ˆ def θ ) = Λ ? ◮ A -optimality: minimize the trace of the covariance matrix, tr ( Λ ) ◮ C -optimality: minimize the variance of a predefined linear combination of quantities of interest: c T Λ c ◮ D -optimality: minimize the determinant of the covariance matrix, det ( Λ ) ◮ E -optimality: minimize the maximum eigenvalue of the covariance matrix Λ For nonlinear models the problem is much more challenging! Entropy based expected information gain in a Bayesian setting: Lindley, D. V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4), 986-1005. 7 / 38

  8. Entropy-based OED criterion 1948, Information Entropy Definition : Claude Shannon introduced the concept of information entropy: p i log 1 � Entropy = , p i i for probabilities p i . The differential entropy for a continuous random variable with probability density function (pdf) p ( x ) with support X , � 1 Differential Entropy = log p ( x ) p ( x ) d x . X 1956, Shannon Entropy-based OED : Dennis V. Lindley proposed an entropy-based measure of the information provided in an experiment: the relative entropy of the posterior pdf π ( θ | Y ) with respect to the prior pdf π ( θ ): � � 1 1 π ( θ ) π ( θ | y ) d θ − π ( θ | Y ) π ( θ | Y ) d θ Relative Entropy = log log , � �� � � �� � = Cross Entropy = Differential Entropy in expectation. 8 / 38

  9. Information gain The information gain (Lindley, 1956) is here defined by the Kullback-Leibler (KL) divergence between prior and posterior pdfs � π ( θ | Y , ξ ) � � def D KL ( π ( θ | Y , ξ ) � π ( θ )) = log π ( θ | Y , ξ ) d θ , π ( θ ) Θ where ◮ π ( θ ), the prior pdf for θ ◮ π ( θ | Y , ξ ), the posterior pdf for θ given Y for design ξ How to obtain the posterior pdf? Use Bayes’ rule : π ( θ | Y , ξ ) = p ( Y | θ , ξ ) p ( Y | ξ ) π ( θ ) where ◮ p ( Y | θ , ξ ), the likelihood function for Y � ◮ p ( Y | ξ ), the evidence, i.e., p ( Y | ξ ) = Θ p ( Y | θ , ξ ) π ( θ ) d θ 9 / 38

  10. Expected Information Gain (EIG) criterion Information gain : � p ( Y | θ , ξ ) π ( θ ) � p ( Y | θ , ξ ) � D KL ( π ( θ | Y , ξ ) � π ( θ )) = log d θ . p ( Y | ξ ) p ( Y | ξ ) Θ The information about Y is given by the data model. The expected information gain (EIG) is given by � def I ( ξ ) = D KL ( π ( θ | Y , ξ ) � π ( θ )) p ( Y ) d Y � � � � p ( Y | θ , ξ ) = log p ( Y | θ , ξ ) d Y π ( θ ) d θ , � Θ p ( Y | θ , ξ ) π ( θ ) d θ � � � ��� p ( Y | θ , ξ ) = E θ E Y | θ log , E θ [ p ( Y | θ , ξ )] where, because of the assumed data model, the likelihood follows a multivariate Normal distribution � � N e − 1 � p ( Y | θ , ξ ) = det (2 π Σ ǫ ) − Ne 2 exp � y i − g ( θ , ξ ) � 2 . Σ − 1 2 ǫ i =1 10 / 38

  11. EIG surface 11 / 38

  12. Estimation EIG The expected information can be computed separately for each ξ . EIG criterion: � � � ��� p ( Y | θ ) I = E θ E Y | θ log , E θ [ p ( Y | θ )] where the likelihood is given by the data model : � � N e − 1 � p ( Y | θ ) = det (2 π Σ ǫ ) − Ne � y i − g ( θ ) � 2 2 exp . Σ − 1 2 ǫ i =1 Conventional choice: Double-loop Monte Carlo sampling 12 / 38

  13. Double-loop Monte Carlo (DLMC) for estimating EIG The DLMC estimator for EIG : N p ( Y n | θ n ) = 1 � def I dl � M N 1 m =1 p ( Y n | θ n , m ) n =1 M i.i.d. i.i.d. where θ n , θ n , m ∼ π ( θ ) and Y n ∼ p ( Y | θ n ). Average work model: W = N × M likelihood evaluations! Bias and variance can be modeled (Ryan, 2003): Bias: | E [ I dl ] − I | ≈ C 1 M , for some constant C 1 > 0 (Not unbiased!) Variance: V [ I dl ] ≈ C 2 N + C 3 NM , for some constants C 2 , C 3 > 0. 13 / 38

  14. How to control the estimator error? Control the accuracy: Choose N and M to control the estimator’s accuracy Goal: For a specified error tolerance, TOL > 0, at a confidence level given by 0 < α ≪ 1, we select the values of N and M minimizing the computational work, while satisfying P ( |I dl ( N , M ) − I | ≤ TOL) ≥ 1 − α. 14 / 38

  15. Total Error = Bias + Statistical Error Split the total error of a random estimator I into a bias component and a statistical error component | I − I| ≤ | I − E [ I ] | + | E [ I ] − I| . Optimization problem: minimize the estimator’s work, W , subject to | I − E [ I ] | ≤ (1 − κ )TOL | E [ I ] − I| ≤ κ TOL for κ ∈ (0 , 1), and the latter should hold with probability 1 − α . By using the CLT theorem, the statistical error constraint is replaced by � V [ I ] ≤ κ TOL C α for C α = Φ − 1 (1 − α 2 ) where Φ is the cumulative probability distribution (CDF) of a standard, normal random variable. 15 / 38

  16. Optimal choice of DLMC parameters Optimization problem for DLMC:  C dl , 3 ≤ (1 − κ )TOL  M ( N ∗ , M ∗ , κ ∗ ) = arg min N × M subject to NM ≤ ( κ/ C α ) 2 TOL 2 ( N , M ,κ )  C dl , 1 + C dl , 2 N Optimal choice of number of samples (Beck et al., 2018): � C 2 1 − κ ∗ TOL − 2 � C dl , 1 N ∗ α = 2 κ ∗ � 0 . 5 C dl , 2 1 − κ ∗ TOL − 1 � M ∗ = Optimal average work: For a specified TOL with the specified probability of success, Optimal average work = N ∗ M ∗ ∝ TOL − 3 , as TOL → 0. 16 / 38

  17. Improving the constant: Laplace-based importance sampling Assumption: The posterior pdf of θ can be well approximated by a Gaussian. Laplace approximation: Approximate the posterior pdf, π ( θ | Y ), by a � � θ ( Y ) , ˆ ˆ Σ ( ˆ multivariate normal pdf, ˜ π ( θ | Y ), following N θ ( Y )) , i.e., � � − 1 π ( θ | Y ) = (2 π ) − d 2 det( ˆ Σ ) − 1 2 � θ − ˆ 2 exp θ � 2 ˜ , ˆ Σ − 1 with maximum a posteriori (MAP) estimate, � � N e − 1 � � Y i − g ( θ ) � 2 ˆ def θ ( Y ) = arg max ǫ + log( π ( θ )) , Σ − 1 2 θ ∈ Θ i =1 and � � 1 Σ − 1 ( θ ) ˆ N e J ( θ ) T Σ − 1 ǫ J ( θ ) − ∇ θ ∇ θ log( π ( θ )) + O P √ N e = . This optimization problem can typically be solved sufficiently with ∼ 20-40 extra solves per outer sample (indexed by n ). 17 / 38

  18. DLMC with Laplace-based Importance Sampling (DLMCIS) Laplace-based importance sampling for the inner expectation : � p ( Y | θ ) π ( θ ) � p ( Y ) = p ( Y | θ ) π ( θ ) d θ = π ( θ | Y ) d θ ˜ π ( θ | Y ) ˜ DLMCIS estimator (Ryan et al., 2015; Beck et al., 2018) :   N = 1  p ( Y | θ n )  � def   I dlis log  ,   p ( Y n | ˜ θ n , m ) π (˜ N θ n , m )  � M 1 n =1 m =1 π (˜ M ˜ θ n , m | Y n ) where ˜ i.i.d. θ n , m ∼ ˜ π ( θ | Y n ). ( no additional bias! ) Optimal average work : ∝ TOL − 3 as TOL → 0. The proportionality constant is substantially smaller for DLMCIS than for DLMC ! ( variance reduction ) 18 / 38

  19. Numerical demonstration Nonlinear scalar model: y i ( ξ ) = θ 3 ξ 2 + θ exp( −| 0 . 2 − ξ | ) + ǫ i , for i = 1 , 2 , . . . , N e , with ◮ θ ∼ U (0 , 1) ◮ ǫ i i.i.d. ∼ N (0 , 10 − 3 ) ◮ N e = 1 and N e = 10 19 / 38

  20. Expected information gain Figure: Expected information gain 20 / 38

  21. Computational work for a single design Figure: Average running time vs. TOL 21 / 38

Recommend


More recommend