outcome weighted sampling for bayesian analysis
play

Outcome-weighted sampling for Bayesian analysis Themis Sapsis and - PowerPoint PPT Presentation

Outcome-weighted sampling for Bayesian analysis Themis Sapsis and Antoine Blanchard Department of Mechanical Engineering Massachusetts Institute of Technology Funding: ONR, AFOSR, Sloan April 23, 2020 1 / 46 Problems-Motivation Risk


  1. Outcome-weighted sampling for Bayesian analysis Themis Sapsis and Antoine Blanchard Department of Mechanical Engineering Massachusetts Institute of Technology Funding: ONR, AFOSR, Sloan April 23, 2020 1 / 46

  2. Problems-Motivation Risk Quantification Optimization under uncertainty 2 / 46

  3. Challenges Challenge I : High-dimensional parameter spaces Intrinsic instabilities Stochastic loads Random parameters Challenge II : Need for expensive models Complex dynamics Hard to isolate dynamical mechanisms 3 / 46

  4. The focus of this work Goal : Develop sampling strategies appropriate for expensive models and high-dimensional parameter spaces Models in fluids: Navier-Stokes, NL Schr¨ odinger, Euler Critical region of parameters is unknown Importance sampling based methods too expensive Input-space PCA focuses on subspaces, not sufficient 4 / 46

  5. Risk Quantification: Problem setup x ∈ R m : Uncertain parameters; pdf: f x y ∈ R d : Output or quantities of interest; expensive to compute Risk Quantification Problem: Compute the statistics of y with the minimum number of experiments, i.e. input parameters { x 1 , x 2 , ..., x N } 5 / 46

  6. A Bayesian approach Employ a linear regression model with an input vector x of length m that multiplies a coefficient vector A to produce an output vector y of length d , with Gaussian noise added: y = Ax + e (1) e ∼ N ( 0 , V ) (2) We are given a data set of pairs: D = { ( y 1 , x 1 ) , ( y 2 , x 2 ) , ..., ( y N , x N ) } . We set Y = [ y 1 , y 2 , ..., y N ] and X = [ x 1 , x 2 , ..., x N ] . 6 / 46

  7. A Bayesian approach From Bayesian regression, we obtain the pdf for new inputs x : p ( y | x , D , V ) = N ( S yx S − 1 xx x , V ( 1 + c )) , c = x T S − 1 xx x , S xx = XX T + K S yx = YX T Question : How to choose the next input point x N + 1 = h ? 7 / 46

  8. 1. Minimizing the model uncertainty Given a hypothetical input point x N + 1 = h , we have at x p ( y | x , D ′ , V ) = N ( S yx S − 1 xx x , V ( 1 + c )) , c = x T S ′− 1 xx x , yx S ′− 1 xx x = S yx S − 1 xx x , assuming y N + 1 = S yx S − 1 where S ′ xx h . We minimize the model uncertainty by choosing h such that the distribution for c converges to zero (at least for the x we are interested): µ c ( h ) = E [ x T S ′− 1 xx x ] = tr [ S ′− 1 x S ′− 1 xx µ x = tr [ S ′− 1 xx C xx ] + µ T xx R xx ] (valid for any f x ) 8 / 46

  9. 1. Minimizing the model uncertainty Interpretation of the sampling process 1. The selection of the new sample does not depend on Y . 2. We diagonalize R xx ; let ˆ x i , i = 1 , ..., m be the principal directions arranged according to the eigenvalues σ 2 i + µ 2 x i . ˆ To minimize d � µ c ( h ) = tr [ S ′− 1 ( σ 2 i + µ 2 x i )[ S ′− 1 h ∈ S m − 1 , xx R xx ] = x ] ii , ˆ x ˆ ˆ i = 1 we need to sample in directions with the largest σ 2 i + µ 2 x i . ˆ 3. After sufficient sampling in this direction, the scheme switches to the next most important direction and so on. 4. Emphasis on input directions with large uncertainty, even those that have zero effect to the output. 9 / 46

  10. 2. Maximizing the x,y mutual information Maximizing the entropy transfer or mutual information between the input and output variables, when a new sample is added: I ( x , y | D ′ ) = E x + E y | D ′ − E x , y | D ′ . We have: � � E x , y ( h ) = f xy ( y , x | D ′ ) log f xy ( y , x | D ′ ) y x � � E y | x ( x | D ′ ) f x ( x ) + f x ( x ) log f x ( x ) = x x = E x [ E y | x ( D ′ )] + E x . 10 / 46

  11. 2. Maximizing the x,y mutual information Given a new input point x N + 1 = h , we have at any input x p ( y | x , D ′ , V ) = N ( S yx S − 1 xx x , V ( 1 + c )) , c = x T S ′− 1 xx x , Therefore, I ( x , y | D ′ , V ) = E y ( h ) − d 2 E x [log( 1 + c ( x ; h ))] − 1 2 log | 2 π e V | Note 1: Valid for any distribution f x Note 2: Hard to compute for high dimensions 11 / 46

  12. 2. Maximizing the x,y mutual information Gaussian approximation The Gaussian approximation of the entropy criterion: I G ( x , y | D ′ , V ) = 1 2 log | V ( 1 + µ c ( h )) + S yx S − 1 xx C xx S − 1 xx S T yx | − 1 2 log | V | − d 2 E x [log( 1 + c ( x ; h ))] , Note 1 : The effect of Y appears only through a single scalar/vector and with no coupling on the new point h . Note 2 : Asymptotically (i.e. for small σ 2 c ) the criterion becomes I G ( x , y | D ′ ) = 1 2 log | I + V − 1 S yx S − 1 xx C xx S − 1 xx S T yx | − � � µ c ( h ) d − tr [[ V + S yx S − 1 xx C xx S − 1 xx S T yx ] − 1 V ] + O ( µ 2 c ) 2 12 / 46

  13. 3. Output-weighted optimal sampling Let y 0 be the rv defined as the mean model: y 0 ≜ S yx S − 1 xx x We define the perturbed model: y + ≜ S yx S − 1 xx x + β r V ( 1 + x T S ′− 1 xx x ) , where β is a scaling factor to be chosen later and r V the most dominant eigenvector of V . We define the distance (Mohamad & Sapsis, PNAS, 2018) � D Log 1 ( y + � y 0 ; h ) = | log f y + ( y ; h ) − log f y 0 ( y ) | d y S y where S y is a finite sub-domain of y . 13 / 46

  14. 3. Output-weighted optimal sampling We can show that for bounded pdfs: D KL ( y + � y 0 ; h ) � κ D Log 1 ( y + � y 0 ; h ) , where κ is a constant. D Log 1 is more conservative compared with the KL divergence. Significantly improved performance in terms of convergence for f y . Criterion D Log 1 ( y + � y 0 ) is hard to compute/optimize. 14 / 46

  15. 3. Output-weighted optimal sampling Under appropriate smoothness conditions standard inequalities for derivatives of smooth functions give (Sapsis, Proc Roy Soc A, 2020): � f x ( x ) f y 0 ( y 0 ( x )) σ 2 lim β → 0 D Log 1 ( y + � y 0 ; h ) ≤ κ 0 y ( x ; h ) d x . 15 / 46

  16. 3. Output-weighted optimal sampling We define the output-weighted model error criterion � f x ( x ) f y 0 ( y 0 ( x )) σ 2 Q [ h ] ≜ y ( x ; h ) d x . Model error weighted according to the importance 1 (probability) of the input Model error inversely weighted according to the probability 2 of the output: emphasis is given to outputs with low probability (rare events) Relevant criterion (Verdinelli & Kadane, 1992) � U ( D ′ ) = q 1 y 0 ( x ) . 1 d x + q 2 E xy | D ′ . 16 / 46

  17. 3. Output-weighted optimal sampling Approximation of the criterion � f x ( x ) Q [ σ 2 f y 0 ( y 0 ( x )) σ 2 y ] ≜ y ( x ; h ) d x . Denominator approximation in S y for symmetric f y and scalar y f − 1 y 0 ( y ) ≃ p 1 + p 2 ( y − µ y ) 2 , where p 1 , p 2 are constants chosen so that m.s. error is min We employ a Gaussian approximation for f y 0 (only for this step) and over the interval S y = [ µ y , µ y + βσ y ] we obtain √ �� β � √ 2 dz − β 3 p 2 = 5 2 π z 2 z 2 e p 1 = 2 πσ y and β 5 σ y 3 0 17 / 46

  18. 3. Output-weighted optimal sampling Approximation of the criterion We collect all the computed terms and obtain (for Gaussian x ) Q βσ y ( h ) 1 = p 1 ( β )( 1 + tr [ S ′− 1 xx C xx ] + µ T x S ′− 1 xx µ x ) σ 2 V + p 2 ( β ) c 0 ( 1 + µ T x S ′− 1 xx µ x − tr [ S ′− 1 xx C xx ]) + 2 p 2 tr [ S − 1 xx S T yx S yx S − 1 xx C xx S ′− 1 xx C xx ] . For zero mean input we have Q βσ y ( h ) 1 = ( p 1 − p 2 c 0 ) tr [ S ′− 1 xx C xx ] σ 2 V + 2 p 2 tr [ S ′− 1 xx C xx 0 S − 1 xx S T yx S yx S − 1 xx C xx ] + const. 18 / 46

  19. 3. Output-weighted optimal sampling Gradient of the criterion For general functions of the form λ [ h ] = tr [ S ′− 1 xx C ] , where C is a symmetric matrix. The gradient takes the form ∂λ = − 2 h T S ′− 1 xx CS ′− 1 xx . ∂ h k 19 / 46

  20. Example 1: 2-dimensional input � σ 2 � 0 ) and σ 2 1 y ( x ) = ˆ a 1 x 1 +ˆ a 2 x 2 + � , where x ∼ N ( 0 , V = 0 . 05 . ˆ σ 2 0 2 a 2 = 1 . 3, and σ 2 1 = 1 . 4 , σ 2 Case I : ˆ a 1 = 0 . 8 , ˆ 2 = 0 . 6 . a 2 = 2 . 0, and σ 2 1 = 2 . 0 , σ 2 Case II: ˆ a 1 = 0 . 01 , ˆ 2 = 0 . 2 . 20 / 46

  21. Results for the 2D problem 21 / 46

  22. Example 2: A 20-dimensional input 20 � a m x m + � , where x m ∼ N ( 0 , σ 2 y ( x ) = m ) , m = 1 , ..., 20 , ˆ ˆ m = 1 � � 3 � � m 10 − 3 , m = 1 , ..., 20 , a m = 1 + 40 ˆ 10 � 1 � 1 σ 2 128 ( m − 10 ) 3 10 − 1 , m = 1 , ..., 20 . m = 4 + For the observation noise we consider two cases: Case I: σ 2 � = 0 . 05 (accurate observations) Case II: σ 2 � = 0 . 5 (noisy observations) 22 / 46

  23. Example 2: A 20-dimensional input Coefficients, ˆ α m , of the map ˆ y ( x ) (black curve) plotted together with the variance of each input direction σ 2 m (red curve). 23 / 46

  24. Example 2: A 20-dimensional input Performance of the two adaptive approaches based on µ c and Q ∞ . 24 / 46

  25. Example 2: A 20-dimensional input Energy of the different components of h with respect to the number of iteration N for Case I of the high dimensional problem. 25 / 46

  26. Optimal sampling for nonlinear regression Let the input x ∈ X ⊂ R m , be expressed as a function of another input z ∈ Z ⊂ R s where the input value has distribution f z and Z be a compact set. We choose a set of basis functions x = φ ( z ) . The distribution of the output values will be p ( y | z , D , V ) = N ( S y φ S − 1 φφ φ ( z ) , V ( 1 + c )) , c = φ ( z ) T S − 1 φφ φ ( z ) , N � φ ( z i ) φ ( z i ) T S φφ = i = 1 26 / 46

Recommend


More recommend