Transport methods for sampling: low-dimensional structure and preconditioning Youssef Marzouk joint work with Daniele Bigoni, Matthew Parno, Alessio Spantini, & Olivier Zahm Department of Aeronautics and Astronautics Center for Computational Engineering Statistics and Data Science Center Massachusetts Institute of Technology http://uqgroup.mit.edu Support from AFOSR, DARPA, DOE 12 July 2019 Marzouk et al. MIT 1 / 40
Motivation: Bayesian inference in large-scale models Observations y Parameters x π pos ( x ) := π ( x | y ) ∝ π ( y | x ) π pr ( x ) � �� � Bayes’ rule ◮ Characterize the posterior distribution (density π pos ) ◮ This is a challenging task since: ◮ x ∈ R n is typically high-dimensional (e.g., a discretized function) ◮ π pos is non-Gaussian ◮ evaluations of the likelihood (hence π pos ) may be expensive ◮ π pos can be evaluated up to a normalizing constant Marzouk et al. MIT 2 / 40
Computational challenges ◮ Extract information from the posterior (means, covariances, event probabilities, predictions) by evaluating posterior expectations: � E π pos [ h ( x )] = h ( x ) π pos ( x ) dx ◮ Key strategy for making this computationally tractable: ◮ Efficient and structure-exploiting sampling schemes Marzouk et al. MIT 3 / 40
Computational challenges ◮ Extract information from the posterior (means, covariances, event probabilities, predictions) by evaluating posterior expectations: � E π pos [ h ( x )] = h ( x ) π pos ( x ) dx ◮ Key strategy for making this computationally tractable: ◮ Efficient and structure-exploiting sampling schemes ◮ This talk: relate to notions of coupling and transport. . . Marzouk et al. MIT 3 / 40
Deterministic couplings of probability measures T η π Core idea ◮ Choose a reference distribution η (e.g., standard Gaussian) ◮ Seek a transport map T : R n → R n such that T ♯ η = π Marzouk et al. MIT 4 / 40
Deterministic couplings of probability measures S = T − 1 η π Core idea ◮ Choose a reference distribution η (e.g., standard Gaussian) ◮ Seek a transport map T : R n → R n such that T ♯ η = π ◮ Equivalently, find S = T − 1 such that S ♯ π = η Marzouk et al. MIT 4 / 40
Deterministic couplings of probability measures T η π Core idea ◮ Choose a reference distribution η (e.g., standard Gaussian) ◮ Seek a transport map T : R n → R n such that T ♯ η = π ◮ Equivalently, find S = T − 1 such that S ♯ π = η ◮ In principle, enables exact (independent, unweighted) sampling! Marzouk et al. MIT 4 / 40
Deterministic couplings of probability measures T η π Core idea ◮ Choose a reference distribution η (e.g., standard Gaussian) ◮ Seek a transport map T : R n → R n such that T ♯ η = π ◮ Equivalently, find S = T − 1 such that S ♯ π = η ◮ Satisfying these conditions only approximately can still be useful! Marzouk et al. MIT 4 / 40
Choice of transport map A useful building block is the Knothe–Rosenblatt rearrangement: T 1 ( x 1 ) T 2 ( x 1 , x 2 ) T ( x ) = . . . T n ( x 1 , x 2 , . . . , x n ) ◮ Unique triangular and monotone map satisfying T ♯ η = π for absolutely continuous η, π on R n ◮ Jacobian determinant easy to evaluate ◮ Monotonicity is essentially one-dimensional: ∂ x k T k > 0 ◮ “Exposes” marginals, enables conditional sampling. . . Marzouk et al. MIT 5 / 40
Choice of transport map A useful building block is the Knothe–Rosenblatt rearrangement: T 1 ( x 1 ) T 2 ( x 1 , x 2 ) T ( x ) = . . . T n ( x 1 , x 2 , . . . , x n ) ◮ Unique triangular and monotone map satisfying T ♯ η = π for absolutely continuous η, π on R n ◮ Jacobian determinant easy to evaluate ◮ Monotonicity is essentially one-dimensional: ∂ x k T k > 0 ◮ “Exposes” marginals, enables conditional sampling. . . ◮ Numerical approximations can employ a monotone parameterization guaranteeing ∂ x k T k > 0. For example: � x k T k ( x 1 , . . . , x k ) = a k ( x 1 , . . . , x k − 1 )+ exp ( b k ( x 1 , . . . , x k − 1 , w )) dw 0 Marzouk et al. MIT 5 / 40
How to construct triangular maps? Construction #1: “maps from densities,” i.e., variational characterization of the direct map T [Moselhy & M 2012] Marzouk et al. MIT 6 / 40
How to construct triangular maps? Construction #1: “maps from densities,” i.e., variational characterization of the direct map T [Moselhy & M 2012] D KL ( η || T − 1 D KL ( T ♯ η || π ) = min min π ) ♯ T ∈T h T ∈T h △ △ ◮ π is the “target” density on R n ; η is, e.g., N ( 0 , I n ) ◮ T h △ is a set of monotone lower triangular maps ◮ T h →∞ contains the Knothe–Rosenblatt rearrangement △ ◮ Expectation is with respect to the reference measure η ◮ Compute via, e.g., Monte Carlo, sparse quadrature ◮ Use unnormalized evaluations of π and its gradients ◮ No MCMC or importance sampling ◮ In general non-convex, unless π is log-concave Marzouk et al. MIT 6 / 40
Illustrative example 18 15 � log ∂ x k T k ] 12 min T E η [ − log π ◦ T − k 9 ◮ Parameterized map T ∈ T h △ ⊂ T △ ◮ Optimize over coefficients of 6 parameterization ◮ Use gradient-based optimization 3 ◮ The posterior is in the tail of the reference 0 3 3 0 3 Marzouk et al. MIT 6 / 40
Illustrative example 18 15 � log ∂ x k T k ] 12 min T E η [ − log π ◦ T − k 9 ◮ Parameterized map T ∈ T h △ ⊂ T △ ◮ Optimize over coefficients of 6 parameterization ◮ Use gradient-based optimization 3 ◮ The posterior is in the tail of the reference 0 3 3 0 3 Marzouk et al. MIT 6 / 40
Illustrative example 18 15 � log ∂ x k T k ] 12 min T E η [ − log π ◦ T − k 9 ◮ Parameterized map T ∈ T h △ ⊂ T △ ◮ Optimize over coefficients of 6 parameterization ◮ Use gradient-based optimization 3 ◮ The posterior is in the tail of the reference 0 3 3 0 3 Marzouk et al. MIT 6 / 40
Illustrative example 18 15 � log ∂ x k T k ] 12 min T E η [ − log π ◦ T − k 9 ◮ Parameterized map T ∈ T h △ ⊂ T △ ◮ Optimize over coefficients of 6 parameterization ◮ Use gradient-based optimization 3 ◮ The posterior is in the tail of the reference 0 3 3 0 3 Marzouk et al. MIT 6 / 40
Useful features ◮ Move samples; don’t just reweigh them ◮ Independent and cheap samples: x i ∼ η ⇒ T ( x i ) ◮ Clear convergence criterion, even with unnormalized target density: � � D KL ( T ♯ η || π ) ≈ 1 η 2 V ar η log T − 1 ¯ π ♯ Marzouk et al. MIT 7 / 40
Useful features ◮ Move samples; don’t just reweigh them ◮ Independent and cheap samples: x i ∼ η ⇒ T ( x i ) ◮ Clear convergence criterion, even with unnormalized target density: � � D KL ( T ♯ η || π ) ≈ 1 η 2 V ar η log T − 1 π ¯ ♯ ◮ Can either accept bias or reduce it by: ◮ Increasing the complexity of the map T ∈ T h △ ◮ Sampling the pullback T − 1 π using MCMC or importance sampling ♯ (more on this later) Marzouk et al. MIT 7 / 40
Useful features ◮ Move samples; don’t just reweigh them ◮ Independent and cheap samples: x i ∼ η ⇒ T ( x i ) ◮ Clear convergence criterion, even with unnormalized target density: � � D KL ( T ♯ η || π ) ≈ 1 η 2 V ar η log T − 1 ¯ π ♯ ◮ Can either accept bias or reduce it by: ◮ Increasing the complexity of the map T ∈ T h △ ◮ Sampling the pullback T − 1 π using MCMC or importance sampling ♯ (more on this later) ◮ Related transport constructions for inference and sampling: Stein variational gradient descent [Liu & Wang 2016, DeTommaso 2018], normalizing flows [Rezende & Mohamed 2015], SOS polynomial flow [Jaini et al. 2019], Gibbs flow [Heng et al. 2015], particle flow filter [Reich 2011], implicit sampling [Chorin et al. 2009–2015], etc. Marzouk et al. MIT 7 / 40
How to construct triangular maps? Construction #2: “maps from samples” D KL ( π || S − 1 min D KL ( S ♯ π || η ) = min η ) ♯ S ∈S h S ∈S h △ △ ◮ Suppose we have Monte Carlo samples { x i } M i = 1 ∼ π ◮ For standard Gaussian η , this problem is convex and separable ◮ This is density estimation via transport! (cf. Tabak & Turner 2013) Marzouk et al. MIT 8 / 40
How to construct triangular maps? Construction #2: “maps from samples” D KL ( π || S − 1 min D KL ( S ♯ π || η ) = min η ) ♯ S ∈S h S ∈S h △ △ ◮ Suppose we have Monte Carlo samples { x i } M i = 1 ∼ π ◮ For standard Gaussian η , this problem is convex and separable ◮ This is density estimation via transport! (cf. Tabak & Turner 2013) ◮ Equivalent to maximum likelihood estimation of S M � 1 � log S − 1 S ∈ arg max η ( x i ) , η = N ( 0 , I n ) , ♯ M S ∈S h � �� � △ i = 1 pullback S k of � ◮ Each component � S can be computed separately , via smooth convex optimization � 1 � M � 1 S k ∈ arg 2 S k ( x i ) 2 − log ∂ k S k ( x i ) � min M S k ∈S h △ , k i = 1 Marzouk et al. MIT 8 / 40
Low-dimensional structure of transport maps Underlying challenge: maps in high dimensions ◮ Major bottleneck: representation of the map, e.g., cardinality of the map basis ◮ How to make the construction/representation of high-dimensional transports tractable? Marzouk et al. MIT 9 / 40
Recommend
More recommend