Meta Particle Flow for Sequential Bayesian Inference Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology Joint work with Xinshi Chen and Hanjun Dai
Bayesian In Inference Infer the posterior distribution of unknown parameter π given β’ Prior distribution π(π¦) π¦ Likelihood function π(π|π¦) β’ Observations π 1 , π 2 , β¦ , π π β’ π 1 π 2 π π β¦β¦ β¦β¦ π = 1 π π¦ π 1:π π¨ π π¦ π(π π |π¦) π=1 π π¨ = π π¦ π(π π |π¦) ππ¦ π=1 Challenging computational problem for high dimensional π¦
Challe llenges in in Bayesian In Inference Gaussian Mixture Model prior π¦ 1 , π¦ 2 βΌ π π¦ = πͺ(0, π½) β’ observations o|π¦ 1 , π¦ 2 βΌ π π π¦ 1 , π¦ 2 = 1 2 πͺ π¦ 1 , 1 + 1 2 πͺ(π¦ 1 + π¦ 2 , 1) β’ With π¦ 1 , π¦ 2 = (1, β2) , the resulting posterior will have two modes: 1, β2 and β1, 2 β’ To fit only one posterior π(π|π π:π ) is already not easy. [Results reported by Dai et al. (2016)] β3 x 10 3 10 9 2 8 1 7 6 0 5 β1 4 3 β2 2 β3 1 β2 β1.5 β1 β0.5 0 0.5 1 1.5 2 (a) True posterior (d) Gibbs Sampling (e) One-pass SMC (b) Stochastic Variational (c) Stochastic Gradient Inference Langevin Dynamics
Fundamental Prin incip iple for Machin ine Learnin ing Lots of applications in machine learning true location β’ π¦ 1 π¦ 2 π¦ π Hidden Markov model β¦β¦ sensor π 1 π 2 π π measure π¦ β’ topic Topic modeling π½ π¨ π π π word π β’ Uncertainty quantification π§ π’βπ π π’+1 = ππ π’βπ exp β π π’ + π π’ exp βππ π’ , β’ π§ 0 2 , π π’ βΌ Ξ π π β2 , π π β2 , π π 2 π π’ βΌ Ξ π π β’ 2 , π π 2 , π, π parameters π¦ = π, π§ 0 , π π β’
Sequential l Bayesian In Inference Online Bayesian Inference Observations π 1 , π 2 , β¦ , π π arrive sequentially β’ An ideal algorithm should: Efficiently update π π¦ π 1:π to π π¦ π 1:π+1 when π π+1 is observed β’ β’ Without storing all historical observations π 1 , π 2 , β¦ , π π π π¦ π 1:π β π π¦ π 1:πβ1 π π π π¦ updated posterior current posterior likelihood π¦ π 1 π 2 π π β¦β¦ β¦β¦ β¦ π π¦ π 1:2 π π¦ π 1:π prior π(π¦) π π¦ π 1 π 1 π 2 π π
Rela lated Work β’ MCMC β requires a complete scan of the data β’ Variational Inference (VI) β requires re-optimization for every new observation β’ Stochastic approximate inference β are prescribed algorithms to optimize the final posterior π π¦ π 1:π β can not exploit the structure of the sequential inference problem π¦ π 1 π 2 π π β¦β¦ β¦β¦ β¦ prior π(π¦) π π¦ π 1 π π¦ π 1:2 π π¦ π 1:π π 1 π 2 π π
Rela lated Work οΌ Sequential monte Carlo (Doucet et al., 2001; Balakrishnan&Madigan, 2006) β the state of art for online Bayesian Inference β but suffers from path degeneracy problem in high dimensions β rejuvenation steps can help but will violate online constraints (Canini et al., 2009) Can we learn to perform efficient and effective sequential Bayesian update? π¦ π 1 π 2 π π β¦β¦ β¦β¦ β¦ prior π(π¦) π π¦ π 1 π π¦ π 1:2 π π¦ π 1:π π 1 π 2 π π
Operator Vie iew οΌ Kernel Bayesβ Rule (Fukumizu et al., 2012) β the posterior is represented as an embedding π π = π½ π(π¦|π 1:π ) [π π¦ ] β π π+1 = π§( π π , π π+1 ) updated embedding current embedding β views the Bayes update as an operator in reproducing kernel Hilbert space (RKHS) β conceptually nice but is limited in practice π¦ π 1 π 2 π π β¦β¦ β¦β¦ β¦ prior π(π¦) π π¦ π 1 π π¦ π 1:2 π π¦ π 1:π π 1 π 2 π π
Our Approach: Bayesian In Inference as Particle Flo low Particle Flow Start with πΆ particles β’ 1 , β¦ , π¦ 0 π } , sampled i.i.d. from prior π(π¦) π΄ 0 = {π¦ 0 β’ Transport particles to next posterior via solution of an initial value problem (IVP) ππ¦ π ππ’ = π π΄ 0 , π 1 , π¦(π’) , βπ’ β [0, π] and π¦ 0 = π¦ 0 π = π¦(π) βΉ solution π¦ 1 π π = π¦ 0 π + π¦ 1 π π΄ 0 , π 1 , π¦(π’) ππ’ 0 1 , β¦ , π¦ 1 π } π΄ 1 = {π¦ 1 1 , β¦ , π¦ 0 π } π΄ 0 = {π¦ 0 π΄ 1 βΌ π(π¦|π 1 ) π΄ 0 βΌ π(π¦)
Flo low Property β’ Continuity Equation expresses the law of local conservation of mass : β Mass can neither be created nor destroyed β nor can it βteleportβ from one place to another ππ π¦, π’ = βπΌ π¦ β (ππ) ππ’ Theorem . If ππ¦ ππ’ = π , then the change in log-density follows the differential equation β’ π log π π¦, π’ = βπΌ π¦ β π ππ’ β’ Notation β ππ ππ’ is material derivative that defines the rate of change of π in a given particle as it moves along its trajectory π¦ = π¦(π’) β ππ ππ’ is partial derivative that defines the rate of change of π at a particular point π¦
Partic icle Flo low for Sequentia ial Bayesian In Inference π¦ π 1 π 2 π π β¦β¦ β¦β¦ β¦ prior π(π¦) π π¦ π 1 π π¦ π 1:2 π π¦ π 1:π π 1 π 2 π π 1 , β¦ , π¦ 0 π } π΄ 0 = {π¦ 0 1 , β¦ , π¦ 1 π } π΄ 1 = {π¦ 1 1 , β¦ , π¦ 2 π } π΄ 2 = {π¦ 2 β¦β¦ π π π π π΄ 1 , π 2 , π¦(π’) ππ’ π π΄ 0 , π 1 , π¦(π’) ππ’ π π΄ 2 , π 3 , π¦(π’) ππ’ 0 0 0 Particle Flow for Sequential Bayesian Inference π π + π π¦ π+1 = π¦ π π π΄ π , π π+1 , π¦(π’) ππ’ 0 π π + π βlog π π+1 = βlog π π πΌ π¦ β π π΄ π , π π+1 , π¦(π’) ππ’ 0 β’ Other ODE approaches (eg. Neural ODE of Chen et al 18), are not for sequential case.
low Velocity π Exis Shared Flo ists? π¦ 0 βΌ π(π¦) π¦ π’ βΌ π(π¦|π 1 ) π π¦ π = π¦ 0 + π(ππππ£π’π‘)ππ’ 0 Does a shared flow velocity π exist for different Bayesian inference tasks involving different priors and different observations? A simple Gaussian Example Prior π π¦ = πͺ(0, π 0 ) , likelihood π π π¦ = πͺ π¦, Ο , observation π = 0 β’ βΉ posterior π π¦ π = 0 = πͺ(0, πβ π 0 π+π 0 ) β’ Whether a shared π exists for priors with different π 0 ? What is the form for it? β’ β E.g. π in the form of π(π, π¦(π’)) wonβt be able to handle different π 0 .
Exi xistence: Connection to Stochastic Flo low β’ Langevin dynamics is a stochastic process ππ¦ π’ = πΌ π¦ log π π¦ π(π|π¦) ππ’ + 2 ππ₯ π’ , where ππ₯(π’) is a standard Brownian motion. Property. If the potential function Ξ¨ π¦ β βlog π π¦ π(π|π¦) is smooth and β’ π βΞ¨ β π 1 (β π ) , the Fokker-Planck equation has a unique stationary solution in the form of Gibbs distribution, π π¦, β = π βΞ¨ = π π¦ π π π¦ = π(π¦|π) π π
Exi xistence: Connection to Stochastic Flo low β’ The probability density π(π¦, π’) of π¦ π’ follows a deterministic evolution according to the Fokker-Planck equation ππ ππ’ = βπΌ π¦ β ππΌ π¦ log π π¦ π π π¦ + Ξ π¦ π π¦, π’ = βπΌ π¦ β (π(πΌ π¦ log π π¦ π(π|π¦) β πΌ π¦ log π(π¦, π’))) , π which is in the form of Continuity Equation. β’ Theorem. When the deterministic transformation of random variable π¦ π’ follows ππ¦ ππ’ = πΌ π¦ log π π¦ π(π|π¦) β πΌ π¦ log π π¦, π’ , its probability density π(π¦, π’) converges to the posterior π(π¦|π) as π’ β β .
Exis xistence: Clo lose-Loop to Open-Loop Conversion Close loop to Open loop Fokker-Planck equation leads to close loop flow, depend not just on π(π¦) and π π π¦ , β’ but also on flow state π π¦, π’ . Is there an equivalent form independent of π π¦, π’ which can achieve the same flow? β’ Optimization problem min π π π¦, β , π(π¦|π) π₯ π‘. π’. ππ¦ ππ’ = πΌ π¦ log π π¦ π(π|π¦) β π₯, Positive answer: there exists a fixed and deterministic flow velocity π of the form β’ ππ¦ π¦ log π π¦ π(π|π¦) β π₯ β (π π¦ , π π|π¦ , π¦, π’) ππ’ = πΌ
Recommend
More recommend