for sequential bayesian inference
play

for Sequential Bayesian Inference Le Song Associate Professor, CSE - PowerPoint PPT Presentation

Meta Particle Flow for Sequential Bayesian Inference Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology Joint work with Xinshi Chen and Hanjun Dai Bayesian In Inference Infer the


  1. Meta Particle Flow for Sequential Bayesian Inference Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology Joint work with Xinshi Chen and Hanjun Dai

  2. Bayesian In Inference Infer the posterior distribution of unknown parameter π’š given β€’ Prior distribution 𝜌(𝑦) 𝑦 Likelihood function π‘ž(𝑝|𝑦) β€’ Observations 𝑝 1 , 𝑝 2 , … , 𝑝 𝑛 β€’ 𝑝 1 𝑝 2 𝑝 𝑛 …… …… 𝑛 = 1 π‘ž 𝑦 𝑝 1:𝑛 𝑨 𝜌 𝑦 π‘ž(𝑝 𝑗 |𝑦) 𝑗=1 𝑛 𝑨 = 𝜌 𝑦 π‘ž(𝑝 𝑗 |𝑦) 𝑒𝑦 𝑗=1 Challenging computational problem for high dimensional 𝑦

  3. Challe llenges in in Bayesian In Inference Gaussian Mixture Model prior 𝑦 1 , 𝑦 2 ∼ 𝜌 𝑦 = π’ͺ(0, 𝐽) β€’ observations o|𝑦 1 , 𝑦 2 ∼ π‘ž 𝑝 𝑦 1 , 𝑦 2 = 1 2 π’ͺ 𝑦 1 , 1 + 1 2 π’ͺ(𝑦 1 + 𝑦 2 , 1) β€’ With 𝑦 1 , 𝑦 2 = (1, βˆ’2) , the resulting posterior will have two modes: 1, βˆ’2 and βˆ’1, 2 β€’ To fit only one posterior 𝒒(π’š|𝒑 𝟐:𝒏 ) is already not easy. [Results reported by Dai et al. (2016)] βˆ’3 x 10 3 10 9 2 8 1 7 6 0 5 βˆ’1 4 3 βˆ’2 2 βˆ’3 1 βˆ’2 βˆ’1.5 βˆ’1 βˆ’0.5 0 0.5 1 1.5 2 (a) True posterior (d) Gibbs Sampling (e) One-pass SMC (b) Stochastic Variational (c) Stochastic Gradient Inference Langevin Dynamics

  4. Fundamental Prin incip iple for Machin ine Learnin ing Lots of applications in machine learning true location β€’ 𝑦 1 𝑦 2 𝑦 𝑛 Hidden Markov model …… sensor 𝑝 1 𝑝 2 𝑝 𝑛 measure 𝑦 β€’ topic Topic modeling 𝛽 𝑨 𝑝 πœ„ 𝑁 word 𝑂 β€’ Uncertainty quantification 𝑧 π‘’βˆ’πœ 𝑝 𝑒+1 = 𝑄𝑝 π‘’βˆ’πœ exp βˆ’ 𝑓 𝑒 + 𝑝 𝑒 exp βˆ’πœ€πœ— 𝑒 , β€’ 𝑧 0 2 , πœ— 𝑒 ∼ Ξ“ 𝜏 𝑒 βˆ’2 , 𝜏 π‘ž βˆ’2 , 𝜏 𝑒 2 𝑓 𝑒 ∼ Ξ“ 𝜏 π‘ž β€’ 2 , 𝜏 𝑒 2 , 𝜐, πœ€ parameters 𝑦 = 𝑄, 𝑧 0 , 𝜏 π‘ž β€’

  5. Sequential l Bayesian In Inference Online Bayesian Inference Observations 𝑝 1 , 𝑝 2 , … , 𝑝 𝑛 arrive sequentially β€’ An ideal algorithm should: Efficiently update π‘ž 𝑦 𝑝 1:𝑛 to π‘ž 𝑦 𝑝 1:𝑛+1 when 𝑝 𝑛+1 is observed β€’ β€’ Without storing all historical observations 𝑝 1 , 𝑝 2 , … , 𝑝 𝑛 π‘ž 𝑦 𝑝 1:𝑛 ∝ π‘ž 𝑦 𝑝 1:π‘›βˆ’1 π‘ž 𝑝 𝑛 𝑦 updated posterior current posterior likelihood 𝑦 𝑝 1 𝑝 2 𝑝 𝑛 …… …… … π‘ž 𝑦 𝑝 1:2 π‘ž 𝑦 𝑝 1:𝑛 prior 𝜌(𝑦) π‘ž 𝑦 𝑝 1 𝑝 1 𝑝 2 𝑝 𝑛

  6. Rela lated Work β€’ MCMC – requires a complete scan of the data β€’ Variational Inference (VI) – requires re-optimization for every new observation β€’ Stochastic approximate inference – are prescribed algorithms to optimize the final posterior π‘ž 𝑦 𝑝 1:𝑁 – can not exploit the structure of the sequential inference problem 𝑦 𝑝 1 𝑝 2 𝑝 𝑛 …… …… … prior 𝜌(𝑦) π‘ž 𝑦 𝑝 1 π‘ž 𝑦 𝑝 1:2 π‘ž 𝑦 𝑝 1:𝑛 𝑝 1 𝑝 2 𝑝 𝑛

  7. Rela lated Work οƒΌ Sequential monte Carlo (Doucet et al., 2001; Balakrishnan&Madigan, 2006) – the state of art for online Bayesian Inference – but suffers from path degeneracy problem in high dimensions – rejuvenation steps can help but will violate online constraints (Canini et al., 2009) Can we learn to perform efficient and effective sequential Bayesian update? 𝑦 𝑝 1 𝑝 2 𝑝 𝑛 …… …… … prior 𝜌(𝑦) π‘ž 𝑦 𝑝 1 π‘ž 𝑦 𝑝 1:2 π‘ž 𝑦 𝑝 1:𝑛 𝑝 1 𝑝 2 𝑝 𝑛

  8. Operator Vie iew οƒΌ Kernel Bayes’ Rule (Fukumizu et al., 2012) – the posterior is represented as an embedding 𝜈 𝑛 = 𝔽 π‘ž(𝑦|𝑝 1:𝑛 ) [𝜚 𝑦 ] – 𝜈 𝑛+1 = 𝒧( 𝜈 𝑛 , 𝑝 𝑛+1 ) updated embedding current embedding – views the Bayes update as an operator in reproducing kernel Hilbert space (RKHS) – conceptually nice but is limited in practice 𝑦 𝑝 1 𝑝 2 𝑝 𝑛 …… …… … prior 𝜌(𝑦) π‘ž 𝑦 𝑝 1 π‘ž 𝑦 𝑝 1:2 π‘ž 𝑦 𝑝 1:𝑛 𝑝 1 𝑝 2 𝑝 𝑛

  9. Our Approach: Bayesian In Inference as Particle Flo low Particle Flow Start with 𝑢 particles β€’ 1 , … , 𝑦 0 𝑂 } , sampled i.i.d. from prior 𝜌(𝑦) 𝒴 0 = {𝑦 0 β€’ Transport particles to next posterior via solution of an initial value problem (IVP) 𝑒𝑦 π‘œ 𝑒𝑒 = 𝑔 𝒴 0 , 𝑝 1 , 𝑦(𝑒) , βˆ€π‘’ ∈ [0, π‘ˆ] and 𝑦 0 = 𝑦 0 π‘œ = 𝑦(π‘ˆ) ⟹ solution 𝑦 1 π‘ˆ π‘œ = 𝑦 0 π‘œ + 𝑦 1 𝑔 𝒴 0 , 𝑝 1 , 𝑦(𝑒) 𝑒𝑒 0 1 , … , 𝑦 1 𝑂 } 𝒴 1 = {𝑦 1 1 , … , 𝑦 0 𝑂 } 𝒴 0 = {𝑦 0 𝒴 1 ∼ π‘ž(𝑦|𝑝 1 ) 𝒴 0 ∼ 𝜌(𝑦)

  10. Flo low Property β€’ Continuity Equation expresses the law of local conservation of mass : – Mass can neither be created nor destroyed – nor can it β€˜teleport’ from one place to another πœ–π‘Ÿ 𝑦, 𝑒 = βˆ’π›Ό 𝑦 β‹… (π‘Ÿπ‘”) πœ–π‘’ Theorem . If 𝑒𝑦 𝑒𝑒 = 𝑔 , then the change in log-density follows the differential equation β€’ 𝑒 log π‘Ÿ 𝑦, 𝑒 = βˆ’π›Ό 𝑦 β‹… 𝑔 𝑒𝑒 β€’ Notation – π‘’π‘Ÿ 𝑒𝑒 is material derivative that defines the rate of change of π‘Ÿ in a given particle as it moves along its trajectory 𝑦 = 𝑦(𝑒) – πœ–π‘Ÿ πœ–π‘’ is partial derivative that defines the rate of change of π‘Ÿ at a particular point 𝑦

  11. Partic icle Flo low for Sequentia ial Bayesian In Inference 𝑦 𝑝 1 𝑝 2 𝑝 𝑛 …… …… … prior 𝜌(𝑦) π‘ž 𝑦 𝑝 1 π‘ž 𝑦 𝑝 1:2 π‘ž 𝑦 𝑝 1:𝑛 𝑝 1 𝑝 2 𝑝 𝑛 1 , … , 𝑦 0 𝑂 } 𝒴 0 = {𝑦 0 1 , … , 𝑦 1 𝑂 } 𝒴 1 = {𝑦 1 1 , … , 𝑦 2 𝑂 } 𝒴 2 = {𝑦 2 …… π‘ˆ π‘ˆ π‘ˆ 𝑔 𝒴 1 , 𝑝 2 , 𝑦(𝑒) 𝑒𝑒 𝑔 𝒴 0 , 𝑝 1 , 𝑦(𝑒) 𝑒𝑒 𝑔 𝒴 2 , 𝑝 3 , 𝑦(𝑒) 𝑒𝑒 0 0 0 Particle Flow for Sequential Bayesian Inference π‘ˆ π‘œ + π‘œ 𝑦 𝑛+1 = 𝑦 𝑛 𝑔 𝒴 𝑛 , 𝑝 𝑛+1 , 𝑦(𝑒) 𝑒𝑒 0 π‘ˆ π‘œ + π‘œ βˆ’log π‘ž 𝑛+1 = βˆ’log π‘ž 𝑛 𝛼 𝑦 β‹… 𝑔 𝒴 𝑛 , 𝑝 𝑛+1 , 𝑦(𝑒) 𝑒𝑒 0 β€’ Other ODE approaches (eg. Neural ODE of Chen et al 18), are not for sequential case.

  12. low Velocity π’ˆ Exis Shared Flo ists? 𝑦 0 ∼ 𝜌(𝑦) 𝑦 𝑒 ∼ π‘ž(𝑦|𝑝 1 ) π‘ˆ 𝑦 π‘ˆ = 𝑦 0 + 𝑔(π‘—π‘œπ‘žπ‘£π‘’π‘‘)𝑒𝑒 0 Does a shared flow velocity 𝑔 exist for different Bayesian inference tasks involving different priors and different observations? A simple Gaussian Example Prior 𝜌 𝑦 = π’ͺ(0, 𝜏 0 ) , likelihood π‘ž 𝑝 𝑦 = π’ͺ 𝑦, Οƒ , observation 𝑝 = 0 β€’ ⟹ posterior π‘ž 𝑦 𝑝 = 0 = π’ͺ(0, πœβ‹…πœ 0 𝜏+𝜏 0 ) β€’ Whether a shared 𝑔 exists for priors with different 𝜏 0 ? What is the form for it? β€’ – E.g. 𝑔 in the form of 𝑔(𝑝, 𝑦(𝑒)) won’t be able to handle different 𝜏 0 .

  13. Exi xistence: Connection to Stochastic Flo low β€’ Langevin dynamics is a stochastic process 𝑒𝑦 𝑒 = 𝛼 𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) 𝑒𝑒 + 2 𝑒π‘₯ 𝑒 , where 𝑒π‘₯(𝑒) is a standard Brownian motion. Property. If the potential function Ξ¨ 𝑦 ≔ βˆ’log 𝜌 𝑦 π‘ž(𝑝|𝑦) is smooth and β€’ 𝑓 βˆ’Ξ¨ ∈ 𝑀 1 (ℝ 𝑒 ) , the Fokker-Planck equation has a unique stationary solution in the form of Gibbs distribution, π‘Ÿ 𝑦, ∞ = 𝑓 βˆ’Ξ¨ = 𝜌 𝑦 π‘ž 𝑝 𝑦 = π‘ž(𝑦|𝑝) π‘Ž π‘Ž

  14. Exi xistence: Connection to Stochastic Flo low β€’ The probability density π‘Ÿ(𝑦, 𝑒) of 𝑦 𝑒 follows a deterministic evolution according to the Fokker-Planck equation πœ–π‘Ÿ πœ–π‘’ = βˆ’π›Ό 𝑦 β‹… π‘Ÿπ›Ό 𝑦 log 𝜌 𝑦 π‘ž 𝑝 𝑦 + Ξ” 𝑦 π‘Ÿ 𝑦, 𝑒 = βˆ’π›Ό 𝑦 β‹… (π‘Ÿ(𝛼 𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ 𝛼 𝑦 log π‘Ÿ(𝑦, 𝑒))) , 𝑔 which is in the form of Continuity Equation. β€’ Theorem. When the deterministic transformation of random variable 𝑦 𝑒 follows 𝑒𝑦 𝑒𝑒 = 𝛼 𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ 𝛼 𝑦 log π‘Ÿ 𝑦, 𝑒 , its probability density π‘ž(𝑦, 𝑒) converges to the posterior π‘ž(𝑦|𝑝) as 𝑒 β†’ ∞ .

  15. Exis xistence: Clo lose-Loop to Open-Loop Conversion Close loop to Open loop Fokker-Planck equation leads to close loop flow, depend not just on 𝜌(𝑦) and π‘ž 𝑝 𝑦 , β€’ but also on flow state π‘Ÿ 𝑦, 𝑒 . Is there an equivalent form independent of π‘Ÿ 𝑦, 𝑒 which can achieve the same flow? β€’ Optimization problem min 𝑒 π‘Ÿ 𝑦, ∞ , π‘ž(𝑦|𝑝) π‘₯ 𝑑. 𝑒. 𝑒𝑦 𝑒𝑒 = 𝛼 𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ π‘₯, Positive answer: there exists a fixed and deterministic flow velocity 𝑔 of the form β€’ 𝑒𝑦 𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ π‘₯ βˆ— (𝜌 𝑦 , π‘ž 𝑝|𝑦 , 𝑦, 𝑒) 𝑒𝑒 = 𝛼

Recommend


More recommend