learning dynamical systems with particle stochastic
play

Learning dynamical systems with particle stochastic approximation EM - PowerPoint PPT Presentation

Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University Parametric state-space models Dynamical system on state-space form,


  1. Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link¨ oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University

  2. Parametric state-space models Dynamical system on state-space form, � x t +1 = f ( x t , u t , v t ; θ ) , y t = g ( x t , u t , e t ; θ ) , Properties: • Can handle nonlinear dynamics ( f ( · ) and g ( · )) • Stochastic (process and measurement noise) • Noise variables can be non-Gaussian • Parametric: parameterized by θ ∈ R n θ Note: The states { x t } are latent (unobserved) random variables. 1/35

  3. Maximum likelihood identification We observe y 1: T = ( y 1 , . . . , y T ) and u 1: T = ( u 1 , . . . , u T ). Aim: Compute the maximum likelihood estimate, � θ ML = arg max { log p θ ( y 1: T ) } . θ 1. Convergent: If given enough time a (local) maximum is found 2. Computationally efficient: The total runtime to recover a satisfactory solution should not be excessively large 2/35

  4. Teaser: Cascaded water tanks Cascaded water tank system from the nonlinear system identification benchmark: http://www.nonlinearbenchmark.org/ M. Schoukens & J.P. No¨ el. Three Benchmarks Addressing Open Challenges in Nonlinear Sys- tem Identification. 20th World Congress of the International Federation of Automatic Control , Toulouse, France, July 2017. Evaluated by computing simulation error on test data. Model Error Initial model to PSAEM 2.85 Estimated with PSAEM 0.29 Best result from literature 0.34 3/35

  5. Teaser: Nonlinear toy model 2 10 1 10 Average relative error 0 10 −1 10 PSAEM,N=15 −2 10 PSEM, N=15 PSEM, N=50 PSEM, N=100 PSEM, N=500 −3 10 −2 −1 0 1 2 3 4 10 10 10 10 10 10 10 Computational time (seconds) 4/35

  6. A composite algorithm Particle Stochastic Approximation EM (PSAEM) is . . . . . . an exercise in building a complicated algorithm from “simple” parts Expectation Stochastic Maximization approximation SAEM MCMC PSAEM PMCMC Particle filters 5/35

  7. MCMC + Stochastic Approximation Particle Stochastic Approximation EM (PSAEM) is . . . . . . an example of how MCMC and Stochastic Approximation can come together! Algorithms involving both optimization and Monte Carlo sampling common in machine learning/system identification. ex) SGD, Variational Inference, SAEM, . . . Often we assume unbiased estimates but it is possible to generalize to Markovian dependencies = ⇒ Can use Markov Chain Monte Carlo sampling! 6/35

  8. Schematic outline Expectation Stochastic Maximization approximation SAEM Will focus on this MCMC PSAEM PMCMC Particle filters 7/35

  9. Expectation Maximization

  10. Latent variable models A latent variable model is described by a joint probability density function p θ ( x , y ) where • y is observed (the data ) • x is unobserved (the latent variable ) • θ is an unknown parameter ex) Probabilistic representation of state space model, � � p θ ( x t +1 | x t ) , x t +1 = f ( x t , u t , v t ; θ ) , ⇐ ⇒ y t = g ( x t , u t , e t ; θ ) , p θ ( y t | x t ) . Therefore, � T � � p θ ( x t +1 | x t ) p θ ( y t | x t ) p θ ( x 0: T , y 1: T ) = p ( x 0 ) . t =1 8/35

  11. Maximum likelihood estimation Aim: Compute the maximum likelihood estimate, � θ ML = arg max { log p θ ( y ) } . θ Evaluating the likelihood p θ ( y ) requires marginalizing the latent vari- able, � p θ ( y ) = p θ ( x , y ) d x Expectation maximization (EM) avoids explicit marginalization by it- eratively optimizing lower bounds on the likelihood. 9/35

  12. Expectation Maximization [3] The EM algorithm generates a sequence of iterates: θ 0 , θ 1 , θ 2 , . . . Iterate: � Expectation def (E) Q k ( θ ) = log p θ ( x , y ) p θ k − 1 ( x | y ) d x . Maximization (M) θ k = arg max θ Q k ( θ ). The iterates { θ k } k ≥ 0 converge (under weak conditions) to a stationary point of p θ ( y ). 10/35

  13. EM for nonlinear state space models ex) For a state space model, � Q k ( θ ) = log p θ ( x 0: T , y 1: T ) p θ k − 1 ( x 0: T | y 1: T ) d x 0: T . First hurdle – high-dimensional integration problem! 11/35

  14. Stochastic Approximation EM

  15. Stochastic approximation of the Q -function � def Q k ( θ ) = log p θ ( x , y ) p θ k − 1 ( x | y ) d x . If � x ∼ p θ k − 1 ( x | y ) then log p θ ( � x , y ) is an unbiased estimate of Q k ( θ ). Replace the E-step with, (S) Simulate � x k ∼ p θ k − 1 ( x | y ), � � (E) � Q k ( θ ) = � x k , y ) − � Q k − 1 ( θ ) + γ k log p θ ( � Q k − 1 ( θ ) . Here � k γ k = ∞ , � k < ∞ , and � k γ 2 Q 0 ( θ ) ≡ 0. Intuitive interpretation: If θ k converges, the x -values sampled at iter- ation k , k + 1, . . . , will come from approximately the same distribution. 12/35

  16. Represenation of � Q � � How is � Q k ( θ ) = � x , y ) − � Q k − 1 ( θ ) + γ k log p θ ( � Q k − 1 ( θ ) represented in practice? 1. In general, as a sum, Q k ( θ ) = � k � j =1 α jk log p θ ( � x j , y ) . But, this can be hard to maximize for large k . 2. Exponential family models , can be expressed using sufficient statis- tics, log p θ ( x , y ) = � ψ ( θ ) , s ( x , y ) � − A ( θ ) . We can then compute x k , y ) − S k − 1 ), (E.1) S k = S k − 1 + γ k ( s ( � (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ). 13/35

  17. Stochastic Approximation EM [2] The SAEM algorithm for exponential family models. Initialize θ 0 and S 0 = 0. Iterate: (S) Simulate � x k ∼ p θ k − 1 ( x | y ), Expectation Stochastic Maximization approximation (E.1) S k = S k − 1 + γ k ( s ( � x k , y ) − S k − 1 ), (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ), SAEM (M) θ k = arg max θ � Q k ( θ ). The iterates { θ k } k ≥ 0 converge (under regularity conditions) to a local maximum of p θ ( y ). 14/35

  18. SAEM for nonlinear state space models ex) For a state space model, the simulation step involves � x 0: T ∼ p θ ( x 0: T | y 1: T ) . Second hurdle – simulating from the joint smoothing distribution is not possible for a nonlinear/non-Gaussian state space model. 15/35

  19. Combining SAEM and MCMC

  20. Markovian noise We need to simulate from an intractable posterior p θ ( x | y ). This is what Markov Chain Monte Carlo is designed for! MCMC, key idea: An ergodic stochastic process has “the same behavior averaged over time as averaged over the space of all the system’s states” (Wikipedia) � � K 1 g ( x ) p θ ( x | y ) d x = lim g ( � x k ) K K →∞ k =1 16/35

  21. MCMC basics Requirements: The process { � x k } k ≥ 1 has to. . . 1. . . . have p θ ( x | y ) as stationary distribution 2. . . . be ergodic (“sufficiently stochastic”) MCMC: Initialize � x 0 and simulate � x k ∼ q θ ( � x k | � x k − 1 ), k = 1 , 2 , . . . x | x ) : θ ∈ R n θ } is a family of Markov kernels s.t., Here, { q θ ( � 1. (Stationary distribution) � q θ ( � x | x ) p θ ( x | y ) d x = p θ ( � x | y ) 2. (Ergodicity) Sufficient condition, ∃ η > 0 s.t. P q θ ( · | x ) [ � x ∈ A ] ≥ η P p θ ( · | y ) [ � x ∈ A ] . 17/35

  22. MCMC methods 65+ years of research on MCMC has resulted in many clever ways of constructing q θ ( � x | x ) • Metropolis–Hastings • Gibbs sampling • Langevin algorithms • Hamiltonian Monte Carlo • Slice sampling • Particle Markov chain Monte Carlo ( ← what we use in PSAEM) • . . . 18/35

  23. MCMC-SAEM [4] SAEM: Initialize θ 0 , � x 0 and S 0 = 0. Iterate: (S) Simulate � x k ∼ p θ k − 1 ( x | y ), Expectation Stochastic Maximization approximation (E.1) S k = S k − 1 + γ k ( s ( � x k ) − S k − 1 ), (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ), SAEM (M) θ k = arg max θ � MCMC Q k ( θ ). Markovian SAEM Under appropriate conditions the iterates { θ k } k ≥ 0 converge to a local maximum of p θ ( y ). x | x ) : θ ∈ R n θ } is. . . We require that { q θ ( � • . . . correct stationary distribution and ergodic for “all” θ x | x ) if θ ≈ θ ′ • . . . smooth in θ , q θ ( � x | x ) ≈ q θ ′ ( � 19/35

  24. MCMC-SAEM for nonlinear state space models ex) For a state space model, the MCMC step aims at simulating from the joint smoothing distribution p θ ( x 0: T | y 1: T ) . Third hurdle – how can we construct an efficient MCMC method with the required properties , to simulate from this distribution? 20/35

  25. Particle SAEM

  26. Particle MCMC [1] We propose to use Particle MCMC PMCMC: 1. Use a particle filter to approximate � N def w i p θ ( x 0: T | y 1: T ) ≈ � p θ ( x 0: T | y 1: T ) = T δ x i 0: T ( x 0: T ) . i =1 2. Use the approximation to define an MCMC kernel on the space of state trajectories 3. Some “tricks” are used to ensure correct stationary distribution for any N ≥ 1. 21/35

  27. PMCMC – Part 1 PMCMC – Part 1: The particle filter Use importance sampling to approximate p θ ( x 0 ) , p θ ( x 0:1 | y 1 ) , p θ ( x 0:2 | y 1:2 ) , . . . , p θ ( x 0: T | y 1: T ) , sequentially in time. Exploits the temporal structure of a dynamical system! 22/35

  28. PMCMC – Part 1 1 0 −1 State −2 −3 −4 5 10 15 20 25 Time 23/35

Recommend


More recommend