Learning dynamical systems with particle stochastic approximation EM - PowerPoint PPT Presentation

Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link¨ oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University

Parametric state-space models Dynamical system on state-space form, � x t +1 = f ( x t , u t , v t ; θ ) , y t = g ( x t , u t , e t ; θ ) , Properties: • Can handle nonlinear dynamics ( f ( · ) and g ( · )) • Stochastic (process and measurement noise) • Noise variables can be non-Gaussian • Parametric: parameterized by θ ∈ R n θ Note: The states { x t } are latent (unobserved) random variables. 1/35

Maximum likelihood identification We observe y 1: T = ( y 1 , . . . , y T ) and u 1: T = ( u 1 , . . . , u T ). Aim: Compute the maximum likelihood estimate, � θ ML = arg max { log p θ ( y 1: T ) } . θ 1. Convergent: If given enough time a (local) maximum is found 2. Computationally efficient: The total runtime to recover a satisfactory solution should not be excessively large 2/35

Teaser: Cascaded water tanks Cascaded water tank system from the nonlinear system identification benchmark: http://www.nonlinearbenchmark.org/ M. Schoukens & J.P. No¨ el. Three Benchmarks Addressing Open Challenges in Nonlinear Sys- tem Identification. 20th World Congress of the International Federation of Automatic Control , Toulouse, France, July 2017. Evaluated by computing simulation error on test data. Model Error Initial model to PSAEM 2.85 Estimated with PSAEM 0.29 Best result from literature 0.34 3/35

Teaser: Nonlinear toy model 2 10 1 10 Average relative error 0 10 −1 10 PSAEM,N=15 −2 10 PSEM, N=15 PSEM, N=50 PSEM, N=100 PSEM, N=500 −3 10 −2 −1 0 1 2 3 4 10 10 10 10 10 10 10 Computational time (seconds) 4/35

A composite algorithm Particle Stochastic Approximation EM (PSAEM) is . . . . . . an exercise in building a complicated algorithm from “simple” parts Expectation Stochastic Maximization approximation SAEM MCMC PSAEM PMCMC Particle filters 5/35

MCMC + Stochastic Approximation Particle Stochastic Approximation EM (PSAEM) is . . . . . . an example of how MCMC and Stochastic Approximation can come together! Algorithms involving both optimization and Monte Carlo sampling common in machine learning/system identification. ex) SGD, Variational Inference, SAEM, . . . Often we assume unbiased estimates but it is possible to generalize to Markovian dependencies = ⇒ Can use Markov Chain Monte Carlo sampling! 6/35

Schematic outline Expectation Stochastic Maximization approximation SAEM Will focus on this MCMC PSAEM PMCMC Particle filters 7/35

Expectation Maximization

Latent variable models A latent variable model is described by a joint probability density function p θ ( x , y ) where • y is observed (the data ) • x is unobserved (the latent variable ) • θ is an unknown parameter ex) Probabilistic representation of state space model, � � p θ ( x t +1 | x t ) , x t +1 = f ( x t , u t , v t ; θ ) , ⇐ ⇒ y t = g ( x t , u t , e t ; θ ) , p θ ( y t | x t ) . Therefore, � T � � p θ ( x t +1 | x t ) p θ ( y t | x t ) p θ ( x 0: T , y 1: T ) = p ( x 0 ) . t =1 8/35

Maximum likelihood estimation Aim: Compute the maximum likelihood estimate, � θ ML = arg max { log p θ ( y ) } . θ Evaluating the likelihood p θ ( y ) requires marginalizing the latent variable, � p θ ( y ) = p θ ( x , y ) d x Expectation maximization (EM) avoids explicit marginalization by it- eratively optimizing lower bounds on the likelihood. 9/35

Expectation Maximization [3] The EM algorithm generates a sequence of iterates: θ 0 , θ 1 , θ 2 , . . . Iterate: � Expectation def (E) Q k ( θ ) = log p θ ( x , y ) p θ k − 1 ( x | y ) d x . Maximization (M) θ k = arg max θ Q k ( θ ). The iterates { θ k } k ≥ 0 converge (under weak conditions) to a stationary point of p θ ( y ). 10/35

EM for nonlinear state space models ex) For a state space model, � Q k ( θ ) = log p θ ( x 0: T , y 1: T ) p θ k − 1 ( x 0: T | y 1: T ) d x 0: T . First hurdle – high-dimensional integration problem! 11/35

Stochastic Approximation EM

Stochastic approximation of the Q -function � def Q k ( θ ) = log p θ ( x , y ) p θ k − 1 ( x | y ) d x . If � x ∼ p θ k − 1 ( x | y ) then log p θ ( � x , y ) is an unbiased estimate of Q k ( θ ). Replace the E-step with, (S) Simulate � x k ∼ p θ k − 1 ( x | y ), � � (E) � Q k ( θ ) = � x k , y ) − � Q k − 1 ( θ ) + γ k log p θ ( � Q k − 1 ( θ ) . Here � k γ k = ∞ , � k < ∞ , and � k γ 2 Q 0 ( θ ) ≡ 0. Intuitive interpretation: If θ k converges, the x -values sampled at iter- ation k , k + 1, . . . , will come from approximately the same distribution. 12/35

Represenation of � Q � � How is � Q k ( θ ) = � x , y ) − � Q k − 1 ( θ ) + γ k log p θ ( � Q k − 1 ( θ ) represented in practice? 1. In general, as a sum, Q k ( θ ) = � k � j =1 α jk log p θ ( � x j , y ) . But, this can be hard to maximize for large k . 2. Exponential family models , can be expressed using sufficient statistics, log p θ ( x , y ) = � ψ ( θ ) , s ( x , y ) � − A ( θ ) . We can then compute x k , y ) − S k − 1 ), (E.1) S k = S k − 1 + γ k ( s ( � (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ). 13/35

Stochastic Approximation EM [2] The SAEM algorithm for exponential family models. Initialize θ 0 and S 0 = 0. Iterate: (S) Simulate � x k ∼ p θ k − 1 ( x | y ), Expectation Stochastic Maximization approximation (E.1) S k = S k − 1 + γ k ( s ( � x k , y ) − S k − 1 ), (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ), SAEM (M) θ k = arg max θ � Q k ( θ ). The iterates { θ k } k ≥ 0 converge (under regularity conditions) to a local maximum of p θ ( y ). 14/35

SAEM for nonlinear state space models ex) For a state space model, the simulation step involves � x 0: T ∼ p θ ( x 0: T | y 1: T ) . Second hurdle – simulating from the joint smoothing distribution is not possible for a nonlinear/non-Gaussian state space model. 15/35

Combining SAEM and MCMC

Markovian noise We need to simulate from an intractable posterior p θ ( x | y ). This is what Markov Chain Monte Carlo is designed for! MCMC, key idea: An ergodic stochastic process has “the same behavior averaged over time as averaged over the space of all the system’s states” (Wikipedia) � � K 1 g ( x ) p θ ( x | y ) d x = lim g ( � x k ) K K →∞ k =1 16/35

MCMC basics Requirements: The process { � x k } k ≥ 1 has to. . . 1. . . . have p θ ( x | y ) as stationary distribution 2. . . . be ergodic (“sufficiently stochastic”) MCMC: Initialize � x 0 and simulate � x k ∼ q θ ( � x k | � x k − 1 ), k = 1 , 2 , . . . x | x ) : θ ∈ R n θ } is a family of Markov kernels s.t., Here, { q θ ( � 1. (Stationary distribution) � q θ ( � x | x ) p θ ( x | y ) d x = p θ ( � x | y ) 2. (Ergodicity) Sufficient condition, ∃ η > 0 s.t. P q θ ( · | x ) [ � x ∈ A ] ≥ η P p θ ( · | y ) [ � x ∈ A ] . 17/35

MCMC methods 65+ years of research on MCMC has resulted in many clever ways of constructing q θ ( � x | x ) • Metropolis–Hastings • Gibbs sampling • Langevin algorithms • Hamiltonian Monte Carlo • Slice sampling • Particle Markov chain Monte Carlo ( ← what we use in PSAEM) • . . . 18/35

MCMC-SAEM [4] SAEM: Initialize θ 0 , � x 0 and S 0 = 0. Iterate: (S) Simulate � x k ∼ p θ k − 1 ( x | y ), Expectation Stochastic Maximization approximation (E.1) S k = S k − 1 + γ k ( s ( � x k ) − S k − 1 ), (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ), SAEM (M) θ k = arg max θ � MCMC Q k ( θ ). Markovian SAEM Under appropriate conditions the iterates { θ k } k ≥ 0 converge to a local maximum of p θ ( y ). x | x ) : θ ∈ R n θ } is. . . We require that { q θ ( � • . . . correct stationary distribution and ergodic for “all” θ x | x ) if θ ≈ θ ′ • . . . smooth in θ , q θ ( � x | x ) ≈ q θ ′ ( � 19/35

MCMC-SAEM for nonlinear state space models ex) For a state space model, the MCMC step aims at simulating from the joint smoothing distribution p θ ( x 0: T | y 1: T ) . Third hurdle – how can we construct an efficient MCMC method with the required properties , to simulate from this distribution? 20/35

Particle SAEM

Particle MCMC [1] We propose to use Particle MCMC PMCMC: 1. Use a particle filter to approximate � N def w i p θ ( x 0: T | y 1: T ) ≈ � p θ ( x 0: T | y 1: T ) = T δ x i 0: T ( x 0: T ) . i =1 2. Use the approximation to define an MCMC kernel on the space of state trajectories 3. Some “tricks” are used to ensure correct stationary distribution for any N ≥ 1. 21/35

PMCMC – Part 1 PMCMC – Part 1: The particle filter Use importance sampling to approximate p θ ( x 0 ) , p θ ( x 0:1 | y 1 ) , p θ ( x 0:2 | y 1:2 ) , . . . , p θ ( x 0: T | y 1: T ) , sequentially in time. Exploits the temporal structure of a dynamical system! 22/35

PMCMC – Part 1 1 0 −1 State −2 −3 −4 5 10 15 20 25 Time 23/35

Learning dynamical systems with particle stochastic approximation EM - PowerPoint PPT Presentation

Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University Parametric state-space models Dynamical system on state-space form,

Continuous orbit equivalence rigidity Xin Li Dynamical systems and operator algebras Dynamical

Homotopy theories of dynamical systems Rick Jardine University of Western Ontario July 15, 2013

20 Particle Systems Steve Marschner Eston Schweickart CS4620 Spring 2017 Examples of Particle

Lecture 5: Basic Dynamical Systems CS 344R/393R: Robotics Benjamin Kuipers Dynamical Systems

Introduction to stochastic dynamical modelling Gavin J Gibson Maxwell Institute for Mathematical

Project 2: Basic particle system Constrained Particle System Tinkertoys Requirements for

ANALYSIS of EUCLIDEAN ALGORITHMS An Arithmetical Instance of Dynamical Analysis Dynamical

ANALYSIS of EUCLIDEAN ALGORITHMS An Arithmetical Instance of Dynamical Analysis Dynamical

! Importance of Particle Adhesion ! Importance of Particle Adhesion ! History of Particle

Particle dynamics Particle overview Particle system Forces Constraints

Elementary Particle Physics in a Nutshell Elementary Particle Physics in a Nutshell

THEORETICAL PARTICLE PHYSICS IN KARLSRUHE I. The Team II. Research in Theoretical Particle

Particle dynamics Particle overview Particle system Forces Constraints

Dynamical analysis of euclidean algorithms Introduction Dynamical analysis of euclidean

Strongly self-absorbing C -dynamical systems Classification and dynamical systems I: C

Statistics of spike trains: A dynamical systems Statistics of spike trains: A dynamical systems

The EM Algorithm 0.6 s 1 {A: .3 ,B: .2 ,C: .5 } 0.30.3 0.20.10.3 p ( O | ) o 1 ,o 2

K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data

Unsupervised Learning Marco Chiarandini Department of Mathematics & Computer Science

CSC 411: Lecture 13: Mixtures of Gaussians and EM Class based on Raquel Urtasun & Rich

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

On Casting Importance Weighted Autoencoder to an EM Algorithm to Learn Deep Generative Models

High Dimensional Data Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD

Learning a Belief Network If you know the structure have observed all of the variables

Learning dynamical systems with particle stochastic approximation EM - PowerPoint PPT Presentation

Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University Parametric state-space models Dynamical system on state-space form,

Continuous orbit equivalence rigidity Xin Li Dynamical systems and operator algebras Dynamical

Homotopy theories of dynamical systems Rick Jardine University of Western Ontario July 15, 2013

20 Particle Systems Steve Marschner Eston Schweickart CS4620 Spring 2017 Examples of Particle

Lecture 5: Basic Dynamical Systems CS 344R/393R: Robotics Benjamin Kuipers Dynamical Systems

Introduction to stochastic dynamical modelling Gavin J Gibson Maxwell Institute for Mathematical

Project 2: Basic particle system Constrained Particle System Tinkertoys Requirements for

ANALYSIS of EUCLIDEAN ALGORITHMS An Arithmetical Instance of Dynamical Analysis Dynamical

ANALYSIS of EUCLIDEAN ALGORITHMS An Arithmetical Instance of Dynamical Analysis Dynamical

! Importance of Particle Adhesion ! Importance of Particle Adhesion ! History of Particle

Particle dynamics Particle overview Particle system Forces Constraints

Elementary Particle Physics in a Nutshell Elementary Particle Physics in a Nutshell

THEORETICAL PARTICLE PHYSICS IN KARLSRUHE I. The Team II. Research in Theoretical Particle

Particle dynamics Particle overview Particle system Forces Constraints

Dynamical analysis of euclidean algorithms Introduction Dynamical analysis of euclidean

Strongly self-absorbing C -dynamical systems Classification and dynamical systems I: C

Statistics of spike trains: A dynamical systems Statistics of spike trains: A dynamical systems

The EM Algorithm 0.6 s 1 {A: .3 ,B: .2 ,C: .5 } 0.30.3 0.20.10.3 p ( O | ) o 1 ,o 2

K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data

Unsupervised Learning Marco Chiarandini Department of Mathematics &amp; Computer Science

CSC 411: Lecture 13: Mixtures of Gaussians and EM Class based on Raquel Urtasun &amp; Rich

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

On Casting Importance Weighted Autoencoder to an EM Algorithm to Learn Deep Generative Models

High Dimensional Data Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD

Learning a Belief Network If you know the structure have observed all of the variables

Unsupervised Learning Marco Chiarandini Department of Mathematics & Computer Science

CSC 411: Lecture 13: Mixtures of Gaussians and EM Class based on Raquel Urtasun & Rich