Adaptive importance sampling for control and inference ∗ Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London December 10, 2016 ∗ Joint work with Hans Ruiz, Dominik Thalmeier Bert Kappen
Optimal control theory Hard problems: - a learning and exploration problem - a stochastic optimal control computation - a representation problem u ( x , t ) Bert Kappen 1/30
PICE: integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Bert Kappen 2/30
PICE: Integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control Bert Kappen 3/30
PICE: Integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control Learning Learn the controller from self-generated data Use Cross Entropy method for parametrized controller Bert Kappen 4/30
PICE: Integrating control, inference and learning Massively parallel computation Bert Kappen 5/30
PICE: Integrating control, inference and learning Massively parallel computation The Monte Carlo sampling serves two purposes: • Planning: compute the control for current state • Learning: improve the sampler/controller for future control computations Bert Kappen 6/30
Path integral control theory Uncontrolled dynamics specifies distribution q ( τ | x , t ) over trajectories τ from x , t . � T Cost for trajectory τ is S ( τ | x , t ) = φ ( x T ) + t dsV ( x s , s ) . Find optimal distribution p ( τ | x , t ) that minimizes E p S and is ’close’ to q ( τ | x , t ) . Bert Kappen 7/30
KL control Find p ∗ that minimizes � d τ p ( τ | x , t ) log p ( τ | x , t ) C ( p ) = KL ( p | q ) + E p S KL ( p | q ) = q ( τ | x , t ) The optimal solution is given by 1 p ∗ ( τ | x , t ) ψ ( x , t ) q ( τ | x , t ) exp( − S ( τ | x , t )) = � d τ q ( τ | x , t ) exp( − S ( τ | x , t )) = E q e − S ψ ( x , t ) = The optimal cost is: C ( p ∗ ) = − log ψ ( x , t ) Bert Kappen 8/30
Controlled diffusions p ( τ | x , t ) is parametrised by function u ( x , t ) : E ( dW 2 dX t f ( X t , t ) dt + g ( X t , t )( u ( X t , t ) dt + dW t ) t ) = dt = � T � � ds 1 2 u ( X s , s ) 2 C ( u | x , t ) E u S ( τ | x , t ) + = t q ( τ | x , t ) corresponds to u = 0 . Goal is to find function u ( x , t ) that minimizes C . Bert Kappen 9/30
Solution The optimal control problem is solved as a Feynman-Kac path integral. The optimal cost-to-go � � e − S � d τ q ( τ | x , t ) e − S ( τ | x , t ) = − log E q J ( x , t ) = − log Optimal control � dWe − S � E q u ∗ ( x , t ) dt E p ∗ ( dW t ) = = � e − S � E q ψ, u ∗ can be computed by forward sampling from q . Bert Kappen 10/30
Sampling 10 5 0 −5 −10 0 0.5 1 1.5 2 Sample trajectories τ i , i = 1 , . . . , N ∼ q ( τ | x ) N E q e − S ≈ 1 � e − S ( τ i | x , t ) N i = 1 Sampling is unbiased but inefficient (large variance). Bert Kappen 11/30
Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider simple 1-d sampling problem. Given q ( x ) , compute � ∞ a = Prob( x < 0) = I ( x ) q ( x ) dx −∞ with I ( x ) = 0 , 1 if x > 0 , x < 0 , respectively. Naive method: generate N samples X i ∼ q N a = 1 � ˆ I ( X i ) N i = 1 Bert Kappen 12/30
Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider another distribution p ( x ) . Then � ∞ I ( x ) q ( x ) a = Prob( x < 0) = p ( x ) p ( x ) dx −∞ Importance sampling: generate N samples X i ∼ p N a = 1 I ( X i ) q ( X i ) � ˆ N p ( X i ) i = 1 Unbiased (= correct) for any distribution p ! Bert Kappen 13/30
Optimal importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 The distribution p ∗ ( x ) = q ( x ) I ( x ) a is the optimal importance sampler. One sample X ∼ p ∗ is sufficient to estimate a : a = I ( X ) q ( X ) ˆ p ∗ ( X ) = a Bert Kappen 14/30
Importance sampling and control In the case of control we must compute � dWe − S � E q J ( x , t ) = − log E q e − S u ∗ ( x , t ) = � e − S � E q Instead of samples from uncontrolled dynamics q ( u = 0 ), we sample with p ( u � 0 ). E q e − S E p e − S u = � T � T e − S dq 1 2 u ( x s , s ) 2 dt − e − S u dp = e − S − u ( x s , s ) dW s = t t We can choose any p , ie. any sampling control u to compute the expectation values. Bert Kappen 15/30
Relation between optimal sampling and optimal control Define e − S u ( τ i | x , t )) α i = � N j = 1 e − S u ( τ j | x , t ) 1 ES S (1 ≤ ES S ≤ N ) = � N j = 1 α 2 j Thm: 1. Better u (in the sense of optimal control) provides a better sampler (in the sense of effective sample size). 2. Optimal u = u ∗ (in the sense of optimal control) requires only one sample, α i = 1 / N and S u ( τ | x , t ) deterministic! � T � T dt 1 2 u ( x s , s ) 2 + S u ( τ | x , t ) S ( τ | x , t ) + u ( x x , s ) dW s = t t Bert Kappen 16/30
So far • Optimal control can be computed by MC sampling • Sampling can be accellerated by using ’good’ controls • The optimal control for sampling is also the optimal control solution How to learn a good controller? Bert Kappen 17/30
The Cross-entropy method p u ( x ) be a family of probability density function parametrized by u . h ( x ) be a positive function. Conside the expectation value � a = E 0 h = dxp 0 ( x ) h ( x ) for a particular value of u = 0 . The optimal importance sampling distribution is p ∗ ( x ) = h ( x ) p 0 ( x ) / a . The cross entropy method minimises the KL divergence dxp ∗ ( x ) log p ∗ ( x ) � KL ( p ∗ | p u ) p u ( x ) ∝ − E p ∗ log p u ( X ) = − E 0 h ( X ) log p u ( X ) = E v h ( X ) p 0 ( X ) ∝ p v ( X ) log p u ( X ) p 0 → p 1 → p 2 . . . Bert Kappen 18/30
The CE method for PI control Sample p u using dX t = f ( X t , t ) dt + g ( X t , t ) ( u ( X t , t ) dt + dW t ) We wish to compute close to optimal control u such that p u is close to p ∗ . Following the CE argument, we minimise � T � 2 � 1 ds 1 u ( X s , s ) − v ( X s , s ) − dW s KL ( p ∗ | p u ) ψ ( t , x ) E v e − S ( t , x , v ) = 2 ds t v is the importance sampling control. Expected value is independent of v , but variance/accuracy depends on v . Bert Kappen 19/30
The CE method for PI control We parametrize the control u ( x , t | θ ) . The gradient is given by: �� T ∂ KL ( p ∗ | p u ) � ( u ( X s , s ) ds − v ( X s , s ) ds − dW s ) ∂ u ( X s , s ) = ∂θ ∂θ t v �� T � ∂ u ( X s , s ) − dW s = ∂θ t u θ − ǫ∂ KL ( p ∗ | p u ) θ : = ∂θ We refer to the method as PICE (Path Integral Cross Entropy). Bert Kappen 20/30
Model based motor learning compute control for k = 0 , . . . do data k = generate data ( model , u k ) % Monte Carlo importance sampler u k + 1 = learn control ( data k , u k ) % Deep or recurrent learning end for 10 10 5 5 0 0 −5 −5 −10 −10 0 0.5 1 1.5 2 0 0.5 1 1.5 2 Bert Kappen 21/30
Parallel implementation Massive parallel sampling on CPUs Massive parallel gradient computation on C/GPU Goal: provide generic solver for any PI control problem to arbitrary precision. Bert Kappen 22/30
Acrobot 2 DOF, second order, under actuated, continuous stochastic control problem. Task is swing-up from down position and stabilize. Bert Kappen 23/30
Acrobot (acrobot.mp4) Neural network 2 hidden layers, 50 neurons per layer. Input is position and velocity. 2000 iterations, with 30000 rollouts per iteration. 100 cores. 15 minutes Bert Kappen 24/30
More samples per iteration is better :) Fraction ESS versus IS iteration 100 k samples (green, cyan) 300 k samples (red, blue) 1000 k samples (black, yellow) Bert Kappen 25/30
Trust region Initial gradient computation too hard. Introduce (KL) trust region. Control cost vs. IS iteration. Blue line: small trust region (ESS ≈ 50 %, 30k samples) (= video) Red line: intermediate trust region (ESS ≈ 1 %, 100k samples) Green line: large trust region (ESS ≈ 0 . 1 %, 300k samples) Trade-off between speed and optimality. Bert Kappen 26/30
Discussion Continuous time SOC is very hard to compute. - PI control: Control ↔ inference - Better sampling (ESS) ↔ better control (control objective) - IS: Learning control solution also increases efficiency of (future) control computations Bert Kappen 27/30
Discussion Continuous time SOC is very hard to compute. - PI control: Control ↔ inference - Better sampling (ESS) ↔ better control (control objective) - IS: Learning control solution also increases efficiency of (future) control computations Continuous time SOC is very hard represent. - CE for parameter estimation → deep neural network Bert Kappen 28/30
Recommend
More recommend