adaptive importance sampling for control and
play

Adaptive importance sampling for control and inference Bert Kappen - PowerPoint PPT Presentation

Adaptive importance sampling for control and inference Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London December 10, 2016 Joint work with Hans Ruiz, Dominik Thalmeier Bert Kappen Optimal control


  1. Adaptive importance sampling for control and inference ∗ Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London December 10, 2016 ∗ Joint work with Hans Ruiz, Dominik Thalmeier Bert Kappen

  2. Optimal control theory Hard problems: - a learning and exploration problem - a stochastic optimal control computation - a representation problem u ( x , t ) Bert Kappen 1/30

  3. PICE: integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Bert Kappen 2/30

  4. PICE: Integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control Bert Kappen 3/30

  5. PICE: Integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control Learning Learn the controller from self-generated data Use Cross Entropy method for parametrized controller Bert Kappen 4/30

  6. PICE: Integrating control, inference and learning Massively parallel computation Bert Kappen 5/30

  7. PICE: Integrating control, inference and learning Massively parallel computation The Monte Carlo sampling serves two purposes: • Planning: compute the control for current state • Learning: improve the sampler/controller for future control computations Bert Kappen 6/30

  8. Path integral control theory Uncontrolled dynamics specifies distribution q ( τ | x , t ) over trajectories τ from x , t . � T Cost for trajectory τ is S ( τ | x , t ) = φ ( x T ) + t dsV ( x s , s ) . Find optimal distribution p ( τ | x , t ) that minimizes E p S and is ’close’ to q ( τ | x , t ) . Bert Kappen 7/30

  9. KL control Find p ∗ that minimizes � d τ p ( τ | x , t ) log p ( τ | x , t ) C ( p ) = KL ( p | q ) + E p S KL ( p | q ) = q ( τ | x , t ) The optimal solution is given by 1 p ∗ ( τ | x , t ) ψ ( x , t ) q ( τ | x , t ) exp( − S ( τ | x , t )) = � d τ q ( τ | x , t ) exp( − S ( τ | x , t )) = E q e − S ψ ( x , t ) = The optimal cost is: C ( p ∗ ) = − log ψ ( x , t ) Bert Kappen 8/30

  10. Controlled diffusions p ( τ | x , t ) is parametrised by function u ( x , t ) : E ( dW 2 dX t f ( X t , t ) dt + g ( X t , t )( u ( X t , t ) dt + dW t ) t ) = dt = � T � � ds 1 2 u ( X s , s ) 2 C ( u | x , t ) E u S ( τ | x , t ) + = t q ( τ | x , t ) corresponds to u = 0 . Goal is to find function u ( x , t ) that minimizes C . Bert Kappen 9/30

  11. Solution The optimal control problem is solved as a Feynman-Kac path integral. The optimal cost-to-go � � e − S � d τ q ( τ | x , t ) e − S ( τ | x , t ) = − log E q J ( x , t ) = − log Optimal control � dWe − S � E q u ∗ ( x , t ) dt E p ∗ ( dW t ) = = � e − S � E q ψ, u ∗ can be computed by forward sampling from q . Bert Kappen 10/30

  12. Sampling 10 5 0 −5 −10 0 0.5 1 1.5 2 Sample trajectories τ i , i = 1 , . . . , N ∼ q ( τ | x ) N E q e − S ≈ 1 � e − S ( τ i | x , t ) N i = 1 Sampling is unbiased but inefficient (large variance). Bert Kappen 11/30

  13. Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider simple 1-d sampling problem. Given q ( x ) , compute � ∞ a = Prob( x < 0) = I ( x ) q ( x ) dx −∞ with I ( x ) = 0 , 1 if x > 0 , x < 0 , respectively. Naive method: generate N samples X i ∼ q N a = 1 � ˆ I ( X i ) N i = 1 Bert Kappen 12/30

  14. Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider another distribution p ( x ) . Then � ∞ I ( x ) q ( x ) a = Prob( x < 0) = p ( x ) p ( x ) dx −∞ Importance sampling: generate N samples X i ∼ p N a = 1 I ( X i ) q ( X i ) � ˆ N p ( X i ) i = 1 Unbiased (= correct) for any distribution p ! Bert Kappen 13/30

  15. Optimal importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 The distribution p ∗ ( x ) = q ( x ) I ( x ) a is the optimal importance sampler. One sample X ∼ p ∗ is sufficient to estimate a : a = I ( X ) q ( X ) ˆ p ∗ ( X ) = a Bert Kappen 14/30

  16. Importance sampling and control In the case of control we must compute � dWe − S � E q J ( x , t ) = − log E q e − S u ∗ ( x , t ) = � e − S � E q Instead of samples from uncontrolled dynamics q ( u = 0 ), we sample with p ( u � 0 ). E q e − S E p e − S u = � T � T e − S dq 1 2 u ( x s , s ) 2 dt − e − S u dp = e − S − u ( x s , s ) dW s = t t We can choose any p , ie. any sampling control u to compute the expectation values. Bert Kappen 15/30

  17. Relation between optimal sampling and optimal control Define e − S u ( τ i | x , t )) α i = � N j = 1 e − S u ( τ j | x , t ) 1 ES S (1 ≤ ES S ≤ N ) = � N j = 1 α 2 j Thm: 1. Better u (in the sense of optimal control) provides a better sampler (in the sense of effective sample size). 2. Optimal u = u ∗ (in the sense of optimal control) requires only one sample, α i = 1 / N and S u ( τ | x , t ) deterministic! � T � T dt 1 2 u ( x s , s ) 2 + S u ( τ | x , t ) S ( τ | x , t ) + u ( x x , s ) dW s = t t Bert Kappen 16/30

  18. So far • Optimal control can be computed by MC sampling • Sampling can be accellerated by using ’good’ controls • The optimal control for sampling is also the optimal control solution How to learn a good controller? Bert Kappen 17/30

  19. The Cross-entropy method p u ( x ) be a family of probability density function parametrized by u . h ( x ) be a positive function. Conside the expectation value � a = E 0 h = dxp 0 ( x ) h ( x ) for a particular value of u = 0 . The optimal importance sampling distribution is p ∗ ( x ) = h ( x ) p 0 ( x ) / a . The cross entropy method minimises the KL divergence dxp ∗ ( x ) log p ∗ ( x ) � KL ( p ∗ | p u ) p u ( x ) ∝ − E p ∗ log p u ( X ) = − E 0 h ( X ) log p u ( X ) = E v h ( X ) p 0 ( X ) ∝ p v ( X ) log p u ( X ) p 0 → p 1 → p 2 . . . Bert Kappen 18/30

  20. The CE method for PI control Sample p u using dX t = f ( X t , t ) dt + g ( X t , t ) ( u ( X t , t ) dt + dW t ) We wish to compute close to optimal control u such that p u is close to p ∗ . Following the CE argument, we minimise � T � 2 � 1 ds 1 u ( X s , s ) − v ( X s , s ) − dW s KL ( p ∗ | p u ) ψ ( t , x ) E v e − S ( t , x , v ) = 2 ds t v is the importance sampling control. Expected value is independent of v , but variance/accuracy depends on v . Bert Kappen 19/30

  21. The CE method for PI control We parametrize the control u ( x , t | θ ) . The gradient is given by: �� T ∂ KL ( p ∗ | p u ) � ( u ( X s , s ) ds − v ( X s , s ) ds − dW s ) ∂ u ( X s , s ) = ∂θ ∂θ t v �� T � ∂ u ( X s , s ) − dW s = ∂θ t u θ − ǫ∂ KL ( p ∗ | p u ) θ : = ∂θ We refer to the method as PICE (Path Integral Cross Entropy). Bert Kappen 20/30

  22. Model based motor learning compute control for k = 0 , . . . do data k = generate data ( model , u k ) % Monte Carlo importance sampler u k + 1 = learn control ( data k , u k ) % Deep or recurrent learning end for 10 10 5 5 0 0 −5 −5 −10 −10 0 0.5 1 1.5 2 0 0.5 1 1.5 2 Bert Kappen 21/30

  23. Parallel implementation Massive parallel sampling on CPUs Massive parallel gradient computation on C/GPU Goal: provide generic solver for any PI control problem to arbitrary precision. Bert Kappen 22/30

  24. Acrobot 2 DOF, second order, under actuated, continuous stochastic control problem. Task is swing-up from down position and stabilize. Bert Kappen 23/30

  25. Acrobot (acrobot.mp4) Neural network 2 hidden layers, 50 neurons per layer. Input is position and velocity. 2000 iterations, with 30000 rollouts per iteration. 100 cores. 15 minutes Bert Kappen 24/30

  26. More samples per iteration is better :) Fraction ESS versus IS iteration 100 k samples (green, cyan) 300 k samples (red, blue) 1000 k samples (black, yellow) Bert Kappen 25/30

  27. Trust region Initial gradient computation too hard. Introduce (KL) trust region. Control cost vs. IS iteration. Blue line: small trust region (ESS ≈ 50 %, 30k samples) (= video) Red line: intermediate trust region (ESS ≈ 1 %, 100k samples) Green line: large trust region (ESS ≈ 0 . 1 %, 300k samples) Trade-off between speed and optimality. Bert Kappen 26/30

  28. Discussion Continuous time SOC is very hard to compute. - PI control: Control ↔ inference - Better sampling (ESS) ↔ better control (control objective) - IS: Learning control solution also increases efficiency of (future) control computations Bert Kappen 27/30

  29. Discussion Continuous time SOC is very hard to compute. - PI control: Control ↔ inference - Better sampling (ESS) ↔ better control (control objective) - IS: Learning control solution also increases efficiency of (future) control computations Continuous time SOC is very hard represent. - CE for parameter estimation → deep neural network Bert Kappen 28/30

Recommend


More recommend