integrating control inference and learning is it what
play

Integrating control, inference and learning. Is it what robots - PowerPoint PPT Presentation

Integrating control, inference and learning. Is it what robots should be doing? Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 18, 2016 Bert Kappen Optimal control theory Given a current state


  1. Integrating control, inference and learning. Is it what robots should be doing? Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 18, 2016 Bert Kappen

  2. Optimal control theory Given a current state and a future desired state, what is the best/cheapest/fastest way to get there. Bert Kappen 1/40

  3. Why stochastic optimal control? Bert Kappen 2/40

  4. Why stochastic optimal control? Exploration Learning Bert Kappen 3/40

  5. Optimal control theory Hard problems: - a learning and exploration problem - a stochastic optimal control computation - a representation problem u ( x , t ) Bert Kappen 4/40

  6. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Bert Kappen 5/40

  7. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (=a state-feedback controller) Optimal importance sampler is optimal control Bert Kappen 6/40

  8. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (=a state-feedback controller) Optimal importance sampler is optimal control Learning Learn the controller from self-generated data Use Cross Entropy method for parametrized controller Bert Kappen 7/40

  9. Outline • Review of path integral control theory – Some results • Importance sampling – Relation between optimal sampling and optimal control • Cross entropy method for adaptive importance sampling (PICE) – A criterion for parametrized control optimization – Learning by gradient descent • Some examples Bert Kappen 8/40

  10. Discrete time optimal control Consider the control of a discrete time deterministic dynamical system: x t + 1 = x t + f ( x t , u t ) , t = 0 , 1 , . . . , T − 1 x t describes the state and u t specifies the control or action at time t . Given x 0 and u 0: T − 1 , we can compute x 1: T . Define a cost for each sequence of controls: T − 1 � C ( x 0 , u 0: T − 1 ) = V ( x t , u t ) t = 0 Find the sequence u 0: T − 1 that minimizes C ( x 0 , u 0: T − 1 ) . Bert Kappen 9/40

  11. Dynamic programming Find the minimal cost path from A to J. J ( J ) 0 = J ( H ) 3 J ( I ) = 4 = J ( F ) min(6 + J ( H ) , 3 + J ( I )) = 7 = J ( B ) min(7 + J ( E ) , 4 + J ( F ) , 2 + J ( G )) = . . . = Minimal cost at time t easily expressable in terms of minimal cost at time t + 1 . Bert Kappen 10/40

  12. Discrete time optimal control Dynamic programming uses concept of optimal cost-to-go J ( t , x ) . One can recursively compute J ( t , x ) from J ( t + 1 , x ) for all x in the following way: J ( t , x t ) min u t ( V ( x t , u t ) + J ( t + 1 , x t + f ( t , x t , u t ))) = J ( T , x ) 0 = J (0 , x ) u 0: T − 1 C ( x , u 0: T − 1 ) min = This is called the Bellman Equation. Computes u t ( x ) for all intermediate t , x . Bert Kappen 11/40

  13. Stochastic optimal control Consider a stochastic dynamical system dX t = f ( X t , u ) dt + dW t E ( dW t , i dW t . j ) = ν i j dt Given X 0 find control function u ( x , t ) that minimizes the expected future cost � T � � C φ ( X T ) + dtV ( X t , u ( X t , t )) E = 0 Expectation is over all trajectories given the control function u ( x , t ) . � � V ( x , u ) + f ( x , u ) ∇ x J ( x , t ) + 1 2 ν ∇ 2 − ∂ t J ( t , x ) min x J ( x , t ) = u with u = u ( x , t ) and boundary condition J ( x , T ) = φ ( x ) . This is HJB equation. Bert Kappen 12/40

  14. Computing the optimal control solution is hard - solve a Bellman Equation, a PDE - scales badly with dimension Efficient solutions exist for - linear dynamical systems with quadratic costs (Gaussians) - deterministic systems (no noise) Bert Kappen 13/40

  15. The idea Uncontrolled dynamics specifies distribution q ( τ | x , t ) over trajectories τ from x , t . � T Cost for trajectory τ is S ( τ | x , t ) = φ ( x T ) + t dsV ( x s , s ) . Find optimal distribution p ( τ | x , t ) that minimizes E p S and is ’close’ to q ( τ | x , t ) . Bert Kappen 14/40

  16. KL control Find p ∗ that minimizes � d τ p ( τ | x , t ) log p ( τ | x , t ) C ( p ) = KL ( p | q ) + E p S KL ( p | q ) = q ( τ | x , t ) The optimal solution is given by 1 p ∗ ( τ | x , t ) ψ ( x , t ) q ( τ | x , t ) exp( − S ( τ | x , t )) = � d τ q ( τ | x , t ) exp( − S ( τ | x , t )) = E q e − S ψ ( x , t ) = The optimal cost is: C ( p ∗ ) = − log ψ ( x , t ) Bert Kappen 15/40

  17. Controlled diffusions p ( τ | x , t ) is parametrised by functions u ( x , t ) : E ( dW 2 dX t f ( X t , t ) dt + g ( X t , t )( u ( X t , t ) dt + dW t ) t ) = dt = � T � � ds 1 2 u ( X s , s ) 2 C ( u | x , t ) E u S ( τ | x , t ) + = t q ( τ | x , t ) corresponds to u = 0 . The Bellman equation becomes a ’Schr¨ odinger’ equation with J ( x , t ) = − log ψ ( x , t ) : � � V − f T ∂ x − 1 2 ∂ 2 ψ ( x , T ) = e − φ ( x ) ∂ t ψ = ψ x Bert Kappen 16/40

  18. Controlled diffusions The Schr¨ odinger’ equation can be solved formally as a Feynman-Kac path integral: � � e − S � d τ q ( τ | x , t ) e − S ( τ | x , t ) = E q ψ ( x , t ) = Optimal control � dWe − S � E q u ∗ ( x , t ) dt E p ∗ ( dW t ) = = � e − S � E q ψ, u ∗ can be computed by forward sampling from q . Bert Kappen 17/40

  19. Delayed choice Time-to-go T = 2 − t . 1.8 3 T=2 1.6 2 T=1 1.4 1 T=0.5 1.2 J(x,t) 0 1 0.8 −1 0.6 −2 0.4 −2 −1 0 1 2 −3 x 0 0.5 1 1.5 2 J ( x , t ) = − ν log E q exp( − φ ( X 2 ) /ν ) Decision is made at T = 1 ν Bert Kappen 18/40

  20. Acrobot 4 20 8 mean std 2 6 increment 10 ss 0 4 0 −2 2 −4 −10 0 0 50 100 0 50 100 0 50 100 Ju 50 20 15000 Jphi 15 0 10000 10 −50 5000 5 −100 0 0 0 50 100 0 50 100 0 50 100 4 30 2 20 x 2 0 10 u −2 0 −4 −10 −4 −2 0 2 0 10 20 x 1 t 100 iterations. At each iteration 50 trajectories were generated. Noise was lowered at each iteration. Top left: final height for each trajectory. Bert Kappen 19/40

  21. Acrobot (movie92.mp4) Result after 100 iterations, 50 samples per iteration. Bert Kappen 20/40

  22. Robotics ≈ 100 . 00 trajectories per iteration, 3 iterations per second. Video at: http://www.snn.ru.nl/˜bertk/control_theory/PI_quadrotors.mp4 Theodorou et al. 2011 ICRA Gomez et al. 2016 ICAPS Bert Kappen 21/40

  23. Importance sampling and control 10 10 5 5 0 0 −5 −5 −10 −10 0 0.5 1 1.5 2 0 0.5 1 1.5 2 � T ψ ( x , t ) = E q e − S S ( τ | x , t ) = φ ( x T ) + dsV ( x s , s ) t Sampling is ’correct’ but inefficient. Bert Kappen 22/40

  24. Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider simple 1-d sampling problem. Given q ( x ) , compute � ∞ a = Prob( x < 0) = I ( x ) q ( x ) dx −∞ with I ( x ) = 0 , 1 if x > 0 , x < 0 , respectively. Naive method: generate N samples X i ∼ q N a = 1 � ˆ I ( X i ) N i = 1 Bert Kappen 23/40

  25. Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider another distribution p ( x ) . Then � ∞ I ( x ) q ( x ) a = Prob( x < 0) = p ( x ) p ( x ) dx −∞ Importance sampling: generate N samples X i ∼ p N a = 1 I ( X i ) q ( X i ) � ˆ N p ( X i ) i = 1 Unbiased (= correct) for any p ! Bert Kappen 24/40

  26. Optimal importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 The distribution p ∗ ( x ) = q ( x ) I ( x ) a is the optimal importance sampler. One sample X ∼ p ∗ is sufficient to estimate a : a = I ( X ) q ( X ) ˆ p ∗ ( X ) = a Bert Kappen 25/40

  27. Importance sampling and control In the case of control we must compute � dWe − S � E q J ( x , t ) = − log E q e − S u ∗ ( x , t ) = � e − S � E q Instead of samples from uncontrolled dynamics q ( u = 0 ), we sample with p ( u � 0 ). E q e − S E p e − S u = � T � T e − S dq 1 2 u ( x , t ) 2 dt − e − S u dp = e − S − u ( x , t ) dW t = t t We can choose any p , ie. any sampling control u . Bert Kappen 26/40

  28. Relation between optimal sampling and optimal control Draw N trajectories τ i , i = 1 , . . . , N from p ( τ | x , t ) using control function u and define e − S u ( τ i | x , t )) α i = � N j = 1 e − S u ( τ j | x , t ) 1 ES S (1 ≤ ES S ≤ N ) = � N j = 1 α 2 j Thm: 1. Better u (in the sense of optimal control) provides a better sampler (in the sense of effective sample size). 2. Optimal u = u ∗ (in the sense of optimal control) requires only one sample, α i = 1 / N and S u ( τ | x , t ) deterministic! � T � T dt 1 2 u ( x s , s ) T ν − 1 u ( x s , s ) + u ( x x , s ) T ν − 1 dW s S u ( τ | x , t ) S ( τ | x , t ) + = t t Bert Kappen 27/40

Recommend


More recommend