Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 21, 2015 Bert Kappen
Why control theory? A theory for intelligent behaviour: - neuroscience Bert Kappen Oxford 2015 1/58
Why control theory? A theory for intelligent behaviour: - neuroscience - robotics Bert Kappen Oxford 2015 2/58
Control theory Given a current state and a future desired state, what is the best/cheapest/fastest way to get there. Bert Kappen Oxford 2015 3/58
Why stochastic control? Bert Kappen Oxford 2015 4/58
How to control? Hard problems: - a learning and exploration problem - a stochastic optimal control computation - a representation problem u ( x , t ) Bert Kappen Oxford 2015 5/58
The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Bert Kappen Oxford 2015 6/58
The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Bert Kappen Oxford 2015 7/58
The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Bert Kappen Oxford 2015 8/58
The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Bert Kappen Oxford 2015 9/58
The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Optimal importance sampler is optimal control Bert Kappen Oxford 2015 10/58
The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Optimal importance sampler is optimal control Learn a good importance sampler using PICE Bert Kappen Oxford 2015 11/58
Outline • Introduction to control theory • Link between control theory, inference and statistical physics – Schr¨ odinger, Fleming Mitter ’82, Kappen ’05, Todorov ’06 • Importance sampling – Relation between optimal sampling and optimal control • Cross entropy method for adaptive importance sampling (PICE) – A criterion for parametrized control optimization – Learning by gradient descent • Some examples Bert Kappen Oxford 2015 12/58
Discrete time optimal control Consider the control of a discrete time deterministic dynamical system: x t + 1 = x t + f ( x t , u t ) , t = 0 , 1 , . . . , T − 1 x t describes the state and u t specifies the control or action at time t . Given x 0 and u 0: T − 1 , we can compute x 1: T . Define a cost for each sequence of controls: T − 1 � C ( x 0 , u 0: T − 1 ) = V ( x t , u t ) t = 0 Find the sequence u 0: T − 1 that minimizes C ( x 0 , u 0: T − 1 ) . Bert Kappen Oxford 2015 13/58
Dynamic programming Find the minimal cost path from A to J. min(6 + C ( H ) , 3 + C ( I )) = 7 C ( F ) = Minimal cost at time t easily expressable in terms of minimal cost at time t + 1 . Bert Kappen Oxford 2015 14/58
Discrete time optimal control Dynamic programming uses concept of optimal cost-to-go J ( t , x ) . One can recursively compute J ( t , x ) from J ( t + 1 , x ) for all x in the following way: T − 1 � J ( t , x t ) V ( x s , u s ) = min u t : T − 1 s = t min u t ( V ( t , x t , u t ) + J ( t + 1 , x t + f ( t , x t , u t ))) = J ( T , x ) = 0 J (0 , x ) u 0: T − 1 C ( x , u 0: T − 1 ) min = This is called the Bellman Equation. Computes u t ( x ) for all intermediate t , x . Bert Kappen Oxford 2015 15/58
Stochastic optimal control Consider a stochastic dynamical system dX i = f i ( X t , u ) dt + dW i E ( dW i dW j ) = ν i j dt Given x (0) find control function u ( x , t ) that minimizes the expected future cost � T � � C E φ ( X T ) + dtV ( X t , u ( X t , t )) = 0 Expectation is over all trajectories given the control path. J ( t , x ) u ( V ( x , u ) + E J ( t + dt , x + dx )) = min � � V ( x , u ) + f ( x , u ) ∇ x J ( x , t ) + 1 2 ν ∇ 2 − ∂ t J ( t , x ) min x J ( x , t ) = u with u = u ( x , t ) and boundary condition J ( x , T ) = φ ( x ) . This is HJB equation. Bert Kappen Oxford 2015 16/58
Computing the optimal control solution is hard - solve a Bellman Equation, a PDE - scales badly with dimension Efficient solutions exist for - linear dynamical systems with quadratic costs (Gaussians) - deterministic systems (no noise) Bert Kappen Oxford 2015 17/58
Path integral control theory f ( X t , t ) dt + g ( X t , t )( udt + dW t ) dX t = � T � � dsV ( X s , s ) + 1 2 u T ( X t , t ) Ru ( X t , t ) C = E φ ( X T ) + t with E ( dW a dW b ) = ν ab dt and R = λν − 1 , λ > 0 . f ∈ R n , g ∈ R n × m , u ∈ R m . The HJB equation becomes � 1 �� 2 u T Ru + V + ( f + gu ) T ( ∇ J ) + 1 � g ν g T ∇ 2 J − ∂ t J = min 2Tr u with boundary condition J ( x , T ) = φ ( x ) . Bert Kappen Oxford 2015 18/58
Path integral control theory Minimization wrt u yields: − R − 1 g T ( x , t ) ∇ J ( x , t ) u ( x , t ) = − 1 2( ∇ J ) T gR − 1 g T ( ∇ J ) + V + f T ∇ J + 1 � � g ν g T ∇ 2 J − ∂ t J = 2Tr Define ψ ( x , t ) through J ( x , t ) = − λ log ψ ( x , t ) . We obtain a linear HJB: � V g ν g T ∇ 2 �� λ − f T ∇ − 1 � ∂ t ψ = ψ 2Tr Bert Kappen Oxford 2015 19/58
Feynman-Kac formula Denote q ( τ | x , t ) the distribution over uncontrolled trajectories that start at x , t : f ( X t , t ) dt + g ( X t , t ) dW dX t = with τ a trajectory x ( t → T ) . Then � � � − S ( τ ) � e − S /λ � ψ ( x , t ) dq ( τ | x , t ) exp = = E q λ � T S ( τ ) φ ( x ( T )) + dsV ( x ( s ) , s ) = t Bert Kappen Oxford 2015 20/58
Posterior distribution over optimal trajectories ψ ( x , t ) is the partition sum for the distribution over paths under optimal control: � � 1 − S ( τ ) p ∗ ( τ | x , t ) = ψ ( x , t ) q ( τ | x , t ) exp λ The optimal cost-to-go is a free energy: � e − S /λ � J ( x , t ) = − λ log E q The optimal control is an expectation wrt p : � dWe − S /λ � E q u ∗ ( x , t ) dt = E p ∗ ( dW t ) = � e − S /λ � E q J , u ∗ can be computed by forward sampling from q . Bert Kappen Oxford 2015 21/58
Delayed choice � � dW 2 dX t = u ( X t , t ) dt + dW t = ν dt t � 2 dt 1 2 u ( t ) 2 C ( p ) = E p φ ( x T ) + 0 Cost encodes targets at t = 2 . 3 2 1 0 −1 −2 −3 0 0.5 1 1.5 2 Bert Kappen Oxford 2015 22/58
Delayed choice Time-to-go T = 2 − t . 1.8 3 T=2 1.6 2 T=1 1.4 1 T=0.5 1.2 J(x,t) 0 1 0.8 −1 0.6 −2 0.4 −2 −1 0 1 2 −3 x 0 0.5 1 1.5 2 J ( x , t ) = − ν log E q exp( − φ ( X 2 ) /ν ) Decision is made at T = 1 ν Bert Kappen Oxford 2015 23/58
Delayed choice Time-to-go T = 2 − t . 1.8 3 T=2 1.6 2 T=1 1.4 1 T=0.5 1.2 J(x,t) 0 1 0.8 −1 0.6 −2 0.4 −2 −1 0 1 2 −3 x 0 0.5 1 1.5 2 J ( x , t ) = − ν log E q exp( − φ ( X 2 ) /ν ) ”When the future is uncertain, delay your decisions.” Bert Kappen Oxford 2015 24/58
KL control Uncontrolled dynamics specifies distribution q ( τ | x , t ) over trajectories τ from t → T . � T Cost for trajectory τ is S ( τ ) = φ ( x T ) + t dsV ( x s , s ) . Find optimal distribution p ( τ | x . t ) that minimizes E p S and is ’close’ to q ( τ | x , t ) . Bert Kappen Oxford 2015 25/58
KL control Find p ∗ that minimizes � d τ p ( τ | x , t ) log p ( τ | x , t ) C ( p ) = KL ( p | q ) + E p S KL ( p | q ) = q ( τ | x , t ) The optimal solution is given by � 1 p ∗ ( τ | x , t ) = ψ ( x , t ) q ( τ | x , t ) exp( − S ( τ | x , t )) ψ ( x , t ) = d τ q ( τ | x , t ) exp( − S ( τ | x , t )) The optimal cost is: C ( p ∗ ) = − log ψ ( x , t ) Bert Kappen Oxford 2015 26/58
Controlled diffusions are special case In the case of controlled diffusions, p is parametrised by functions u ( x , t ) : dX t = f ( X t , t ) dt + g ( X t , t )( u ( X t , t ) dt + dW t ) E ( dW i dW j ) = ν i j dt � T � � ds 1 2 u ( X s , s ) T ν − 1 u ( X s , s ) + V ( X s , s ) C ( p ) = E p φ ( X T ) + t ψ ( x , t ) is the solution of the linear Bellman equation and J ( x , t ) = − log ψ ( x , t ) is the optimal cost-to-go. Bert Kappen Oxford 2015 27/58
Sampling efficiency 10 5 0 −5 −10 0 0.5 1 1.5 2 Sampling with uncontrolled dynamics is theoretically correct, but inefficient in effi- cient in practice. Bert Kappen Oxford 2015 28/58
Recommend
More recommend