reinforcement methods for autonomous online learning of
play

Reinforcement Methods for Autonomous Online Learning of Optimal - PowerPoint PPT Presentation

Supported by : F.L. Lewis, Dan Popa NSF - Paul Werbos Automation & Robotics Research Institute (ARRI) ARO- Sam Stanton The University of Texas at Arlington, USA AFOSR- Fariba Fahroo Guido Herrmann Bristol Robotics Lab, University of


  1. Supported by : F.L. Lewis, Dan Popa NSF - Paul Werbos Automation & Robotics Research Institute (ARRI) ARO- Sam Stanton The University of Texas at Arlington, USA AFOSR- Fariba Fahroo Guido Herrmann Bristol Robotics Lab, University of Bristol, UK Reinforcement Methods for Autonomous Online Learning of Optimal Robot Behaviors Talk available online at http://ARRI.uta.edu/acs

  2. Invited by Rolf Johansson  Optimal Control  Reinforcement learning  Policy Iteration  Q Learning  Humanoid Robot Learning Control Using RL  Telerobotic Interface Learning Using RL

  3. It is man’s obligation to explore the most difficult questions in the clearest possible way and use reason and intellect to arrive at the best answer. Man’s task is to understand patterns in nature and society. The first task is to understand the individual problem, then to analyze symptoms and causes, and only then to design treatment and controls. Ibn Sina 1002-1042 (Avicenna)

  4. Importance of Feedback Control Darwin- FB and natural selection Volterra- FB and fish population balance Adam Smith- FB and international economy James Watt- FB and the steam engine FB and cell homeostasis The resources available to most species for their survival are meager and limited Nature uses Optimal control

  5. F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, Third Quarter 2009. IEEE Control systems magazine, to appear.

  6. Discrete-Time Optimal Control    ( ) ( ) x f x g x u system 1 k k k k        ( , ) T T ( ) i k ( , ) cost r x u x Qx u Ru Example V x r x u h k i i k k k k k k  i k         ( 1) ( ) ( , ) ( , ) i k Difference eq. equivalent V x r x u r x u h k k k i i   1 i k u  ( ) Control policy = the prescribed control input function h x k k   Example u Kx Linear state variable feedback k k     ( ) ( , ( )) ( ) , (0) 0 V x r x h x V x V Bellman equation  1 h k k k h k h     ( ) ( ) T T V x x Qx u Ru V x  1 h k k k k k h k Bellman’s Principle gives Bellman opt. eq= DT HJB *    * ( ) min ( ( , ) ( )) V x r x u V x  1 k k k k u k    * * ( ) arg min ( ( , ) ( )) h x r x u V x Optimal Control  1 k k k k u k   ( ) V x      1 ( ) ( ) T 1 1 k u x R g x Off-line solution  2 k k x  Dynamics must be known 1 k

  7. DT Optimal Control – Linear Systems Quadratic cost (LQR) system    x Ax Bu 1 k k k  cost    ( ) T T V x x Qx u Ru k i i i i  i k  ( ) T for some symmetric matrix P Fact. The cost is quadratic V x x Px k k k HJB = DT Riccati equation       1 0 ( ) T T T T A PA P Q A PB R B PB B PA   u Lx Optimal Control k k    1 ( ) T T L R B PB B PA Off-line solution Optimal Cost Dynamics must be known  * ( ) T V x x Px k k k

  8. We want robot controllers that learn optimal control solutions online in real-time Synthesis of  Computational intelligence  Control systems  Neurobiology Different methods of learning Machine learning- the formal study of learning systems Supervised learning Unsupervised learning Reinforcement learning

  9. Different methods of learning Reinforcement learning We want OPTIMAL performance Ivan Pavlov 1890s - ADP- Approximate Dynamic Programming Actor-Critic Learning Desired performance Reinforcement signal Critic environment Tune actor Control Inputs Adaptive System Learning system outputs Actor

  10. RL Policy Iterations to Solve Optimal Control Problem    ( ) ( ) x f x g x u system 1 k k k k      ( ) i k ( , ) cost V x r x u h k i i  i k         ( 1) ( ) ( , ) ( , ) i k Difference eq. equivalent V x r x u r x u h k k k i i   1 i k     ( ) ( , ( )) ( ) , (0) 0 V x r x h x V x V Bellman equation  1 h k k k h k h     ( ) ( ) T T V x x Qx u Ru V x  1 h k k k k k h k Bellman’s Principle gives Bellman opt. eq= DT HJB *    * ( ) min ( ( , ) ( )) V x r x u V x  1 k k k k u Focus on these two eqs. k    * * ( ) arg min ( ( , ) ( )) h x r x u V x Optimal Control  1 k k k k u k   ( ) V x      1 ( ) ( ) T 1 1 k u x R g x 2  k k x  1 k

  11. Bellman Equation    ( ) ( , ( )) ( ) V x r x h x V x  1 h k k k h k Can be interpreted as a consistency equation that must be satisfied by the value function at each time stage. Expresses a relation between the current value of being in state x and the value(s) of being in next state x’ given that policy Captures the action, observation, evaluation, and improvement mechanisms of reinforcement learning.      ( ) ( , ( )) ( ) Temporal Difference Idea e V x r x h x V x  1 k h k k k h k

  12. Policy Evaluation and Policy Improvement consider algorithms that repeatedly interleave the two procedures: Policy Evaluation by Bellman Equation:    ( ) ( , ( )) ( ) V x r x h x V x  1 h k k k h k Policy Improvement:  ( ) 1 V x    1  1 '( ) ( ) T k h x R g x  k 2 k x  1 k  ' ( ) ( ) V x V x Policy Improvement makes h h (Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998). ( ) '( ) the policy is said to be greedy with respect to value function V x h x h k At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map. There is a large family of algorithms that implement the policy evaluation and policy improvement procedures in various ways

  13. DT Policy Iteration to solve HJB Cost for any given control policy h(x k ) satisfies the recursion    ( ) ( , ( )) ( ) V x r x h x V x Bellman eq.  1 h k k k h k Recursive form Recursive solution Consistency equation Pick stabilizing initial control Policy Evaluation – solve Bellman Equation    ( ) ( , ( )) ( ) f(.) and g(.) do not appear V x r x h x V x    1 1 1 j k k j k j k Policy Improvement    ( ) arg min( ( , ) ( )) h x r x u V x     1 1 1 1 j k k k j k u k Howard (1960) proved convergence for MDP (Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998). 1 ( ) 1 ( ) the policy is said to be greedy with respect to value function V x h x   j j k At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map.

  14. Methods to implement Policy Iteration  Exact Computation- needs full system dynamics  Temporal Difference- for robot trajectory following  Montecarlo Learning- for learning episodic robot tasks

  15. DT Policy Iteration – Linear Systems Quadratic Cost- LQR        ( ) , u Lx x Ax Bu A BL x k k 1 k k k k For any stabilizing policy, the cost is     ( ) ( ) ( ) T T V x x Qx u x Ru x k i i i i  i k  ( ) T LQR value is quadratic V x x Px Solves Lyapunov eq. without knowing A and B DT Policy iterations    ( ) ( ) ( ) ( ) T T V x x Qx u x Ru x V x    1 1 1 j k k k j k j k j k ( ) 1 dV x   1 1    1 ( ) ( ) j k T u x R g x   1 1 j k k 2 dx  1 k Equivalent to an Underlying Problem- DT LQR:       ( ) ( ) T T DT Lyapunov eq. A BL P A BL P Q L RL   1 1 j j j j j j    1 ( ) T T L R B P B B P A    1 1 1 j j j Hewer proved convergence in 1971 Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics

Recommend


More recommend