Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley - PowerPoint PPT Presentation

Reinforcement Learning – Policy Op5miza5on Pieter Abbeel UC Berkeley EECS

Policy Op>miza>on n Consider control policy parameterized by parameter vector θ H X max E[ R ( s t ) | π θ ] θ t =0 n O<en stochas>c policy class (smooths out the problem): n probability of taking ac>on u in state s π θ ( u | s )

Learning to Trot/Run A<er learning Before learning (hand-tuned) [Policy search was done through trials on the actual robot.] Kohl and Stone, ICRA2004

Learning to Trot/Run 12 parameters define the Aibo’s gait: n n The front locus (3 parameters: height, x-pos., y-pos.) n The rear locus (3 parameters) n Locus length n Locus skew mul>plier in the x-y plane (for turning) n The height of the front of the body n The height of the rear of the body n The >me each foot takes to move through its locus n The frac>on of >me each foot spends on the ground Kohl and Stone, ICRA2004

[Policy search was done in simulation] [Ng + al, ISER 2004]

Learning to Hover

Ball-In-A-Cup [Kober and Peters, 2009]

Learning to Walk in 20 Minutes [Tedrake, Zhang, Seung 2005]

Learning to Walk in 20 Minutes passive hip joint [1DOF] Arms: coupled to opposite leg to reduce yaw moment freely swinging load [1DOF] 44 cm 2 x 2 (roll, pitch) posi>on controlled servo motors [4 DOF] 9DOFs: * 6 internal DOFs * 3 DOFs for the robot’s orienta>on (always assumed in contact with ground at a single point, absolute (x,y) ignored) Natural gait down 0.03 radians ramp: 0.8Hz, 6.5cm steps [Tedrake, Zhang, Seung 2005]

Learning to Walk in 20 Minutes [Tedrake, Zhang, Seung 2005]

Gradient-Free Methods H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 n Cross-Entropy Method (CEM) n Covariance Matrix Adapta>on (CMA) n Dynamics model: stochas>c: OK; unknown: OK n Policy class: stochas>c: OK n Downside: gradient-free methods slower than gradient-based methods à in prac>ce OK if low-dimensional θ and willing to do do many runs

Gradient-Based Policy Op>miza>on H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 f f s0 s1 s2 π θ π θ π θ u0 u1 u2 R R R r0 r1 r2

Overview of Methods / Seings Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S + + + + PD + + + + + + + LR D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

Ques>ons n When more than one is applicable, which one is best? n When dynamics is only available as black-box, but deriva>ves aren’t available – finite differences based deriva>ves? n Vs. directly finite differences / gradient-free on the policy n Note: finite difference tricky (imprac>cal?) when can’t control random seed n What if model is unknown, but es>mate available

Gradient Computa>on – Unknown Model – Finite Differences

Noise Can Dominate

Finite Differences and Noise n Solu>on 1: Average over many samples n Solu>on 2: Fix the randomness (if possible) n Intui>on by example: wind influence on a helicopter is stochas>c, but if we assume the same wind paqern across trials, this will make the different choices of θ more readily comparable n General instan>a>on: Fix the random seed; and the result is determinis>c system n Ng & Jordan, 2000 provide theore>cal analysis of gains from fixing randomness

Path Deriva>ve for Dynamics: D+K; Policy: D Reminder of op>miza>on objec>ve: n H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 Can compute gradient es>mate along current roll-out: n H ∂ U ∂ R ∂ s ( s t ) ∂ s t X = ∂θ i ∂θ i t =0 ∂ s t = ∂ f ∂ s ( s t − 1 , u t − 1 ) ∂ s t − 1 + ∂ f ∂ s ( s t − 1 , u t − 1 ) ∂ u t − 1 ∂θ i ∂θ i ∂θ i ∂ u t = ∂π θ ( s t , θ ) + ∂π θ ∂ s ( s t , θ ) ∂ s t ∂θ i ∂θ i ∂θ i

Path Deriva>ve for Dynamics: S+K+R; Policy: S+R Reminder of op>miza>on objec>ve: n H X max U ( θ ) = max E[ R ( s t ) | π θ ] θ θ t =0 (draw reparameterized graph on board) n + average over mul>ple samples n

Overview of Methods / Seings Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S + + + + PD + + + + + + + LR D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

Gradient Computa>on – Unknown Model – Likelihood Ra>o

Likelihood Ra>o Gradient [Note: Can also be derived/generalized through an importance sampling deriva>on – Tang and Abbeel, 2011]

Importance Sampling n On board..

Likelihood Ra>o Gradient Es>mate

Likelihood Ra>o Gradient Es>mate n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world prac>cality n Baseline n Temporal structure n Also: KL-divergence trust region / natural gradient (= general trick, equally applicable to perturba>on analysis and finite differences)

Likelihood Ra>o with Baseline n Gradient es>mate with baseline: m g = 1 X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 n Crudely, increasing log-likelihood of paths with higher than baseline reward and decreasing log-likelihood of paths with lower than baseline reward " m # 1 n S>ll unbiased? Yes! X r θ log P ( τ ( i ) ; θ ) b E = 0 m i =1

Likelihood Ra>o and Temporal Structure n Current es>mate: m g = 1 X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 H − 1 ! H − 1 ! m = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) t , u ( i ) X X X t ) t ) � b m i =1 t =0 t =0 n Future ac>ons do not depend on past rewards, hence can lower variance by instead using: H − 1 H − 1 m ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) X X X t ) k ) � b m i =1 t =0 k = t

Step-sizing and Trust Regions n Naïve step-sizing: Line search n Step-sizing necessary as gradient is only first-order approxima>on n Line search in the direc>on of gradient n Simple, but expensive (evalua>ons along the line) n Naïve: ignores where the first-order approxima>on is good/poor

Step-sizing and Trust Regions n Advanced step-sizing: Trust regions n First-order approxima>on from gradient is a good approxima>on within “trust region” à Solve for best point within trust region: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε

KL Trust Region (a.k.a. natural gradient) n Solve for best point within trust region: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n KL can be approximated efficiently with 2 nd order expansion: G: Fisher Informa>on Matrix

Experiments in Locomo>on [Schulman, Levine, Abbeel, 2014]

Actor-Cri>c Variant n Current es>mate: H − 1 H − 1 m ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) X X X t ) k ) � b m i =1 t =0 k = t Sample based estimate of Q ( s ( i ) k , u ( i ) k ) n Actor-cri>c algorithms in parallel run an es>mator for the Q- func>on, and subs>tute in the es>mated Q value

Learning Locomo>on [Schulman, Moritz, Levine, Jordan, Abbeel, 2015]

In Contrast: Darpa Robo>cs Challenge

Thank you

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley - PowerPoint PPT Presentation

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley EECS Policy Op>miza>on n Consider control policy parameterized by parameter vector H X max E[ R ( s t ) | ] t =0 n O<en stochas>c policy class

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Isomorphic Gait Execution in Homogeneous Modular Robots Michael Park, Sachin Chitta, and Mark Yim

Human-Robot Interactive Systems Sunil K. Agrawal, Ph.D. Professor of Mechanical Engineering and

Dual-arm Manipulation Planning Kensuke Harada, Weiwei Wan and Ixchel G. Ramiez-Alpizar

CS489/698 Lecture 11: Feb 8, 2017 Gaussian Processes [B] Section 6.4 [M] Chap. 15 [HTF] Sec.

Lasagna: Towards Deep Hierarchical Understanding and Searching over Mobile Sensing Data Cihang

Mass Spectrometry Proteomics for the Computational Biologist December 1, 2006 John T. Prince

Introduction to Microarray Data Analysis and Gene Networks lecture 8 Alvis Brazma European

Chapter 2 Nature of Matter CHAPTER CHALLENGE TEST REVIEW Chapter 2 Challenge Jeopardy Round

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley - PowerPoint PPT Presentation

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley EECS Policy Op>miza>on n Consider control policy parameterized by parameter vector H X max E[ R ( s t ) | ] t =0 n O<en stochas>c policy class

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Isomorphic Gait Execution in Homogeneous Modular Robots Michael Park, Sachin Chitta, and Mark Yim

Human-Robot Interactive Systems Sunil K. Agrawal, Ph.D. Professor of Mechanical Engineering and

Dual-arm Manipulation Planning Kensuke Harada, Weiwei Wan and Ixchel G. Ramiez-Alpizar

CS489/698 Lecture 11: Feb 8, 2017 Gaussian Processes [B] Section 6.4 [M] Chap. 15 [HTF] Sec.

Lasagna: Towards Deep Hierarchical Understanding and Searching over Mobile Sensing Data Cihang

Mass Spectrometry Proteomics for the Computational Biologist December 1, 2006 John T. Prince

Introduction to Microarray Data Analysis and Gene Networks lecture 8 Alvis Brazma European

Chapter 2 Nature of Matter CHAPTER CHALLENGE TEST REVIEW Chapter 2 Challenge Jeopardy Round

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &