CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ - PowerPoint PPT Presentation

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel UC Berkeley EECS

Outline Motivation n Q-learning n DQN + variants n Q-learning with continuous action spaces (SoftQ) n Deep Deterministic Policy Gradient (DDPG) n Soft Actor Critic (SAC) n

Story-line TRPO, PPO: Importance sampling surrogate loss allows to do more than a gradient n step, but still very local Could we re-use samples more? Could we learn more globally / off-policy? n Yes! By leveraging the dynamic programming structure of the problem, breaking it n down into 1-step pieces Q-learning, DQN: 1-step (sampled) off-policy Bellman back-ups à more sample re-use à more data- n efficient learning directly about the optimal policy Why not always Q-learning/DQN? n Often less stable n The data doesn’t always support learning about the optimal policy (even if in principle can learn fully off-policy) n DDGP, SAC: like Q-learning, but does off-policy learning about the current policy and how to locally n improve it (vs. directly learning about the optimal policy)

Recap Q-Values Q * (s, a) = expected utility starting in s, taking action a, and (thereafter) acting optimally Bellman Equation: Q-Value Iteration:

(Tabular) Q-Learning Q-value iteration: n h i Rewrite as expectation: R ( s, a, s 0 ) + γ max a 0 Q k ( s 0 , a 0 ) Q k +1 ← E s 0 ⇠ P ( s 0 | s,a ) n (Tabular) Q-Learning: replace expectation by samples n For an state-action pair (s,a), receive: s 0 ∼ P ( s 0 | s, a ) n Consider your old estimate: Q k ( s, a ) n Consider your new sample estimate: n Incorporate the new estimate into a running average: n Q k +1 ( s, a ) ← (1 − α ) Q k ( s, a ) + α [target( s 0 )]

(Tabular) Q-Learning Algorithm: Start with for all s, a. Q 0 ( s, a ) Get initial state s For k = 1, 2, … till convergence Sample action a, get next state s’ If s’ is terminal: target = R ( s, a, s 0 ) Sample new initial state s’ else: target = R ( s, a, s 0 ) + γ max a 0 Q k ( s 0 , a 0 ) Q k +1 ( s, a ) ← (1 − α ) Q k ( s, a ) + α [target] s ← s 0

How to sample actions? Choose random actions? n Q k ( s, a ) Choose action that maximizes (i.e. greedily)? n ɛ-Greedy: choose random action with prob. ɛ, otherwise choose n action greedily

Q-Learning Properties Amazing result: Q-learning converges to optimal policy -- n even if you’re acting suboptimally! This is called off-policy learning n Caveats: n You have to explore enough n You have to eventually make the learning rate n small enough … but not decrease it too quickly n

Q-Learning Properties n Technical requirements. n All states and actions are visited infinitely often n Basically, in the limit, it doesn’t matter how you select actions (!) n Learning rate schedule such that for all state and action pairs (s,a): ∞ ∞ X X α 2 α t ( s, a ) = ∞ t ( s, a ) < ∞ t =0 t =0 For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), November 1994.

Q-Learning Demo: Crawler States: discretized value of 2d state: (arm angle, hand angle) • Actions: Cartesian product of {arm up, arm down} and {hand up, hand down} • Reward: speed in the forward direction •

Video of Demo Crawler Bot

Video of Demo Q-Learning -- Crawler

Can tabular methods scale? n Discrete environments Tetris Atari Gridworld 10^60 10^308 (ram) 10^16992 (pixels) 10^1

Can tabular methods scale? n Continuous environments (by crude discretization) Crawler Hopper Humanoid 10^2 10^10 10^100

Generalizing Across States Basic Q-Learning keeps a table of all q-values n In realistic situations, we cannot possibly learn n about every single state! Too many states to visit them all in training n Too many states to hold the q-tables in memory n Instead, we want to generalize: n Learn about some small number of training states from n experience Generalize that experience to new, similar situations n This is a fundamental idea in machine learning n

Approximate Q-Learning Instead of a table, we have a parametrized Q function: Q θ ( s, a ) n n Can be a linear function in features: Q θ ( s, a ) = θ 0 f 0 ( s, a ) + θ 1 f 1 ( s, a ) + · · · + θ n f n ( s, a ) n Or a neural net, decision tree, etc. Learning rule: n n Remember: target( s 0 ) = R ( s, a, s 0 ) + γ max a 0 Q θ k ( s 0 , a 0 ) n Update:  1 �� 2( Q θ ( s, a ) � target( s 0 )) 2 � θ k +1 θ k � α r θ � � θ = θ k

Recall Approximate Q-Learning Instead of a table, we have a parametrized Q function n n E.g. a neural net Q θ ( s, a ) Learning rule: n n Compute target: target( s 0 ) = R ( s, a, s 0 ) + γ max a 0 Q θ k ( s 0 , a 0 ) n Update Q-network:  1 �� 2( Q θ ( s, a ) � target( s 0 )) 2 � θ k +1 θ k � α r θ � � θ = θ k

See also n “Rainbow: Combining Improvements in Deep Reinforcement Learning,” Matteo Hessel et al, 2017 n Double DQN (DDQN) n Prioritized Replay DDQN n Dueling DQN n Distributional DQN n Noisy DQN

Soft Q-Learning → Use a sample estimate → Supervised learning → Stein variational gradient descent

Stein Variational Gradient Descent: Intuition Q-function Policy sampling network Implicit density model D. Wang et al., Learning to draw samples: With application to amortized MLE for generative adversarial learning, 2016.

0 min 12 min 30 min 2 hours Training time sites.google.com/view/composing-real-world-policies/

After 2 hours of training sites.google.com/view/composing-real-world-policies/

Deep Deterministic Policy Gradient (DDPG): Basic (=SVG(0)) • for iter = 1, 2, … Roll-outs: Execute roll-outs under current policy (+some noise for exploration) Q function update: Q ( s t , u t )) 2 with X ( Q φ ( s t , u t ) � ˆ ˆ Q ( s t , u t ) = r t + γ Q φ ( s t +1 , u t +1 ) g / r φ t Policy update: Backprop through Q to compute gradient estimates for all t: X g / r θ Q φ ( s t , π θ ( s t , v t )) t

SVG(k) n Applied to 2-D robotics tasks n Different gradient estimators behave similarly

SVG(k)

Deep Deterministic Policy Gradient (DDPG): Complete n Add noise for exploration n Incorporate replay buffer for off-policy learning n For increased stability, use lagged (Polyak-averaging) version of and for target values Q φ π θ ˆ Q t = r t + γ Q φ 0 ( s t +1 , π θ 0 ( s t +1 )) off-policy!

DDPG n Applied to 2D and 3D robotics tasks and driving with pixel input

DDPG + very sample efficient thanks to off-policy updates - often unstable à Soft Actor Critic (SAC), which adds entropy of policy to the objective, ensuring better exploration and less overfitting of the policy to any quirks in the Q-function

Soft Policy Iteration Soft Actor-Critic Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor- Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML, 2018. 1. Soft policy evaluation : Fix policy, apply soft Bellman backup until converges: 1. Take one stochastic gradient step to minimize soft Bellman residual This converges to . 2. Soft policy improvement : Update the policy through information projection: 2. Take one stochastic gradient step to minimize the KL divergence For the new policy, we have . 3. Execute one action in the 3. Repeat until convergence environment and repeat

Soft Actor Critic n Objective: n Iterate: n Perform roll-out from pi, add data in replay buffer n Learn V, Q, pi: [see also: https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665]

Algorithms: Soft Actor-Critic (SAC) Deep Deterministic Policy Gradient (DDPG) Proximal Policy Optimization (PPO) Soft Q-Learning (SQL) sites.google.com/view/soft-actor-critic

sites.google.com/view/soft-actor-critic

Real Robot Results

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ - PowerPoint PPT Presentation

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel UC Berkeley EECS Outline Motivation n Q-learning n DQN + variants n Q-learning with continuous action spaces (SoftQ) n Deep Deterministic

287(g) Program Sheriff Eric J. Severson Waukesha County, WI 287(g) Program Legal Authority

CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC Berkeley EECS Outline n

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides

US 287 / SH 40 Passing Lane Pre-Proposal August 18, 2020 1 US 287 / SH 40 Passing Lane Project

CS 287 Advanced Robotics (Fall 2019) Lecture 9: Motion Planning Lecture by: Huazhe (Harry) Xu

Chrome Dino DQN AU T H T H O R : G E O RG E M A RG A R I T I S I N S N ST RU C T U C TO R :

Deep Exploration via Bootstrapped DQN Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin

An Introductory Tutorial on Implementing DRL Algorithms with DQN and TensorFlow Tim Tse May 18,

Immigration and Customs Enforcement 287(g) Program & Secure Communities Terry S. Johnson

Reshaping Westchesters I-287 Corridor Making the most of a major investment in regional

Southeast Connector Update Village Creek Neighborhood Association I-20, I-820, & US 287

CS 287 Lecture 12 (Fall 2019) Kalman Filtering Lecturer: Ignasi Clavera Slides by Pieter Abbeel

CS 287 Advanced Robotics (Fall 2019) Lecture 7: Constrained Optimization Pieter Abbeel UC

CS 287 Lecture 21 (Fall 2019) Physics Simulation Pieter Abbeel UC Berkeley EECS A lightning

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

Build Tools + Makefiles 2 Lab Schedule AcDviDes

COLLARTS WRITTEN JOB APPLICATIONS PART 2: COVER LETTERS COLLARTS COVER LETTERS Ask

Introduction to Computer Science CSCI 109 China Tianhe-2 Readings Andrew Goodney St.

How did we end up here? Todd Montgomery Martin Thompson How bad can things really be? Software

Disclosures Anabolic therapy, alone and in Research support from Bariatric Advantage and Tate

Signatures in Shape Analysis Nikolas Tapia (WIAS/TU Berlin) joint w.i.p. with E. Celledoni &

February 2017 School of the Incarnation Mission Statement The School of the Incarnation embraces

SINGLE-STATE METHOD WITHIN THE HORSE (J-MATRIX) FORMALISM Andrey Shirokov Lomonosov Moscow

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ - PowerPoint PPT Presentation

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel UC Berkeley EECS Outline Motivation n Q-learning n DQN + variants n Q-learning with continuous action spaces (SoftQ) n Deep Deterministic

287(g) Program Sheriff Eric J. Severson Waukesha County, WI 287(g) Program Legal Authority

CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC Berkeley EECS Outline n

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides

US 287 / SH 40 Passing Lane Pre-Proposal August 18, 2020 1 US 287 / SH 40 Passing Lane Project

CS 287 Advanced Robotics (Fall 2019) Lecture 9: Motion Planning Lecture by: Huazhe (Harry) Xu

Chrome Dino DQN AU T H T H O R : G E O RG E M A RG A R I T I S I N S N ST RU C T U C TO R :

Deep Exploration via Bootstrapped DQN Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin

An Introductory Tutorial on Implementing DRL Algorithms with DQN and TensorFlow Tim Tse May 18,

Immigration and Customs Enforcement 287(g) Program &amp; Secure Communities Terry S. Johnson

Reshaping Westchesters I-287 Corridor Making the most of a major investment in regional

Southeast Connector Update Village Creek Neighborhood Association I-20, I-820, &amp; US 287

CS 287 Lecture 12 (Fall 2019) Kalman Filtering Lecturer: Ignasi Clavera Slides by Pieter Abbeel

CS 287 Advanced Robotics (Fall 2019) Lecture 7: Constrained Optimization Pieter Abbeel UC

CS 287 Lecture 21 (Fall 2019) Physics Simulation Pieter Abbeel UC Berkeley EECS A lightning

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

Build Tools + Makefiles 2 Lab Schedule AcDviDes

COLLARTS WRITTEN JOB APPLICATIONS PART 2: COVER LETTERS COLLARTS COVER LETTERS Ask

Introduction to Computer Science CSCI 109 China Tianhe-2 Readings Andrew Goodney St.

How did we end up here? Todd Montgomery Martin Thompson How bad can things really be? Software

Disclosures Anabolic therapy, alone and in Research support from Bariatric Advantage and Tate

Signatures in Shape Analysis Nikolas Tapia (WIAS/TU Berlin) joint w.i.p. with E. Celledoni &amp;

February 2017 School of the Incarnation Mission Statement The School of the Incarnation embraces

SINGLE-STATE METHOD WITHIN THE HORSE (J-MATRIX) FORMALISM Andrey Shirokov Lomonosov Moscow

Immigration and Customs Enforcement 287(g) Program & Secure Communities Terry S. Johnson

Southeast Connector Update Village Creek Neighborhood Association I-20, I-820, & US 287

Signatures in Shape Analysis Nikolas Tapia (WIAS/TU Berlin) joint w.i.p. with E. Celledoni &