TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 - PowerPoint PPT Presentation

NPFL122, Lecture 9 TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Continuous Action Space Until now, the actions were discrete. However, many environments naturally accept actions from a , b ∈ R [ a , b ] continuous space. We now consider actions which come from range for , or more generally from a Cartesian product of several such ranges: ∏ [ a , b ]. i i i         A simple way how to parametrize the action distribution              is to choose them from the normal distribution.   σ 2 μ                  Given mean and variance , probability density 2 N ( μ , σ )      function of is  ( x − μ ) 2 − 1 def p ( x ) = 2 σ 2 .  e 2 πσ 2                   Figure from section 13.7 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 9 Refresh TD3 TL;DR AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 2/35

Continuous Action Space in Gradient Methods Utilizing continuous action spaces in gradient-based methods is straightforward. Instead of the softmax distribution we suitably parametrize the action value, usually using the normal distribution. Considering only one real-valued action, we therefore have def P ( a ∼ N ( μ ( s ; θ ), σ ( s ; θ ) ) ) , 2 π ( a ∣ s ; θ ) = μ ( s ; θ ) σ ( s ; θ ) where and are function approximation of mean and standard deviation of the action distribution. The mean and standard deviation are usually computed from the shared representation, with the mean being computed as a regular regression (i.e., one output neuron without activation); def log(1 + e ) the standard variance (which must be positive) being computed again as a regression, exp softplus softplus( x ) = x followed most commonly by either or , where . NPFL122, Lecture 9 Refresh TD3 TL;DR AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 3/35

Deterministic Policy Gradient Theorem Combining continuous actions and Deep Q Networks is not straightforward. In order to do so, we need a different variant of the policy gradient theorem. Recall that in policy gradient theorem, ∑ ∑ π ∇ J ( θ ) ∝ μ ( s ) ( s , a )∇ π ( a ∣ s ; θ ). q θ θ s ∈ S a ∈ A Deterministic Policy Gradient Theorem a ∈ R π ( s ; θ ) Assume that the policy is deterministic and computes an action . Then under several assumptions about continuousness, the following holds: [ ∇ ] . ∣ E ∇ J ( θ ) ∝ π ( s ; θ )∇ ( s , a ) q ∣ s ∼ μ ( s ) θ θ a π a = π ( s ; θ ) The theorem was first proven in the paper Deterministic Policy Gradient Algorithms by David Silver et al. NPFL122, Lecture 9 Refresh TD3 TL;DR AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 4/35

Deep Deterministic Policy Gradients Note that the formulation of deterministic policy gradient theorem allows an off-policy algorithm, because the loss functions no longer depends on actions (similarly to how expected Sarsa is also an off-policy algorithm). π ( s ; θ ) q ( s , a ; θ ) q ( s , a ; θ ) We therefore train function approximation for both and , training using a deterministic variant of the Bellman equation: E q ( S , A ; θ ) = + γq ( S , π ( S ; θ )) ] [ R , S t +1 t +1 t +1 t t R t +1 t +1 π ( s ; θ ) and according to the deterministic policy gradient theorem. The algorithm was first described in the paper Continuous Control with Deep Reinforcement Learning by Timothy P. Lillicrap et al. (2015). The authors utilize a replay buffer, a target network (updated by exponential moving average τ = 0.001 with ), batch normalization for CNNs, and perform exploration by adding a normal- distributed noise to predicted actions. Training is performed by Adam with learning rates of 1e-4 and 1e-3 for the policy and critic network, respectively. NPFL122, Lecture 9 Refresh TD3 TL;DR AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 5/35

Deep Deterministic Policy Gradients                                                                                                                                                                                                                                                                                                                           Algorithm 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. NPFL122, Lecture 9 Refresh TD3 TL;DR AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 6/35

Twin Delayed Deep Deterministic Policy Gradient The paper Addressing Function Approximation Error in Actor-Critic Methods by Scott Fujimoto et al. from February 2018 proposes improvements to DDPG which decrease maximization bias by training two critics and choosing minimum of their predictions; introduce several variance-lowering optimizations: delayed policy updates; target policy smoothing. NPFL122, Lecture 9 Refresh TD3 TL;DR AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 7/35

TD3 – Maximization Bias Similarly to Q-learning, the DDPG algorithm suffers from maximization bias. In Q-learning, the max maximization bias was caused by the explicit operator. For DDPG methods, it can be θ q approx θ caused by the gradient descent itself. Let be the parameters maximizing the and let θ q π π true approx true π be the hypothetical parameters which maximise true , and let and denote the corresponding policies. α < ε 1 Because the gradient direction is a local maximizer, for sufficiently small we have E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] . θ approx θ true α < ε q 2 π However, for real and for sufficiently small it holds that E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] . π true π approx E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] α < min( ε , ε ) 1 2 true true θ π Therefore, if , for E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] . θ approx π approx NPFL122, Lecture 9 Refresh TD3 TL;DR AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 8/35

TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 - PowerPoint PPT Presentation

NPFL122, Lecture 9 TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Continuous Action Space Until

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 Charles University in Prague

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

INF580 Advanced Mathematical Programming TD3 Complexity and MP Leo Liberti CNRS LIX,

Modern Monte Carlo Tree Search Andrew Li, John Chen, Keiran Paster 1 Outline Motivation

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

AI: The New Electricity Devdatt Dubhashi Computer Science and Engineering Chalmers Machine

Professor Lasse Lipponen Department of Teacher Education University of Helsinki ECDA Early

EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah

CS241- Operating Systems and Networks Arshad Jhumka (OS) Graham Martin (Networks) Department of

Traceable Anonymous Certificate dra raft-ie ietf-pkix-tac-01.txt IE IETF-72 at t PKIX IX WG

COSTS AND SUPPLY GENERAL PICTURE OF ONE FIRMS COST CURVES Reminder of notation: FC = fixed

Notes from Urumqi meeting August 12, 2017 Peter Timbie Status of data taking - TAC has about 1

CS 671 Automated Reasoning Tactical Theorem Proving in NuPRL 1. Basic Tactics 2. Tacticals 3.

Sambuz

Useful Links

Newsletter

Mail Us

TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 - PowerPoint PPT Presentation

NPFL122, Lecture 9 TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Continuous Action Space Until

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 Charles University in Prague

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

INF580 Advanced Mathematical Programming TD3 Complexity and MP Leo Liberti CNRS LIX,

Modern Monte Carlo Tree Search Andrew Li, John Chen, Keiran Paster 1 Outline Motivation

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

AI: The New Electricity Devdatt Dubhashi Computer Science and Engineering Chalmers Machine

Professor Lasse Lipponen Department of Teacher Education University of Helsinki ECDA Early

EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah

CS241- Operating Systems and Networks Arshad Jhumka (OS) Graham Martin (Networks) Department of

Traceable Anonymous Certificate dra raft-ie ietf-pkix-tac-01.txt IE IETF-72 at t PKIX IX WG

COSTS AND SUPPLY GENERAL PICTURE OF ONE FIRMS COST CURVES Reminder of notation: FC = fixed

Notes from Urumqi meeting August 12, 2017 Peter Timbie Status of data taking - TAC has about 1

CS 671 Automated Reasoning Tactical Theorem Proving in NuPRL 1. Basic Tactics 2. Tacticals 3.

Sambuz

Useful Links

Newsletter

Mail Us

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.