Deep Reinforcement Learning Dominik Winkelbauer State Value - PowerPoint PPT Presentation

Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer

State 𝑡 Value function: 1 1 0.5 𝑊 𝜌 𝑡 = 𝔽[𝑆 𝑢 |𝑡 𝑢 = 𝑡] Action 𝑏 1 0.5 Example: Reward 𝑠 𝑊 𝜌 𝑡 𝑢 = 0.8 ∗ 0.1 ∗ −1 + 0 0.2 Policy 𝜌 0.8 ∗ 0.9 ∗ 2 + 0.5 0.5 0 0 0.2 ∗ 0.5 ∗ 0 + Value 𝑤 0.2 ∗ 0.5 ∗ 1 = 1.46 1.46 Action value 𝑟 Action value function: 𝑅 𝜌 𝑡, 𝑏 = 𝔽[𝑆 𝑢 |𝑡 𝑢 = 𝑡, 𝑏] 1.7 2 2 0.9 0.8 2 1.7 -1 0.1 -1 -1

State 𝑡 Value function: 1 𝑊 𝜌 𝑡 = 𝔽[𝑆 𝑢 |𝑡 𝑢 = 𝑡] ? Action 𝑏 1 Action value function: Reward 𝑠 𝑅 𝜌 𝑡, 𝑏 = 𝔽[𝑆 𝑢 |𝑡 𝑢 = 𝑡, 𝑏] 0 ? Policy 𝜌 ? 1 0 Optimal action value function: 𝑅 ∗ 𝑡, 𝑏 = 𝑛𝑏𝑦 𝜌 𝑅 𝜌 𝑡, 𝑏 Value 𝑤 Action value 𝑟 => 𝑅 ∗ 𝑡, 𝑏 implicitly describes an 2 optimal policy 2 ? ? 2 -1 ? -1

Value-based algorithms Policy-based algorithms Try to approximate 𝑊 ∗ 𝑡 or - Directly learn policy - 𝑅 ∗ 𝑡, 𝑏 - Implicitly learn policy

Q-Learning • Try to iteratively calculate 𝑅 ∗ (𝑡, 𝑏) q = 0.5 𝑠 = 0.5 𝑟 ⟵ 1.7 q = 1.2 𝑏 q = -1 𝑡 𝑡′ 𝑏′ 𝑅(𝑡 ′ , 𝑏 ′ ) 𝑅 𝑡, 𝑏 ⟵ 𝑠 + 𝛿 max • Idea: Use neural network for approximating Q 𝑅 𝑡 ′ , 𝑏 ′ ; 𝜄 − 𝑅 𝑡, 𝑏; 𝜄 ] 𝑀 𝜄 = 𝔽[𝑠 + 𝛿 max 𝑏

How to traverse through the environment • We follow an 𝜗 – greedy policy with 𝜗 ∈ 0,1 • In every state: • Sample random number 𝑙 ∈ 0,1 • If 𝑙 > 𝜗 => choose action with maximum q value • else => choose random action • Exploration vs. Exploitation

Q-Learning with Neural Networks Use network to traverse through the environment Neural Network Agent approximating Q*(s,a) Train Network with generated data => Data is non-stationary => Training with NN is instable

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin Playing atari with deep reinforcement learning Neural Network approximating Q*(s,a) Use network to traverse through the environment Train Network with randomly sampled data Agent Replay Memory Store new data in replay memory => Data is stationary => Training with NN is stable

On-policy vs. off-policy • On-policy: The data which is used to train our policy, has to be generated using the exact same policy. => Example: REINFORCE • Off-policy: The data which is used to train our policy, can also be generated using another policy. => Example: Q-Learning

Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016) Asynchronous Methods for Deep RL • Alternative method to make RL work better together with neural networks Agent #1 Agent #2 Data generation Data generation Gradient computation Gradient computation One agent: Data generation Weight update Gradient computation Weight update Agent #3 Agent #4 Data generation Data generation Gradient computation Gradient computation Traditional way Asynchronous way

Asynchronous Q-Learning • Combine Idea with Q-Learning • Generated data is stationary => Training is stable => No replay memory necessary => Data can be used directly while training is still stable

Value-based algorithms Policy-based algorithms Try to approximate 𝑊 ∗ 𝑡 or - Directly learn policy - 𝑅 ∗ 𝑡, 𝑏 - Implicitly learn policy

REINFORCE: 1 0.5 ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 𝑆 𝑢 Sample trajectories and 0.5 enforce actions which lead 0 0.2 to high rewards 0.8 2 0.9 0.1 -1

REINFORCE: 1 0.5 ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 𝑆 𝑢 0.5 0 0.2 0.8 2 0.9 0.1 -1

REINFORCE: Problem: High Variance ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 𝑆 𝑢 Substract baseline: ∇ 𝜄 𝑗 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 𝑆 𝑢 Δ𝜄 𝑗 ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 (𝑆 𝑢 − 𝑐 𝑢 (𝑡 𝑢 )) 0.9 500 450 0.2 501 100,2 -0.3 499 -149,7 501 0.33 0.33 500 0.33 499

REINFORCE: Problem: High Variance ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 𝑆 𝑢 Substract baseline: ∇ 𝜄 𝑗 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 𝐵 𝑢 Δ𝜄 𝑗 ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 (𝑆 𝑢 − 𝑐 𝑢 (𝑡 𝑢 )) 0.9 0 0 0.2 1 0.2 Use value function as baseline: -0.3 -1 0.3 ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 (𝑆 𝑢 − 𝑊(𝑡 𝑢 , 𝜄 𝑤 )) 501 0.33 (𝑆 𝑢 −𝑊(𝑡 𝑢 , 𝜄 𝑤 )) Can be seen as estimate of advantage: 0.33 500 500 𝐵 𝑏 𝑢 , 𝑡 𝑢 = 𝑅 𝑏 𝑢 , 𝑡 𝑢 − 𝑊(𝑡 𝑢 ) Actor: policy network 0.33 Critic: value network 499

Update interval REINFORCE … 0 0.1 0 1 Actor-critic with advantage 0.6 0.1 0

Asynchronous advantage actor-critic (A3C) • Update local parameters from global shared parameters • Explore environment according to policy 𝜌(𝑏 𝑢 |𝑡 𝑢 ; 𝜄) for N steps • Compute gradients for every visited state • Policy network: ∇ 𝜄 log 𝜌 𝑏 𝑢 𝑡 𝑢 , 𝜄 (𝑆 𝑢 − 𝑊(𝑡 𝑢 , 𝜄 𝑤 )) 2 • Value network: ∇ 𝜄 𝑤 𝑆 − 𝑊 𝑡 𝑗 ; 𝜄 𝑤 • Update global shared parameters with computed gradients

Disadvantage of A3C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Compute gradients

Synchronous version of A3C => A2C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Perform steps Compute gradients Compute … … gradients

Advantages of „ Asynchronous methods “ • Simple extension • Can be applied to a big variety of algorithms • Makes robust NN training possible • Linear speedup

Advantages of „ Asynchronous methods “ • Simple extension • Can be applied to a big variety of algorithms • Makes robust NN training possible • Linear speedup Consumed data

Deep Reinforcement Learning Dominik Winkelbauer State Value - PowerPoint PPT Presentation

Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5 = [ | = ] Action 1 0.5 Example: Reward = 0.8 0.1 1

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

Reopening Plan A Town Hall Meeting 7/27/2020 Scenario Planning for 2020-2021 * Task Force of Lay,

A GPU Inference System Scheduling Algorithm with Asynchronous Data Transfer Qin Zhang, Li Zha,

Freshman Parent Orientation Roseville High School 2020-21 Agenda High School 101: Schedule,

Elevating Asynchronous Distance Learning through Excellent Online Discussions David Baker,

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of