Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer
State ๐ก Value function: 1 1 0.5 ๐ ๐ ๐ก = ๐ฝ[๐ ๐ข |๐ก ๐ข = ๐ก] Action ๐ 1 0.5 Example: Reward ๐ ๐ ๐ ๐ก ๐ข = 0.8 โ 0.1 โ โ1 + 0 0.2 Policy ๐ 0.8 โ 0.9 โ 2 + 0.5 0.5 0 0 0.2 โ 0.5 โ 0 + Value ๐ค 0.2 โ 0.5 โ 1 = 1.46 1.46 Action value ๐ Action value function: ๐ ๐ ๐ก, ๐ = ๐ฝ[๐ ๐ข |๐ก ๐ข = ๐ก, ๐] 1.7 2 2 0.9 0.8 2 1.7 -1 0.1 -1 -1
State ๐ก Value function: 1 ๐ ๐ ๐ก = ๐ฝ[๐ ๐ข |๐ก ๐ข = ๐ก] ? Action ๐ 1 Action value function: Reward ๐ ๐ ๐ ๐ก, ๐ = ๐ฝ[๐ ๐ข |๐ก ๐ข = ๐ก, ๐] 0 ? Policy ๐ ? 1 0 Optimal action value function: ๐ โ ๐ก, ๐ = ๐๐๐ฆ ๐ ๐ ๐ ๐ก, ๐ Value ๐ค Action value ๐ => ๐ โ ๐ก, ๐ implicitly describes an 2 optimal policy 2 ? ? 2 -1 ? -1
Value-based algorithms Policy-based algorithms Try to approximate ๐ โ ๐ก or - Directly learn policy - ๐ โ ๐ก, ๐ - Implicitly learn policy
Q-Learning โข Try to iteratively calculate ๐ โ (๐ก, ๐) q = 0.5 ๐ = 0.5 ๐ โต 1.7 q = 1.2 ๐ q = -1 ๐ก ๐กโฒ ๐โฒ ๐ (๐ก โฒ , ๐ โฒ ) ๐ ๐ก, ๐ โต ๐ + ๐ฟ max โข Idea: Use neural network for approximating Q ๐ ๐ก โฒ , ๐ โฒ ; ๐ โ ๐ ๐ก, ๐; ๐ ] ๐ ๐ = ๐ฝ[๐ + ๐ฟ max ๐
How to traverse through the environment โข We follow an ๐ โ greedy policy with ๐ โ 0,1 โข In every state: โข Sample random number ๐ โ 0,1 โข If ๐ > ๐ => choose action with maximum q value โข else => choose random action โข Exploration vs. Exploitation
Q-Learning with Neural Networks Use network to traverse through the environment Neural Network Agent approximating Q*(s,a) Train Network with generated data => Data is non-stationary => Training with NN is instable
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin Playing atari with deep reinforcement learning Neural Network approximating Q*(s,a) Use network to traverse through the environment Train Network with randomly sampled data Agent Replay Memory Store new data in replay memory => Data is stationary => Training with NN is stable
On-policy vs. off-policy โข On-policy: The data which is used to train our policy, has to be generated using the exact same policy. => Example: REINFORCE โข Off-policy: The data which is used to train our policy, can also be generated using another policy. => Example: Q-Learning
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016) Asynchronous Methods for Deep RL โข Alternative method to make RL work better together with neural networks Agent #1 Agent #2 Data generation Data generation Gradient computation Gradient computation One agent: Data generation Weight update Gradient computation Weight update Agent #3 Agent #4 Data generation Data generation Gradient computation Gradient computation Traditional way Asynchronous way
Asynchronous Q-Learning โข Combine Idea with Q-Learning โข Generated data is stationary => Training is stable => No replay memory necessary => Data can be used directly while training is still stable
Value-based algorithms Policy-based algorithms Try to approximate ๐ โ ๐ก or - Directly learn policy - ๐ โ ๐ก, ๐ - Implicitly learn policy
REINFORCE: 1 0.5 โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ ๐ ๐ข Sample trajectories and 0.5 enforce actions which lead 0 0.2 to high rewards 0.8 2 0.9 0.1 -1
REINFORCE: 1 0.5 โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ ๐ ๐ข 0.5 0 0.2 0.8 2 0.9 0.1 -1
REINFORCE: 1 0.5 โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ ๐ ๐ข 0.5 0 0.2 0.8 2 0.9 0.1 -1
REINFORCE: Problem: High Variance โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ ๐ ๐ข Substract baseline: โ ๐ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ ๐ ๐ข ฮ๐ ๐ โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ (๐ ๐ข โ ๐ ๐ข (๐ก ๐ข )) 0.9 500 450 0.2 501 100,2 -0.3 499 -149,7 501 0.33 0.33 500 0.33 499
REINFORCE: Problem: High Variance โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ ๐ ๐ข Substract baseline: โ ๐ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ ๐ต ๐ข ฮ๐ ๐ โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ (๐ ๐ข โ ๐ ๐ข (๐ก ๐ข )) 0.9 0 0 0.2 1 0.2 Use value function as baseline: -0.3 -1 0.3 โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ (๐ ๐ข โ ๐(๐ก ๐ข , ๐ ๐ค )) 501 0.33 (๐ ๐ข โ๐(๐ก ๐ข , ๐ ๐ค )) Can be seen as estimate of advantage: 0.33 500 500 ๐ต ๐ ๐ข , ๐ก ๐ข = ๐ ๐ ๐ข , ๐ก ๐ข โ ๐(๐ก ๐ข ) Actor: policy network 0.33 Critic: value network 499
Update interval REINFORCE โฆ 0 0.1 0 1 Actor-critic with advantage 0.6 0.1 0
Asynchronous advantage actor-critic (A3C) โข Update local parameters from global shared parameters โข Explore environment according to policy ๐(๐ ๐ข |๐ก ๐ข ; ๐) for N steps โข Compute gradients for every visited state โข Policy network: โ ๐ log ๐ ๐ ๐ข ๐ก ๐ข , ๐ (๐ ๐ข โ ๐(๐ก ๐ข , ๐ ๐ค )) 2 โข Value network: โ ๐ ๐ค ๐ โ ๐ ๐ก ๐ ; ๐ ๐ค โข Update global shared parameters with computed gradients
Disadvantage of A3C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Compute gradients
Synchronous version of A3C => A2C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Perform steps Compute gradients Compute โฆ โฆ gradients
Advantages of โ Asynchronous methods โ โข Simple extension โข Can be applied to a big variety of algorithms โข Makes robust NN training possible โข Linear speedup
Advantages of โ Asynchronous methods โ โข Simple extension โข Can be applied to a big variety of algorithms โข Makes robust NN training possible โข Linear speedup Consumed data
Recommend
More recommend