deep reinforcement learning
play

Deep Reinforcement Learning Dominik Winkelbauer State Value - PowerPoint PPT Presentation

Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5 = [ | = ] Action 1 0.5 Example: Reward = 0.8 0.1 1


  1. Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer

  2. State ๐‘ก Value function: 1 1 0.5 ๐‘Š ๐œŒ ๐‘ก = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก] Action ๐‘ 1 0.5 Example: Reward ๐‘  ๐‘Š ๐œŒ ๐‘ก ๐‘ข = 0.8 โˆ— 0.1 โˆ— โˆ’1 + 0 0.2 Policy ๐œŒ 0.8 โˆ— 0.9 โˆ— 2 + 0.5 0.5 0 0 0.2 โˆ— 0.5 โˆ— 0 + Value ๐‘ค 0.2 โˆ— 0.5 โˆ— 1 = 1.46 1.46 Action value ๐‘Ÿ Action value function: ๐‘… ๐œŒ ๐‘ก, ๐‘ = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก, ๐‘] 1.7 2 2 0.9 0.8 2 1.7 -1 0.1 -1 -1

  3. State ๐‘ก Value function: 1 ๐‘Š ๐œŒ ๐‘ก = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก] ? Action ๐‘ 1 Action value function: Reward ๐‘  ๐‘… ๐œŒ ๐‘ก, ๐‘ = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก, ๐‘] 0 ? Policy ๐œŒ ? 1 0 Optimal action value function: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐‘›๐‘๐‘ฆ ๐œŒ ๐‘… ๐œŒ ๐‘ก, ๐‘ Value ๐‘ค Action value ๐‘Ÿ => ๐‘… โˆ— ๐‘ก, ๐‘ implicitly describes an 2 optimal policy 2 ? ? 2 -1 ? -1

  4. Value-based algorithms Policy-based algorithms Try to approximate ๐‘Š โˆ— ๐‘ก or - Directly learn policy - ๐‘… โˆ— ๐‘ก, ๐‘ - Implicitly learn policy

  5. Q-Learning โ€ข Try to iteratively calculate ๐‘… โˆ— (๐‘ก, ๐‘) q = 0.5 ๐‘  = 0.5 ๐‘Ÿ โŸต 1.7 q = 1.2 ๐‘ q = -1 ๐‘ก ๐‘กโ€ฒ ๐‘โ€ฒ ๐‘…(๐‘ก โ€ฒ , ๐‘ โ€ฒ ) ๐‘… ๐‘ก, ๐‘ โŸต ๐‘  + ๐›ฟ max โ€ข Idea: Use neural network for approximating Q ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ] ๐‘€ ๐œ„ = ๐”ฝ[๐‘  + ๐›ฟ max ๐‘

  6. How to traverse through the environment โ€ข We follow an ๐œ— โ€“ greedy policy with ๐œ— โˆˆ 0,1 โ€ข In every state: โ€ข Sample random number ๐‘™ โˆˆ 0,1 โ€ข If ๐‘™ > ๐œ— => choose action with maximum q value โ€ข else => choose random action โ€ข Exploration vs. Exploitation

  7. Q-Learning with Neural Networks Use network to traverse through the environment Neural Network Agent approximating Q*(s,a) Train Network with generated data => Data is non-stationary => Training with NN is instable

  8. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin Playing atari with deep reinforcement learning Neural Network approximating Q*(s,a) Use network to traverse through the environment Train Network with randomly sampled data Agent Replay Memory Store new data in replay memory => Data is stationary => Training with NN is stable

  9. On-policy vs. off-policy โ€ข On-policy: The data which is used to train our policy, has to be generated using the exact same policy. => Example: REINFORCE โ€ข Off-policy: The data which is used to train our policy, can also be generated using another policy. => Example: Q-Learning

  10. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016) Asynchronous Methods for Deep RL โ€ข Alternative method to make RL work better together with neural networks Agent #1 Agent #2 Data generation Data generation Gradient computation Gradient computation One agent: Data generation Weight update Gradient computation Weight update Agent #3 Agent #4 Data generation Data generation Gradient computation Gradient computation Traditional way Asynchronous way

  11. Asynchronous Q-Learning โ€ข Combine Idea with Q-Learning โ€ข Generated data is stationary => Training is stable => No replay memory necessary => Data can be used directly while training is still stable

  12. Value-based algorithms Policy-based algorithms Try to approximate ๐‘Š โˆ— ๐‘ก or - Directly learn policy - ๐‘… โˆ— ๐‘ก, ๐‘ - Implicitly learn policy

  13. REINFORCE: 1 0.5 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข Sample trajectories and 0.5 enforce actions which lead 0 0.2 to high rewards 0.8 2 0.9 0.1 -1

  14. REINFORCE: 1 0.5 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข 0.5 0 0.2 0.8 2 0.9 0.1 -1

  15. REINFORCE: 1 0.5 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข 0.5 0 0.2 0.8 2 0.9 0.1 -1

  16. REINFORCE: Problem: High Variance โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข Substract baseline: โˆ‡ ๐œ„ ๐‘— log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข ฮ”๐œ„ ๐‘— โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘ ๐‘ข (๐‘ก ๐‘ข )) 0.9 500 450 0.2 501 100,2 -0.3 499 -149,7 501 0.33 0.33 500 0.33 499

  17. REINFORCE: Problem: High Variance โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข Substract baseline: โˆ‡ ๐œ„ ๐‘— log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐ต ๐‘ข ฮ”๐œ„ ๐‘— โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘ ๐‘ข (๐‘ก ๐‘ข )) 0.9 0 0 0.2 1 0.2 Use value function as baseline: -0.3 -1 0.3 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘Š(๐‘ก ๐‘ข , ๐œ„ ๐‘ค )) 501 0.33 (๐‘† ๐‘ข โˆ’๐‘Š(๐‘ก ๐‘ข , ๐œ„ ๐‘ค )) Can be seen as estimate of advantage: 0.33 500 500 ๐ต ๐‘ ๐‘ข , ๐‘ก ๐‘ข = ๐‘… ๐‘ ๐‘ข , ๐‘ก ๐‘ข โˆ’ ๐‘Š(๐‘ก ๐‘ข ) Actor: policy network 0.33 Critic: value network 499

  18. Update interval REINFORCE โ€ฆ 0 0.1 0 1 Actor-critic with advantage 0.6 0.1 0

  19. Asynchronous advantage actor-critic (A3C) โ€ข Update local parameters from global shared parameters โ€ข Explore environment according to policy ๐œŒ(๐‘ ๐‘ข |๐‘ก ๐‘ข ; ๐œ„) for N steps โ€ข Compute gradients for every visited state โ€ข Policy network: โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘Š(๐‘ก ๐‘ข , ๐œ„ ๐‘ค )) 2 โ€ข Value network: โˆ‡ ๐œ„ ๐‘ค ๐‘† โˆ’ ๐‘Š ๐‘ก ๐‘— ; ๐œ„ ๐‘ค โ€ข Update global shared parameters with computed gradients

  20. Disadvantage of A3C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Compute gradients

  21. Synchronous version of A3C => A2C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Perform steps Compute gradients Compute โ€ฆ โ€ฆ gradients

  22. Advantages of โ€ž Asynchronous methods โ€œ โ€ข Simple extension โ€ข Can be applied to a big variety of algorithms โ€ข Makes robust NN training possible โ€ข Linear speedup

  23. Advantages of โ€ž Asynchronous methods โ€œ โ€ข Simple extension โ€ข Can be applied to a big variety of algorithms โ€ข Makes robust NN training possible โ€ข Linear speedup Consumed data

Recommend


More recommend