gpu based a3c for deep reinforcement learning
play

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , - PowerPoint PPT Presentation

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , I.Frosio , S.Tyree , J. Clemons , J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA An ICLR 2017 paper A github project GPU-BASED


  1. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA An ICLR 2017 paper A github project

  2. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

  3. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

  4. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 4

  5. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Learning to accomplish a task Image from www.33rdsquare.com 5

  6. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t ✓ Environment S t , R t ✓ Agent ✓ Observable status S t RL agent S t ✓ Reward R t ✓ Action a t ✓ Policy a t = p ( S t ) a t = p ( S t ) 6

  7. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Deep RL agent S t a t = p ( S t ) 7

  8. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Dp (∙) S t R 0 R 1 R 2 R 3 R 4 a t = p ( S t ) 8

  9. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 9

  10. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015) Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master S t, R t p ’(∙) Agent 2 model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 10

  11. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 11

  12. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING The GPU 12

  13. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING LOW OCCUPANCY (33%) LOW OCCUPANCY 13

  14. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%) Batch size 14

  15. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), LOW UTILIZATION (40%) time time 15

  16. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%) time time 16

  17. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING BANDWIDTH LIMITED time time 17

  18. Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master S t, R t p ’ ( ∙ ) Agent 2 model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t + R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 18

  19. MAPPING DEEP PROBLEMS TO A GPU REGRESSION, CLASSIFICATION, REINFORCEMENT LEARNING … data status, reward ? 100% utilization / occupancy Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, labels action Strawberry, Strawberry, Strawberry, … … 19

  20. A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 20

  21. A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 21

  22. A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 22

  23. A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 23

  24. A3C Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 24

  25. A3C Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t p ’(∙) model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 25

  26. GA3C GPU-based A3C 26 El Capitan big wall, Yosemite Valley

  27. GA3C (INFERENCE) a t {a t } Agent 1 prediction queue { S t } … … S t Master Agent 2 model predictors … Agent N 27

  28. GA3C (TRAINING) Agent 1 Dp (∙) Master Agent 2 model S t ,R t … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 28

  29. GA3C a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 29

  30. GA3C GPU-based A3C 30 El Capitan big wall, Yosemite Valley

  31. GA3C Learn how to balance GPU-based A3C 31 El Capitan big wall, Yosemite Valley

  32. GA3C: PREDICTIONS PER SECOND (PPS) a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 32

  33. GA3C: TRAININGS PER SECOND (TPS) a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 33

  34. AUTOMATIC SCHEDULING Balancing the system at run time ATARI Boxing ATARI Pong N P = # predictors, N T = # trainers, N A = # agents, TPS = training per seconds 34

  35. THE ADVANTAGE OF SPEED More frames = faster convergence 35

  36. LARGER DNNS For real world applications (e.g. robotics, automotive) A3C (ATARI) Others (robotics) Timothy P. Lillicrap et al., Continuous Conv 16 Conv 32 control with deep reinforcement learning, 8x8 4x4 International Conference on Learning FC 256 filters, filters, Representations, 2016. stride 4 stride 2 S. Levine et al., End-to-end training of deep ? visuomotor policies, Journal of Machine Conv 32 Learning Research, 17:1-40, 2016. Conv 32 Conv 64 8x8 4x4 4x4 filters, FC 256 filters, filters, stride 1, stride 2 stride 2 2, 3, 4 36

  37. GA3C VS. A3C*: PREDICTIONS PER SECONDS * Our Tensor Flow implementation on a CPU 10000 4x 11x 12x 20x 45x 1000 PPS A3C 100 GA3C 10 1 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 37

  38. CPU & GPU UTILIZATION IN GA3C For larger DNNs 100 90 80 70 Utilization (%) 60 50 CPU % 40 GPU % 30 20 10 0 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 38

  39. GA3C POLICY LAG Asynchronous playing and training {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … DNN A … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers DNN B 39

  40. STABILITY AND CONVERGENCE SPEED Reducing policy lag through min training batch size 40

  41. GA3C (45x faster) Balancing computational resources, speed and stability. GPU-based A3C 41 El Capitan big wall, Yosemite Valley

  42. RESOURCES THEORY CODING M. Babaeizadeh, I. Frosio, S. Tyree, J. GA3C, a GPU implementation of A3C Clemons, J. Kautz, Reinforcement (open source at Learning through Asynchronous https://github.com/NVlabs/GA3C). Advantage Actor-Critic on a GPU , A general architecture to generate ICLR 2017 (available at and consume training data. https://openreview.net/forum?id=r1V GvBcxl&noteId=r1VGvBcxl). 42

  43. QUESTIONS QUESTIONS ATARI 2600 Policy lag Multiple GPUs Why TensorFlow Replay memory … Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU. 43

  44. BACKUP SLIDES

  45. POLICY LAG IN GA3C Potentially large time lag between training data generation and network update 45

  46. SCORES 46

  47. Training a larger DNN 47

  48. BALANCING COMPUTATIONAL RESOURCES The actors: CPU, PCI-E & GPU Trainings per second Prediction queue size Training queue size 48

Recommend


More recommend