GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA An ICLR 2017 paper A github project
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 4
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Learning to accomplish a task Image from www.33rdsquare.com 5
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t ✓ Environment S t , R t ✓ Agent ✓ Observable status S t RL agent S t ✓ Reward R t ✓ Action a t ✓ Policy a t = p ( S t ) a t = p ( S t ) 6
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Deep RL agent S t a t = p ( S t ) 7
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Dp (∙) S t R 0 R 1 R 2 R 3 R 4 a t = p ( S t ) 8
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 9
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015) Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master S t, R t p ’(∙) Agent 2 model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 10
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 11
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING The GPU 12
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING LOW OCCUPANCY (33%) LOW OCCUPANCY 13
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%) Batch size 14
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), LOW UTILIZATION (40%) time time 15
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%) time time 16
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING BANDWIDTH LIMITED time time 17
Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master S t, R t p ’ ( ∙ ) Agent 2 model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t + R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 18
MAPPING DEEP PROBLEMS TO A GPU REGRESSION, CLASSIFICATION, REINFORCEMENT LEARNING … data status, reward ? 100% utilization / occupancy Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, labels action Strawberry, Strawberry, Strawberry, … … 19
A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 20
A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 21
A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 22
A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 23
A3C Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 24
A3C Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t p ’(∙) model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 25
GA3C GPU-based A3C 26 El Capitan big wall, Yosemite Valley
GA3C (INFERENCE) a t {a t } Agent 1 prediction queue { S t } … … S t Master Agent 2 model predictors … Agent N 27
GA3C (TRAINING) Agent 1 Dp (∙) Master Agent 2 model S t ,R t … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 28
GA3C a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 29
GA3C GPU-based A3C 30 El Capitan big wall, Yosemite Valley
GA3C Learn how to balance GPU-based A3C 31 El Capitan big wall, Yosemite Valley
GA3C: PREDICTIONS PER SECOND (PPS) a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 32
GA3C: TRAININGS PER SECOND (TPS) a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 33
AUTOMATIC SCHEDULING Balancing the system at run time ATARI Boxing ATARI Pong N P = # predictors, N T = # trainers, N A = # agents, TPS = training per seconds 34
THE ADVANTAGE OF SPEED More frames = faster convergence 35
LARGER DNNS For real world applications (e.g. robotics, automotive) A3C (ATARI) Others (robotics) Timothy P. Lillicrap et al., Continuous Conv 16 Conv 32 control with deep reinforcement learning, 8x8 4x4 International Conference on Learning FC 256 filters, filters, Representations, 2016. stride 4 stride 2 S. Levine et al., End-to-end training of deep ? visuomotor policies, Journal of Machine Conv 32 Learning Research, 17:1-40, 2016. Conv 32 Conv 64 8x8 4x4 4x4 filters, FC 256 filters, filters, stride 1, stride 2 stride 2 2, 3, 4 36
GA3C VS. A3C*: PREDICTIONS PER SECONDS * Our Tensor Flow implementation on a CPU 10000 4x 11x 12x 20x 45x 1000 PPS A3C 100 GA3C 10 1 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 37
CPU & GPU UTILIZATION IN GA3C For larger DNNs 100 90 80 70 Utilization (%) 60 50 CPU % 40 GPU % 30 20 10 0 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 38
GA3C POLICY LAG Asynchronous playing and training {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … DNN A … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers DNN B 39
STABILITY AND CONVERGENCE SPEED Reducing policy lag through min training batch size 40
GA3C (45x faster) Balancing computational resources, speed and stability. GPU-based A3C 41 El Capitan big wall, Yosemite Valley
RESOURCES THEORY CODING M. Babaeizadeh, I. Frosio, S. Tyree, J. GA3C, a GPU implementation of A3C Clemons, J. Kautz, Reinforcement (open source at Learning through Asynchronous https://github.com/NVlabs/GA3C). Advantage Actor-Critic on a GPU , A general architecture to generate ICLR 2017 (available at and consume training data. https://openreview.net/forum?id=r1V GvBcxl¬eId=r1VGvBcxl). 42
QUESTIONS QUESTIONS ATARI 2600 Policy lag Multiple GPUs Why TensorFlow Replay memory … Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU. 43
BACKUP SLIDES
POLICY LAG IN GA3C Potentially large time lag between training data generation and network update 45
SCORES 46
Training a larger DNN 47
BALANCING COMPUTATIONAL RESOURCES The actors: CPU, PCI-E & GPU Trainings per second Prediction queue size Training queue size 48
Recommend
More recommend