March 26-29, 2018 | Silicon Valley CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time
REINFORCEMENT LEARNING A successful approach… Board games • • Video games Robotics • Finance • Automotive • ML training – L2L (learn-to-learn) • • … 2
REINFORCEMENT LEARNING A successful approach calling for more investigation New RL algorithms • Development • • Debugging / testing Benchmarking • Alternative approaches • Evolutionary strategies • • Imitation learning … • 3
REINFORCEMENT LEARNING ALE (Atari Learning Environment) • Diverse set of tasks Established benchmark – MNIST of • RL? 4
CULE CUDA Learning Environment LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money 5
RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 6
RL TRAINING The OpenAI ATARI interface https://github.com/openai/atari-py (OpenAI gym) 7
RL TRAINING CPU only based training DQN A3C … Mnih V. et al., Human-level control through deep reinforcement Learning, Nature, 2015 8 Minh V. et al., Asynchronous Methods for Deep Reinforcement Learning, ICML 2016
RL TRAINING Hybrid CPU GPU training 4x 11x 12x 20x 45x 10000 DQN GA3C 1000 A3C … A3C PPS … 100 GA3C 10 1 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 Babaeizadeh M. et al., Reinforcement Learning through Asynchronous Advantage Actor- 9 Critic on a GPU, ICLR, 2017
RL TRAINING Clusters ES (GA) A3C A2C Cluster IMPALA … DQN GA3C A3C … … Espeholt L. et al., IMPALA: Scalable Distributed Deep-RL with Importance Weighted 10 Actor-Learner Architectures, 2018
RL TRAINING DGX-1 ES (GA) A3C A2C Cluster IMPALA … DQN GA3C A3C … … DGX-1 Policy gradient Q-value … Stoole A., Abbeel P ., Accelerated Methods 11 for Deep Reinforcement Learning, 2018
RL TRAINING Limitations DQN GA3C A3C … TIME … 12
RL TRAINING Limitations ES (GA) A3C A2C Cluster IMPALA … $$$ DGX-1 Policy gradient Q-value … Stoole A., Abbeel P ., Accelerated Methods 13 for Deep Reinforcement Learning, 2018
CULE CUDA Learning Environment LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money 14
RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 15
RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … States, rewards 16
RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … States, rewards 17
RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … Limited bandwidth States, rewards 18
RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … Limited number of CPUs, low Limited frames / bandwidth second States, rewards 19
RL TRAINING (CULE) Porting ATARI to the GPU Actions (, weights) Updates … States, rewards 20
RL TRAINING (GPU) 1-to-1 mapping of ALEs to threads ALE ATARI simulator 21
RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 22
GYM COMPATIBLE (MOSTLY) AtariPy for agent in (0, agents): action.cpu() # transfer to CPU observation, reward, done, info = env.step(action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() CuLE # parallel call to all agents observations, rewards, dones, infos = env.step(actions ) # execute 23
FRAMES PER SECOND Breakout, inference only (no training) 1 environment 1024 environments 4096 environments 32768 environments GPU occupancy 24
GYM COMPATIBLE (MOSTLY) AtariPy for agent in (0, agents): action.cpu() # transfer to CPU observation, reward, done, info = env.step( action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() train() CuLE # parallel call to all agents observations, rewards, dones, infos = env.step(actions) # execute train() 25
REINFORCEMENT LEARNING Breakout – A2C (preliminary result) 26
RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 27
TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 Frames update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 28 Traditional approach
TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 update 2 Frames update 3 update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 29 Traditional approach
TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 update 2 Frames update 3 update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 30 Traditional approach
GYM COMPATIBLE (MOSTLY) AtariPy / CuLE for time in (0, np.inf): action.cpu() # transfer to CPU observation, reward, done, info = env.step( action.numpy()) # execute cpu_state = cule.get_state() # get state train() CuLE # seed, cule::set_state(gpuState, cpuState) env.seed(cpu_state, first_agent = 0, last_agent = 100) # parallel call to all agents observations, rewards, dones, infos = env.step(actions) # execute # … 31
SEEDING Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE seed 1 update 1 Frames seed 2 seed 3 seed 4 Bandwidth vs. Latency seed 5 seed 6 32 Traditional approach
CONCLUSION CuLE More frames for less money (democratizing RL) New scenarios How to use large batches? Seeding from the CPU, ES, … Soon released on https://github.com/NVlabs/ 33
March 26-29, 2018 | Silicon Valley THANK YOU CULE (CUDA LEARNING ENVIRONMENT), SOON RELEASED HTTPS://GITHUB.COM/NVLABS/
March 26-29, 2018 | Silicon Valley
MOTIVATION Democratizing RL research Colab Cluster (Jupyter-like environment) CuLE K80 DGX 36
ASYNCHRONOUS UPDATES GA3C-like updates … … … … … R 4 R 0 R 1 R 2 37
EXPERIENCE TRADE-OFF Bandwidth vs. Latency Low experience volume, High experience volume, High updates per second Low updates per second VS Time Time 38
GYM COMPATIBLE (MOSTLY) AtariPy action.cpu() # transfer to CPU observation, reward, done, info = env.step(action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() ke: `cule::set_state(gpuState, cpuState` variable contains CuLE memory references. observation, reward, done, info = env.step(action) 39
SYNCHRONOUS On GPU updates … … … … … R 4 R 0 R 1 R 2 40
RL TRAINING (GPU) Standard training scenario AtariPy CuLE 41
Recommend
More recommend