distributed meta optimization of
play

Distributed Meta Optimization of Reinforcement Learning Agents Greg - PowerPoint PPT Presentation

Distributed Meta Optimization of Reinforcement Learning Agents Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev


  1. Distributed Meta Optimization of Reinforcement Learning Agents Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019

  2. Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results AGENDA Conclusion 2

  3. GPU-Based A3C for Deep Reinforcement Learning (RL) keywords: GPU, A3C, RL M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , ICLR 2017 (available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C. 3

  4. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Learning to accomplish a task Image from www.33rdsquare.com 4

  5. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t ✓ Environment S t , R t ✓ Agent ✓ Observable status S t RL agent S t ✓ Reward R t ✓ Action a t ✓ Policy a t = π ( S t ) a t = π ( S t ) 5

  6. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Deep RL agent S t a t = π ( S t ) 6

  7. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Δπ (∙) S t R 0 R 1 R 2 R 3 R 4 a t = π ( S t ) 7

  8. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Objective: maximize expected discounted rewards Value of a state The role of 𝛿 : short or far-sighted agents 0 < 𝛿 < 1, usually 0.99 8

  9. GPU-Based A3C for Deep Reinforcement Learning (RL) keywords: GPU, A3C, RL M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , ICLR 2017 (available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C. 9

  10. GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015) Δπ (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t π ’(∙) model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) 10

  11. GPU-Based A3C for Deep Reinforcement Learning (RL) keywords: GPU, A3C, RL M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , ICLR 2017 (available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C. 11

  12. MAPPING DEEP PROBLEMS TO A GPU REGRESSION, CLASSIFICATION, REINFORCEMENT LEARNING … status, data reward ? 100% utilization / occupancy Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, action Strawberry, Strawberry, labels Strawberry, … … 12

  13. A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) t_max 13

  14. A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) 14

  15. A3C Δπ (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) t_max 15

  16. A3C Δπ (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t π ’(∙) model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) t_max 16

  17. GA3C (INFERENCE) a t {a t } Agent 1 prediction queue { S t } … … S t Master Agent 2 model predictors … Agent N 17

  18. GA3C (TRAINING) Agent 1 Δπ (∙) Master Agent 2 model S t ,R t … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 18

  19. GA3C a t {a t } Agent 1 prediction queue Δπ (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 19

  20. CPU & GPU UTILIZATION IN GA3C For larger DNNs - bandwidth limited, do not scale to multiple GPUs! GPU for inference / training CPU for environment simulation 20

  21. Role of t_max t_max = 4 [Play to the end - Monte Carlo] No variance (collected rewards are real) High bias (we played only once) One update every t_max frames 1 4 -2 -6 21

  22. Role of t_max t_max = 2 Value network High variance (noisy value network) Low bias (unbiased net, many agents) 1 More updates per second 4 Value network: from here, approximately 2.5 t_max affects bias, variance, computational cost (number of updates per second, batch size) 22

  23. Other parameters and stability Hyperparameter search in 2015: The search for the optimal learning rate: https://arxiv.org/pdf/1602.01783.pdf 23

  24. GA3C on distributed systems ● RL is unstable, metaoptimization for optimal hyperparameters search ● E.g. learning rate may affect the stability and speed of convergence ● GA3C does not scale to multiple GPUs (bandwidth limited), but … We can run parallel instances of GA3C on a distributed system ● The discount factor 𝛿 affects the final aim (short or far- sighted agent) ● The t_max factor affects the computational cost and stability* of GA3C * See G. Heinrich, I. Frosio, Metaoptimization on a Distributed System forDeep Reinforcement Learning, ttps://arxiv.org/abs/1902.02725. 24

  25. Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results AGENDA Conclusion 25

  26. META OPTIMIZATION It is as easy as flying a Concorde. GA3C Agent Concorde • Topology parameters ▪ Number of layers and their width ▪ Choice of activations • Training parameters ▪ Learning rate ▪ Reward decay rate ( 𝛅 ) ▪ back-propagation window size (tmax) ▪ Choice of optimizer ▪ Number of training episodes • Data parameters ▪ Environment model. Exhaustive search is intractable 26 26 Source: Christian Kath

  27. META OPTIMIZATION How does a standard optimization algorithm fare? Example: Tree of Parzen Estimators • Two Parameters, one Metric to minimize. • Optimization Trade-offs: ▪ Exploitation v.s. exploration. → ▪ Wall time v.s. resource efficiency. • Optimization packages start with a Random Search. • Tens of experiments are needed before historical records can be leveraged. Warm Starts are needed to cut down complexity over time. 27 27

  28. META OPTIMIZATION The Need for Diversity Metric Variance • Non-determinism makes individual experiments inconclusive. • A change can only be considered an improvement if it works under a → variety of conditions. Meta Optimization should be part of data scientists’ daily routine. 28 28

  29. META OPTIMIZATION The Complexity of Evaluating Models Complex Pipelines • Evaluation cannot be reduced to a single Python function, or Docker container. Meta Optimization must be independent of task scheduling. 29 29

  30. META OPTIMIZATION Project MagLev: Machine Learning Platform Architecture • Scalable Platform for Traceable Machine Learning Workflows • Self-Documented Experiments • Services can be used in isolation, or combined for maximum traceability. 30 30

  31. META OPTIMIZATION ExperimentSet MagLev Experiment Tracking 03-ad-24 id yml config bob creator ParamSet Knowledge Base DormRoom project 2018-09-28 45-4a-26 created id 78-3f-58 exp_id • Experiment Data is fully connected. Job Metric ae-45-4a id Workflow • Objects are searched SHA:3487c 69-bd-4c scm id fb-d5-7a id bob 45-4a-26 creator param_set_id through their my_v0.1.1 Desc fb-d5-7a ae-45-4a workflow_id job_id ... spec 2018-09-28 relationships with 6a-e7-59 created model_id DormRoom project xentropy name bob creator others. 0.01 value 2018-09-28 Model created bob creator 6a-e7-59 No Information silo. id 2018-09-28 created ae-45-4a job_id ac-73-fc dataset_id 45-4a-26 param_set_id bob creator 2018-09-28 created ac-73-fc dataset_id Dataset ac-73-fc id zzz://... vdisk DormRoom project bob creator 31 31 2018-09-28 created

  32. META OPTIMIZATION Typical Setup Main SDK Features • All common parameter types. • Early-termination methods. • Standard + custom parameter picking methods. 32 32

  33. Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results AGENDA Conclusion 33

  34. META OPTIMIZATION + GA3C Hyper Parameters and Preview of results. • Learning Rate: log uniform distribution over [1e-5, 1e-2] interval. • tmax: quantized (q=1) log uniform distribution over [2, 100] interval. 𝜹 : one of {0.9, 0.95, 0.99, 0.995, 0.999, 0.9995, 0.9999} • 34 34

  35. Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results AGENDA Conclusion 35

Recommend


More recommend