evolution strategies
play

Evolution Strategies Distributed deep reinforcement learning - PowerPoint PPT Presentation

Evolution Strategies Distributed deep reinforcement learning (blog.otoro.net) Evolutionary Strategies Steven Schmatz November 21, 2017 @stevenschmatz Deep Reinforcement Learning Evolution Strategies Steven Schmatz November 21, 2017


  1. Evolution Strategies Distributed deep reinforcement learning (blog.otoro.net) Evolutionary Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  2. Deep Reinforcement Learning Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  3. Agenda 1. Why is deep reinforcement learning hard? 2. How does evolution strategies (ES) help? 3. Advice on applying ES to real-world problems Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  4. RL in a nutshell (reinforcement learning) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  5. Deep RL in a nutshell Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  6. Deep CNNs are useful. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  7. Assumptions of supervised learning Stationary Independence Clear input-output distribution of examples relationship Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  8. RL violates these assumptions. 😮 Stationary Independence Clear input-output distribution of examples relationship Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  9. RL violates these assumptions. 😮 Stationary distribution The training data changes as you act differently. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  10. RL violates these assumptions. 😮 Independence of examples Adjacent game frames are usually very similar. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  11. RL violates these assumptions. 😮 Clear input-output relationship There can be a large delay between action and reward. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  12. Deep Q-Learning Model Training objective Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  13. Policy gradients (our objective) (our weight update) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  14. Policy gradients What if our reward function is highly nonlinear ? What if our reward is received much later ? What if our policy is non-differentiable ? How far should we step? Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  15. Local optima Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  16. Local optima Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  17. Black-box optimization Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  18. ES to the rescue! At each iteration: 1. Generate candidate solutions from old candidates by adding noise 2. Evaluate a fitness function for each candidate 3. Aggregate the results and discard bad candidates. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  19. Simple ES Basic idea: Select the single best previous solution, and add Gaussian noise. (Keep standard deviation fixed.) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  20. Genetic ES Basic idea: Only keep the top performing 10% of solutions. Randomly select two solutions. Recombine them by randomly assigning each parameter value from either parent. (and add fixed Gaussian noise.) Example: Combine (1, 2, 3) and ( 4, 5, 6 ): - (1, 5, 6 ) - ( 4 , 2 , 3) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  21. CMA–ES Basic idea: Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  22. CMA–ES Basic idea: Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances. 😮 Problem: Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  23. Natural ES Basic idea: Treat the problem a bit differently: Then use the gradient with your favorite SGD optimizer: Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  24. OpenAI ES Basic idea: Similar to Natural ES, but σ constant. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  25. OpenAI ES Basic idea: Similar to Natural ES, but σ constant. Note: to parallelize we only need to know pairs! Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  26. Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 6 3 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  27. Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 6 3 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  28. Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  29. Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 3. Communicate to all nodes 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  30. Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  31. Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds 5 4 5. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  32. Efficiency • The only information communicated at each iteration is a single scalar per machine. • Most distributed update mechanisms (A3C, Gorila) must communicate entire parameter lists. • Result: linear horizontal parallelization . Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  33. Efficiency • The only information communicated at each iteration is a single scalar per machine. • Most distributed update mechanisms (A3C, Gorila) must communicate entire parameter lists. • Result: linear horizontal parallelization . Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  34. Benefits Non-differentiable No backprop! Sparse rewards! policies! 3x computation time Learn long-term policies decrease! in hard environments! (hard attention!) And much cheaper than GPUs! Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  35. Drawbacks Not useful for Data inefficient supervised learning . About 3–10x less data efficient (good, reliable gradients) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  36. Bottom Line If you have a large amount of CPU cores (>100), or if you have sparse rewards , evolution strategies may be a good bet. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  37. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

  38. Appendix Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz

Recommend


More recommend