Evolution Strategies Distributed deep reinforcement learning (blog.otoro.net) Evolutionary Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Deep Reinforcement Learning Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Agenda 1. Why is deep reinforcement learning hard? 2. How does evolution strategies (ES) help? 3. Advice on applying ES to real-world problems Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
RL in a nutshell (reinforcement learning) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Deep RL in a nutshell Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Deep CNNs are useful. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Assumptions of supervised learning Stationary Independence Clear input-output distribution of examples relationship Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
RL violates these assumptions. 😮 Stationary Independence Clear input-output distribution of examples relationship Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
RL violates these assumptions. 😮 Stationary distribution The training data changes as you act differently. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
RL violates these assumptions. 😮 Independence of examples Adjacent game frames are usually very similar. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
RL violates these assumptions. 😮 Clear input-output relationship There can be a large delay between action and reward. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Deep Q-Learning Model Training objective Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Policy gradients (our objective) (our weight update) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Policy gradients What if our reward function is highly nonlinear ? What if our reward is received much later ? What if our policy is non-differentiable ? How far should we step? Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Local optima Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Local optima Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Black-box optimization Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
ES to the rescue! At each iteration: 1. Generate candidate solutions from old candidates by adding noise 2. Evaluate a fitness function for each candidate 3. Aggregate the results and discard bad candidates. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Simple ES Basic idea: Select the single best previous solution, and add Gaussian noise. (Keep standard deviation fixed.) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Genetic ES Basic idea: Only keep the top performing 10% of solutions. Randomly select two solutions. Recombine them by randomly assigning each parameter value from either parent. (and add fixed Gaussian noise.) Example: Combine (1, 2, 3) and ( 4, 5, 6 ): - (1, 5, 6 ) - ( 4 , 2 , 3) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
CMA–ES Basic idea: Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
CMA–ES Basic idea: Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances. 😮 Problem: Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Natural ES Basic idea: Treat the problem a bit differently: Then use the gradient with your favorite SGD optimizer: Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
OpenAI ES Basic idea: Similar to Natural ES, but σ constant. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
OpenAI ES Basic idea: Similar to Natural ES, but σ constant. Note: to parallelize we only need to know pairs! Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 6 3 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 6 3 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 3. Communicate to all nodes 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds 5 4 Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Parallelization Initialize 1. Create shared list of all random 1 seeds, one per worker; and 7 2 Repeat: 1. Sample 2. Evaluate 6 3 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds 5 4 5. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Efficiency • The only information communicated at each iteration is a single scalar per machine. • Most distributed update mechanisms (A3C, Gorila) must communicate entire parameter lists. • Result: linear horizontal parallelization . Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Efficiency • The only information communicated at each iteration is a single scalar per machine. • Most distributed update mechanisms (A3C, Gorila) must communicate entire parameter lists. • Result: linear horizontal parallelization . Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Benefits Non-differentiable No backprop! Sparse rewards! policies! 3x computation time Learn long-term policies decrease! in hard environments! (hard attention!) And much cheaper than GPUs! Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Drawbacks Not useful for Data inefficient supervised learning . About 3–10x less data efficient (good, reliable gradients) Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Bottom Line If you have a large amount of CPU cores (>100), or if you have sparse rewards , evolution strategies may be a good bet. Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Appendix Evolution Strategies Steven Schmatz November 21, 2017 @stevenschmatz
Recommend
More recommend