introduction to evolution strategy algorithms
play

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric - PowerPoint PPT Presentation

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders REINFORCEMENT LEARNING CHALLENGES () is a discrete Credit assignment problem function of theta Bob got a great How do we get a bonus this


  1. INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders

  2. REINFORCEMENT LEARNING CHALLENGES 𝑔(πœ„) is a discrete Credit assignment problem function of theta… Bob got a great How do we get a bonus this year! gradient ​𝛼↓ ΞΈ 𝑔 ? …what did Bob do to earn his bonus? Time horizon: 1 year [+] Met all his deadlines Backprop [+] Took an ML course 3 years ago Discrete ​𝛼↓ ΞΈ 𝑔 Local minima Sparse reward signal 𝑔(πœ„) IDEA: Lets just treat 𝑔 like a black-box function when optimizing it. like a black-box function when optimizing it. Evolution β€œTry different ΞΈ ”, and see what works. strategy If we find good ΞΈ ’s, keep them, discard the bad ones. Recombine β€‹πœ„β†“ 1 and β€‹πœ„β†“ 2 to form a new (possibly better) β€‹πœ„β†“ 3

  3. EVOLUTION STRATEGY ALGORITHMS Β„ The template: β€œSample” new generation MNIST ConvNet Generate some parameter parameters vectors for your neural networks. Fitness Evaluate how well each neural network performs on a training set . β€œPrepare” to sample the new generation: Given how well each β€œmutant” performed… Natural selection! Γ  Keep the good ones The ones that remain β€œrecombine” to form the next generation.

  4. SCARY β€œTEST FUNCTIONS” (1) Rastrigin function Rastrigin function (again) Test function Lots of local optima; will be difficult to optimize with Backprop + SGD!

  5. SCARY β€œTEST FUNCTIONS” (2) Schaffer function

  6. ΞΈ WHAT WE WANT TO DO; β€œTRY DIFFERENT β€œ Schaffer Rastrigin Algorithm: CMA-ES

  7. CMA-ES; HIGH-LEVEL OVERVIEW 𝑃 ( β€‹πœ„β†‘ 2 ) Step 3: Step 1: Step 2: Recombine to form the Calculate fitness of Natural selection! new generation: current generation 𝑕( 1 ) Keep the top 25%. (purple dots) Discrepancy between mean of previous generation and top 25% will cast a wider net!

  8. ES: LESS COMPUTATIONALLY EXPENSIVE IDEA: Sample neural-network parameters from a multi-variate gaussian w/ diagonal covariance matrix. Update 𝑂(πœ„ =[ 𝜈 , Ξ£] ) parameters using REINFORCE gradient estimate. Neural-network parameters. Parameters for sampling neural-network parameters. 𝑃 (ΞΈ) Adaptive Οƒ and Β΅

  9. ES: __EVEN_LESS__ COMPUTATIONALLY EXPENSIVE Β„ IDEA: Just use the same Οƒ and 𝜈 for each parameter. for each parameter. Γ¨ Sample neural-network parameters from β€œ isotropic gaussian ” = 𝑂(𝜈 , β€‹πœβ†‘ 2 𝐽) Β„ Seems suspiciously simple…but it can compete! Β„ OpenAI ES paper: Β„ 𝜏 is a hyperparameter is a hyperparameter Β„ 1 set of hyperparameters for Atari Constant Οƒ and Β΅ Β„ 1 set of hyperparameters for Mujoco Β„ Competes with A3C and TRPO performance

  10. EVOLUTION STRATEGIES AS A SCALABLE ALTERNATIVE TO REINFORCEMENT LEARNING James Gleeson Eric Langlois William Saunders

  11. TODAY’S RL LANDSCAPE AND RECENT SUCCESS Β„ Discrete action tasks: Β„ Continuous action tasks: Β„ Learning to play Atari from raw pixels Β„ β€œHopping” locomotion Β„ Expert-level go player Q-learning: Policy gradient; e.g. TRPO: Learn the action-value function: Learn the policy directly Approximate the function using a neural-network, train it using gradients computed via backpropagation (i.e. the chain rule)

  12. MOTIVATION: PROBLEMS WITH BACKPROPAGATION Β„ Backpropagation isn’t perfect: Β„ GPU memory requirements Β„ Difficult to parallelize Β„ Cannot apply directly to non-differentiable functions Β„ e.g. discrete functions 𝐺 ( πœ„ ) (the topic of this course) Β„ Exploding gradient (e.g. for RNN’s) Β„ You have a datacenter, and cycles to spend RL problem

  13. AN ALTERNATIVE TO BACKPROPAGATION: EVOLUTION STRATEGY (ES) Claim: And have it be embarrassingly parallel ? Proof: 2 nd order Taylor series approximation 𝐺 ( πœ„ ) independent of πœ— Relevant to our course: 𝐺(πœ„) could be a discrete function of ΞΈ No derivates of 𝐺 ( πœ„ ) Gradient of objective 𝐺 ( πœ„ ) No chain rule / backprop required!

  14. THE MAIN CONTRIBUTION OF THIS PAPER Β„ Criticisms: Β„ Evolution strategy aren’t new! Β„ Common sense: The variance/bias of this gradient estimator will be too high, making the algorithm unstable on today’s problems! Β„ This paper aims to refute your common sense : Β„ Comparison against state-of-the-art RL algorithms: Β„ Atari: Half the games do better than a recent algorithm (A3C) , half the games do worse Β„ Mujoco: Can match state-of-the-art policy gradients on continuous action tasks. Linear speedups with more compute nodes: 1 day with A3C Γ¨ 1 hour with ES

  15. FIRST ATTEMPT AT ES: THE SEQUENTIAL ALGORITHM Gradient estimator needed for updating ΞΈ : Sample: In RL, the fitness 𝐺(πœ„) is defined as: Embarassingly parallel! for each 𝑋𝑝𝑠𝑙𝑓​𝑠↓𝑗 𝑗 =1.. π‘œ : : 𝑋𝑝𝑠𝑙𝑓​𝑠↓𝑗 : computes ​𝐺↓𝑗 in parallel Generate n random perturbations of ΞΈ Sequentially run each mutant Compute gradient estimate

  16. SECOND ATTEMPT: THE PARALLEL ALGORITHM Embarassingly parallel! With β€‹πΊβ†“π‘˜ and β€‹πœ—β†“π‘˜ known by everyone, each worker compute the same gradient estimate Tradeoff: redundant computation over Β„ KEY IDEA: Minimize communication cost |πœ„| message size avoid sending len( πœ— )= |πœ„| , send len (​𝐺↓𝑗 ) =1 instead. How? Each worker reconstructs random perturbation vector Ο΅ …How? M ake initial random seed of 𝑋𝑝𝑠𝑙𝑓​𝑠↓𝑗 globally known.

  17. EXPERIMENT: HOW WELL DOES IT SCALE? Actual speedup Ideal speedup (perfectly linear) 200 cores, 60 minutes Criticism: Are diminishing returns due to: β€’ increased communication cost from more workers β€’ less reduction in variance of the gradient estimate from more workers Β„ Linearly! With diminishing returns; often inevitable.

  18. INTRINSIC DIMENSIONALITY OF THE PROBLEM Justification: E.g. Simple linear regression: β€‹πœ—β†“ 2 Double | ΞΈ | β†’| ​ ΞΈ ↑ β€² | β€‹πœ—β†“ 1 β€‹πœ—β†“ 1 ∼ β€‹πœ—β†“ 2 ∼ 𝑂 ( 𝜈 , ​ πœβ†‘ 2 ) β‰ˆ finite differences in some random direction Ο΅ Γ¨ # update steps scales with |πœ„| ? After adjusting Ξ· and Οƒ , Update step has the same effect. Γ¨ Same # of update steps. Argument: # of update steps in ES scales with the intrinsic dimensionality of ΞΈ needed for the problem, not with the length of ΞΈ.

  19. WHEN IS ES A BETTER CHOICE THAN POLICY GRADIENTS? How do we compute gradients? ASIDE: In case you forget; for independent X & Y: Policy gradients: Policy network outputs a softmax of probabilities for different discrete actions, and we sample an action randomly . Evolution strategy (ES): We randomly perturb our parameters: then select actions according to Independent of episode length. Credit assignment problem ES makes fewer Variance of gradient estimate grows linearly (potentially incorrect) with the length of the episode. assumptions 𝛿 only fixes this for short-term returns! only fixes this for short-term returns!

  20. Ο„ EXPERIMENT: ES ISN’T SENSITIVE TO LENGTH OF EPISODE Β„ Frame-skip F: Playing pong with frameskip Β„ Agent can select an action every F frames of input pixels β‰ˆ Same policy Β„ E.g. F = 4 frame 1: agent selects an action frame 1-3: agent is forced to take Noop action IDEA: artificially inflate the length of an episode Ο„ Argument: Since the ES algorithm doesn’t make any assumption about time horizon Ξ³ (decaying reward), it is less sensitive to long episodes Ο„ (i.e. the credit assignment problem )

  21. EXPERIMENT: LEARNED PERFORMANCE Β„ The authors looked at: Β„ discrete action tasks -- Atari Β„ continuous action tasks -- Mujoco

  22. EXPERIMENT: DISCRETE ACTION TASKS -- ATARI Β„ Paper’s claim: β€œGiven the same amount of compute time as other algorithms, compared to A3C , ES does better on 21 games, worse on 29 ” 50 games in total Slightly misleading claim Best score: 4 19 7 9 11 if you aren’t reading 8% 38% 14% 18% 22% carefully: A3C still does better on most games across all algorithms Γ¨ ES is still beaten by other algorithms when it beats A3C

  23. EXPERIMENT: CONTINUOUS ACTION TASKS -- MUJOCO Sampling complexity: How many steps in the environment were needed to reach X% of policy gradient performance? < 1 Γ¨ Better sampling complexity ​ # 𝐹𝑇 π‘ˆπ‘—π‘›π‘“π‘‘π‘’π‘“π‘žπ‘‘/ # > 1 Γ¨ Worse sampling complexity π‘ˆπ‘†π‘„π‘ƒ π‘ˆπ‘—π‘›π‘“π‘‘π‘’π‘“π‘žπ‘‘ Harder tasks: at most 10x more samples required Simpler tasks: as few as 0.33x samples required

  24. SUMMARY: EVOLUTION STRATEGY Β„ ES are a viable alternative to current RL algorithms : Policy gradient; e.g. TRPO: Q-learning: Learn the policy Learn the action-value directly function: Β„ ES: Treat the problem like a black-box, perturb ΞΈ and evaluate fitness F( πœ„ ) : Β„ No potentially incorrect assumptions about credit assignment problem (e.g. time horizon Ξ³ ) Β„ No backprop required Β„ Embarrassingly parallel Β„ Lower GPU memory requirements

Recommend


More recommend