INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders
REINFORCEMENT LEARNING CHALLENGES π(π) is a discrete Credit assignment problem function of thetaβ¦ Bob got a great How do we get a bonus this year! gradient βπΌβ ΞΈ π ? β¦what did Bob do to earn his bonus? Time horizon: 1 year [+] Met all his deadlines Backprop [+] Took an ML course 3 years ago Discrete βπΌβ ΞΈ π Local minima Sparse reward signal π(π) IDEA: Lets just treat π like a black-box function when optimizing it. like a black-box function when optimizing it. Evolution βTry different ΞΈ β, and see what works. strategy If we find good ΞΈ βs, keep them, discard the bad ones. Recombine βπβ 1 and βπβ 2 to form a new (possibly better) βπβ 3
EVOLUTION STRATEGY ALGORITHMS Β The template: βSampleβ new generation MNIST ConvNet Generate some parameter parameters vectors for your neural networks. Fitness Evaluate how well each neural network performs on a training set . βPrepareβ to sample the new generation: Given how well each βmutantβ performedβ¦ Natural selection! Γ Keep the good ones The ones that remain βrecombineβ to form the next generation.
SCARY βTEST FUNCTIONSβ (1) Rastrigin function Rastrigin function (again) Test function Lots of local optima; will be difficult to optimize with Backprop + SGD!
SCARY βTEST FUNCTIONSβ (2) Schaffer function
ΞΈ WHAT WE WANT TO DO; βTRY DIFFERENT β Schaffer Rastrigin Algorithm: CMA-ES
CMA-ES; HIGH-LEVEL OVERVIEW π ( βπβ 2 ) Step 3: Step 1: Step 2: Recombine to form the Calculate fitness of Natural selection! new generation: current generation π( 1 ) Keep the top 25%. (purple dots) Discrepancy between mean of previous generation and top 25% will cast a wider net!
ES: LESS COMPUTATIONALLY EXPENSIVE IDEA: Sample neural-network parameters from a multi-variate gaussian w/ diagonal covariance matrix. Update π(π =[ π , Ξ£] ) parameters using REINFORCE gradient estimate. Neural-network parameters. Parameters for sampling neural-network parameters. π (ΞΈ) Adaptive Ο and Β΅
ES: __EVEN_LESS__ COMPUTATIONALLY EXPENSIVE Β IDEA: Just use the same Ο and π for each parameter. for each parameter. Γ¨ Sample neural-network parameters from β isotropic gaussian β = π(π , βπβ 2 π½) Β Seems suspiciously simpleβ¦but it can compete! Β OpenAI ES paper: Β π is a hyperparameter is a hyperparameter Β 1 set of hyperparameters for Atari Constant Ο and Β΅ Β 1 set of hyperparameters for Mujoco Β Competes with A3C and TRPO performance
EVOLUTION STRATEGIES AS A SCALABLE ALTERNATIVE TO REINFORCEMENT LEARNING James Gleeson Eric Langlois William Saunders
TODAYβS RL LANDSCAPE AND RECENT SUCCESS Β Discrete action tasks: Β Continuous action tasks: Β Learning to play Atari from raw pixels Β βHoppingβ locomotion Β Expert-level go player Q-learning: Policy gradient; e.g. TRPO: Learn the action-value function: Learn the policy directly Approximate the function using a neural-network, train it using gradients computed via backpropagation (i.e. the chain rule)
MOTIVATION: PROBLEMS WITH BACKPROPAGATION Β Backpropagation isnβt perfect: Β GPU memory requirements Β Difficult to parallelize Β Cannot apply directly to non-differentiable functions Β e.g. discrete functions πΊ ( π ) (the topic of this course) Β Exploding gradient (e.g. for RNNβs) Β You have a datacenter, and cycles to spend RL problem
AN ALTERNATIVE TO BACKPROPAGATION: EVOLUTION STRATEGY (ES) Claim: And have it be embarrassingly parallel ? Proof: 2 nd order Taylor series approximation πΊ ( π ) independent of π Relevant to our course: πΊ(π) could be a discrete function of ΞΈ No derivates of πΊ ( π ) Gradient of objective πΊ ( π ) No chain rule / backprop required!
THE MAIN CONTRIBUTION OF THIS PAPER Β Criticisms: Β Evolution strategy arenβt new! Β Common sense: The variance/bias of this gradient estimator will be too high, making the algorithm unstable on todayβs problems! Β This paper aims to refute your common sense : Β Comparison against state-of-the-art RL algorithms: Β Atari: Half the games do better than a recent algorithm (A3C) , half the games do worse Β Mujoco: Can match state-of-the-art policy gradients on continuous action tasks. Linear speedups with more compute nodes: 1 day with A3C Γ¨ 1 hour with ES
FIRST ATTEMPT AT ES: THE SEQUENTIAL ALGORITHM Gradient estimator needed for updating ΞΈ : Sample: In RL, the fitness πΊ(π) is defined as: Embarassingly parallel! for each πππ ππβπ βπ π =1.. π : : πππ ππβπ βπ : computes βπΊβπ in parallel Generate n random perturbations of ΞΈ Sequentially run each mutant Compute gradient estimate
SECOND ATTEMPT: THE PARALLEL ALGORITHM Embarassingly parallel! With βπΊβπ and βπβπ known by everyone, each worker compute the same gradient estimate Tradeoff: redundant computation over Β KEY IDEA: Minimize communication cost |π| message size avoid sending len( π )= |π| , send len (βπΊβπ ) =1 instead. How? Each worker reconstructs random perturbation vector Ο΅ β¦How? M ake initial random seed of πππ ππβπ βπ globally known.
EXPERIMENT: HOW WELL DOES IT SCALE? Actual speedup Ideal speedup (perfectly linear) 200 cores, 60 minutes Criticism: Are diminishing returns due to: β’ increased communication cost from more workers β’ less reduction in variance of the gradient estimate from more workers Β Linearly! With diminishing returns; often inevitable.
INTRINSIC DIMENSIONALITY OF THE PROBLEM Justification: E.g. Simple linear regression: βπβ 2 Double | ΞΈ | β| β ΞΈ β β² | βπβ 1 βπβ 1 βΌ βπβ 2 βΌ π ( π , β πβ 2 ) β finite differences in some random direction Ο΅ Γ¨ # update steps scales with |π| ? After adjusting Ξ· and Ο , Update step has the same effect. Γ¨ Same # of update steps. Argument: # of update steps in ES scales with the intrinsic dimensionality of ΞΈ needed for the problem, not with the length of ΞΈ.
WHEN IS ES A BETTER CHOICE THAN POLICY GRADIENTS? How do we compute gradients? ASIDE: In case you forget; for independent X & Y: Policy gradients: Policy network outputs a softmax of probabilities for different discrete actions, and we sample an action randomly . Evolution strategy (ES): We randomly perturb our parameters: then select actions according to Independent of episode length. Credit assignment problem ES makes fewer Variance of gradient estimate grows linearly (potentially incorrect) with the length of the episode. assumptions πΏ only fixes this for short-term returns! only fixes this for short-term returns!
Ο EXPERIMENT: ES ISNβT SENSITIVE TO LENGTH OF EPISODE Β Frame-skip F: Playing pong with frameskip Β Agent can select an action every F frames of input pixels β Same policy Β E.g. F = 4 frame 1: agent selects an action frame 1-3: agent is forced to take Noop action IDEA: artificially inflate the length of an episode Ο Argument: Since the ES algorithm doesnβt make any assumption about time horizon Ξ³ (decaying reward), it is less sensitive to long episodes Ο (i.e. the credit assignment problem )
EXPERIMENT: LEARNED PERFORMANCE Β The authors looked at: Β discrete action tasks -- Atari Β continuous action tasks -- Mujoco
EXPERIMENT: DISCRETE ACTION TASKS -- ATARI Β Paperβs claim: βGiven the same amount of compute time as other algorithms, compared to A3C , ES does better on 21 games, worse on 29 β 50 games in total Slightly misleading claim Best score: 4 19 7 9 11 if you arenβt reading 8% 38% 14% 18% 22% carefully: A3C still does better on most games across all algorithms Γ¨ ES is still beaten by other algorithms when it beats A3C
EXPERIMENT: CONTINUOUS ACTION TASKS -- MUJOCO Sampling complexity: How many steps in the environment were needed to reach X% of policy gradient performance? < 1 Γ¨ Better sampling complexity β # πΉπ πππππ‘π’πππ‘/ # > 1 Γ¨ Worse sampling complexity ππππ πππππ‘π’πππ‘ Harder tasks: at most 10x more samples required Simpler tasks: as few as 0.33x samples required
SUMMARY: EVOLUTION STRATEGY Β ES are a viable alternative to current RL algorithms : Policy gradient; e.g. TRPO: Q-learning: Learn the policy Learn the action-value directly function: Β ES: Treat the problem like a black-box, perturb ΞΈ and evaluate fitness F( π ) : Β No potentially incorrect assumptions about credit assignment problem (e.g. time horizon Ξ³ ) Β No backprop required Β Embarrassingly parallel Β Lower GPU memory requirements
Recommend
More recommend