INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric - PowerPoint PPT Presentation

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders

REINFORCEMENT LEARNING CHALLENGES 𝑔(𝜄) is a discrete Credit assignment problem function of theta… Bob got a great How do we get a bonus this year! gradient 𝛼↓ θ 𝑔 ? …what did Bob do to earn his bonus? Time horizon: 1 year [+] Met all his deadlines Backprop [+] Took an ML course 3 years ago Discrete 𝛼↓ θ 𝑔 Local minima Sparse reward signal 𝑔(𝜄) IDEA: Lets just treat 𝑔 like a black-box function when optimizing it. like a black-box function when optimizing it. Evolution “Try different θ ”, and see what works. strategy If we find good θ ’s, keep them, discard the bad ones. Recombine 𝜄↓ 1 and 𝜄↓ 2 to form a new (possibly better) 𝜄↓ 3

EVOLUTION STRATEGY ALGORITHMS The template: “Sample” new generation MNIST ConvNet Generate some parameter parameters vectors for your neural networks. Fitness Evaluate how well each neural network performs on a training set . “Prepare” to sample the new generation: Given how well each “mutant” performed… Natural selection! à Keep the good ones The ones that remain “recombine” to form the next generation.

SCARY “TEST FUNCTIONS” (1) Rastrigin function Rastrigin function (again) Test function Lots of local optima; will be difficult to optimize with Backprop + SGD!

SCARY “TEST FUNCTIONS” (2) Schaffer function

θ WHAT WE WANT TO DO; “TRY DIFFERENT “ Schaffer Rastrigin Algorithm: CMA-ES

CMA-ES; HIGH-LEVEL OVERVIEW 𝑃 ( 𝜄↑ 2 ) Step 3: Step 1: Step 2: Recombine to form the Calculate fitness of Natural selection! new generation: current generation 𝑕( 1 ) Keep the top 25%. (purple dots) Discrepancy between mean of previous generation and top 25% will cast a wider net!

ES: LESS COMPUTATIONALLY EXPENSIVE IDEA: Sample neural-network parameters from a multi-variate gaussian w/ diagonal covariance matrix. Update 𝑂(𝜄 =[ 𝜈 , Σ] ) parameters using REINFORCE gradient estimate. Neural-network parameters. Parameters for sampling neural-network parameters. 𝑃 (θ) Adaptive σ and µ

ES: __EVEN_LESS__ COMPUTATIONALLY EXPENSIVE IDEA: Just use the same σ and 𝜈 for each parameter. for each parameter. è Sample neural-network parameters from “ isotropic gaussian ” = 𝑂(𝜈 , 𝜏↑ 2 𝐽) Seems suspiciously simple…but it can compete! OpenAI ES paper: 𝜏 is a hyperparameter is a hyperparameter 1 set of hyperparameters for Atari Constant σ and µ 1 set of hyperparameters for Mujoco Competes with A3C and TRPO performance

EVOLUTION STRATEGIES AS A SCALABLE ALTERNATIVE TO REINFORCEMENT LEARNING James Gleeson Eric Langlois William Saunders

TODAY’S RL LANDSCAPE AND RECENT SUCCESS Discrete action tasks: Continuous action tasks: Learning to play Atari from raw pixels “Hopping” locomotion Expert-level go player Q-learning: Policy gradient; e.g. TRPO: Learn the action-value function: Learn the policy directly Approximate the function using a neural-network, train it using gradients computed via backpropagation (i.e. the chain rule)

MOTIVATION: PROBLEMS WITH BACKPROPAGATION Backpropagation isn’t perfect: GPU memory requirements Difficult to parallelize Cannot apply directly to non-differentiable functions e.g. discrete functions 𝐺 ( 𝜄 ) (the topic of this course) Exploding gradient (e.g. for RNN’s) You have a datacenter, and cycles to spend RL problem

AN ALTERNATIVE TO BACKPROPAGATION: EVOLUTION STRATEGY (ES) Claim: And have it be embarrassingly parallel ? Proof: 2 nd order Taylor series approximation 𝐺 ( 𝜄 ) independent of 𝜗 Relevant to our course: 𝐺(𝜄) could be a discrete function of θ No derivates of 𝐺 ( 𝜄 ) Gradient of objective 𝐺 ( 𝜄 ) No chain rule / backprop required!

THE MAIN CONTRIBUTION OF THIS PAPER Criticisms: Evolution strategy aren’t new! Common sense: The variance/bias of this gradient estimator will be too high, making the algorithm unstable on today’s problems! This paper aims to refute your common sense : Comparison against state-of-the-art RL algorithms: Atari: Half the games do better than a recent algorithm (A3C) , half the games do worse Mujoco: Can match state-of-the-art policy gradients on continuous action tasks. Linear speedups with more compute nodes: 1 day with A3C è 1 hour with ES

FIRST ATTEMPT AT ES: THE SEQUENTIAL ALGORITHM Gradient estimator needed for updating θ : Sample: In RL, the fitness 𝐺(𝜄) is defined as: Embarassingly parallel! for each 𝑋𝑝𝑠𝑙𝑓𝑠↓𝑗 𝑗 =1.. 𝑜 : : 𝑋𝑝𝑠𝑙𝑓𝑠↓𝑗 : computes 𝐺↓𝑗 in parallel Generate n random perturbations of θ Sequentially run each mutant Compute gradient estimate

SECOND ATTEMPT: THE PARALLEL ALGORITHM Embarassingly parallel! With 𝐺↓𝑘 and 𝜗↓𝑘 known by everyone, each worker compute the same gradient estimate Tradeoff: redundant computation over KEY IDEA: Minimize communication cost |𝜄| message size avoid sending len( 𝜗 )= |𝜄| , send len (𝐺↓𝑗 ) =1 instead. How? Each worker reconstructs random perturbation vector ϵ …How? M ake initial random seed of 𝑋𝑝𝑠𝑙𝑓𝑠↓𝑗 globally known.

EXPERIMENT: HOW WELL DOES IT SCALE? Actual speedup Ideal speedup (perfectly linear) 200 cores, 60 minutes Criticism: Are diminishing returns due to: • increased communication cost from more workers • less reduction in variance of the gradient estimate from more workers Linearly! With diminishing returns; often inevitable.

INTRINSIC DIMENSIONALITY OF THE PROBLEM Justification: E.g. Simple linear regression: 𝜗↓ 2 Double | θ | →| θ ↑ ′ | 𝜗↓ 1 𝜗↓ 1 ∼ 𝜗↓ 2 ∼ 𝑂 ( 𝜈 , 𝜏↑ 2 ) ≈ finite differences in some random direction ϵ è # update steps scales with |𝜄| ? After adjusting η and σ , Update step has the same effect. è Same # of update steps. Argument: # of update steps in ES scales with the intrinsic dimensionality of θ needed for the problem, not with the length of θ.

WHEN IS ES A BETTER CHOICE THAN POLICY GRADIENTS? How do we compute gradients? ASIDE: In case you forget; for independent X & Y: Policy gradients: Policy network outputs a softmax of probabilities for different discrete actions, and we sample an action randomly . Evolution strategy (ES): We randomly perturb our parameters: then select actions according to Independent of episode length. Credit assignment problem ES makes fewer Variance of gradient estimate grows linearly (potentially incorrect) with the length of the episode. assumptions 𝛿 only fixes this for short-term returns! only fixes this for short-term returns!

τ EXPERIMENT: ES ISN’T SENSITIVE TO LENGTH OF EPISODE Frame-skip F: Playing pong with frameskip Agent can select an action every F frames of input pixels ≈ Same policy E.g. F = 4 frame 1: agent selects an action frame 1-3: agent is forced to take Noop action IDEA: artificially inflate the length of an episode τ Argument: Since the ES algorithm doesn’t make any assumption about time horizon γ (decaying reward), it is less sensitive to long episodes τ (i.e. the credit assignment problem )

EXPERIMENT: LEARNED PERFORMANCE The authors looked at: discrete action tasks -- Atari continuous action tasks -- Mujoco

EXPERIMENT: DISCRETE ACTION TASKS -- ATARI Paper’s claim: “Given the same amount of compute time as other algorithms, compared to A3C , ES does better on 21 games, worse on 29 ” 50 games in total Slightly misleading claim Best score: 4 19 7 9 11 if you aren’t reading 8% 38% 14% 18% 22% carefully: A3C still does better on most games across all algorithms è ES is still beaten by other algorithms when it beats A3C

EXPERIMENT: CONTINUOUS ACTION TASKS -- MUJOCO Sampling complexity: How many steps in the environment were needed to reach X% of policy gradient performance? < 1 è Better sampling complexity # 𝐹𝑇 𝑈𝑗𝑛𝑓𝑡𝑢𝑓𝑞𝑡/ # > 1 è Worse sampling complexity 𝑈𝑆𝑄𝑃 𝑈𝑗𝑛𝑓𝑡𝑢𝑓𝑞𝑡 Harder tasks: at most 10x more samples required Simpler tasks: as few as 0.33x samples required

SUMMARY: EVOLUTION STRATEGY ES are a viable alternative to current RL algorithms : Policy gradient; e.g. TRPO: Q-learning: Learn the policy Learn the action-value directly function: ES: Treat the problem like a black-box, perturb θ and evaluate fitness F( 𝜄 ) : No potentially incorrect assumptions about credit assignment problem (e.g. time horizon γ ) No backprop required Embarrassingly parallel Lower GPU memory requirements

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric - PowerPoint PPT Presentation

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders REINFORCEMENT LEARNING CHALLENGES () is a discrete Credit assignment problem function of theta Bob got a great How do we get a bonus this

EVOLUTION X3 - 1 - Evolution X3 Marketing Dpt. November 2006 - 2 - EVOLUTION X3 Evolution X3

Genetic Algorithms: An introductory Overview References:An introduction to Genetic Algorithms by

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Lecture 1 Chapter 9 Software evolution 1 Topics covered Evolution processes Change

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Merricks Capital Systematic Commodity Strategy The Evolution As an evolution of the existing

Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms

Technology Evolution Technology Focused Evolution Architectural Changes Impact on

The Generalized Theories of Evolution Why it is the Theory of Evolution that is Constantly

Chapter: 12 12 HSPA Evolution HSPA Evolution Ruiyuan Tian Department of Electrical and

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

2006 Group Business Strategy 2006 Group Business Strategy Group Business Strategy 2006 2006

Process Strategy: Bootstrapping Ole Hanseth Infrastructure evolution Evolution Adoption

EVOLUTION Its a Family Affair TODAYS LESSON Diversity and Evolution of Living Organisms

Evolution Strategies Distributed deep reinforcement learning (blog.otoro.net) Evolutionary

Models of Language Evolution Session 10 : Iterated Learning and the Evolution of Compositionality

INTERACTIVE EVOLUTIONARY GENERATION OF FACIAL COMPOSITES FOR LOCATING SUSPECTS IN CRIMINAL

Survey of Artificial Intelligence for Card Games and Its Application to the Swiss Game Jass J.

Evolutionary Algorithms General Concepts Prof. Thomas Bck Natural Computing Group

Model-Based Evolutionary Algorithms Part 1: Estimation of Distribution Algorithms Dirk Thierens

Covariance Matrix Adaptation Covariance Matrix Adaptation Evolution Strategies Recalling New

Natural Computing Lecture 9: Evolutionary Strategies Michael Herrmann INFR09038