deep reinforcement learning
play

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz - PowerPoint PPT Presentation

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient


  1. Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab

  2. Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient Methods Variance Reduction for Policy Gradients Trust Region and Natural Gradient Methods Open Problems Course materials: goo.gl/5wsgbJ

  3. Introduction and Overview

  4. What is Reinforcement Learning? ◮ Branch of machine learning concerned with taking sequences of actions ◮ Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward

  5. Motor Control and Robotics Robotics: ◮ Observations: camera images, joint angles ◮ Actions: joint torques ◮ Rewards: stay balanced, navigate to target locations, serve and protect humans

  6. Business Operations ◮ Inventory Management ◮ Observations: current inventory levels ◮ Actions: number of units of each item to purchase ◮ Rewards: profit ◮ Resource allocation: who to provide customer service to first ◮ Routing problems: in management of shipping fleet, which trucks / truckers to assign to which cargo

  7. Games A different kind of optimization problem (min-max) but still considered to be RL. ◮ Go (complete information, deterministic) – AlphaGo 2 ◮ Backgammon (complete information, stochastic) – TD-Gammon 3 ◮ Stratego (incomplete information, deterministic) ◮ Poker (incomplete information, stochastic) 2 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. 3 Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68.

  8. Approaches to RL Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Q-Learning Actor-Critic Methods

  9. What is Deep RL? ◮ RL using nonlinear function approximators ◮ Usually, updating parameters with stochastic gradient descent

  10. What’s Deep RL? Whatever the front half of the cerebral cortex does (motor and executive cortices)

  11. Markov Decision Processes

  12. Definition ◮ Markov Decision Process (MDP) defined by ( S , A , P ), where ◮ S : state space ◮ A : action space ◮ P ( r , s ′ | s , a ): a transition probability distribution ◮ Extra objects defined depending on problem setting ◮ µ : Initial state distribution ◮ γ : discount factor

  13. Episodic Setting ◮ In each episode, the initial state is sampled from µ , and the process proceeds until the terminal state is reached. For example: ◮ Taxi robot reaches its destination (termination = good) ◮ Waiter robot finishes a shift (fixed time) ◮ Walking robot falls over (termination = bad) ◮ Goal: maximize expected reward per episode

  14. Policies ◮ Deterministic policies: a = π ( s ) ◮ Stochastic policies: a ∼ π ( a | s ) ◮ Parameterized policies: π θ

  15. Episodic Setting s 0 ∼ µ ( s 0 ) a 0 ∼ π ( a 0 | s 0 ) s 1 , r 0 ∼ P ( s 1 , r 0 | s 0 , a 0 ) a 1 ∼ π ( a 1 | s 1 ) s 2 , r 1 ∼ P ( s 2 , r 1 | s 1 , a 1 ) . . . a T − 1 ∼ π ( a T − 1 | s T − 1 ) s T , r T − 1 ∼ P ( s T | s T − 1 , a T − 1 ) Objective: maximize η ( π ) , where η ( π ) = E [ r 0 + r 1 + · · · + r T − 1 | π ]

  16. Episodic Setting π Agent a T-1 a 0 a 1 s T s 0 s 1 s 2 r 0 r 1 r T-1 μ 0 Environment P Objective: maximize η ( π ) , where η ( π ) = E [ r 0 + r 1 + · · · + r T − 1 | π ]

  17. Parameterized Policies ◮ A family of policies indexed by parameter vector θ ∈ R d ◮ Deterministic: a = π ( s , θ ) ◮ Stochastic: π ( a | s , θ ) ◮ Analogous to classification or regression with input s , output a . E.g. for neural network stochastic policies: ◮ Discrete action space: network outputs vector of probabilities ◮ Continuous action space: network outputs mean and diagonal covariance of Gaussian

  18. Reinforcement Learning via Black-Box Optimization

  19. Derivative Free Optimization Approach ◮ Objective: maximize E [ R | π ( · , θ )] ◮ View θ → � → R as a black box ◮ Ignore all other information other than R collected during episode

  20. Cross-Entropy Method ◮ Evolutionary algorithm ◮ Works embarrassingly well Istv´ an Szita and Andr´ as L¨ orincz. “Learning Tetris using the noisy cross-entropy method”. In: Neural computation 18.12 (2006), pp. 2936–2941 Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. “Approximate Dynamic Programming Finally Performs Well in the Game of Tetris”. In: Advances in Neural Information Processing Systems . 2013

  21. Cross-Entropy Method ◮ Evolutionary algorithm ◮ Works embarrassingly well ◮ A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:

  22. Cross-Entropy Method Initialize µ ∈ R d , σ ∈ R d for iteration = 1 , 2 , . . . do Collect n samples of θ i ∼ N ( µ, diag( σ )) Perform a noisy evaluation R i ∼ θ i Select the top p % of samples (e.g. p = 20), which we’ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ . end for Return the final µ .

  23. Cross-Entropy Method ◮ Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward ◮ Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob ◮ We can derive MM algorithm where each iteration you maximize � i log p ( θ i ) R i

  24. Policy Gradient Methods

  25. Policy Gradient Methods: Overview Problem: maximize E [ R | π θ ] Intuitions: collect a bunch of trajectories, and ... 1. Make the good trajectories more probable 2. Make the good actions more probable (actor-critic, GAE) 3. Push the actions towards good actions (DPG, SVG)

  26. Score Function Gradient Estimator ◮ Consider an expectation E x ∼ p ( x | θ ) [ f ( x )]. Want to compute gradient wrt θ � ∇ θ E x [ f ( x )] = ∇ θ d x p ( x | θ ) f ( x ) � = d x ∇ θ p ( x | θ ) f ( x ) � d x p ( x | θ ) ∇ θ p ( x | θ ) = p ( x | θ ) f ( x ) � = d x p ( x | θ ) ∇ θ log p ( x | θ ) f ( x ) = E x [ f ( x ) ∇ θ log p ( x | θ )] . ◮ Last expression gives us an unbiased gradient estimator. Just sample x i ∼ p ( x | θ ), and compute ˆ g i = f ( x i ) ∇ θ log p ( x i | θ ). ◮ Need to be able to compute and differentiate density p ( x | θ ) wrt θ

  27. Derivation via Importance Sampling Alternate Derivation Using Importance Sampling � p ( x | θ ) � E x ∼ θ [ f ( x )] = E x ∼ θ old p ( x | θ old ) f ( x ) � ∇ θ p ( x | θ ) � ∇ θ E x ∼ θ [ f ( x )] = E x ∼ θ old p ( x | θ old ) f ( x ) � � � ∇ θ p ( x | θ ) � � θ = θ old ∇ θ E x ∼ θ [ f ( x )] θ = θ old = E x ∼ θ old f ( x ) � p ( x | θ old ) � � � = E x ∼ θ old ∇ θ log p ( x | θ ) θ = θ old f ( x ) �

  28. Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ ) ◮ Let’s say that f ( x ) measures how good the sample x is. ◮ Moving in the direction ˆ g i pushes up the logprob of the sample, in proportion to how good it is ◮ Valid even if f ( x ) is discontinuous, and unknown, or sample space (containing x) is a discrete set

  29. Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ )

  30. Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ )

  31. Score Function Gradient Estimator for Policies ◮ Now random variable x is a whole trajectory τ = ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . , s T − 1 , a T − 1 , r T − 1 , s T ) ∇ θ E τ [ R ( τ )] = E τ [ ∇ θ log p ( τ | θ ) R ( τ )] ◮ Just need to write out p ( τ | θ ): T − 1 � p ( τ | θ ) = µ ( s 0 ) [ π ( a t | s t , θ ) P ( s t +1 , r t | s t , a t )] t =0 T − 1 � log p ( τ | θ ) = log µ ( s 0 ) + [log π ( a t | s t , θ ) + log P ( s t +1 , r t | s t , a t )] t =0 T − 1 � ∇ θ log p ( τ | θ ) = ∇ θ log π ( a t | s t , θ ) t =0 � � T − 1 � ∇ θ E τ [ R ] = E τ R ∇ θ log π ( a t | s t , θ ) t =0 ◮ Interpretation: using good trajectories (high R ) as supervised examples in classification / regression

  32. Policy Gradient–Slightly Better Formula ◮ Previous slide: �� T − 1 �� T − 1 �� � � ∇ θ E τ [ R ] = E τ r t ∇ θ log π ( a t | s t , θ ) t =0 t =0 ◮ But we can cut trajectory to t steps and derive gradient estimator for one reward term r t ′ . � � t � ∇ θ E [ r t ′ ] = E r t ′ ∇ θ log π ( a t | s t , θ ) t =0 ◮ Sum this formula over t , obtaining � T − 1 � t ′ � � ∇ θ E [ R ] = E r t ′ ∇ θ log π ( a t | s t , θ ) t =0 t =0 � T − 1 � T − 1 � � = E ∇ θ log π ( a t | s t , θ ) r t ′ t =0 t ′ = t

  33. Adding a Baseline ◮ Suppose f ( x ) ≥ 0 , ∀ x ◮ Then for every x i , gradient estimator ˆ g i tries to push up it’s density ◮ We can derive a new unbiased estimator that avoids this problem, and only pushes up the density for better-than-average x i . ∇ θ E x [ f ( x )] = ∇ θ E x [ f ( x ) − b ] = E x [ ∇ θ log p ( x | θ )( f ( x ) − b )] ◮ A near-optimal choice of b is always E [ f ( x )] (which must be estimated)

Recommend


More recommend