breakout group reinforcement learning
play

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY - PowerPoint PPT Presentation

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 12/01/2017 Outline Theoretical introduction ( 30 minutes ) Discussion of code ( 30 minutes ) Solve version of grid world with


  1. Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 12/01/2017

  2. Outline ‣ Theoretical introduction ( 30 minutes ) ‣ Discussion of code ( 30 minutes ) • Solve version of grid world with SARSA ‣ Discussion of RL and its applications to String Theory ( 30 minutes )

  3. How to teach a machine ‣ Supervised Learning (SL): • provide a set of training tuples [(in 0 , out 0 ) , (in 1 , out 1 ) , . . . , (in n , out n )] • after training, machine predicts out i from in i ‣ Unsupervised Learning (UL): • only provide set training input set [in 0 , in 1 , . . . , in n ] • give task to machine (e.g. cluster input) without telling it how to do this exactly • After training, the machine will perform self-learned action on in i ‣ Reinforcement Learning (RL): • in between SL and UL • Machine acts autonomously, but actions are reinforced / punished

  4. Theoretical introduction

  5. Reinforcement Learning - Vocabulary ‣ Basic textbooks/literature [Barton, Sutton ’98 ‘17] ‣ The “thing that learns” is called agent or worker ‣ The “thing that is explored” is called environment ‣ The “elements of the environment” are called states or observations ‣ The “things that take you from one state to another” are called actions ‣ The “thing that tells you how to select the next action” is called policy ‣ Actions are executed sequentially in a sequence called (time) steps ‣ The “reinforcement” the agent experiences is called reward ‣ The “accumulated reward” is called return ‣ In RL, an agent performs actions in an environment with the goal to maximize its long-term return

  6. Reinforcement Learning - Details ‣ We focus on discrete state and action spaces ‣ State space S = { states in environment } ‣ Action space • total: A = { actions to transition between states } • for : s ∈ S A ( s ) = { possible actions in state s } ‣ Policy : Select next action for given state π : S 7! A π π ( s ) = a , a ∈ A ( s ) ‣ Reward : Reward for taking action in state 
 R ( s, a ) ∈ R a s R : S ⇥ A 7! R

  7. Reinforcement Learning - Details ‣ Return: The accumulated reward from current step 
 t ∞ X γ k r t + k +1 , γ ∈ (0 , 1] G t = k =0 ‣ State value function : Expected return for with v π ( s ) s v π ( s ) = E [ G t | s = s t ] policy : π ‣ Action value function : Expected return for q ( s, a ) performing action in state with policy : s π a q π ( s, a ) = E [ G t | s = s t , a = a t ] ‣ Prediction problem: Given , predict or v π ( s ) q π ( s, a ) π ‣ Control problem: Find optimal policy that π ∗ q π ( s, a ) maximizes or v π ( s )

  8. Reinforcement Learning - Details ‣ Commonly used policies: • greedy: Choose the action that maximizes the action π 0 ( s ) = argmax q ( s, a ) value function: • - greedy: Explore different possibilities ε ⇢ Choose greedy in (1 − " ) cases ⇡ 0 ( s ) = Choose random action in ✏ cases ‣ We take -greedy policy improvement ε ‣ On-policy: Update policy you are following (e.g. always - ε greedy) ‣ Off-policy: Use different policy for choosing next action a t +1 and updating q ( s t , a t )

  9. Reinforcement Learning - SARSA ‣ Solving the control problem: ∆ v ( s t ) = α [ G t − v ( s t )] v ( s t ) • : Learning rate ( means no update to ) α = 0 α G t = r + γ v ( s t +1 ) • One step approximation: ‣ Similar for action value function: ∆ q ( s t , a t ) = α [ G t − q ( s t , a t )] = α [ r + γ q ( s t +1 , a t +1 ) − q ( s t , a t ))] ( s t , a t , r, s t +1 , a t +1 ) • Update depends on tuple • is currently best known action for state a t +1 s t +1 ‣ Note: SARSA is on-policy

  10. Reinforcement Learning - Q-Learning ‣ Very similar to SARSA ‣ Difference in update: • SARSA: ∆ q ( s t , a t ) = α [ r + γ q ( s t +1 , a t +1 ) − q ( s t , a t )] • Q_Learning: ∆ q ( s t , a t ) = α [ r + γ max a 0 q ( s t +1 , a 0 ) − q ( s t , a t )] ‣ Note: This means that Q-Learning is off-policy ‣ SARSA is found to perform better ‣ Q-Learning is proven to converge to solution ‣ Combine with (deep NNs): Deep Q-Learning

  11. Example - Gridworld Worker (“Explorer”) Pitfall Exit Wall

  12. Example - Gridworld ‣ We will look at a version of grid world: • Gridworld is a grid-like maze with walls, pitfalls, and an exit • Each state is a point on the grid of the maze A = { up, down, left, right } • The actions are • Goal: Find the exit (strongly rewarded) • Each step is punished mildly (solve maze quickly) • Pitfalls should be avoided (strongly punished) • Running into a wall does not change the state

  13. Gridworld vs String Landscape ‣ Walls = Boundaries of landscape (negative number of branes) ‣ Empty square = Consistent point in the landscape which does not correspond to our Universe ‣ Pitfalls = Mathematically / Physically inconsistent states (anomalies, tadpoles, …) ‣ Exit = Standard Model of Particle Physics

  14. Coding

  15. Discussion

Recommend


More recommend