n step bootstrapping
play

N-step bootstrapping Robert Platt Northeastern University Id love - PowerPoint PPT Presentation

N-step bootstrapping Robert Platt Northeastern University Id love to use my experiences more efficiently... Motivation Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive


  1. N-step bootstrapping Robert Platt Northeastern University I’d love to use my experiences more efficiently...

  2. Motivation Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair Problem: standard Q-Learning/SARSA “propagates rewards” only one state back per time step – n-step bootstrapping is one way to address this problem – we will see other ways in subsequent slide decks.

  3. TD and MC are two extremes of a continuum

  4. TD and MC are two extremes of a continuum What are these?

  5. TD and MC are two extremes of a continuum

  6. TD and MC are two extremes of a continuum Update equation:

  7. TD and MC are two extremes of a continuum Is called the target of the update Update equation:

  8. TD and MC are two extremes of a continuum What’s the target for this one? Update equation:

  9. TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:

  10. TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:

  11. TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:

  12. TD and MC are two extremes of a continuum Notice that you can’t do this update until time step t+3 – TD update happens on next time step – MC update happens at end of episode – n-step TD update happens on time step n What’s the target for this one? Complete update equation:

  13. How well does this work? This comparison is for: – a 19 state random walk policy – n-step TD policy evaluation

  14. n-step TD algorithm

  15. n-step SARSA Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different?

  16. n-step SARSA Why does the backup start with a dot rather than a circle? Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different?

  17. n-step SARSA Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair Right: 10-step SARSA updates last 10 state/action pairs

Recommend


More recommend