N-step bootstrapping Robert Platt Northeastern University I’d love to use my experiences more efficiently...
Motivation Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair Problem: standard Q-Learning/SARSA “propagates rewards” only one state back per time step – n-step bootstrapping is one way to address this problem – we will see other ways in subsequent slide decks.
TD and MC are two extremes of a continuum
TD and MC are two extremes of a continuum What are these?
TD and MC are two extremes of a continuum
TD and MC are two extremes of a continuum Update equation:
TD and MC are two extremes of a continuum Is called the target of the update Update equation:
TD and MC are two extremes of a continuum What’s the target for this one? Update equation:
TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:
TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:
TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:
TD and MC are two extremes of a continuum Notice that you can’t do this update until time step t+3 – TD update happens on next time step – MC update happens at end of episode – n-step TD update happens on time step n What’s the target for this one? Complete update equation:
How well does this work? This comparison is for: – a 19 state random walk policy – n-step TD policy evaluation
n-step TD algorithm
n-step SARSA Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different?
n-step SARSA Why does the backup start with a dot rather than a circle? Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different?
n-step SARSA Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair Right: 10-step SARSA updates last 10 state/action pairs
Recommend
More recommend