online planning
play

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming - PowerPoint PPT Presentation

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming Backpropagation Update depends on prior Update uses all rewards estimates for other states. from a full rollout. end t T R t X Q ( s, a ) [ R + V (


  1. Online Planning 3/1/17

  2. Q-Learning vs MCTS • Dynamic programming • Backpropagation • Update depends on prior • Update uses all rewards estimates for other states. from a full rollout. end γ t − T R t X Q ( s, a ) ← α [ R + γ V ( s 0 )] + Q ( s, a ) ← average of t = T (1 − α ) [old Q ( s, a )] and old Q ( s, a ) • Updates immediately • Updates after rollout • Try action a in state s • Save path of (s,a) pairs • Update Q(s,a) • Update when all rewards are known. Both converge to correct Q(s,a) estimates!

  3. Demo: Q-Learning vs. MCTS

  4. What about expansion? • In MCTS for game playing, we only update values for nodes already in the tree. • On each rollout we expanded exactly one node. • In Q-learning, we update values for every node we encounter. Which method should we use in MCTS for MDPs? • Hint: either is appropriate under the right circumstances. What are those circumstances?

  5. Online vs. Offline Decision-Making Our approach to MDPs so far: learn the value model completely, then pick optimal actions. Alternative approach: learn the (local) value model well enough to find a good action for the current state, take that action, then continue learning. When is online reasoning a good idea? Note: online learning (taking actions while you’re still learning) comes up in many machine learning contexts.

  6. Simulated vs. Real World Actions So far, we’ve been blurring an important distinction. Does the agent: • take actions in the world and learn from the consequences, or • simulate the effect of possible actions before deciding how to act? Q-learning can be applied in either case. For online learning, we care about the difference.

  7. Model Simulations • Value iteration is great when we know the whole model (and can fit the value table in memory). • Q-learning is great when we don’t know anything. • Simulation is a middle ground. • We might want to use simulation when: • We know the MDP, but it’s huge. • We have a function that generates successor states, but don’t know the full set of possible states in advance.

  8. MCTS for Online Planning In the online planning setting, every time we need to choose an action, we stop and think about it first. “Thinking about it” means simulating future actions to learn from their consequences.

  9. MCTS Review • Selection • Runs in the already-explored part of the state space. • Choose a random action, according to UCB weights. • Expansion • When we first encounter something unexplored. • Chose an unexplored action uniformly at random. • Simulation • After we’ve left the known region. • Select actions randomly according to the default policy. • Backpropagation • Update values for states visited in selection/expansion. • Average previous values with value on current rollout.

  10. Differences from game-playing MCTS • Learning state/action values instead of state values. • The next state is non-deterministic. • Simulation may never reach a terminal state. • There is no longer a tree-structure to the states. • Non-terminal states can have rewards.

  11. Online MCTS Value Backup Observe sequence of (state, action) pairs and corresponding rewards. • Save (state, action, reward) during selection / expansion • Save only reward during simulation Want to compute value (on the current rollout) for each (s,a) pair, then average with old values. states: [s0, s7, s3, s5] actions: [a0, a0, a2, a1] rewards: [ 0, -1, +2, 0, 0, +1, -1] 𝛿 =.9 Compute values for the current rollout.

  12. Demo: Online MCTS

Recommend


More recommend