CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are due Wheeler Ruml (UNH) Lecture 18, CS 730 – 1 / 14
MDP Wrap-Up ■ RTDP ■ MDPs ADP Q -Learning MDP Wrap-Up Wheeler Ruml (UNH) Lecture 18, CS 730 – 2 / 14
Real-time Dynamic Programming for a known MDP. which states to update? MDP Wrap-Up ■ RTDP initialize U to an upper bound ■ MDPs ADP update U as we follow greedy policy from s 0 Q -Learning � T ( s, a, s ′ ) U ( s ′ ) U ( s ) ← R ( s ) + γ max a s ′ states that agent is likely to visit (nice anytime profile) Wheeler Ruml (UNH) Lecture 18, CS 730 – 3 / 14
Summary of MDP Solving value iteration: compute U π ∗ ■ MDP Wrap-Up ■ RTDP prioritized sweeping ◆ ■ MDPs RTDP ◆ ADP Q -Learning policy iteration: compute U π using ■ ◆ linear algebra (exact) ◆ simplified value iteration (exact and faster?) ◆ modified PI (a few updates, so inexact) Wheeler Ruml (UNH) Lecture 18, CS 730 – 4 / 14
MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Model-based Reinforcement Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 5 / 14
Adaptive Dynamic Programming ‘model-based’. active vs passive MDP Wrap-Up learn T and R as we go, calculating π using MDP methods (eg, ADP ■ ADP VI or PI) ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Until max-update ≤ loss − bound (1 − γ ) 2 2 γ 2 Q -Learning for each state s s ′ T ( s, a, s ′ ) U ( s ′ ) U ( s ) ← R ( s ) + γ max a � � T ( s, a, s ′ ) U ( s ′ ) π ( s ) = argmax a s ′ Wheeler Ruml (UNH) Lecture 18, CS 730 – 6 / 14
Prioritized Sweeping given an experience ( s, a, s ′ , r ) , MDP Wrap-Up ADP update model ■ ADP update s ■ Sweeping ■ Policy Iteration repeat k times: ■ Bandits do highest priority update ■ Break Q -Learning to update state s with change δ in U ( s ) : update U ( s ) priority of s ← 0 for each predecessor s ′ of s : priority s ′ ← max of current and max a δ ˆ T ( s ′ , as ′ ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 7 / 14
Policy Iteration repeat until π doesn’t change: MDP Wrap-Up ADP given π , compute U π ( s ) for all states ■ ADP given U , calculate policy by one-step look-ahead ■ Sweeping ■ Policy Iteration ■ Bandits If π doesn’t change, U doesn’t either. ■ Break We are at an equilibrium (= optimal π )! Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 8 / 14
Exploration vs Exploitation problem: MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14
Exploration vs Exploitation problem: greedy (local minima) MDP Wrap-Up ADP �� � ■ ADP U + ( s ) ← R ( s ) + γ max T ( s, a, s ′ ) U + ( s ′ ) , N ( a, s ) f ■ Sweeping ■ Policy Iteration a s ′ ■ Bandits ■ Break where f ( u, n ) = R max if n < k , u otherwise Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14
Break asst 4 ■ MDP Wrap-Up final papers: writing-intensive ■ ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 10 / 14
MDP Wrap-Up ADP Q -Learning ■ Q -Learning ■ Summary ■ EOLQs Model-free Reinforcement Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 11 / 14
Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning ■ Summary ■ EOLQs Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14
Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14
Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14
Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14
Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Q ( s ′ , a ′ )) − Q ( s, a )) Q ( s, a ) ← Q ( s, a ) + α ( γ ( r + max a ′ Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14
Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Q ( s ′ , a ′ )) − Q ( s, a )) Q ( s, a ) ← Q ( s, a ) + α ( γ ( r + max a ′ α ≈ 1 /N ? policy: choose random with probability 1 /N ? Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14
Summary Model known (solving MDP): MDP Wrap-Up ADP value iteration ■ policy iteration: compute U π using Q -Learning ■ ■ Q -Learning ■ Summary ◆ linear algebra ■ EOLQs ◆ simplified value iteration ◆ a few updates (modified PI) Model unknown (RL): ADP using ■ ◆ value iteration ◆ a few updates (eg, prioritized sweeping) Q-learning ■ Wheeler Ruml (UNH) Lecture 18, CS 730 – 13 / 14
EOLQs What question didn’t you get to ask today? ■ MDP Wrap-Up What’s still confusing? ■ ADP What would you like to hear more about? ■ Q -Learning ■ Q -Learning ■ Summary Please write down your most pressing question about AI and put ■ EOLQs it in the box on your way out. Thanks! Wheeler Ruml (UNH) Lecture 18, CS 730 – 14 / 14
Recommend
More recommend