CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: - PowerPoint PPT Presentation

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are due Wheeler Ruml (UNH) Lecture 18, CS 730 – 1 / 14

MDP Wrap-Up ■ RTDP ■ MDPs ADP Q -Learning MDP Wrap-Up Wheeler Ruml (UNH) Lecture 18, CS 730 – 2 / 14

Real-time Dynamic Programming for a known MDP. which states to update? MDP Wrap-Up ■ RTDP initialize U to an upper bound ■ MDPs ADP update U as we follow greedy policy from s 0 Q -Learning � T ( s, a, s ′ ) U ( s ′ ) U ( s ) ← R ( s ) + γ max a s ′ states that agent is likely to visit (nice anytime profile) Wheeler Ruml (UNH) Lecture 18, CS 730 – 3 / 14

Summary of MDP Solving value iteration: compute U π ∗ ■ MDP Wrap-Up ■ RTDP prioritized sweeping ◆ ■ MDPs RTDP ◆ ADP Q -Learning policy iteration: compute U π using ■ ◆ linear algebra (exact) ◆ simplified value iteration (exact and faster?) ◆ modified PI (a few updates, so inexact) Wheeler Ruml (UNH) Lecture 18, CS 730 – 4 / 14

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Model-based Reinforcement Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 5 / 14

Adaptive Dynamic Programming ‘model-based’. active vs passive MDP Wrap-Up learn T and R as we go, calculating π using MDP methods (eg, ADP ■ ADP VI or PI) ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Until max-update ≤ loss − bound (1 − γ ) 2 2 γ 2 Q -Learning for each state s s ′ T ( s, a, s ′ ) U ( s ′ ) U ( s ) ← R ( s ) + γ max a � � T ( s, a, s ′ ) U ( s ′ ) π ( s ) = argmax a s ′ Wheeler Ruml (UNH) Lecture 18, CS 730 – 6 / 14

Prioritized Sweeping given an experience ( s, a, s ′ , r ) , MDP Wrap-Up ADP update model ■ ADP update s ■ Sweeping ■ Policy Iteration repeat k times: ■ Bandits do highest priority update ■ Break Q -Learning to update state s with change δ in U ( s ) : update U ( s ) priority of s ← 0 for each predecessor s ′ of s : priority s ′ ← max of current and max a δ ˆ T ( s ′ , as ′ ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 7 / 14

Policy Iteration repeat until π doesn’t change: MDP Wrap-Up ADP given π , compute U π ( s ) for all states ■ ADP given U , calculate policy by one-step look-ahead ■ Sweeping ■ Policy Iteration ■ Bandits If π doesn’t change, U doesn’t either. ■ Break We are at an equilibrium (= optimal π )! Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 8 / 14

Exploration vs Exploitation problem: MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14

Exploration vs Exploitation problem: greedy (local minima) MDP Wrap-Up ADP �� ■ ADP U + ( s ) ← R ( s ) + γ max T ( s, a, s ′ ) U + ( s ′ ) , N ( a, s ) f ■ Sweeping ■ Policy Iteration a s ′ ■ Bandits ■ Break where f ( u, n ) = R max if n < k , u otherwise Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14

Break asst 4 ■ MDP Wrap-Up final papers: writing-intensive ■ ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 10 / 14

MDP Wrap-Up ADP Q -Learning ■ Q -Learning ■ Summary ■ EOLQs Model-free Reinforcement Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 11 / 14

Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning ■ Summary ■ EOLQs Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Q ( s ′ , a ′ )) − Q ( s, a )) Q ( s, a ) ← Q ( s, a ) + α ( γ ( r + max a ′ Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Q ( s ′ , a ′ )) − Q ( s, a )) Q ( s, a ) ← Q ( s, a ) + α ( γ ( r + max a ′ α ≈ 1 /N ? policy: choose random with probability 1 /N ? Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

Summary Model known (solving MDP): MDP Wrap-Up ADP value iteration ■ policy iteration: compute U π using Q -Learning ■ ■ Q -Learning ■ Summary ◆ linear algebra ■ EOLQs ◆ simplified value iteration ◆ a few updates (modified PI) Model unknown (RL): ADP using ■ ◆ value iteration ◆ a few updates (eg, prioritized sweeping) Q-learning ■ Wheeler Ruml (UNH) Lecture 18, CS 730 – 13 / 14

EOLQs What question didn’t you get to ask today? ■ MDP Wrap-Up What’s still confusing? ■ ADP What would you like to hear more about? ■ Q -Learning ■ Q -Learning ■ Summary Please write down your most pressing question about AI and put ■ EOLQs it in the box on your way out. Thanks! Wheeler Ruml (UNH) Lecture 18, CS 730 – 14 / 14

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: - PowerPoint PPT Presentation

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are due Wheeler Ruml (UNH) Lecture 18, CS 730 1 / 14 MDP Wrap-Up RTDP MDPs ADP Q -Learning MDP Wrap-Up Wheeler Ruml (UNH) Lecture 18, CS

CS 730/730W/830: Intro AI First-order Logic Inference in FOL 1 handout: slides 730W journal

CS 730/730W/830: Intro AI Beyond STRIPS Hierarchy Wheeler Ruml (UNH) Lecture 18, CS 730 1 /

CS 730/730W/830: Intro AI What is KR? Prop. Logic Reasoning 2 handouts: slides, assignment 2

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

CS 730/730W/830: Intro AI Propositional Logic First-Order Logic 1 handout: slides Wheeler Ruml

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/730W/830: Intro AI Break HMMs 1 handout: slides final blog entries were due Wheeler

CS 730/730W/830: Intro AI Prof. Wheeler Ruml, Kingsbury W233 (esp ?) What is AI? This class

CS 730/830: Intro AI CSPs 1 handout: slides asst 4 posted Wheeler Ruml (UNH) Lecture 8, CS 730

CS 730/830: Intro AI 1 handout: slides Control Wheeler Ruml (UNH) Lecture 6, CS 730 1 / 12

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

CS 730/830: Intro AI 1 handout: slides Search Basic Algorithms A Clever Algorithm EOLQs

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

CS 730/830: Intro AI Class Outro AI at UNH Wheeler Ruml (UNH) Lecture 27, CS 730 1 / 12

CS 730/830: Intro AI Unsuperv. Learning asst 11 posted Wheeler Ruml (UNH) Lecture 23, CS 730

CS 730/830: Intro AI 1 handout: slides Are We Done? Beyond A* Suboptimal Search Anytime

The ADP: enabling access and exploitation of radio data collections through the IVOA Marco

Applicant User Guide Presentation Slides to help the applicant through the application process

1 Pure Reinforcement Learning vs. Reinforcement Learning Monte-Carlo Planning No knowledge

Model Checking: the Interval Way Alberto Molinari ( j.w. with L. Bozzelli, A. Montanari, A. Peron,

Parish X Cluster Parish A Cluster PPP loan = $24,500 PPP loan = $24,500 EIDL Advance = $6,000

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

ftwilliam.com: Defined Benefit Webinar 8/19/2020 Speakers and Agenda Joe Kleinrichert

an Americ rican an Democ ocrac racy Project ect initi tiat ativ ive Introducti oduction

Sambuz

Useful Links

Newsletter

Mail Us