1 Gridworld: Q* The Bellman Equa)ons How to be op)mal: - PDF document

Stochas)c ¡Planning: ¡MDPs ¡ CS ¡188: ¡Ar)ficial ¡Intelligence ¡ Static ¡ Markov ¡Decision ¡Processes ¡II ¡ Environment Fully Observable Stochastic What action next? Instantaneous Perfect Percepts Actions Instructors: ¡Dan ¡Klein ¡and ¡Pieter ¡Abbeel ¡-‑-‑-‑ ¡University ¡of ¡California, ¡Berkeley ¡ 3 [These ¡slides ¡were ¡created ¡by ¡Dan ¡Klein ¡and ¡Pieter ¡Abbeel ¡for ¡CS188 ¡Intro ¡to ¡AI ¡at ¡UC ¡Berkeley. ¡ ¡All ¡CS188 ¡materials ¡are ¡available ¡at ¡hKp://ai.berkeley.edu.] ¡ Recap: ¡MDPs ¡ Solving ¡MDPs ¡ § Markov ¡decision ¡processes: ¡ § States ¡S ¡ § Value ¡Itera)on ¡ § Ac)ons ¡A ¡ § Real-‑Time ¡Dynamic ¡Programming ¡ § Transi)ons ¡P(s’|s,a) ¡(or ¡T(s,a,s’)) ¡ § Rewards ¡R(s,a,s’) ¡(and ¡discount ¡ γ ) ¡ § Policy ¡Itera)on ¡ § Start ¡state ¡s 0 ¡ s § Quan))es: ¡ § Reinforcement ¡Learning ¡ a § Policy ¡= ¡map ¡of ¡states ¡to ¡ac)ons ¡ § U)lity ¡= ¡sum ¡of ¡discounted ¡rewards ¡ s, ¡a ¡ § Values ¡= ¡expected ¡future ¡u)lity ¡from ¡a ¡state ¡(max ¡node) ¡ § Q-‑Values ¡= ¡expected ¡future ¡u)lity ¡from ¡a ¡q-‑state ¡(chance ¡node) ¡ s,a,s ’ ¡ s ’ ¡ Op)mal ¡Quan))es ¡ Gridworld ¡Values ¡V* ¡ § The ¡value ¡(u)lity) ¡of ¡a ¡state ¡s: ¡ V * (s) ¡= ¡expected ¡u)lity ¡star)ng ¡in ¡s ¡and ¡ s is a s ac)ng ¡op)mally ¡ state a (s, a) is a § The ¡value ¡(u)lity) ¡of ¡a ¡q-‑state ¡(s,a): ¡ s, a q-state Q * (s,a) ¡= ¡expected ¡u)lity ¡star)ng ¡out ¡ having ¡taken ¡ac)on ¡a ¡from ¡state ¡s ¡and ¡ s,a,s’ (s,a,s’) is a (therea]er) ¡ac)ng ¡op)mally ¡ transition s’ ¡ § The ¡op)mal ¡policy: ¡ π * (s) ¡= ¡op)mal ¡ac)on ¡from ¡state ¡s ¡ [Demo: ¡ ¡gridworld ¡values ¡(L9D1)] ¡ 1

Gridworld: ¡Q* ¡ The ¡Bellman ¡Equa)ons ¡ How ¡to ¡be ¡op)mal: ¡ ¡ ¡ ¡ ¡ ¡Step ¡1: ¡Take ¡correct ¡first ¡ac)on ¡ ¡ ¡ ¡ ¡ ¡Step ¡2: ¡Keep ¡being ¡op)mal ¡ The ¡Bellman ¡Equa)ons ¡ Racing ¡Search ¡Tree ¡ § Defini)on ¡of ¡“op)mal ¡u)lity” ¡via ¡expec)max ¡recurrence ¡ s § We’re ¡doing ¡way ¡too ¡much ¡ work ¡with ¡expec)max! ¡ gives ¡a ¡simple ¡one-‑step ¡lookahead ¡rela)onship ¡ a amongst ¡op)mal ¡u)lity ¡values ¡ s, ¡a ¡ § Problem: ¡States ¡are ¡repeated ¡ ¡ § Idea: ¡Only ¡compute ¡needed ¡ s,a,s ’ ¡ quan))es ¡once ¡ s ’ ¡ § Problem: ¡Tree ¡goes ¡on ¡forever ¡ § Idea: ¡Do ¡a ¡depth-‑limited ¡ computa)on, ¡but ¡with ¡increasing ¡ depths ¡un)l ¡change ¡is ¡small ¡ § These ¡are ¡the ¡Bellman ¡equa)ons, ¡and ¡they ¡characterize ¡ § Note: ¡deep ¡parts ¡of ¡the ¡tree ¡ eventually ¡don’t ¡maKer ¡if ¡ γ ¡< ¡1 ¡ op)mal ¡values ¡in ¡a ¡way ¡we’ll ¡use ¡over ¡and ¡over ¡ ¡ ¡ Time-‑Limited ¡Values ¡ Time-‑Limited ¡Values: ¡Avoiding ¡Redundant ¡Computa)on ¡ § Key ¡idea: ¡)me-‑limited ¡values ¡ § Define ¡V k (s) ¡to ¡be ¡the ¡op)mal ¡value ¡of ¡s ¡if ¡the ¡game ¡ends ¡ in ¡k ¡more ¡)me ¡steps ¡ § Equivalently, ¡it’s ¡what ¡a ¡depth-‑k ¡expec)max ¡would ¡give ¡from ¡s ¡ [Demo ¡– ¡)me-‑limited ¡values ¡(L8D6)] ¡ 2

Value ¡Itera)on ¡ Example: ¡Value ¡Itera)on ¡ ¡ ¡3.5 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2.5 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡0 ¡ Assume ¡no ¡discount ¡(gamma=1) ¡to ¡keep ¡math ¡simple! ¡ Value ¡Itera)on ¡ Called a Example: Bellman Backup “Bellman Backup” § Start ¡with ¡V 0 (s) ¡= ¡0: ¡ ¡ ¡ ¡ ¡ no ¡8me ¡steps ¡le9 ¡means ¡an ¡expected ¡reward ¡sum ¡of ¡zero ¡ Q 1 (s,a 1 ) = 2 + γ 0 § Given ¡vector ¡of ¡V k (s) ¡values, ¡do ¡one ¡ply ¡of ¡expec?max ¡from ¡each ¡state: ¡ s 1 2 a 1 V 0 = 0 V 1 = 6.5 ~ 2 V k+1 (s) ¡ 5 a Q 1 (s,a 2 ) = 5 + γ 0.9~ 1 s 0 a 2 s, ¡a ¡ 4 + γ 0.1~ 2 . 5 § Repeat ¡un?l ¡convergence ¡ ¡ 0 . 9 s 2 V 0 = 1 ~ 6.1 s,a,s ’ ¡ a greedy = a 3 0 (trust ¡me, ¡it ¡does) ¡ a 3 . 1 V k (s’) ¡ Q 1 (s,a 3 ) = 4.5 + γ 2 ~ 6.5 s 3 V 0 = 2 max k=0 ¡ k=1 ¡ If agent is in 4,3, it only has one legal action: get jewel. It gets a reward and the game is over. If agent is in the pit, it has only one legal action, die. It gets a penalty and the game is over. Agent does NOT get a reward for moving INTO 4,3. Noise ¡= ¡0.2 ¡ Noise ¡= ¡0.2 ¡ Discount ¡= ¡0.9 ¡ Discount ¡= ¡0.9 ¡ Living ¡reward ¡= ¡0 ¡ Living ¡reward ¡= ¡0 ¡ 3

k=2 ¡ k=3 ¡ Noise ¡= ¡0.2 ¡ Noise ¡= ¡0.2 ¡ Discount ¡= ¡0.9 ¡ Discount ¡= ¡0.9 ¡ Living ¡reward ¡= ¡0 ¡ Living ¡reward ¡= ¡0 ¡ k=4 ¡ k=5 ¡ Noise ¡= ¡0.2 ¡ Noise ¡= ¡0.2 ¡ Discount ¡= ¡0.9 ¡ Discount ¡= ¡0.9 ¡ Living ¡reward ¡= ¡0 ¡ Living ¡reward ¡= ¡0 ¡ k=6 ¡ k=7 ¡ Noise ¡= ¡0.2 ¡ Noise ¡= ¡0.2 ¡ Discount ¡= ¡0.9 ¡ Discount ¡= ¡0.9 ¡ Living ¡reward ¡= ¡0 ¡ Living ¡reward ¡= ¡0 ¡ 4

k=8 ¡ k=9 ¡ Noise ¡= ¡0.2 ¡ Noise ¡= ¡0.2 ¡ Discount ¡= ¡0.9 ¡ Discount ¡= ¡0.9 ¡ Living ¡reward ¡= ¡0 ¡ Living ¡reward ¡= ¡0 ¡ k=10 ¡ k=11 ¡ Noise ¡= ¡0.2 ¡ Noise ¡= ¡0.2 ¡ Discount ¡= ¡0.9 ¡ Discount ¡= ¡0.9 ¡ Living ¡reward ¡= ¡0 ¡ Living ¡reward ¡= ¡0 ¡ k=12 ¡ k=100 ¡ Noise ¡= ¡0.2 ¡ Noise ¡= ¡0.2 ¡ Discount ¡= ¡0.9 ¡ Discount ¡= ¡0.9 ¡ Living ¡reward ¡= ¡0 ¡ Living ¡reward ¡= ¡0 ¡ 5

Value ¡Itera)on ¡ Value ¡Itera)on ¡ § Start ¡with ¡V 0 (s) ¡= ¡0: ¡ § Bellman ¡equa)ons ¡characterize ¡the ¡op)mal ¡values: ¡ V(s) ¡ § Given ¡vector ¡of ¡V k (s) ¡values, ¡do ¡one ¡ply ¡of ¡expec?max ¡from ¡each ¡state: ¡ V k+1 (s) ¡ a s, ¡a ¡ a s, ¡a ¡ s,a,s ’ ¡ § Value ¡itera)on ¡computes ¡them: ¡ V(s’) ¡ s,a,s ’ ¡ § Repeat ¡un?l ¡convergence ¡ V k (s’) ¡ § Complexity ¡of ¡each ¡itera?on: ¡O(S 2 A) ¡ § Number ¡of ¡itera?ons: ¡poly(|S|, ¡|A|, ¡1/(1-‑g)) ¡ ¡ § Value ¡itera)on ¡is ¡just ¡a ¡fixed ¡point ¡solu)on ¡method ¡ § … ¡though ¡the ¡V k ¡vectors ¡are ¡also ¡interpretable ¡as ¡)me-‑limited ¡values ¡ § Theorem: ¡will ¡converge ¡to ¡unique ¡op?mal ¡values ¡ Convergence* ¡ Policy ¡Extrac)on ¡ § How ¡do ¡we ¡know ¡the ¡V k ¡vectors ¡will ¡converge? ¡ § Case ¡1: ¡If ¡the ¡tree ¡has ¡maximum ¡depth ¡M, ¡then ¡ V M ¡holds ¡the ¡actual ¡untruncated ¡values ¡ § Case ¡2: ¡If ¡the ¡discount ¡is ¡less ¡than ¡1 ¡ § Sketch: ¡For ¡any ¡state ¡V k ¡and ¡V k+1 ¡can ¡be ¡viewed ¡as ¡ depth ¡k+1 ¡expec)max ¡results ¡in ¡nearly ¡iden)cal ¡ search ¡trees ¡ § The ¡max ¡difference ¡happens ¡if ¡big ¡reward ¡at ¡k+1 ¡level ¡ § That ¡last ¡layer ¡is ¡at ¡best ¡all ¡R MAX ¡ ¡ § But ¡everything ¡is ¡discounted ¡by ¡γ k ¡that ¡far ¡out ¡ § So ¡V k ¡and ¡V k+1 ¡are ¡at ¡most ¡γ k ¡max|R| ¡different ¡ § So ¡as ¡k ¡increases, ¡the ¡values ¡converge ¡ Compu)ng ¡Ac)ons ¡from ¡Values ¡ Compu)ng ¡Ac)ons ¡from ¡Q-‑Values ¡ § Let’s ¡imagine ¡we ¡have ¡the ¡op)mal ¡values ¡V*(s) ¡ § Let’s ¡imagine ¡we ¡have ¡the ¡op)mal ¡q-‑values: ¡ § How ¡should ¡we ¡act? ¡ § How ¡should ¡we ¡act? ¡ § It’s ¡not ¡obvious! ¡ § Completely ¡trivial ¡to ¡decide! ¡ § We ¡need ¡to ¡do ¡a ¡mini-‑expec)max ¡(one ¡step) ¡ § This ¡is ¡called ¡policy ¡extrac)on, ¡since ¡it ¡gets ¡the ¡policy ¡implied ¡by ¡the ¡values ¡ § Important ¡lesson: ¡ac)ons ¡are ¡easier ¡to ¡select ¡from ¡q-‑values ¡than ¡values! ¡ 6

1 Gridworld: Q* The Bellman Equa)ons How to be op)mal: - PDF document

Stochas)c Planning: MDPs CS 188: Ar)ficial Intelligence Static Markov Decision Processes II Environment Fully Observable Stochastic What action next? Instantaneous Perfect Percepts

Privacy-preserving KYC on Ethereum Introduction A decentralized KYC-compliant identity Alex

So, I have all these containers! Now what? Image by Connie Zhou Developer View job hello_world

Python Evangelism 101 or Python Meets The Enterprise Peter Wang Teaching Languages

Locations and Neighborhoods in Urbanizing Africa Simon Franklin London School of Economics

CS70: Jean Walrand: Lecture 35. Continuous Probability 2 1. Review: CDF , PDF 2. Examples 3.

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 16: Uncertainty and

Lecture 7: Shared memory programming David Bindel 20 Sep 2011 Logistics Still have a couple

Last time: GADTs a b 1/ 41 This time: monads (etc.) = > > 2/ 41 What do monads

Neil Mitchell www.cs.york.ac.uk/~ndm/ The Problem Count the number of lines in a file

Generation of Non-Uniform Random Numbers Generation of Non-Uniform Random Numbers Refs: Chapter 8

Upper and Lower Semimodularity of the Supercharacter Theory Lattices of Cyclic Groups Samuel

PVMD Olindo Isabella Delft University of Technology Learning objectives 1. PV module design

TEAM MANAGERS' BRIEFING & DRAW 15 th August 2019 @7pm STTA Conference Room Briefing Agenda

Ramsey-type colourings and Relation Algebras Tomasz Kowalski Department of Mathematics and

Prices may vary geographically Remember there is a network involved, and power has to flow...

Rectangle Free Coloring of Grids Stephen Fenner- U of SC William Gasarch- U of MD Charles

SANE BOUNDS ON SOME VDW-TYPE NUMBERS A collaboration spanning many papers and authors.

A new criterion for avoiding the propagation of linear relations through an Sbox Christina Boura

Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming Applications on Manycore

Algorithms for Lattice Transforms and 2348 2349 Lattice Transforms for 234 248 239

Automated Reasoning Rippling: Heuristic Guidance for Inductive Proof (II) Alan Bundy Automated

Fast polynomial reduction for generic bivariate ideals Joris van der Hoeven, Robin Larrieu

Composition of Monads Jeremy E. Dawson Logic & Computation Programme Automated Reasoning

On the Boomerang Uniformity of Cryptographic Sboxes Christina Boura and Anne Canteaut University

1 Gridworld: Q* The Bellman Equa)ons How to be op)mal: - PDF document

Stochas)c Planning: MDPs CS 188: Ar)ficial Intelligence Static Markov Decision Processes II Environment Fully Observable Stochastic What action next? Instantaneous Perfect Percepts

Privacy-preserving KYC on Ethereum Introduction A decentralized KYC-compliant identity Alex

So, I have all these containers! Now what? Image by Connie Zhou Developer View job hello_world

Python Evangelism 101 or Python Meets The Enterprise Peter Wang Teaching Languages

Locations and Neighborhoods in Urbanizing Africa Simon Franklin London School of Economics

CS70: Jean Walrand: Lecture 35. Continuous Probability 2 1. Review: CDF , PDF 2. Examples 3.

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 16: Uncertainty and

Lecture 7: Shared memory programming David Bindel 20 Sep 2011 Logistics Still have a couple

Last time: GADTs a b 1/ 41 This time: monads (etc.) = &gt; &gt; 2/ 41 What do monads

Neil Mitchell www.cs.york.ac.uk/~ndm/ The Problem Count the number of lines in a file

Generation of Non-Uniform Random Numbers Generation of Non-Uniform Random Numbers Refs: Chapter 8

Upper and Lower Semimodularity of the Supercharacter Theory Lattices of Cyclic Groups Samuel

PVMD Olindo Isabella Delft University of Technology Learning objectives 1. PV module design

TEAM MANAGERS' BRIEFING &amp; DRAW 15 th August 2019 @7pm STTA Conference Room Briefing Agenda

Ramsey-type colourings and Relation Algebras Tomasz Kowalski Department of Mathematics and

Prices may vary geographically Remember there is a network involved, and power has to flow...

Rectangle Free Coloring of Grids Stephen Fenner- U of SC William Gasarch- U of MD Charles

SANE BOUNDS ON SOME VDW-TYPE NUMBERS A collaboration spanning many papers and authors.

A new criterion for avoiding the propagation of linear relations through an Sbox Christina Boura

Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming Applications on Manycore

Algorithms for Lattice Transforms and 2348 2349 Lattice Transforms for 234 248 239

Automated Reasoning Rippling: Heuristic Guidance for Inductive Proof (II) Alan Bundy Automated

Fast polynomial reduction for generic bivariate ideals Joris van der Hoeven, Robin Larrieu

Composition of Monads Jeremy E. Dawson Logic &amp; Computation Programme Automated Reasoning

On the Boomerang Uniformity of Cryptographic Sboxes Christina Boura and Anne Canteaut University

Last time: GADTs a b 1/ 41 This time: monads (etc.) = > > 2/ 41 What do monads

TEAM MANAGERS' BRIEFING & DRAW 15 th August 2019 @7pm STTA Conference Room Briefing Agenda

Composition of Monads Jeremy E. Dawson Logic & Computation Programme Automated Reasoning