logistics
play

Logistics Reading AIMA Ch 21 (Reinforcement Learning) Markov - PDF document

Logistics Reading AIMA Ch 21 (Reinforcement Learning) Markov Decision Processes Project 1 due today 2 printouts of report Email Miao with CSE 573 Source code Document in .doc or .pdf Project 2 description on web New


  1. Logistics • Reading AIMA Ch 21 (Reinforcement Learning) Markov Decision Processes • Project 1 due today 2 printouts of report Email Miao with CSE 573 • Source code • Document in .doc or .pdf • Project 2 description on web New teams • By Monday 11/15 - Email Miao w/ team + direction Feel free to consider other ideas Idea 1: Spam Filter Idea 2: Localization • Decision Tree Learner ? • Placelab data • Ensemble of… ? • Learn “places” • Naïve Bayes ? K-means clustering • Predict movements Bag of Words Representation between places Enhancement Markov model, or …. • Augment Data Set ? • ??????? Proto-idea 4: Openmind.org Proto-idea 3: Captchas • Repository of Knowledge in NLP • The problem of software robots • What the heck can we do with it???? • Turing test is big business • Break or create Non-vision based? 1

  2. Proto-idea 4: Wordnet Openmind Animals www.cogsci.princeton.edu/~wn/ • Giant graph of concepts Centrally controlled � semantics • What to do? • Integrate with FAQ lists, Openmind, ??? 573 Topics Where are We? • Uncertainty • Bayesian Networks • Sequential Stochastic Processes Reinforcement Learning (Hidden) Markov Models Supervised Learning Planning Dynamic Bayesian networks (DBNs) Probabalistic STRIPS Representation Logic-Based Probabilistic • Markov Decision Processes (MDPs) Knowledge Representation & Inference • Reinforcement Learning Search Problem Spaces Agency An Example Bayes Net Planning under uncertainty Planning Static Environment Pr(B=t) Pr(B=f) Earthquake Burglary 0.05 0.95 Instantaneous Stochastic Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) Fully Observable Radio Alarm Perfect e,b 0.85 (0.15) What action e,b 0.01 (0.99) next? Percepts Actions Nbr1Calls Nbr2Calls Full 2

  3. Recap: Markov Models Models of Planning Uncertainty Q: set of states Deterministic Disjunctive Probabilistic Complete Classical Contingent MDP Observation π: init prob distribution Partial ??? Contingent POMDP A: transition probability distribution ONE per ACTION Markov assumption ??? Conformant POMDP None Stationary model assumption Probabilistic “STRIPS”? A Factored domain in O ffice Move:office � cafe • Variables : R aining has_user_coffee (huc) , has_robot_coffee (hrc), robot_is_wet (w), has_robot_umbrella (u), raining (r), robot_in_office (o) has U mbrella -in O ffice • Actions : buy_coffee, deliver_coffee, get_umbrella, P<.1 move -in O ffice + W et What is the number of states? Can we succinctly represent transition -in O ffice -in O ffice probabilities in this case? + W et Dynamic Bayesian Nets Dynamic Bayesian Net for Move Actually table Total values 8 huc huc’ required to should have 16 huc huc represent entries! 4 transition hrc hrc’ hrc hrc probability Pr(w’|u,w) table = 36 u,w 1.0 (0) 16 w w w w’ u,w 0.1 (0.9) Vs 4096 u,w 1.0 (0) 4 u,w 1.0 (0) u u u u’ 2 r r’ r r Pr(r=T) Pr(r=F) 0.95 0.5 2 o o o o’ 3

  4. Actions in DBN Observability huc huc • Full Observability hrc hrc • Partial Observability Last Time: w w • No Observability Actions in DBN u u Unrolling r r o o Don’t need them Today T T+1 a Reward/cost Horizon • Finite : Plan till t stages. • Each action has an associated cost. Reward = R(s 0 )+R(s 1 )+R(s 2 )+…+R(s t ) • Agent may accrue rewards at different • Infinite : The agent never dies. stages. A reward may depend on The reward R(s 0 )+R(s 1 )+R(s 2 )+… The current state Could be unbounded. The (current state, action) pair ? The (current state, action, next state) triplet • Additivity assumption : Costs and rewards are additive. Discounted reward : R(s 0 )+ γ R(s 1 )+ γ 2 R(s 2 )+… • Reward accumulated = R(s 0 )+R(s 1 )+R(s 2 )+… Average reward : lim n � ∞ (1/n)[ Σ i R(s i )] Goal for an MDP Optimal value of a state • Define V*(s) `value of a state’ as the maximum • Find a policy which: expected discounted reward achievable from this maximizes expected discounted reward state. over an infinite horizon • Value of state if we force it to do action “a” right now, but let it act optimally later: for a fully observable Q*(a,s)=R(s) + c(a) + Markov decision process. γΣ s’ ε S Pr(s’|a,s)V*(s’) • V* should satisfy the following equation: Why shouldn’t the planner find a plan?? V*(s) = max a ε A {Q*(a,s)} What is a policy?? = R(s) + max a ε A {c(a) + γΣ s’ ε S Pr(s’|a,s)V*(s’)} 4

  5. Value iteration Bellman Backup • Assign an arbitrary assignment of values to V n Q n+1 (s,a) each state (or use an admissible heuristic). Max V n a 1 • Iterate over the set of states and in each V n iteration improve the value function as follows: a 2 s V n+1 (s) V n V t+1 (s)=R(s) + max a ε A {c(a)+ γΣ s’ ε S Pr(s’|a,s) V t (s’)} a 3 V n `Bellman Backup’ V n • Stop the iteration appropriately. V t approaches V* as t increases. V n Stopping Condition Complexity of value iteration • ε -convergence : A value function is ε –optimal • One iteration takes O(|S| 2 |A|) time. if the error (residue) at every state is less • Number of iterations required : than ε . poly(|S|,|A|,1/(1- γ )) Residue(s)=|V t+1 (s)- V t (s)| • Overall, the algorithm is polynomial in state Stop when max s ε S R(s) < ε space! • Thus exponential in number of state variables. Computation of optimal policy Policy evaluation • Given the value function V*(s), for each • Given a policy Π :S � A, find value of each state, do Bellman backups and the action state using this policy. which maximises the inner product term is • V Π (s) = R(s) + c( Π (s)) + the optimal action. γ [ Σ s’ ε S Pr(s’| Π (s),s)V Π (s’)] • � Optimal policy is stationary (time • This is a system of linear equations independent) – intuitive for infinite horizon involving |S| variables. case. 5

  6. Bellman’s principle of optimality Policy iteration • Start with any policy ( Π 0 ). • A policy Π is optimal if V Π (s) ≥ V Π ’ (s) for all policies Π ’ and all states s є S. • Iterate Policy evaluation : For each state find V Π i (s). • Rather than finding the optimal value Policy improvement : For each state s, find action function, we can try and find the optimal a* that maximises Q Π i (a,s). policy directly, by doing a policy space If Q Π i (a*,s) > V Π i (s) let Π i+1 (s) = a* search. else let Π i+1 (s) = Π i (s) • Stop when Π i+1 = Π i • Converges faster than value iteration but policy evaluation step is more expensive. Modified Policy iteration RTDP iteration • Rather than evaluating the actual value of • Start with initial belief and initialize value of policy by solving system of equations, each belief as the heuristic value. approximate it by using value iteration with • For current belief fixed policy. Save the action that minimises the current state value in the current policy. Update the value of the belief through Bellman Backup. • Apply the minimum action and then randomly pick an observation. • Go to next belief assuming that observation. • Repeat until goal is achieved. Fast RTDP convergence Other speedups • What are the advantages of RTDP? • Heuristics • What are the disadvantages of RTDP? • Aggregations • Reachability Analysis How to speed up RTDP? 6

  7. Going beyond full observability Models of Planning Uncertainty • In execution phase, we are uncertain Deterministic Disjunctive Probabilistic where we are, • but we have some idea of where we can be. Complete Classical Contingent MDP • A belief state = ? Observation Partial ??? Contingent POMDP ??? Conformant POMDP None Speedups Mathematical modelling • Search space : finite/infinite state/belief space. • Reachability Analysis Belief state = some idea of where we are • More informed heuristic • Initial state/belief. • Actions • Action transitions (state to state / belief to belief) • Action costs • Feedback : Zero/Partial/Total Algorithms for search Full Observability • A* : works for sequential solutions. • Modelled as MDPs. (also called fully observable MDPs) • AO* : works for acyclic solutions. • Output : Policy (State � Action) • LAO* : works for cyclic solutions. • Bellman Equation • RTDP : works for cyclic solutions. V*(s)=max a ε A(s) [c(a)+ Σ s’ ε S V*(s’)P(s’|s,a)] 7

  8. Partial Observability No observability • Modelled as POMDPs. (partially observable • Deterministic search in the belief space. MDPs). Also called Probabilistic Contingent • Output ? Planning. • Belief = probabilistic distribution over states. • What is the size of belief space? • Output : Policy (Discretized Belief -> Action) • Bellman Equation o )] V*(b)=max a ε A(b) [c(a)+ Σ o ε O P(b,a,o) V*(b a 8

Recommend


More recommend