markov decision processes and reinforcement learning
play

Markov Decision Processes and Reinforcement Learning Marco - PowerPoint PPT Presentation

Lecture 14 Markov Decision Processes and Reinforcement Learning Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Markov Decision Processes Course


  1. Lecture 14 Markov Decision Processes and Reinforcement Learning Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig

  2. Markov Decision Processes Course Overview Reinforcement Learning ✔ Introduction Learning ✔ Artificial Intelligence ✔ Supervised Decision Trees, Neural ✔ Intelligent Agents Networks ✔ Search Learning Bayesian Networks ✔ Uninformed Search Unsupervised ✔ Heuristic Search EM Algorithm ✔ Uncertain knowledge and Reinforcement Learning Reasoning Games and Adversarial Search ✔ Probability and Bayesian approach Minimax search and ✔ Bayesian Networks Alpha-beta pruning ✔ Hidden Markov Chains Multiagent search ✔ Kalman Filters Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2

  3. Markov Decision Processes Recap Reinforcement Learning Supervised ( x 1 , y 1 )( x 2 , y 2 ) . . . y = f ( x ) Unsupervised x 1 , x 2 , . . . Pr ( X = x ) Reinforcement ( s , a , s , a , s ) + rewards at some states π ( s ) 3

  4. Markov Decision Processes Reinforcement Learning Reinforcement Learning Consider chess: we wish to learn correct move for each state but no feedback available on this only feedback available is a reward or reinforcement at the end of a sequence of moves or at some intermediary states. agents then learn a transition model Other examples, backgammon, helicopter, etc. Recall: Environments are categorized along several dimensions: fully observable partially observable deterministic stochastic episodic sequential static dynamic discrete continuous single-agent multi-agents 4

  5. Markov Decision Processes Markov Decision Processes Reinforcement Learning Sequential decision problems: the outcome depends on a sequence of decisions. Include search and plannig as special cases. search (problem solving in a state space (detrministic and fully observable) planning (interleaves planning and execution gathering feedback from environment because of stochasticity, partial observability, multi-agents. Belief state space) learning uncertainty Environment: Deterministic Stochastic A ∗ , DFS, BFS Fully observable MDP 5

  6. Markov Decision Processes Reinforcement Learning Reinforcement Learning MDP: fully observable environment and agent knows reward functions Now: fully observable environment but no knoweldge of how it works (reward functions) and probabilistic actions 6

  7. Markov Decision Processes Outline Reinforcement Learning 1. Markov Decision Processes 2. Reinforcement Learning 7

  8. Markov Decision Processes Terminology and Notation Reinforcement Learning Sequential decision probelm in a fully observable, stochastic environment with Markov transition model and additive rewards s ∈ S states a ∈ A ( s ) actions s 0 start state transition probability; world is stochastic; p ( s ′ | s , a ) Markovian assumption R ( s ) or R ( s , a , s ′ ) reward utility function depends on sequence of states U ([ s 0 , s 1 , . . . , s n ]) or V () (sum of rewards) Example: A fixed action sequence is not good becasue of probabilistic + 1 3 0.8 actions 0.1 0.1 Policy π : specification of what to 2 –1 do in any state 1 START Optimal policy π ∗ : policy with highest expected utility 1 2 3 4 (a) (b) 8

  9. Markov Decision Processes Highest Expected Utility Reinforcement Learning U ([ s 0 , s 1 , . . . , s n ]) = R ( s 0 ) + γ R ( s 1 ) + γ 2 R ( s 2 ) + . . . + γ n R ( s n ) � ∞ � � γ t R ( s t ) � Pr ( s ′ | s , a ∈ π ( s )) U π ( s ′ ) U π ( s ) = E π = R ( s ) + γ t = 0 s ′ looks onwards, dependency on future neighbors Optimal policy: U π ∗ ( s ) = max U π ( s ) π π ∗ ( s ) = argmax π U π ( s ) Choose actions by max expected utilities (Bellman equation): U π ∗ ( s ) = R ( s ) + γ max � Pr ( s ′ | s , a ) U ( s ′ ) a ∈ A ( s ) s ′ � � � π ∗ ( s ) = argmax a ∈ A ( s ) Pr ( s ′ | s , a ) U ( s ′ ) R ( s ) + γ max a ∈ A ( s ) s ′ 9

  10. Markov Decision Processes Value Iteration Reinforcement Learning 1. calculate the utility function of each state using the iterative procedure below 2. use state utilities to select an optimal action For 1. use the following iterative algorithm: 10

  11. Markov Decision Processes Q-Values Reinforcement Learning For 2. once the optimal U ∗ values have been calculated: � � � π ∗ ( s ) = argmax a ∈ A ( s ) Pr ( s ′ | s , a ) U ∗ ( s ′ ) R ( s ) + γ s ′ Hence we would need to compute the sum for each a . Idea: save Q -values � Q ∗ ( s , a ) = R ( s ) + γ Pr ( s ′ | s , a ) U ∗ ( s ′ ) s ′ so actions are easier to select: π ∗ ( s ) = argmax a ∈ A ( s ) Q ∗ ( s , a ) 11

  12. Markov Decision Processes Example Reinforcement Learning python gridworld.py -a value -i 1 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 1 ITERATIONS Q-VALUES AFTER 1 ITERATIONS --------------------------------------------------- ----------------------------------------------------------------------- | | 0 | 1 | 2 | 3 | | | 0 | 1 | 2 | 3 | --------------------------------------------------- ----------------------------------------------------------------------- | | ^ | ^ | | ------ | | | /0.00\ | /0.00\ | 0.09 | | | | | | | | | | | | | | | | |2| 0.00 | 0.00 | 0.00 > | | 1.00 | | | | | | | [ 1.00 ] | | | | | | | | | |2|<0.00 0.00>|<0.00 0.00>| 0.00 0.72> | | | | | | | ------ | | | | | | | --------------------------------------------------- | | | | | | | | ^ | | | ------- | | | \0.00/ | \0.00/ | 0.09 | | | | | ##### | | | | | ----------------------------------------------------------------------- |1| 0.00 | ##### | < 0.00 | | -1.00 | | | | /0.00\ | | -0.09 | | | | | ##### | | | | | | | | | | | | | | | | ------- | | | | ##### | | [ -1.00 ] | --------------------------------------------------- |1|<0.00 0.00>| ##### |<0.00 -0.72 | | | | ^ | ^ | ^ | | | | | ##### | | | | | | | | | | | | | | | |0| S: 0.00 | 0.00 | 0.00 | 0.00 | | | \0.00/ | | -0.09 | | | | | | | | ----------------------------------------------------------------------- | | | | | v | | | /0.00\ | /0.00\ | /0.00\ | -0.72 | --------------------------------------------------- | | | | | | | | | | | | |0|<0.00 S 0.00>|<0.00 0.00>| <0.00 0.00> | -0.09 -0.09 | | | | | | | | | | | | | | | \0.00/ | \0.00/ | \0.00/ | \0.00/ | ----------------------------------------------------------------------- 12

  13. Markov Decision Processes Example Reinforcement Learning python gridworld.py -a value -i 2 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 2 ITERATIONS Q-VALUES AFTER 2 ITERATIONS --------------------------------------------------- ----------------------------------------------------------------------- | | 0 | 1 | 2 | 3 | | | 0 | 1 | 2 | 3 | --------------------------------------------------- ----------------------------------------------------------------------- | | ^ | | | ------ | | | /0.00\ | 0.06 | 0.61 | | | | | | | | | | | | | | | | |2| 0.00 | 0.00 > | 0.72 > | | 1.00 | | | | | | | [ 1.00 ] | | | | | | | | | |2|<0.00 0.00>| 0.00 0.52>| 0.06 0.78> | | | | | | | ------ | | | | | | | --------------------------------------------------- | | | | | | | | ^ | | ^ | ------- | | | \0.00/ | 0.06 | 0.09 | | | | | ##### | | | | | ----------------------------------------------------------------------- |1| 0.00 | ##### | 0.00 | | -1.00 | | | | /0.00\ | | /0.43\ | | | | | ##### | | | | | | | | | | | | | | | | ------- | | | | ##### | | [ -1.00 ] | --------------------------------------------------- |1|<0.00 0.00>| ##### | 0.06 -0.66 | | | | ^ | ^ | ^ | | | | | ##### | | | | | | | | | | | | | | | |0| S: 0.00 | 0.00 | 0.00 | 0.00 | | | \0.00/ | | -0.09 | | | | | | | | ----------------------------------------------------------------------- | | | | | v | | | /0.00\ | /0.00\ | /0.00\ | -0.72 | --------------------------------------------------- | | | | | | | | | | | | |0|<0.00 S 0.00>|<0.00 0.00>| <0.00 0.00> | -0.09 -0.09 | | | | | | | | | | | | | | | \0.00/ | \0.00/ | \0.00/ | \0.00/ | ----------------------------------------------------------------------- 13

Recommend


More recommend