Lecture 14
Markov Decision Processes and Reinforcement Learning
Marco Chiarandini
Department of Mathematics & Computer Science University of Southern Denmark
Markov Decision Processes and Reinforcement Learning Marco - - PowerPoint PPT Presentation
Lecture 14 Markov Decision Processes and Reinforcement Learning Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Markov Decision Processes Course
Department of Mathematics & Computer Science University of Southern Denmark
Markov Decision Processes Reinforcement Learning
2
Markov Decision Processes Reinforcement Learning
3
Markov Decision Processes Reinforcement Learning
4
Markov Decision Processes Reinforcement Learning
5
Markov Decision Processes Reinforcement Learning
6
Markov Decision Processes Reinforcement Learning
7
Markov Decision Processes Reinforcement Learning
1 2 3 1 2 3 4
START
0.8 0.1 0.1 (a) (b) –1 + 1
8
Markov Decision Processes Reinforcement Learning
π
a∈A(s)
a∈A(s)
Markov Decision Processes Reinforcement Learning
10
Markov Decision Processes Reinforcement Learning
11
Markov Decision Processes Reinforcement Learning
python gridworld.py -a value -i 1 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 1 ITERATIONS
| 1 | 2 | 3 |
^ | ^ | |
| | | | | | | | |2| 0.00 | 0.00 | 0.00 > | | 1.00 | | | | | | | | | | | | | | |
^ | | |
| | | ##### | | | | | |1| 0.00 | ##### | < 0.00 | | -1.00 | | | | | ##### | | | | | | | | | |
^ | ^ | ^ | | | | | | | | |0| S: 0.00 | 0.00 | 0.00 | 0.00 | | | | | | | | | | | | v |
| 1 | 2 | 3 |
/0.00\ | /0.00\ | 0.09 | | | | | | | | | | | | | [ 1.00 ] | |2|<0.00 0.00>|<0.00 0.00>| 0.00 0.72> | | | | | | | | | | | | | | | | \0.00/ | \0.00/ | 0.09 | |
/0.00\ | |
| | | | | | | | | | | ##### | | [ -1.00 ] | |1|<0.00 0.00>| ##### |<0.00
| | | | ##### | | | | | | | | | | | \0.00/ | |
| |
/0.00\ | /0.00\ | /0.00\ |
| | | | | | | | | | | | | |0|<0.00 S 0.00>|<0.00 0.00>| <0.00 0.00> | -0.09
| | | | | | | | | | | | | | \0.00/ | \0.00/ | \0.00/ | \0.00/ |
Markov Decision Processes Reinforcement Learning
python gridworld.py -a value -i 2 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 2 ITERATIONS
| 1 | 2 | 3 |
^ | | |
| | | | | | | | |2| 0.00 | 0.00 > | 0.72 > | | 1.00 | | | | | | | | | | | | | | |
^ | | ^ |
| | | ##### | | | | | |1| 0.00 | ##### | 0.00 | | -1.00 | | | | | ##### | | | | | | | | | |
^ | ^ | ^ | | | | | | | | |0| S: 0.00 | 0.00 | 0.00 | 0.00 | | | | | | | | | | | | v |
| 1 | 2 | 3 |
/0.00\ | 0.06 | 0.61 | | | | | | | | | | | | | [ 1.00 ] | |2|<0.00 0.00>| 0.00 0.52>| 0.06 0.78> | | | | | | | | | | | | | | | | \0.00/ | 0.06 | 0.09 | |
/0.00\ | | /0.43\ | | | | | | | | | | | ##### | | [ -1.00 ] | |1|<0.00 0.00>| ##### | 0.06
| | | | ##### | | | | | | | | | | | \0.00/ | |
| |
/0.00\ | /0.00\ | /0.00\ |
| | | | | | | | | | | | | |0|<0.00 S 0.00>|<0.00 0.00>| <0.00 0.00> | -0.09
| | | | | | | | | | | | | | \0.00/ | \0.00/ | \0.00/ | \0.00/ |
Markov Decision Processes Reinforcement Learning
python gridworld.py -a value -i 3 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 3 ITERATIONS
| 1 | 2 | 3 |
| | |
| | | | | | | | |2| 0.00 > | 0.52 > | 0.78 > | | 1.00 | | | | | | | | | | | | | | |
^ | | ^ |
| | | ##### | | | | | |1| 0.00 | ##### | 0.43 | | -1.00 | | | | | ##### | | | | | | | | | |
^ | ^ | ^ | | | | | | | | |0| S: 0.00 | 0.00 | 0.00 | 0.00 | | | | | | | | | | | | v |
| 1 | 2 | 3 |
0.05 | 0.44 | 0.70 | | | | | | | | | | | | | [ 1.00 ] | |2| 0.00 0.37>| 0.09 0.66>| 0.48 0.83> | | | | | | | | | | | | | | | | 0.05 | 0.44 | 0.45 | |
/0.00\ | | /0.51\ | | | | | | | | | | | ##### | | [ -1.00 ] | |1|<0.00 0.00>| ##### | 0.38
| | | | ##### | | | | | | | | | | | \0.00/ | |
| |
/0.00\ | /0.00\ | /0.31\ |
| | | | | | | | | | | | | |0|<0.00 S 0.00>|<0.00 0.00>| 0.04 0.04 | -0.09
| | | | | | | | | | | | | | \0.00/ | \0.00/ | 0.00 | \0.00/ |
Markov Decision Processes Reinforcement Learning
1 2 3 1 2 3 + 1
–1
4
–1
+1
–1
+1
–1
+1
–1
+1
15
Markov Decision Processes Reinforcement Learning
16
Markov Decision Processes Reinforcement Learning
17
Markov Decision Processes Reinforcement Learning
–1 +1 1 2 3 1 2 3 4 1 2 3 1 2 3 –1 + 1 4 0.611 0.812 0.655 0.762 0.918 0.705 0.660 0.868 0.388
18
Markov Decision Processes Reinforcement Learning
19
Markov Decision Processes Reinforcement Learning
20
Markov Decision Processes Reinforcement Learning
21
Markov Decision Processes Reinforcement Learning
22
Markov Decision Processes Reinforcement Learning
1 2 3 1 2 3 –1 +1 4
0.5 1 1.5 2 50 100 150 200 250 300 350 400 450 500 RMS error, policy loss Number of trials RMS error Policy loss
23
Markov Decision Processes Reinforcement Learning
24
Markov Decision Processes Reinforcement Learning
25
a∈A(s)
a∈A(s)
a′
a′ Q∗(s′, a′)
Markov Decision Processes Reinforcement Learning
a′ Q(s′, a′) − Q(s, a)]
27
Markov Decision Processes Reinforcement Learning
28
Markov Decision Processes Reinforcement Learning
29