Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 21: Reinforcement learning I Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1
Administrivia • Poster printing • Email your poster to inkspot.umc@colorado.edu with subject “Tan Poster Project” by Thursday noon • Poster size A1 • Check Piazza for details • Light refreshments will be provided, invite your friends • Poster session: DLC 1B70 on Dec 13 2
Learning objectives • Understand the formulation of reinforcement learning • Understand the definition of a policy and the optimal policy • Learn about value iteration • Most of the two lectures are based on Richard S. Sutton and Andrew G. Barto’s book 3
Supervised learning Unsupervised learning Data: X Labels: Y Data: X Latent structure: Z 4
5
An agent learns to behave in an environment 6
Reinforcement learning examples • Minh et al. 2013 • https://www.youtube.com/watch?v=V1eYniJ0Rnk 7
Reinforcement learning examples 8
Reinforcement learning 9
Reinforcement learning 10
Markov decision processes 11
Markov decision processes 12
Markov decision processes 13
A few examples • Grid world 14
A few examples • Atari game (Bonus: try Google image search “atari breakout”) 15
A few examples • Go 16
Goal • Episodes: ending at a terminal state, e.g., a play of a game • Continuing tasks: keep trying and having infinite steps 17
Policy • The agent’s action selection 18
Value function 19
Action-value function (Q-function) 20
Optimal policy and optimal value function 21
Optimal policy and optimal value function 22
Optimal policy and optimal value function 23
A concrete grid example • Grid world 24
A concrete grid example Rewards can be positive or negative Delayed reward : might not get reward until you reach goal Might have negative reward until you reach goal 25
A concrete grid example 26
A concrete grid example 27
A concrete grid example 28
A concrete grid example 29
A concrete grid example 30
A concrete grid example 31
A concrete grid example Take-Away : Optimal policy highly dependent on details of reward 32
Value Iteration Punchline : Discounted reward renders an infinite horizon value function finite . Great b/c we can actually compare value of different sequences 33
Value Iteration 34
Value Iteration 35
Value Iteration 36
Value Iteration 37
Value Iteration 38
Value Iteration 39
Value Iteration 40
Value Iteration 41
Value Iteration 42
Value Iteration 43
Value Iteration 44
Recommend
More recommend