department of computer science csci 5622 machine learning
play

Department of Computer Science CSCI 5622: Machine Learning Chenhao - PowerPoint PPT Presentation

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 21: Reinforcement learning I Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1 Administrivia Poster printing Email your poster to


  1. Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 21: Reinforcement learning I Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1

  2. Administrivia • Poster printing • Email your poster to inkspot.umc@colorado.edu with subject “Tan Poster Project” by Thursday noon • Poster size A1 • Check Piazza for details • Light refreshments will be provided, invite your friends • Poster session: DLC 1B70 on Dec 13 2

  3. Learning objectives • Understand the formulation of reinforcement learning • Understand the definition of a policy and the optimal policy • Learn about value iteration • Most of the two lectures are based on Richard S. Sutton and Andrew G. Barto’s book 3

  4. Supervised learning Unsupervised learning Data: X Labels: Y Data: X Latent structure: Z 4

  5. 5

  6. An agent learns to behave in an environment 6

  7. Reinforcement learning examples • Minh et al. 2013 • https://www.youtube.com/watch?v=V1eYniJ0Rnk 7

  8. Reinforcement learning examples 8

  9. Reinforcement learning 9

  10. Reinforcement learning 10

  11. Markov decision processes 11

  12. Markov decision processes 12

  13. Markov decision processes 13

  14. A few examples • Grid world 14

  15. A few examples • Atari game (Bonus: try Google image search “atari breakout”) 15

  16. A few examples • Go 16

  17. Goal • Episodes: ending at a terminal state, e.g., a play of a game • Continuing tasks: keep trying and having infinite steps 17

  18. Policy • The agent’s action selection 18

  19. Value function 19

  20. Action-value function (Q-function) 20

  21. Optimal policy and optimal value function 21

  22. Optimal policy and optimal value function 22

  23. Optimal policy and optimal value function 23

  24. A concrete grid example • Grid world 24

  25. A concrete grid example Rewards can be positive or negative Delayed reward : might not get reward until you reach goal Might have negative reward until you reach goal 25

  26. A concrete grid example 26

  27. A concrete grid example 27

  28. A concrete grid example 28

  29. A concrete grid example 29

  30. A concrete grid example 30

  31. A concrete grid example 31

  32. A concrete grid example Take-Away : Optimal policy highly dependent on details of reward 32

  33. Value Iteration Punchline : Discounted reward renders an infinite horizon value function finite . Great b/c we can actually compare value of different sequences 33

  34. Value Iteration 34

  35. Value Iteration 35

  36. Value Iteration 36

  37. Value Iteration 37

  38. Value Iteration 38

  39. Value Iteration 39

  40. Value Iteration 40

  41. Value Iteration 41

  42. Value Iteration 42

  43. Value Iteration 43

  44. Value Iteration 44

Recommend


More recommend