outline general introduction basic settings
play

Outline General introduction Basic - PowerPoint PPT Presentation

Outline General introduction Basic settings Tabular approach Deep reinforcement learning Challenges and opportunities Appendix: selected applications General Introduction


  1. Recall: Bellman Optimality Equation โ€ข Optimal value function โ€ข Optimal state-value function: ๐‘ค โˆ— ๐‘ก = max ๐‘ค ๐œŒ ๐‘ก ๐œŒ โ€ข Optimal action-value function: ๐‘Ÿ โˆ— ๐‘ก, ๐‘ = max ๐‘Ÿ ๐œŒ ๐‘ก, ๐‘ ๐œŒ โ€ข Bellman optimality equation โ€ข ๐‘ค โˆ— ๐‘ก = max ๐‘Ÿ โˆ— ๐‘ก, ๐‘ ๐‘ ๐‘ ๐‘ค โˆ— (๐‘ก โ€ฒ ) ๐‘ + ๐›ฟ ฯƒ ๐‘ก โ€ฒ ๐‘„ ๐‘ก๐‘ก โ€ฒ โ€ข ๐‘Ÿ โˆ— ๐‘ก, ๐‘ = ๐‘† ๐‘ก

  2. Planning (Optimal Control) Given an exact model (i.e., reward function, transition probabilities) Value iteration with Bellman optimality equation : Arbitrary initialization: ๐‘Ÿ 0 For ๐‘™ = 0,1,2, โ€ฆ โˆ€๐‘ก โˆˆ ๐‘‡, ๐‘ โˆˆ ๐ต ๐‘Ÿ ๐‘™+1 ๐‘ก, ๐‘ = ๐‘  ๐‘ก, ๐‘ + ๐›ฟ ฯƒ ๐‘ก โ€ฒ โˆˆ๐‘‡ ๐‘„ ๐‘ก โ€ฒ ๐‘ก, ๐‘ max ๐‘ โ€ฒ ๐‘Ÿ ๐‘™ (๐‘ก โ€ฒ , ๐‘โ€ฒ) Stopping criterion: max ๐‘กโˆˆ๐‘‡,๐‘โˆˆ๐ต ๐‘Ÿ ๐‘™+1 ๐‘ก, ๐‘ โˆ’ ๐‘Ÿ ๐‘™ ๐‘ก, ๐‘ โ‰ค ๐œ—

  3. Learning in MDPs โ€ข Have access to the real system but no model โ€ข Generate experience ๐‘ 1 , ๐‘ 1 , ๐‘  1 , ๐‘ 2 , ๐‘ 2 , ๐‘  2 , โ€ฆ , ๐‘ ๐‘ขโˆ’1 , ๐‘ ๐‘ขโˆ’1 , ๐‘  ๐‘ขโˆ’1 , ๐‘ ๐‘ข โ€ข Two kinds of approaches โ€ข Model-free learning โ€ข Model-based learning

  4. Monte-Carlo Policy Evaluation โ€ข To evaluate state ๐‘ก โ€ข The first time-step ๐‘ข that state ๐‘ก is visited in an episode, โ€ข Increment counter ๐‘‚(๐‘ก) = ๐‘‚(๐‘ก) + 1 โ€ข Increment total return ๐‘‡ ๐‘ก = ๐‘‡(๐‘ก) + ๐ป ๐‘ข ๐‘‡ ๐‘ก โ€ข Value is estimated by mean return ๐‘Š ๐‘ก = ๐‘‚ ๐‘ก โ€ข By law of large numbers, ๐‘Š ๐‘ก โ†’ ๐‘ค ๐œŒ ๐‘ก ๐‘๐‘ก ๐‘‚ ๐‘ก โ†’ โˆž

  5. Incremental Monte-Carlo Update ๐‘™ ๐‘™โˆ’1 ๐œˆ ๐‘™ = 1 ๐‘ฆ ๐‘˜ = 1 ๐‘™ เท ๐‘ฆ ๐‘™ + เท ๐‘ฆ ๐‘˜ ๐‘™ ๐‘˜=1 ๐‘˜=1 = 1 ๐‘™ ๐‘ฆ ๐‘™ + ๐‘™ โˆ’ 1 ๐œˆ ๐‘™โˆ’1 = ๐œˆ ๐‘™โˆ’1 + 1 ๐‘™ ๐‘ฆ ๐‘™ โˆ’ ๐œˆ ๐‘™โˆ’1 For each state ๐‘ก with return ๐ป ๐‘ข : ๐‘‚ ๐‘ก โ† ๐‘‚ ๐‘ก + 1 1 ๐‘Š ๐‘ก โ† ๐‘Š ๐‘ก + ๐‘‚ ๐‘ก (๐ป ๐‘ข โˆ’ ๐‘Š ๐‘ก ) Handle non-stationary problem: ๐‘Š ๐‘ก โ† ๐‘Š ๐‘ก + ๐›ฝ(๐ป ๐‘ข โˆ’ ๐‘Š ๐‘ก )

  6. Monte-Carlo Policy Evaluation ๐‘ค ๐‘ก ๐‘ข โ† ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ ๐ป ๐‘ข โˆ’ ๐‘ค ๐‘ก ๐‘ข ๐ป ๐‘ข is the actual long-term return following state ๐‘ก ๐‘ข in a sampled trajectory

  7. Monte-Carlo Reinforcement Learning โ€ข MC methods learn directly from episodes of experience โ€ข MC is model-free: no knowledge of MDP transitions / rewards โ€ข MC learns from complete episodes โ€ข Values for each state or pair state-action are updated only based on final reward, not on estimations of neighbor states โ€ข MC uses the simplest possible idea: value = mean return โ€ข Caveat: can only apply MC to episodic MDPs โ€ข All episodes must terminate

  8. Temporal-Difference Policy Evaluation Monte-Carlo : ๐‘ค ๐‘ก ๐‘ข โ† ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ ๐ป ๐‘ข โˆ’ ๐‘ค ๐‘ก ๐‘ข TD: ๐‘ค ๐‘ก ๐‘ข โ† ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค(๐‘ก ๐‘ข+1 ) โˆ’ ๐‘ค ๐‘ก ๐‘ข ๐‘  ๐‘ข is the actual immediate reward following state ๐‘ก ๐‘ข in a sampled step

  9. Temporal-Difference Policy Evaluation โ€ข TD methods learn directly from episodes of experience โ€ข TD is model-free: no knowledge of MDP transitions / rewards โ€ข TD learns from incomplete episodes, by bootstrapping โ€ข TD updates a guess towards a guess โ€ข Simplest temporal-difference learning algorithm: TD(0) โ€ข Update value ๐‘ค ๐‘ก ๐‘ข toward estimated return ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 ๐‘ค ๐‘ก ๐‘ข = ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ(๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 โˆ’ ๐‘ค ๐‘ก ๐‘ข ) โ€ข ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 is called the TD target โ€ข ๐œ€ ๐‘ข = ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 โˆ’ ๐‘ค ๐‘ก ๐‘ข is called the TD error

  10. Comparisons TD MC DP

  11. Policy Improvement

  12. Policy Iteration

  13. ๐œ— -greedy Exploration

  14. Monte-Carlo Policy Iteration

  15. Monte-Carlo Control

  16. MC vs TD Control โ€ข Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC) โ€ข Lower variance โ€ข Online โ€ข Incomplete sequences โ€ข Natural idea: use TD instead of MC in our control loop โ€ข Apply TD to Q ( S; A ) โ€ข Use ๐œ— -greedy policy improvement โ€ข Update every time-step

  17. Model-based Learning โ€ข Use experience data to estimate model โ€ข Compute optimal policy w.r.t the estimated model

  18. Summary to RL Planning Policy evaluation For a fixed policy Value iteration, policy iteration Optimal control Optimize Policy Model-free learning Policy evaluation For a fixed policy Monte-carlo, TD learning Optimal control Optimize Policy Model-based learning

  19. Large Scale RL โ€ข So far we have represented value function by a lookup table โ€ข Every state ๐‘ก has an entry ๐‘ค(๐‘ก) โ€ข Or every state-action pair ๐‘ก, ๐‘ has an entry ๐‘Ÿ(๐‘ก, ๐‘) โ€ข Problem with large MDPs: โ€ข Too many states and/or actions to sore in memory โ€ข Too slow to learn the value of each state (action pair) individually โ€ข Backgammon: 10 20 states โ€ข Go: 10 170 states

  20. Solution: Function Approximation for RL โ€ข Estimate value function with function approximation โ€ข เทœ ๐‘ค ๐‘ก; ๐œ„ โ‰ˆ ๐‘ค ๐œŒ ๐‘ก or เทœ ๐‘Ÿ ๐‘ก, ๐‘; ๐œ„ โ‰ˆ ๐‘Ÿ ๐œŒ (๐‘ก, ๐‘) โ€ข Generalize from seen states to unseen states โ€ข Update parameter ๐œ„ using MC or TD learning โ€ข Policy function โ€ข Model transition function

  21. Deep Reinforcement Learning Deep learning . Value based . Policy gradients Actor-critic . Model based

  22. Deep Learning Is Making Break-through! ไบบๅทฅๆ™บ่ƒฝๆŠ€ๆœฏๅœจ้™ๅฎšๅ›พๅƒ็ฑปๅˆซ ็š„ๅฐ้—ญ่ฏ•้ชŒไธญ๏ผŒไนŸๅทฒ็ป่พพๅˆฐๆˆ– ่ถ…่ฟ‡ไบ†ไบบ็ฑป็š„ๆฐดๅนณ 2016 ๅนด 10 ๆœˆ๏ผŒๅพฎ่ฝฏ็š„่ฏญ้Ÿณ่ฏ†ๅˆซ็ณป็ปŸๅœจ ๆ—ฅๅธธๅฏน่ฏๆ•ฐๆฎไธŠ๏ผŒ่พพๅˆฐไบ† 5.9% ็š„ๅ• ่ฏ้”™่ฏฏ็Ž‡๏ผŒ้ฆ–ๆฌกๅ–ๅพ—ไธŽไบบ็ฑป็›ธๅฝ“็š„ ่ฏ†ๅˆซ็ฒพๅบฆ

  23. Deep Learning Deep learning ( deep machine learning , or deep structured learning , or hierarchical learning , or sometimes DL ) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non-linear transformations. 2012 : Distributed deep learning 2015 : Open source tools: MxNet, 1974 : Backpropagation 1997 : LSTM-RNN (e.g., Google Brain) TensorFlow, CNTK t 1958 : Birth of Late 1980s : convolution neural 2006 : Unsupervised pretraining 2013 : DQN for deep Perceptron and neural networks (CNN) and recurrent neural for deep neutral networks reinforcement learning networks networks (RNN) trained using backpropagation

  24. Driving Power โ€ข โ€ข โ€ข Big data: web pages, search logs, Deep models: 1000+ layers, tens Big computer clusters: CPU social networks, and new of billions of parameters clusters, GPU clusters, FPGA farms, mechanisms for data collection: provided by Amazon, Azure, etc. conversation and crowdsourcing

  25. Value based methods: estimate value function or Q-function of the optimal policy (no explicit policy)

  26. Nature 2015 Human Level Control Through Deep Reinforcement Learning

  27. Human-level Control Through Deep Reinforcement Learning Representations of Atari Games โ€ข End-to-end learning of values ๐‘…(๐‘ก, ๐‘) from pixels ๐‘ก โ€ข Input state ๐‘ก is stack of raw pixels from last 4 frames โ€ข Output is ๐‘…(๐‘ก, ๐‘) for 18 joystick/button positions โ€ข Reward is change in score for that step

  28. Value Iteration with Q-Learning โ€ข Represent value function by deep Q-network with weights ๐œ„ ๐‘… ๐‘ก, ๐‘; ๐œ„ โ‰ˆ ๐‘… ๐œŒ (๐‘ก, ๐‘) โ€ข Define objective function by mean-squared error in Q-values 2 ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘€ ๐œ„ = ๐น ๐‘  + ๐›ฟmax โ€ข Leading to the following Q-learning gradient ๐œ–๐‘€ ๐œ„ ๐œ–๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ = ๐น ๐‘  + ๐›ฟmax ๐œ–๐œ„ ๐œ–๐œ„ โ€ข Optimize objective end-to-end by SGD

  29. Stability Issues with Deep RL Naive Q-learning oscillates or diverges with neural nets โ€ข Data is sequential โ€ข Successive samples are correlated, non-iid โ€ข Policy changes rapidly with slight changes to Q-values โ€ข Policy may oscillate โ€ข Distribution of data can swing from one extreme to another

  30. Deep Q-Networks โ€ข DQN provides a stable solution to deep value-based RL โ€ข Use experience replay โ€ข Break correlations in data, bring us back to iid setting โ€ข Learn from all past policies โ€ข Using off-policy Q-learning โ€ข Freeze target Q-network โ€ข Avoid oscillations โ€ข Break correlations between Q-network and target

  31. Deep Q-Networks: Experience Replay To remove correlations, build data-set from agent's own experience โ€ข Take action at according to ๐œ— -greedy policy โ€ข Store transition (๐‘ก ๐‘ข , ๐‘ ๐‘ข , ๐‘  ๐‘ข+1 , ๐‘ก ๐‘ข+1 ) in replay memory D โ€ข Sample random mini-batch of transitions (๐‘ก, ๐‘, ๐‘ , ๐‘กโ€ฒ) from D โ€ข Optimize MSE between Q-network and Q-learning targets, e.g. 2 ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘€ ๐œ„ = ๐น ๐‘ก,๐‘,๐‘ ,๐‘ก โ€ฒ โˆผ๐ธ ๐‘  + ๐›ฟmax

  32. Deep Q-Networks: Fixed target network To avoid oscillations, fix parameters used in Q-learning target โ€ข Compute Q-learning targets w.r.t. old, fixed parameters ๐œ„ โˆ’ ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘  + ๐›ฟmax โ€ข Optimize MSE between Q-network and Q-learning targets 2 ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘€ ๐œ„ = ๐น ๐‘ก,๐‘,๐‘ ,๐‘ก โ€ฒ โˆผ๐ธ ๐‘  + ๐›ฟmax โ€ข Periodically update fixed parameters ๐œ„ โˆ’ โ† ๐œ„

  33. Of 49 Atari games 43 games are better than state-of-art results Experiment 29 games achieves 75% expert score

  34. Other Tricks โ€ข DQN clips the rewards to [ - 1 ; +1] โ€ข This prevents Q-values from becoming too large โ€ข Ensures gradients are well-conditioned โ€ข Canโ€™t tell difference between small and large rewards โ€ข Better approach: normalize network output โ€ข e.g. via batch normalization

  35. Extensions โ€ข Deep Recurrent Q-Learning for Partially Observable MDPs โ€ข Use CNN + LSTM instead of CNN to encode frames of images โ€ข Deep Attention Recurrent Q-Network โ€ข Use CNN + LSTM + Attention model to encode frames of images

  36. Policy gradients: directly differentiate the objective

  37. Gradient Computation

  38. Policy Gradients โ€ข Optimization Problem: Find ๐œ„ that maximizes expected total reward. โ€ข The gradient of a stochastic policy ๐œŒ ฮธ (๐‘|๐‘ก) is given by โ€ข The gradient of a deterministic policy ๐‘ = ๐œˆ ฮธ ๐‘ก is given by โ€ข Gradient tries to โ€ข Increase probability of paths with positive R โ€ข Decrease probability of paths with negative R

  39. REINFORCE โ€ข We use return ๐‘ค ๐‘ข as an unbiased sample of Q. โ€ข ๐‘ค ๐‘ข = ๐‘  1 + ๐‘  2 + โ‹ฏ + ๐‘  ๐‘ข โ€ข high variance โ€ข limited for stochastic case

  40. Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy

  41. Actor-Critic โ€ข We use a critic to estimate the action- value function โ€ข Actor-critic algorithms โ€ข Updates action-value function parameters โ€ข Updates policy parameters ฮธ, in direction suggested by critic

  42. Review โ€ข Value Based โ€ข Learnt Value Function โ€ข Implicit policy โ€ข (e.g. ๐œ— -greedy) โ€ข Policy Based โ€ข No Value Function โ€ข Learnt Policy โ€ข Actor-Critic โ€ข Learnt Value Function โ€ข Learnt Policy

  43. Model based DRL โ€ข Learn a transition model of the environment/system ๐‘„(๐‘ , ๐‘ก โ€ฒ |๐‘ก, ๐‘) โ€ข Using deep network to represent the model โ€ข Define loss function for the model โ€ข Optimize the loss by SGD or its variants โ€ข Plan using the transition model โ€ข E.g., lookahead using the transition model to find optimal actions

  44. Model based DRL: Challenges โ€ข Errors in the transition model compound over the trajectory โ€ข By the end of a long trajectory, rewards can be totally wrong โ€ข Model-based RL has failed in Atari

  45. Challenges and Opportunities

  46. 1. Robustness โ€“ random seeds

  47. 1. Robustness โ€“ random seeds Deep Reinforcement Learning that Matters, AAAI18

  48. 2. Robustness โ€“ across task Deep Reinforcement Learning that Matters, AAAI18

  49. โ€ข ResNet performs pretty well on various kinds of tasks โ€ข Object detection As a โ€ข Image segmentation Comparison โ€ข Go playing โ€ข Image generation โ€ข โ€ฆ

  50. 3. Learning - sample efficiency โ€ข Supervised learning โ€ข Learning from oracle โ€ข Reinforcement learning โ€ข Learning from trial and error Rainbow: Combining Improvements in Deep Reinforcement Learning

  51. Multi-task/transfer learning โ€ข Humans canโ€™t learn individual complex tasks from scratch. โ€ข Maybe our agents shouldnโ€™t either. โ€ข We ultimately want our agents to learn many tasks in many environments โ€ข learn to learn new tasks quickly (Duan et al. โ€™17, Wang et al. โ€™17, Finn et al. ICML โ€™17) โ€ข share information across tasks in other ways (Rusu et al. NIPS โ€™16, Andrychowicz et al. โ€˜17, Cabi et al. โ€™17, Teh et al. โ€™17) โ€ข Better exploration strategies

  52. 4. Optimization โ€“ local optima

  53. 5. No/sparse reward Real world interaction: โ€ข Usually no (visible) immediate reward for each action โ€ข Maybe no (visible) explicit final reward for a sequence of actions โ€ข Donโ€™t know how to terminate a sequence Consequences: โ€ข Most DRL algos are for games or robotics โ€ข Reward information is defined by video games in Atari and Go โ€ข Within controlled environments

  54. โ€ข Scalar reward is an extremely sparse signal, while at the same time, humans can learn without any external rewards. โ€ข Self-supervision (Osband et al. NIPS โ€™16, Houthooft et al. NIPS โ€™16, Pathak et al. ICML โ€™17, Fu*, Co - Reyes* et al. โ€˜17, Tang et al. ICLR โ€™17, Plappert et al. โ€˜17) โ€ข options & hierarchy (Kulkarni et al. NIPS โ€™16, Vezhnevets et al. NIPS โ€™16, Bacon et al. AAAI โ€™16, Heess et al. โ€˜17, Vezhnevets et al. ICML โ€™17, Tessler et al. AAAI โ€™17) โ€ข leveraging stochastic policies for better exploration (Florensa et al. ICLR โ€™17, Haarnoja et al. ICML โ€™17) โ€ข auxiliary objectives (Jaderberg et al. โ€™17, Shelhamer et al. โ€™17, Mirowski et al. ICLR โ€™17)

  55. 6. Is DRL a good choice for a task?

  56. 7. Imperfect-information games and multi-agent games โ€ข No-limit heads up Texas Holdโ€™Em โ€ข Libratus (Brown et al, NIPS 2017) โ€ข DeepStack ( Moravฤรญk et al, 2017) Refer to Prof. Bo Anโ€™s talk

  57. Improve robustness (e.g., w.r.t random seeds and across tasks) Improve learning efficiency Better optimization Opportunities Define reward in practical applications Identify appropriate tasks Imperfect information and multi-agent games

  58. Applications

  59. Game Neuro Science Music & Movie Healthcare NLP Trading Robotics Education Control

  60. Game โ€ข RL for Game โ€ข Sequential Decision Making โ€ข Delayed Reward TD-Gammon Atari Games

  61. Game โ€ข Atari Games โ€ข Learned to play 49 games for the Atari 2600 game console, without labels or human input, from self-play and the score alone โ€ข Learned to play better than all previous algorithms and at human level for more than half the games Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.

  62. Game โ€ข AlphaGo 4-1 CNN โ€ข Master(AlphaGo++) 60-0 ) Value Network Policy Network http://icml.cc/2016/tutorials/AlphaGo-tutorial-slides.pdf

  63. Game Neuro Science Music & Movie Healthcare NLP Trading Robotics Education Control

  64. Neuro Science The world presents animals/humans with a huge reinforcement learning problem (or many such small problems)

  65. Neuro Science โ€ข How can the brain realize these? Can RL help us understand the brainโ€™s computations? โ€ข Reinforcement learning has revolutionized our understanding of learning in the brain in the last 20 years. โ€ข A success story: Dopamine and prediction errors Yael Niv. The Neuroscience of Reinforcement Learning. Princeton University. ICMLโ€™09 Tutorial

  66. What is dopamine? โ€ข Parkinsonโ€™s Disease โ€ข Plays a major role in reward-motivated behavior as a โ€œglobal reward signalโ€ โ€ข Gambling โ€ข Regulating attention โ€ข Pleasure

  67. Conditioning โ€ข Pavlovโ€™s Dog

Recommend


More recommend