Recall: Bellman Optimality Equation โข Optimal value function โข Optimal state-value function: ๐ค โ ๐ก = max ๐ค ๐ ๐ก ๐ โข Optimal action-value function: ๐ โ ๐ก, ๐ = max ๐ ๐ ๐ก, ๐ ๐ โข Bellman optimality equation โข ๐ค โ ๐ก = max ๐ โ ๐ก, ๐ ๐ ๐ ๐ค โ (๐ก โฒ ) ๐ + ๐ฟ ฯ ๐ก โฒ ๐ ๐ก๐ก โฒ โข ๐ โ ๐ก, ๐ = ๐ ๐ก
Planning (Optimal Control) Given an exact model (i.e., reward function, transition probabilities) Value iteration with Bellman optimality equation : Arbitrary initialization: ๐ 0 For ๐ = 0,1,2, โฆ โ๐ก โ ๐, ๐ โ ๐ต ๐ ๐+1 ๐ก, ๐ = ๐ ๐ก, ๐ + ๐ฟ ฯ ๐ก โฒ โ๐ ๐ ๐ก โฒ ๐ก, ๐ max ๐ โฒ ๐ ๐ (๐ก โฒ , ๐โฒ) Stopping criterion: max ๐กโ๐,๐โ๐ต ๐ ๐+1 ๐ก, ๐ โ ๐ ๐ ๐ก, ๐ โค ๐
Learning in MDPs โข Have access to the real system but no model โข Generate experience ๐ 1 , ๐ 1 , ๐ 1 , ๐ 2 , ๐ 2 , ๐ 2 , โฆ , ๐ ๐ขโ1 , ๐ ๐ขโ1 , ๐ ๐ขโ1 , ๐ ๐ข โข Two kinds of approaches โข Model-free learning โข Model-based learning
Monte-Carlo Policy Evaluation โข To evaluate state ๐ก โข The first time-step ๐ข that state ๐ก is visited in an episode, โข Increment counter ๐(๐ก) = ๐(๐ก) + 1 โข Increment total return ๐ ๐ก = ๐(๐ก) + ๐ป ๐ข ๐ ๐ก โข Value is estimated by mean return ๐ ๐ก = ๐ ๐ก โข By law of large numbers, ๐ ๐ก โ ๐ค ๐ ๐ก ๐๐ก ๐ ๐ก โ โ
Incremental Monte-Carlo Update ๐ ๐โ1 ๐ ๐ = 1 ๐ฆ ๐ = 1 ๐ เท ๐ฆ ๐ + เท ๐ฆ ๐ ๐ ๐=1 ๐=1 = 1 ๐ ๐ฆ ๐ + ๐ โ 1 ๐ ๐โ1 = ๐ ๐โ1 + 1 ๐ ๐ฆ ๐ โ ๐ ๐โ1 For each state ๐ก with return ๐ป ๐ข : ๐ ๐ก โ ๐ ๐ก + 1 1 ๐ ๐ก โ ๐ ๐ก + ๐ ๐ก (๐ป ๐ข โ ๐ ๐ก ) Handle non-stationary problem: ๐ ๐ก โ ๐ ๐ก + ๐ฝ(๐ป ๐ข โ ๐ ๐ก )
Monte-Carlo Policy Evaluation ๐ค ๐ก ๐ข โ ๐ค ๐ก ๐ข + ๐ฝ ๐ป ๐ข โ ๐ค ๐ก ๐ข ๐ป ๐ข is the actual long-term return following state ๐ก ๐ข in a sampled trajectory
Monte-Carlo Reinforcement Learning โข MC methods learn directly from episodes of experience โข MC is model-free: no knowledge of MDP transitions / rewards โข MC learns from complete episodes โข Values for each state or pair state-action are updated only based on final reward, not on estimations of neighbor states โข MC uses the simplest possible idea: value = mean return โข Caveat: can only apply MC to episodic MDPs โข All episodes must terminate
Temporal-Difference Policy Evaluation Monte-Carlo : ๐ค ๐ก ๐ข โ ๐ค ๐ก ๐ข + ๐ฝ ๐ป ๐ข โ ๐ค ๐ก ๐ข TD: ๐ค ๐ก ๐ข โ ๐ค ๐ก ๐ข + ๐ฝ ๐ ๐ข+1 + ๐ฟ๐ค(๐ก ๐ข+1 ) โ ๐ค ๐ก ๐ข ๐ ๐ข is the actual immediate reward following state ๐ก ๐ข in a sampled step
Temporal-Difference Policy Evaluation โข TD methods learn directly from episodes of experience โข TD is model-free: no knowledge of MDP transitions / rewards โข TD learns from incomplete episodes, by bootstrapping โข TD updates a guess towards a guess โข Simplest temporal-difference learning algorithm: TD(0) โข Update value ๐ค ๐ก ๐ข toward estimated return ๐ ๐ข+1 + ๐ฟ๐ค ๐ก ๐ข+1 ๐ค ๐ก ๐ข = ๐ค ๐ก ๐ข + ๐ฝ(๐ ๐ข+1 + ๐ฟ๐ค ๐ก ๐ข+1 โ ๐ค ๐ก ๐ข ) โข ๐ ๐ข+1 + ๐ฟ๐ค ๐ก ๐ข+1 is called the TD target โข ๐ ๐ข = ๐ ๐ข+1 + ๐ฟ๐ค ๐ก ๐ข+1 โ ๐ค ๐ก ๐ข is called the TD error
Comparisons TD MC DP
Policy Improvement
Policy Iteration
๐ -greedy Exploration
Monte-Carlo Policy Iteration
Monte-Carlo Control
MC vs TD Control โข Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC) โข Lower variance โข Online โข Incomplete sequences โข Natural idea: use TD instead of MC in our control loop โข Apply TD to Q ( S; A ) โข Use ๐ -greedy policy improvement โข Update every time-step
Model-based Learning โข Use experience data to estimate model โข Compute optimal policy w.r.t the estimated model
Summary to RL Planning Policy evaluation For a fixed policy Value iteration, policy iteration Optimal control Optimize Policy Model-free learning Policy evaluation For a fixed policy Monte-carlo, TD learning Optimal control Optimize Policy Model-based learning
Large Scale RL โข So far we have represented value function by a lookup table โข Every state ๐ก has an entry ๐ค(๐ก) โข Or every state-action pair ๐ก, ๐ has an entry ๐(๐ก, ๐) โข Problem with large MDPs: โข Too many states and/or actions to sore in memory โข Too slow to learn the value of each state (action pair) individually โข Backgammon: 10 20 states โข Go: 10 170 states
Solution: Function Approximation for RL โข Estimate value function with function approximation โข เท ๐ค ๐ก; ๐ โ ๐ค ๐ ๐ก or เท ๐ ๐ก, ๐; ๐ โ ๐ ๐ (๐ก, ๐) โข Generalize from seen states to unseen states โข Update parameter ๐ using MC or TD learning โข Policy function โข Model transition function
Deep Reinforcement Learning Deep learning . Value based . Policy gradients Actor-critic . Model based
Deep Learning Is Making Break-through! ไบบๅทฅๆบ่ฝๆๆฏๅจ้ๅฎๅพๅ็ฑปๅซ ็ๅฐ้ญ่ฏ้ชไธญ๏ผไนๅทฒ็ป่พพๅฐๆ ่ถ ่ฟไบไบบ็ฑป็ๆฐดๅนณ 2016 ๅนด 10 ๆ๏ผๅพฎ่ฝฏ็่ฏญ้ณ่ฏๅซ็ณป็ปๅจ ๆฅๅธธๅฏน่ฏๆฐๆฎไธ๏ผ่พพๅฐไบ 5.9% ็ๅ ่ฏ้่ฏฏ็๏ผ้ฆๆฌกๅๅพไธไบบ็ฑป็ธๅฝ็ ่ฏๅซ็ฒพๅบฆ
Deep Learning Deep learning ( deep machine learning , or deep structured learning , or hierarchical learning , or sometimes DL ) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non-linear transformations. 2012 : Distributed deep learning 2015 : Open source tools: MxNet, 1974 : Backpropagation 1997 : LSTM-RNN (e.g., Google Brain) TensorFlow, CNTK t 1958 : Birth of Late 1980s : convolution neural 2006 : Unsupervised pretraining 2013 : DQN for deep Perceptron and neural networks (CNN) and recurrent neural for deep neutral networks reinforcement learning networks networks (RNN) trained using backpropagation
Driving Power โข โข โข Big data: web pages, search logs, Deep models: 1000+ layers, tens Big computer clusters: CPU social networks, and new of billions of parameters clusters, GPU clusters, FPGA farms, mechanisms for data collection: provided by Amazon, Azure, etc. conversation and crowdsourcing
Value based methods: estimate value function or Q-function of the optimal policy (no explicit policy)
Nature 2015 Human Level Control Through Deep Reinforcement Learning
Human-level Control Through Deep Reinforcement Learning Representations of Atari Games โข End-to-end learning of values ๐ (๐ก, ๐) from pixels ๐ก โข Input state ๐ก is stack of raw pixels from last 4 frames โข Output is ๐ (๐ก, ๐) for 18 joystick/button positions โข Reward is change in score for that step
Value Iteration with Q-Learning โข Represent value function by deep Q-network with weights ๐ ๐ ๐ก, ๐; ๐ โ ๐ ๐ (๐ก, ๐) โข Define objective function by mean-squared error in Q-values 2 ๐ โฒ ๐ ๐ก โฒ , ๐ โฒ ; ๐ โ ๐ ๐ก, ๐; ๐ ๐ ๐ = ๐น ๐ + ๐ฟmax โข Leading to the following Q-learning gradient ๐๐ ๐ ๐๐ ๐ก, ๐; ๐ ๐ โฒ ๐ ๐ก โฒ , ๐ โฒ ; ๐ โ ๐ ๐ก, ๐; ๐ = ๐น ๐ + ๐ฟmax ๐๐ ๐๐ โข Optimize objective end-to-end by SGD
Stability Issues with Deep RL Naive Q-learning oscillates or diverges with neural nets โข Data is sequential โข Successive samples are correlated, non-iid โข Policy changes rapidly with slight changes to Q-values โข Policy may oscillate โข Distribution of data can swing from one extreme to another
Deep Q-Networks โข DQN provides a stable solution to deep value-based RL โข Use experience replay โข Break correlations in data, bring us back to iid setting โข Learn from all past policies โข Using off-policy Q-learning โข Freeze target Q-network โข Avoid oscillations โข Break correlations between Q-network and target
Deep Q-Networks: Experience Replay To remove correlations, build data-set from agent's own experience โข Take action at according to ๐ -greedy policy โข Store transition (๐ก ๐ข , ๐ ๐ข , ๐ ๐ข+1 , ๐ก ๐ข+1 ) in replay memory D โข Sample random mini-batch of transitions (๐ก, ๐, ๐ , ๐กโฒ) from D โข Optimize MSE between Q-network and Q-learning targets, e.g. 2 ๐ โฒ ๐ ๐ก โฒ , ๐ โฒ ; ๐ โ ๐ ๐ก, ๐; ๐ ๐ ๐ = ๐น ๐ก,๐,๐ ,๐ก โฒ โผ๐ธ ๐ + ๐ฟmax
Deep Q-Networks: Fixed target network To avoid oscillations, fix parameters used in Q-learning target โข Compute Q-learning targets w.r.t. old, fixed parameters ๐ โ ๐ โฒ ๐ ๐ก โฒ , ๐ โฒ ; ๐ โ ๐ + ๐ฟmax โข Optimize MSE between Q-network and Q-learning targets 2 ๐ โฒ ๐ ๐ก โฒ , ๐ โฒ ; ๐ โ โ ๐ ๐ก, ๐; ๐ ๐ ๐ = ๐น ๐ก,๐,๐ ,๐ก โฒ โผ๐ธ ๐ + ๐ฟmax โข Periodically update fixed parameters ๐ โ โ ๐
Of 49 Atari games 43 games are better than state-of-art results Experiment 29 games achieves 75% expert score
Other Tricks โข DQN clips the rewards to [ - 1 ; +1] โข This prevents Q-values from becoming too large โข Ensures gradients are well-conditioned โข Canโt tell difference between small and large rewards โข Better approach: normalize network output โข e.g. via batch normalization
Extensions โข Deep Recurrent Q-Learning for Partially Observable MDPs โข Use CNN + LSTM instead of CNN to encode frames of images โข Deep Attention Recurrent Q-Network โข Use CNN + LSTM + Attention model to encode frames of images
Policy gradients: directly differentiate the objective
Gradient Computation
Policy Gradients โข Optimization Problem: Find ๐ that maximizes expected total reward. โข The gradient of a stochastic policy ๐ ฮธ (๐|๐ก) is given by โข The gradient of a deterministic policy ๐ = ๐ ฮธ ๐ก is given by โข Gradient tries to โข Increase probability of paths with positive R โข Decrease probability of paths with negative R
REINFORCE โข We use return ๐ค ๐ข as an unbiased sample of Q. โข ๐ค ๐ข = ๐ 1 + ๐ 2 + โฏ + ๐ ๐ข โข high variance โข limited for stochastic case
Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy
Actor-Critic โข We use a critic to estimate the action- value function โข Actor-critic algorithms โข Updates action-value function parameters โข Updates policy parameters ฮธ, in direction suggested by critic
Review โข Value Based โข Learnt Value Function โข Implicit policy โข (e.g. ๐ -greedy) โข Policy Based โข No Value Function โข Learnt Policy โข Actor-Critic โข Learnt Value Function โข Learnt Policy
Model based DRL โข Learn a transition model of the environment/system ๐(๐ , ๐ก โฒ |๐ก, ๐) โข Using deep network to represent the model โข Define loss function for the model โข Optimize the loss by SGD or its variants โข Plan using the transition model โข E.g., lookahead using the transition model to find optimal actions
Model based DRL: Challenges โข Errors in the transition model compound over the trajectory โข By the end of a long trajectory, rewards can be totally wrong โข Model-based RL has failed in Atari
Challenges and Opportunities
1. Robustness โ random seeds
1. Robustness โ random seeds Deep Reinforcement Learning that Matters, AAAI18
2. Robustness โ across task Deep Reinforcement Learning that Matters, AAAI18
โข ResNet performs pretty well on various kinds of tasks โข Object detection As a โข Image segmentation Comparison โข Go playing โข Image generation โข โฆ
3. Learning - sample efficiency โข Supervised learning โข Learning from oracle โข Reinforcement learning โข Learning from trial and error Rainbow: Combining Improvements in Deep Reinforcement Learning
Multi-task/transfer learning โข Humans canโt learn individual complex tasks from scratch. โข Maybe our agents shouldnโt either. โข We ultimately want our agents to learn many tasks in many environments โข learn to learn new tasks quickly (Duan et al. โ17, Wang et al. โ17, Finn et al. ICML โ17) โข share information across tasks in other ways (Rusu et al. NIPS โ16, Andrychowicz et al. โ17, Cabi et al. โ17, Teh et al. โ17) โข Better exploration strategies
4. Optimization โ local optima
5. No/sparse reward Real world interaction: โข Usually no (visible) immediate reward for each action โข Maybe no (visible) explicit final reward for a sequence of actions โข Donโt know how to terminate a sequence Consequences: โข Most DRL algos are for games or robotics โข Reward information is defined by video games in Atari and Go โข Within controlled environments
โข Scalar reward is an extremely sparse signal, while at the same time, humans can learn without any external rewards. โข Self-supervision (Osband et al. NIPS โ16, Houthooft et al. NIPS โ16, Pathak et al. ICML โ17, Fu*, Co - Reyes* et al. โ17, Tang et al. ICLR โ17, Plappert et al. โ17) โข options & hierarchy (Kulkarni et al. NIPS โ16, Vezhnevets et al. NIPS โ16, Bacon et al. AAAI โ16, Heess et al. โ17, Vezhnevets et al. ICML โ17, Tessler et al. AAAI โ17) โข leveraging stochastic policies for better exploration (Florensa et al. ICLR โ17, Haarnoja et al. ICML โ17) โข auxiliary objectives (Jaderberg et al. โ17, Shelhamer et al. โ17, Mirowski et al. ICLR โ17)
6. Is DRL a good choice for a task?
7. Imperfect-information games and multi-agent games โข No-limit heads up Texas HoldโEm โข Libratus (Brown et al, NIPS 2017) โข DeepStack ( Moravฤรญk et al, 2017) Refer to Prof. Bo Anโs talk
Improve robustness (e.g., w.r.t random seeds and across tasks) Improve learning efficiency Better optimization Opportunities Define reward in practical applications Identify appropriate tasks Imperfect information and multi-agent games
Applications
Game Neuro Science Music & Movie Healthcare NLP Trading Robotics Education Control
Game โข RL for Game โข Sequential Decision Making โข Delayed Reward TD-Gammon Atari Games
Game โข Atari Games โข Learned to play 49 games for the Atari 2600 game console, without labels or human input, from self-play and the score alone โข Learned to play better than all previous algorithms and at human level for more than half the games Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
Game โข AlphaGo 4-1 CNN โข Master(AlphaGo++) 60-0 ) Value Network Policy Network http://icml.cc/2016/tutorials/AlphaGo-tutorial-slides.pdf
Game Neuro Science Music & Movie Healthcare NLP Trading Robotics Education Control
Neuro Science The world presents animals/humans with a huge reinforcement learning problem (or many such small problems)
Neuro Science โข How can the brain realize these? Can RL help us understand the brainโs computations? โข Reinforcement learning has revolutionized our understanding of learning in the brain in the last 20 years. โข A success story: Dopamine and prediction errors Yael Niv. The Neuroscience of Reinforcement Learning. Princeton University. ICMLโ09 Tutorial
What is dopamine? โข Parkinsonโs Disease โข Plays a major role in reward-motivated behavior as a โglobal reward signalโ โข Gambling โข Regulating attention โข Pleasure
Conditioning โข Pavlovโs Dog
Recommend
More recommend