DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 – Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018
Neural Face by Taehoon Kim Previously on CMP784 • Generative Adversarial Networks (GANs) • How do GANs work • Conditional GAN • Tips and Tricks • Applications 2
Lecture overview • What is Reinforcement Learning? • Components of a RL problem • Markov Decision Processes • Value-Based Deep RL • Policy-Based Deep RL • Model-Based Deep RL r: Much of the material and slides for this lecture were borrowed from Disclaimer: — John Schulman’s talk on “Deep Reinforcement Learning: Policy Gradients and Q-Learning” — David Silver’s tutorial on “Deep Reinforcement Learning” — Lex Fridman’s MIT 6.S094 Deep Learning for Self-Driving Cars class 3
What is Reinforcement Learning? 4
What is Reinforcement Learning? Supervised Learning Unsupervised Learning Reinforcement Learning Slide credit: Razvan Pascanu 5
What is Reinforcement Learning? • Branch of machine learning concerned with taking sequences of actions • Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward 6
Motor Control and Robotics Robotics: • Observations: camera images, joint angles • Actions: joint torques • Rewards: stay balanced, navigate to target locations, serve and protect humans 7
Business Operations Inventory Management: • Observations: current inventory levels • Actions: number of units of each item to purchase • Rewards: profit 8
Image Captioning Hard Attention for Image Captioning: • Observations: current image window • Actions: where to look • Rewards: classification 9
Why is Go hard for computers to play? Game tree complexity = b d Games Brute force search intractable: 1. Search space is huge A different kind of optimization problem (min-max) 2. “Impossible” for computers to evaluate who is winning but still considered to be RL. • Go (complete information, deterministic) – AlphaGo • Backgammon (complete information, stochastic) – TD-Gammon • Stratego (incomplete information, deterministic) • Poker (incomplete information, stochastic) Matej Morav č ík et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science 02 Mar 2017 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68. 10
How does RL relate to Supervised Learning? • Supervised learning: — Environment samples input-output pair pair ( x t , y t ) ∼ ⇢ — Agent predicts ts ˆ y t = f ( x t ) — Agent receives loss ss ` ( y t , ˆ y t ) — Environment asks agent a question, and then tells her the right answer 11
How does RL relate to Supervised Learning? • Reinforcement learning: − Environment samples input-output pair input x t ∼ P ( x t | x t − 1 , y t − 1 ) § Input depends on your previous actions! − Agent predicts ts ˆ y t = f ( x t ) − Agent receives cost where P a probability distribution st c t ∼ P ( c t | x t , ˆ y t ) w unknown to the agent. 12
Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making • RL is for an ag ent with the capacity to ac agen act action influences the agent’s future st • Each ac state • Success is measured by a scalar re rd signal reward • Goal: se select ct act ctions ons to o maximize fut uture ure re reward rd 14
Deep Learning in a nutshell RL is a general-purpose framework for decision-making • Given an obj object ctive ion that is required to achieve objective • Learn re repre rese sentatio • Directly from ra raw in inputs • Using minimal domain knowledge 15
Deep Reinforcement Learning: AI = RL + DL We seek a single agent which can solve any human-level task • RL defines the objective • DL gives the mechanism • RL + DL = general intelligence • Examples: Play games: Atari, poker, Go, ... − Pla plore worlds: 3D worlds, Labyrinth, ... − Ex Explor Control physical systems: manipulate, walk, swim, ... − Co ct with users: recommend, optimize, personalize, ... − Interact 16
Agent and Environment • At each step t the agent: • Executes action a t observation action • Receives observation o t o t a t • Receives scalar reward r t reward r t • The environment: • Receives action a t • Emits observations o t+1 • Emits scalar reward r t+1 17
Example Reinforcement Learning Problem • • • An agent operates in an environment: At Atari Breakout out • • • An agent has the capacity to ac act • Each action influences the agent’s • Each action influences the agent’s • Each action influences the agent’s fu futu ture sta tate te • • • rd signal • Success is measured by a re reward • Goal is to select actions to maximize future • Go reward 18
State • Experience is a sequence of observations, actions, rewards o 1 , r 1 , a 1 , ..., a t-1 , o t , r t • The state is a summary of experience s t = f(o 1 ,r 1 ,a 1 ,...,a t-1 ,o t ,r t ) • In a fully observed environment s t =f(o t ) 19
Major Components of an RL Agent • An RL agent may include one or more of these components: − Policy: Agent’s behavior function − Value function: How good is each state and/or action − Model: Agent’s representation of the environment 20
Policy • A policy is the agent’s behavior • It is a map from state to action: − Deterministic policy: : a = π ( s ) P − Stochastic policy: : π ( a | s ) = P [ a | s ] 21
Value Function • A value function is a prediction of future reward − “How much reward will I get from action a in state s ?” • Q -value function gives expected total reward − from state s and action a − under policy y π or γ − with discount factor ⇥ r t +1 + γ r t +2 + γ 2 r t +3 + ... | s , a ⇥ ⇤ Q π ( s , a ) = E • Value functions decompose into a Bellman equation Q π ( s , a ) = E s 0 , a 0 ⇥ ⇤ r + γ Q π ( s 0 , a 0 ) | s , a 22
Optimal Value Functions • An optimal value function is the maximum achievable value Q π ( s , a ) = Q π ⇤ ( s , a ) Q ⇤ ( s , a ) = max π • Once we have Q * we can act optimally, π ⇤ ( s ) = argmax Q ⇤ ( s , a ) a • Optimal value maximizes over all decisions. Informally: a t +1 r t +2 + γ 2 max Q ⇤ ( s , a ) = r t +1 + γ max a t +2 r t +3 + ... a t +1 Q ⇤ ( s t +1 , a t +1 ) = r t +1 + γ max • Formally, optimal values decompose into a Bellman equation � Q ⇤ ( s , a ) = E s 0 Q ⇤ ( s 0 , a 0 ) | s , a r + γ max a 0 23
Model observation action o t a t reward r t 24
Model • Model is learnt from experience • Acts as proxy for environment • Planner interacts with model • e.g. using lookahead search 25
Approaches To Reinforcement Learning • Value-based RL − Estimate the optimal value function Q * ( s , a ) − This is the maximum value achievable under any policy • Policy-based RL − Search directly for the optimal policy y π ∗ − This is the policy achieving maximum future reward • Model-based RL − Build a model of the environment − Plan (e.g. by lookahead) using model 26
Deep Reinforcement Learning • Use deep neural networks to represent − Value function − Policy − Model • Optimize loss function by stochastic gradient descent 27
Value-Based Deep RL 28
Q-Networks • Represent value function by Q-network with weights w Q ( s , a , w ) ≈ Q ∗ ( s , a ) … Q(s,a, w ) Q(s,a 1 , w ) Q(s,a m , w ) w w s a s 29
Q-Learning • Optimal Q-values should obey Bellman equation � Q ⇤ ( s , a ) = E s 0 Q ⇤ ( s 0 , a 0 ) | s , a r + γ max a 0 • Treat right-handside as a target e r + γ max Q ( s 0 , a 0 , w ) a 0 • Minimize MSE loss by stochastic gradient descent ◆ 2 ✓ Q ( s 0 , a 0 , w ) − Q ( s , a , w ) l = r + γ max a 0 • Converges to Q * using table lookup representation • But diverges using neural networks due to: − Correlations between samples − Non-stationary targets 30
Deep Q-Networks (DQN): Experience Replay • To remove correlations, build data-set from agent’s own experience s 1 , a 1 , r 2 , s 2 s , a , r , s 0 s 2 , a 2 , r 3 , s 3 → s 3 , a 3 , r 4 , s 4 ... s t , a t , r t +1 , s t +1 s t , a t , r t +1 , s t +1 → • Sample experiences from data-set and apply update ◆ 2 ✓ Q ( s 0 , a 0 , w � ) − Q ( s , a , w ) l = r + γ max a 0 • To deal with non-stationarity, target parameters w — are held fixed 31
Deep Reinforcement Learning in Atari state action s t a t reward r t Human level control through deep reinforcement learning, V. Mnih et al. Nature 518:529-533, 2015. 32
DQN in Atari • End-to-end learning of values Q ( s , a ) from pixels s • Input state s is stack of raw pixels from last 4 frames • Output is Q ( s , a ) for 18 joystick/button positions • Reward is change in score for that step Network architecture and hyperparameters fixed across all games 33
DQN Results in Atari 34
35
36
Recommend
More recommend