人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning ! DQN
人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation:
人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where
人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):
人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: Iteratively try to make the Q-value where close to the target value (y )it i should have, if Q-function corresponds to optimal Q* (and Backward Pass optimal policy u *) Gradient update (with respect to Q-function parameters θ):
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network with weights FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network with weights FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Input: state s t Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network with weights FC-256 Familiar conv layers, 32 4x4 conv, stride 2 FC layer 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), with weights FC-256 corresponding to Q(s t , a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), with weights FC-256 corresponding to Q(s t , a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) 16 8x8 conv, stride 4 Number of actions between 4-18 depending on Atari game Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), with weights FC-256 corresponding to Q(s t , a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) A single feedforward pass 16 8x8 conv, stride 4 to compute Q-values for all actions from the current Number of actions between 4-18 state => efficient! depending on Atari game Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: Iteratively try to make the Q-value where close to the target value (y )it i should have, if Q-function corresponds to optimal Q* (and Backward Pass optimal policy u *) Gradient update (with respect to Q-function parameters θ):
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (s t , a t , r t , s t+1 ) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (s t , a t , r t , s t+1 ) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Play M episodes (full games)
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay For each timestep t of the game
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Take the action (a t ), and observe the reward r t and next state s t+1
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory
人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌 Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair
人工智能引论 2018 罗智凌 Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies?
人工智能引论 2018 罗智凌 Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value:
人工智能引论 2018 罗智凌 Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this?
人工智能引论 2018 罗智凌 Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Gradient ascent on policy parameters!
人工智能引论 2018 罗智凌 REINFORCE algorithm Mathematically, we can write: Where r( r ) is the reward of a trajectory
人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward:
人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Now let’s differentiate this:
人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ
人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ However, we can use a nice trick:
人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ However, we can use a nice trick: If we inject this back: Can estimate with Monte Carlo sampling
人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have:
人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus:
人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities!
人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Therefore when sampling a trajectory r , we can estimate J( 𝜄 ) with
人工智能引论 2018 罗智凌 Intuition Gradient estimator: Interpretation: If r( r ) is high, push up the probabilities of the actions seen - - If r( r ) is low, push down the probabilities of the actions seen
人工智能引论 2018 罗智凌 Intuition Gradient estimator: Interpretation: If r( r ) is high, push up the probabilities of the actions seen - - If r( r ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!
人工智能引论 2018 罗智凌 Actor-Critic Algorithm Problem: we don’t know Q and V. Can we learn them? Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function). - The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust - Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy - Can also incorporate Q-learning tricks e.g. experience replay 1. Actor 看到游戏目前的 state ,做出一个 action 。 2. Critic 根据 state 和 action 两者,对 actor 刚才的表现打一个分数。 3. Actor 依据 critic (评委)的打分,调整自己的策略( actor 神经网络参数), 争取下次做得更好。 4. Critic 根据系统给出的 reward (相当于 ground truth )来调整自己的打分策略 ( critic 神经网络参数)
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse [Mnih et al. 2014]
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action [Mnih et al. 2014]
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) Input NN image [Mnih et al. 2014]
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) Input NN NN image [Mnih et al. 2014]
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) Input NN NN NN image [Mnih et al. 2014]
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) (x 4 , y 4 ) Input NN NN NN NN image [Mnih et al. 2014]
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) (x 4 , y 4 ) (x 5 , y 5 ) Softmax y=2 Input NN NN NN NN NN image [Mnih et al. 2014]
人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering! [Mnih et al. 2014]
人工智能引论 2018 罗智凌 More policy gradients: AlphaGo Overview: - Mix of supervised learning and reinforcement learning - Mix of old methods (Monte Carlo Tree Search) and recent ones (deep RL) How to beat the Go world champion: Featurize the board (stone color, move legality, bias, …) - - Initialize policy network with supervised training from professional go games, then continue training using policy gradient (play against itself from random previous iterations, +1 / -1 reward for winning / losing) - Also learn value network (critic) - Finally, combine combine policy and value networks in a Monte Carlo Tree [Silver et al., Search algorithm to select actions by lookahead search Nature 2016] This image is CC0 publicdomain 10 1
人工智能引论 2018 罗智凌 Summary - Policy gradients : very general but suffer from high variance so requires a lot of samples. Challenge : sample-efficiency - Q-learning : does not always work but when it works, usually more sample-efficient. Challenge : exploration - Guarantees: - Policy Gradients : Converges to a local minima of J( 8 ), often good enough! - Q-learning : Zero guarantees since you are approximating Bellman equation with a complicated function approximator
人工智能引论 2018 罗智凌 OUTLINE • Intro on Reinforcement Learning • Learning with Reward – Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm • Learning without Reward – Inverse Reinforcement Learning • AlphaGo
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌 Motivation Probability Dynamics distribution over next Model P sa sa states given current Describes desirability state and action of being in a state. Controller/Poli Reinforcement Reward cy p Learning / Optimal Function R Control Prescribes action to take for each state Key challenges Providing a formal specification of the control task. Building a good dynamics model. Finding closed-loop controllers.
人工智能引论 2018 罗智凌 Destination • Inverse Reinforcement Learning algorithms – Leverage expert demonstrations to learn to perform a desired task. • Formal guarantees – Running time – Sample complexity – Performance of resulting controller • Enabled us to solve highly challenging, previously unsolved, real-world control problems in – Quadruped locomotion – Autonomous helicopter flight
人工智能引论 2018 罗智凌 Example task: driving
人工智能引论 2018 罗智凌 Problem setup • Input: – Dynamics model / Simulator P sa ( s t +1 | s t , a t ) – No reward function – Teacher’s demonstration: s 0 , a 0 , s 1 , a 1 , s 2 , a 2 , … (= trace of the teacher’s policy p *) • Desired output: – Policy , which (ideally) has performance guarantees, i.e., – Note: R* is unknown.
人工智能引论 2018 罗智凌 Prior work: behavioral cloning • Formulate as standard machine learning problem – Fix a policy class • E.g., support vector machine, neural network, decision tree, deep belief net, … – Estimate a policy from the training examples ( s 0 , a 0 ), ( s 1 , a 1 ), ( s 2 , a 2 ), … • Limitations: – Fails to provide strong performance guarantees – Underlying assumption: policy simplicity • E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002.
人工智能引论 2018 罗智凌 Main Idea Dynamics Model P sa sa Prescribes action to take for each state: Often fairly succinct typically very complex Reinforcement Controller/Poli Reward cy p Learning / Optimal Function R Control
人工智能引论 2018 罗智凌 Method • Assume Initialize: pick some controller p 0 . • Learning through reward • Iterate for i = 1, 2, … : functions rather than directly learning policies. – “Guess” the reward function : Find a reward function such that the teacher maximally outperforms all previously found controllers. – Find optimal control policy p i for the current guess of the reward function R w . – If , exit the algorithm.
人工智能引论 2018 罗智凌 Exp on Highway driving Teacher in Training World Learned Policy in Testing World • Input: – Dynamics model / Simulator P sa ( s t +1 | s t , a t ) – Teacher’s demonstration: 1 minute in “training world” – Note: R* is unknown. – Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances
人工智能引论 2018 罗智凌 More driving examples Driving Learned Driving Learned demonstration behavior demonstration behavior In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
Recommend
More recommend