intro on artificial intelligence from the perspective of
play

( ) Intro. On Artificial Intelligence from the perspective of - PowerPoint PPT Presentation

2018 ( ) Intro. On Artificial Intelligence from the perspective of probability theory luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net


  1. 人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning ! DQN

  2. 人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation:

  3. 人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where

  4. 人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

  5. 人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: Iteratively try to make the Q-value where close to the target value (y )it i should have, if Q-function corresponds to optimal Q* (and Backward Pass optimal policy u *) Gradient update (with respect to Q-function parameters θ):

  6. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

  7. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network with weights FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

  8. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network with weights FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Input: state s t Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

  9. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network with weights FC-256 Familiar conv layers, 32 4x4 conv, stride 2 FC layer 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

  10. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), with weights FC-256 corresponding to Q(s t , a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

  11. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), with weights FC-256 corresponding to Q(s t , a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) 16 8x8 conv, stride 4 Number of actions between 4-18 depending on Atari game Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

  12. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), with weights FC-256 corresponding to Q(s t , a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) A single feedforward pass 16 8x8 conv, stride 4 to compute Q-values for all actions from the current Number of actions between 4-18 state => efficient! depending on Atari game Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

  13. 人工智能引论 2018 罗智凌 Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: Iteratively try to make the Q-value where close to the target value (y )it i should have, if Q-function corresponds to optimal Q* (and Backward Pass optimal policy u *) Gradient update (with respect to Q-function parameters θ):

  14. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

  15. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (s t , a t , r t , s t+1 ) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples

  16. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (s t , a t , r t , s t+1 ) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency

  17. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay

  18. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network

  19. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Play M episodes (full games)

  20. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode

  21. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay For each timestep t of the game

  22. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy

  23. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Take the action (a t ), and observe the reward r t and next state s t+1

  24. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory

  25. 人工智能引论 2018 罗智凌 [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step

  26. 人工智能引论 2018 罗智凌

  27. 人工智能引论 2018 罗智凌 Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair

  28. 人工智能引论 2018 罗智凌 Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies?

  29. 人工智能引论 2018 罗智凌 Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value:

  30. 人工智能引论 2018 罗智凌 Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this?

  31. 人工智能引论 2018 罗智凌 Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Gradient ascent on policy parameters!

  32. 人工智能引论 2018 罗智凌 REINFORCE algorithm Mathematically, we can write: Where r( r ) is the reward of a trajectory

  33. 人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward:

  34. 人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Now let’s differentiate this:

  35. 人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ

  36. 人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ However, we can use a nice trick:

  37. 人工智能引论 2018 罗智凌 REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ However, we can use a nice trick: If we inject this back: Can estimate with Monte Carlo sampling

  38. 人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have:

  39. 人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus:

  40. 人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities!

  41. 人工智能引论 2018 罗智凌 REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Therefore when sampling a trajectory r , we can estimate J( 𝜄 ) with

  42. 人工智能引论 2018 罗智凌 Intuition Gradient estimator: Interpretation: If r( r ) is high, push up the probabilities of the actions seen - - If r( r ) is low, push down the probabilities of the actions seen

  43. 人工智能引论 2018 罗智凌 Intuition Gradient estimator: Interpretation: If r( r ) is high, push up the probabilities of the actions seen - - If r( r ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!

  44. 人工智能引论 2018 罗智凌 Actor-Critic Algorithm Problem: we don’t know Q and V. Can we learn them? Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function). - The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust - Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy - Can also incorporate Q-learning tricks e.g. experience replay 1. Actor 看到游戏目前的 state ,做出一个 action 。 2. Critic 根据 state 和 action 两者,对 actor 刚才的表现打一个分数。 3. Actor 依据 critic (评委)的打分,调整自己的策略( actor 神经网络参数), 争取下次做得更好。 4. Critic 根据系统给出的 reward (相当于 ground truth )来调整自己的打分策略 ( critic 神经网络参数)

  45. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse [Mnih et al. 2014]

  46. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action [Mnih et al. 2014]

  47. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) Input NN image [Mnih et al. 2014]

  48. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) Input NN NN image [Mnih et al. 2014]

  49. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) Input NN NN NN image [Mnih et al. 2014]

  50. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) (x 4 , y 4 ) Input NN NN NN NN image [Mnih et al. 2014]

  51. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) (x 4 , y 4 ) (x 5 , y 5 ) Softmax y=2 Input NN NN NN NN NN image [Mnih et al. 2014]

  52. 人工智能引论 2018 罗智凌 REINFORCE in action: Recurrent Attention Model (RAM) Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering! [Mnih et al. 2014]

  53. 人工智能引论 2018 罗智凌 More policy gradients: AlphaGo Overview: - Mix of supervised learning and reinforcement learning - Mix of old methods (Monte Carlo Tree Search) and recent ones (deep RL) How to beat the Go world champion: Featurize the board (stone color, move legality, bias, …) - - Initialize policy network with supervised training from professional go games, then continue training using policy gradient (play against itself from random previous iterations, +1 / -1 reward for winning / losing) - Also learn value network (critic) - Finally, combine combine policy and value networks in a Monte Carlo Tree [Silver et al., Search algorithm to select actions by lookahead search Nature 2016] This image is CC0 publicdomain 10 1

  54. 人工智能引论 2018 罗智凌 Summary - Policy gradients : very general but suffer from high variance so requires a lot of samples. Challenge : sample-efficiency - Q-learning : does not always work but when it works, usually more sample-efficient. Challenge : exploration - Guarantees: - Policy Gradients : Converges to a local minima of J( 8 ), often good enough! - Q-learning : Zero guarantees since you are approximating Bellman equation with a complicated function approximator

  55. 人工智能引论 2018 罗智凌 OUTLINE • Intro on Reinforcement Learning • Learning with Reward – Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm • Learning without Reward – Inverse Reinforcement Learning • AlphaGo

  56. 人工智能引论 2018 罗智凌

  57. 人工智能引论 2018 罗智凌 Motivation Probability Dynamics distribution over next Model P sa sa states given current Describes desirability state and action of being in a state. Controller/Poli Reinforcement Reward cy p Learning / Optimal Function R Control Prescribes action to take for each state  Key challenges  Providing a formal specification of the control task.  Building a good dynamics model.  Finding closed-loop controllers.

  58. 人工智能引论 2018 罗智凌 Destination • Inverse Reinforcement Learning algorithms – Leverage expert demonstrations to learn to perform a desired task. • Formal guarantees – Running time – Sample complexity – Performance of resulting controller • Enabled us to solve highly challenging, previously unsolved, real-world control problems in – Quadruped locomotion – Autonomous helicopter flight

  59. 人工智能引论 2018 罗智凌 Example task: driving

  60. 人工智能引论 2018 罗智凌 Problem setup • Input: – Dynamics model / Simulator P sa ( s t +1 | s t , a t ) – No reward function – Teacher’s demonstration: s 0 , a 0 , s 1 , a 1 , s 2 , a 2 , … (= trace of the teacher’s policy p *) • Desired output: – Policy , which (ideally) has performance guarantees, i.e., – Note: R* is unknown.

  61. 人工智能引论 2018 罗智凌 Prior work: behavioral cloning • Formulate as standard machine learning problem – Fix a policy class • E.g., support vector machine, neural network, decision tree, deep belief net, … – Estimate a policy from the training examples ( s 0 , a 0 ), ( s 1 , a 1 ), ( s 2 , a 2 ), … • Limitations: – Fails to provide strong performance guarantees – Underlying assumption: policy simplicity • E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002.

  62. 人工智能引论 2018 罗智凌 Main Idea Dynamics Model P sa sa Prescribes action to take for each state: Often fairly succinct typically very complex Reinforcement Controller/Poli Reward cy p Learning / Optimal Function R Control

  63. 人工智能引论 2018 罗智凌 Method • Assume Initialize: pick some controller p 0 . • Learning through reward • Iterate for i = 1, 2, … : functions rather than directly learning policies. – “Guess” the reward function : Find a reward function such that the teacher maximally outperforms all previously found controllers. – Find optimal control policy p i for the current guess of the reward function R w . – If , exit the algorithm.

  64. 人工智能引论 2018 罗智凌 Exp on Highway driving Teacher in Training World Learned Policy in Testing World • Input: – Dynamics model / Simulator P sa ( s t +1 | s t , a t ) – Teacher’s demonstration: 1 minute in “training world” – Note: R* is unknown. – Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances

  65. 人工智能引论 2018 罗智凌 More driving examples Driving Learned Driving Learned demonstration behavior demonstration behavior In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Recommend


More recommend