Human-level control through deep reinforcement Liia Butler
But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger W. Dijkstra
Overview 1. Introduction 2. Reinforcement Learning 3. Deep neural networks 4. Markov Decision Process 5. Algorithm Breakdown 6. Evaluation and conclusions
Introduction Deep Q-network (DQN) - The agent - Reinforcement learning plus - Deep neural networks - Goal : General artificial intelligence - How little do we have to know to be intelligent? Can we solve a wide range of challenging tasks? - Pixels and game score as input
Reinforcement Learning - Theory of how software agents may optimize their control of the environment - Inspired by the psychological and neuroscientific perspectives on animal behavior - One of the three types of machine learning http://en.proft.me/media/science/ml_types.png
Space Invaders
Deep Neural Networks - An architecture in deep learning, type of artificial neural network - Artificial neural network: a network of nodes representing processing elements that are highly connected, working together towards specific problems, like in biological nervous system - Multiple layers of nodes with increasing abstraction of the data - Extract high-level representations from raw data - DQN uses "deep convolutional network" - 84 x 4 x 4 image produced by preprocessing map - three convolutional layers - Two fully connected layers - http://www.nature.com/nature/journal/v5 18/n7540/images/nature14236-f4.jpg http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg
Markov Decision Process - State - Action - Reward http://cse-wiki.unl.edu/wiki/images/5/58/ReinforJpeg.jpg
What these mean for DQN - State - What is going on? - The goal was to be universal so it's represented by screen pixels - Action - What can we do? - Ex. moving, direction, buttons - Reward - What's our motivation? - Points, lives, etc. http://www.retrogamer.net/wp-content/uploads/2014/07/Top-10-Atari-Jaguar-Games-616x410.png
How is DQN going to do this? - Preprocessing - Reduce input dimensionality, max value for pixel color, remove flickering - ε-greedy policy - choosing the action - Bellman equation - optimal control of environment, action-value function - Using a function approximator to estimate the action-value function - Loss function and Q-learning gradient - Experience replay - building a data set from agent's experience
Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image Φ =preprocessing sequence T = time-step at which game terminates ε = probability in ε-greedy policy a = action s = state y = target r = reward ν = reward discount factor C = Number of updates to Q
Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image ε-greedy policy T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q
ε-greedy policy How to choose the action 'a' at time 't' - Exploration, random - Exploitation, best one according to the Q value
Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing function s = state Experience y = target Replay r = reward ν = reward discount factor C = Number of updates to Q
Experience Replay 1. Take action 2. Store transition in memory 3. Sample random minibatch of transitions from D 4. Optimize using gradient descent on target 'y' and Q-network
Optimizing the Q-Network - Bellman Equation: - The loss function we have: - From this: - Gives us the Q-learning gradient:
Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight for approximator M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q
Breakout!
Evaluation and Conclusions - Agents vs. Pro gamers - Action at 10 Hz (an action every 0.1 seconds), every 6th frame - At 60 Hz (every 0.017 seconds), every frame, only 6 games > 5% better performance - Controlled human conditions - Out of the 49 games - 29 at human or above - 20 below http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f2.jpg
29 out of 49 20 out of 49 http://www.nature.com/nature/journal/v518/n7540/images_ article/nature14236-f3.jpg
Questions and Discussion - What do you think are some non-gaming applications of deep reinforcement learning? - Do you think that comparing with the "professional human game tester" is a sufficient enough of an evaluation? Is there a better way? - Should we even have a general AI, or are we better off with domain specific AIs? - Are there other consequences besides a computer beating your high score? (Have we doomed society?)
Recommend
More recommend