Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed Abdulkadir Seminar: Current Works in Computer Vision Research Group: Pattern Recognition and Image Processing Albert-Ludwigs-Universit¨ at Freiburg 07. July 2016
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Inhalt Reinforcement Learning 1 Basics Learning Methods Asynchronous Reinforcement Learning 2 Related Work Multi-threaded learning Experiments 3 Benchmark Environments Network Architecture Results Conclusion 4 Axel Perschmann Deep Reinforcement Learning Presentation 2 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience Agent interacts with Environment over a number of discrete timesteps t . chooses an action based on current situation recieves feedback updates future choice of actions Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Environment) Markov decision process ) 5-tuple ( T , S , A , f , c ) discrete decision points t 2 T = 0 , 1 , ..., N system state s t 2 S actions a t 2 A transition function s t +1 = f ( s t , a t , w t ) = p ij ( a ) direct costs/rewards c : S ⇥ A ! R Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33 Image: de.slideshare.net/ckmarkohchang/
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Policy) Agents behavior is defined by it’s policy ⇡ . ⇡ t ( s t ) = a ⇡ can be stationary or non-stationary ˆ ⇡ = ( ⇡ 1 , ⇡ 2 , ..., ⇡ N ) or ˆ ⇡ = ( ⇡ , ⇡ , ..., ⇡ ) Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Goal) RL Goal: maximize long term return learn optimal policy ⇡ ⇤ ⇡ ⇤ always selects action a that maximizes long term return Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Motivation: Learning by experience (Goal) RL Goal: maximize long term return learn optimal policy ⇡ ⇤ ⇡ ⇤ always selects action a that maximizes long term return estimate value of states and actions Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value functions State value function: Expected return when following ⇡ from state s ∞ P � k r t + k +1 | s t = s ] V π ( s ) = E π [ R t | s t = s ] = E π [ k =0 Action value function: Expected return starting from state s , taking action a , following ⇡ ∞ P � k r t + k +1 | s t = s , a t = a ] Q π ( s , a ) = E π [ R t | s t = s , a t = a ] = E π [ k =0 where 0 � 1 is the discounting rate Axel Perschmann Deep Reinforcement Learning Presentation 4 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Problems: 1 Q ( s , a ) only feasible for small environments. replace Q ( s , a ) by a function approximator Q ( s , a ; ✓ ) e.g. utilizing a neural network Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learning to choose ’good’ actions Simply extract the optimal policy: ⇡ ⇤ ( s ) = max a Q ⇤ ( s , a ) Problems: 1 Q ( s , a ) only feasible for small environments. replace Q ( s , a ) by a function approximator Q ( s , a ; ✓ ) e.g. utilizing a neural network 2 Unknown environment: unknown transition functions unknown rewards ) no model ) learn optimal Q ⇤ ( s , a ; ✓ ) by updating ✓ Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Learn optimal value functions Source: Reinforcement Learning Lecture, ue01.pdf ’trial and error’ approach take action a t according to the ✏ -greedy policy receive new state s t +1 and reward r t continue until a terminal state is reached use history to change Q ( s , a ; ✓ ) or V ( s ; ✓ ), restart Axel Perschmann Deep Reinforcement Learning Presentation 6 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) ) obtaining a reward only a ff ects ( s , a )-pair that led to the reward Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Value-based model-free reinforcement learning methods Strategies to update parameters ✓ iteratively one-step Q-Learning: o ff -policy technique ◆ 2 ✓ a 0 Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E r + � max (1) one-step SARSA: on-policy technique � 2 � r + � Q ( s 0 , a 0 ; ✓ i � 1 ) � Q ( s , a ; ✓ i ) L i ( ✓ i ) = E (2) n-step Q-Learning: o ff -policy technique r t + � r t +1 + ... + � n � 1 r t + n + max a � n Q ( s t + n +1 , a ; ✓ i ) (3) Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Policy-based model-free reinforcement learning method Alternative Approach: parameterize policy ⇡ ( s ; ✓ ) update parameters ✓ towards the gradient r θ log ⇡ ( a t | s t ; ✓ )( R t � b t ( s t )) (4) scale gradient by the estimates certainty and by the advantage of taking action a t in state s t : R t is an estimate of Q t ( a t , s t ) b t is an estimate of V t ( s t ) Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Policy-based model-free reinforcement learning method Alternative Approach: parameterize policy ⇡ ( s ; ✓ ) update parameters ✓ towards the gradient r θ log ⇡ ( a t | s t ; ✓ )( R t � b t ( s t )) (4) scale gradient by the estimates certainty and by the advantage of taking action a t in state s t : R t is an estimate of Q t ( a t , s t ) b t is an estimate of V t ( s t ) ) actor-critic architecture Actor: policy ⇡ Critic: baseline b t ( s t ) ⇡ V π ( s t ) Source: Reinforcement Learning Lecture, ue07.pdf Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Inhalt Reinforcement Learning 1 Basics Learning Methods Asynchronous Reinforcement Learning 2 Related Work Multi-threaded learning Experiments 3 Benchmark Environments Network Architecture Results Conclusion 4 Axel Perschmann Deep Reinforcement Learning Presentation 9 / 33
Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion Related Work (1) Deep Q Networks (DQN) [Mnih et al., 2015] Deep Neural Network as non-linear function approximator Techniques to avoid divergence: experience replay: perform Q-learning updates on random samples of past experience target network fix neural network for several thousand iterations, before updating weights ) make training data less non-stationary, stabilize training Axel Perschmann Deep Reinforcement Learning Presentation 10 / 33
Recommend
More recommend