variational deep q networks in edward
play

Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open - PowerPoint PPT Presentation

Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open Source Project Presentation 19/11/2019 Q-Learning Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement


  1. Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open Source Project Presentation 19/11/2019

  2. Q-Learning Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement — this is learned. Conceptually, h P 1 i t =  r t γ t | s  = s, a  = a Q π ( s, a ) = E a t ⇠ π ( ·| s t )

  3. Q-Learning: Bellman Error The value of Q π at a certain point in time, t , in terms of the payo ff from an initial choice, a t , and the value of the remaining decision problem that results after that choice. h� Q π ( s t , a t ) � max �  i E [ r t + γ Q π ( s t +  , a )] J ( π ) = E α a α ! s  ⇠ ρ , a t ⇠ π ( ·| s t )

  4. Deep Q-Networks Briefly Approximate the action-value function Q π ( s, a ) with a neural network Q θ ( s, a ). The (greedy) policy represented by this is π θ . Discretise the expectation using K sample trajectories, each with period T . Use this to approximate J ( θ ). K T �  h i ˜ ( Q ( i ) θ ( s ( i ) t , a ( i ) r t + γ Q ( i ) θ ( s ( i ) J ( θ ) =   )  P P t )) � max t +  , a ) K T a i =  t = 

  5. Variational Inference Main Concepts: 1. Try to solve an optimisation problem over a class of tractable distributions, q , parameterised by φ , in order to find the one most similar to p . 2. φ min φ KL ( q φ ( θ ) k p ( θ | D )) 3. Approximate this using gradient descent.

  6. Variational Deep Q-Networks Idea: For e ffi cient exploration we need q φ ( θ ) to be dispersed — near even coverage of the parameter space. Encourage this by adding an entropy bonus to the objective. h� �  i Q θ ( s j , a j ) � max a 0 E [ r j + γ Q θ ( s 0 j , a 0 )] � λ H ( q φ ( θ )) E θ ⇠ q φ ( θ ) Assigning systematic randomness to Q enables e ffi cient exploration of the policy space. Further, encouraging high entropy over parameter distribution prevent premature convergence. tl;dr Higher chance of finding maximal rewards in a faster time than standard DQNs.

  7. Algorithm Figure: VDQN Pseudocode.

  8. Aim / Goals Workplan

  9. Questions?

Recommend


More recommend