Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open Source Project Presentation 19/11/2019
Q-Learning Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement — this is learned. Conceptually, h P 1 i t = r t γ t | s = s, a = a Q π ( s, a ) = E a t ⇠ π ( ·| s t )
Q-Learning: Bellman Error The value of Q π at a certain point in time, t , in terms of the payo ff from an initial choice, a t , and the value of the remaining decision problem that results after that choice. h� Q π ( s t , a t ) � max � i E [ r t + γ Q π ( s t + , a )] J ( π ) = E α a α ! s ⇠ ρ , a t ⇠ π ( ·| s t )
Deep Q-Networks Briefly Approximate the action-value function Q π ( s, a ) with a neural network Q θ ( s, a ). The (greedy) policy represented by this is π θ . Discretise the expectation using K sample trajectories, each with period T . Use this to approximate J ( θ ). K T � h i ˜ ( Q ( i ) θ ( s ( i ) t , a ( i ) r t + γ Q ( i ) θ ( s ( i ) J ( θ ) = ) P P t )) � max t + , a ) K T a i = t =
Variational Inference Main Concepts: 1. Try to solve an optimisation problem over a class of tractable distributions, q , parameterised by φ , in order to find the one most similar to p . 2. φ min φ KL ( q φ ( θ ) k p ( θ | D )) 3. Approximate this using gradient descent.
Variational Deep Q-Networks Idea: For e ffi cient exploration we need q φ ( θ ) to be dispersed — near even coverage of the parameter space. Encourage this by adding an entropy bonus to the objective. h� � i Q θ ( s j , a j ) � max a 0 E [ r j + γ Q θ ( s 0 j , a 0 )] � λ H ( q φ ( θ )) E θ ⇠ q φ ( θ ) Assigning systematic randomness to Q enables e ffi cient exploration of the policy space. Further, encouraging high entropy over parameter distribution prevent premature convergence. tl;dr Higher chance of finding maximal rewards in a faster time than standard DQNs.
Algorithm Figure: VDQN Pseudocode.
Aim / Goals Workplan
Questions?
Recommend
More recommend