Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1 Ant Financial Services Group 2 Faculty of Electrical Engineering, Technion 3 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Tech
The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z ( s , a ) . The recursion of Z ( s , a ) is described by the distributional Bellman equation, Z ( s , a ) D = R ( s , a ) + γ Z ( s ′ , a ′ ) , where D = stands for “equal in distribution”
The distributional reinforcement learning has gained much attention recently [Bellemare et al., 2017]. It explicitly considers the stochastic nature of the long term return Z ( s , a ) . The recursion of Z ( s , a ) is described by the distributional Bellman equation, Z ( s , a ) D = R ( s , a ) + γ Z ( s ′ , a ′ ) , where D = stands for “equal in distribution”
Distributional gradient temporal differenct learning We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties: • Convergence in the off-policy setting. • Convergence with the nonlinear function approximation. • Include distributional nature of the long term reward.
Distributional gradient temporal differenct learning We consider a distributional counterpart of Gradient Temporal Difference Learning [Sutton et al., 2008]. Properties: • Convergence in the off-policy setting. • Convergence with the nonlinear function approximation. • Include distributional nature of the long term reward.
To measure the distance between distributions Z ( s , a ) and T Z ( s , a ), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are F P and F Q respectively, then the square root of Cram´ er distance between P and Q is � � ∞ � 1 / 2 . ( F P ( x ) − F Q ( x )) 2 dx ℓ 2 ( P , Q ) := −∞
To measure the distance between distributions Z ( s , a ) and T Z ( s , a ), we need to introduce Cram´ er distance. Suppose there are two distributions P and Q and their cumulative distribution functions are F P and F Q respectively, then the square root of Cram´ er distance between P and Q is � � ∞ � 1 / 2 . ( F P ( x ) − F Q ( x )) 2 dx ℓ 2 ( P , Q ) := −∞
Denote the (cumulative) distribution function of Z ( s ) as F θ ( s , z ), G θ ( s , z ) as the distribution function of T Z ( s ). D-MSPBE: J ( θ ) := � Φ T θ D ( F θ − G θ ) � 2 minimize: θ D Φ θ ) − 1 , (Φ T θ
Denote the (cumulative) distribution function of Z ( s ) as F θ ( s , z ), G θ ( s , z ) as the distribution function of T Z ( s ). D-MSPBE: J ( θ ) := � Φ T θ D ( F θ − G θ ) � 2 minimize: θ D Φ θ ) − 1 , (Φ T θ
• Value distribution ( F θ ( s , z )) is discrete within the range [ V min , V max ] with m atoms. • φ θ ( s , z ) = ∂ F θ ( s , z ) ∂ and (Φ θ ) (( i , j ) , l ) = ∂θ l F θ ( s i , z j ). ∂θ • Project onto the space spanned by Φ w.r.t. the Cram´ er distance and then obtain D-MSPBE. • SGD and weight duplication trick to optimize it.
Distributional GTD2 Input: step size α t , step size β t , policy π . for t = 0 , 1 , ... do m � � � − φ T � � w t +1 = w t + β t θ t ( s t , z j ) w t + δ θ t φ θ t ( s t , z j ) j =1 m φ θ t ( s t , z j ) − φ θ t ( s t +1 , z j − r t � � � � � θ t +1 =Γ[ θ t + α t { ) γ j =1 φ T θ t ( s t , z j ) w t − h t } ] Γ : R d → R d is a projection onto an compact set C with a smooth boundary. � m t φ θ t ( s t , z j )) ∇ 2 F θ t ( s t , z j ) w t , � j =1 ( δ θ t − w T h t = � where δ θ t = F θ t ( s t +1 , z j − r t ) − F θ t ( s t , z j ) . γ end for
Some remarks: • Use the temporal distribution difference δ θ t instead of the temporal difference in GTD2. • Summation over z j , which corresponds to the integral in the Cram´ er distance. • h t results from the nonlinear function approximation, which is zero in the linear case. it can be evaluated using forward and backward propagation.
Theoretical Result Theorem Let ( s t , r t , s ′ t ) t ≥ 0 be a sequence of transitions. The positive step-sizes in the algrithm satisfy � ∞ t =0 a t = ∞ , � ∞ t =0 β t = ∞ , � ∞ t =0 α 2 t , � ∞ t =1 β 2 t < ∞ and α t β t → 0 , as t → ∞ . Assume that for any θ ∈ C and s ∈ S s.t. d ( s ) > 0 , F θ is three times continuously differentiable. Further assume that for each θ ∈ C, E � m j =1 φ θ ( s , z j ) φ T � � θ ( s , z j ) is nonsingular. Then the Algorithm converges with probability one, as t → ∞ .
Distributional Greedy GQ Input: step size α t , step size β t , 0 ≤ η ≤ 1 for t = 0 , 1 , ... do Q ( s t +1 , a ) = � m j =1 z j p j ( s t , a ) , where p j ( s t , a ) is the density function w.r.t. F θ (( s t , a )). a ∗ = arg max a Q ( s t +1 , a ) . m � − φ T � � w t +1 = w t + β t θ t (( s t , a t ) , z j ) w t + δ θ t j =1 × φ θ t (( s t , a t ) , z j ) . m � � θ t +1 = θ t + α t { δ θ t φ θ t (( s t , a t ) , z j ) − j =1 ηφ θ t (( s t +1 , a ∗ ) , z j − r t )( φ T � θ t (( s t , a t ) , z j ) w t ) } . γ where δ θ t = F θ t (( s t +1 , a ∗ ) , z j − r t ) − F θ t (( s t , a t ) , z j ) . γ end for
Experimental Result 18 10 Distributional GTD2 0.9 16 Distributional TDC 0.8 0 14 0.7 12 Cumulative Reward −10 0.6 Kill Counts DMSPBE 10 −20 0.5 8 0.4 6 −30 0.3 4 C51 −40 C51 0.2 2 DQN DQN Distributional Greedy-GQ Distributional Greedy-GQ 0.1 −50 0 0 2000 4000 6000 8000 0 100 200 300 400 500 0 2000 4000 6000 8000 time step Episodes Episodes
Thank you! Visit our poster today at pacific Ballroom #33.
Recommend
More recommend