From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)
Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 1
Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 2 J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖
Basic Idea T T T T Step-wise IS Standard PG (Tang and Abbeel, 2010) T T Traj-wise IS Ofg-Policy Evaluation Estimators Policy Gradient Estimators 3 REINFORCE J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � � γ t ′ r t ′ ρ [ 0 : T ] γ t r t ∇ log π t θ t = 0 t = 0 t ′ = 0 � � � γ t ′ r t ′ ∇ log π t γ t ρ [ 0 : t ] r t θ t = 0 t = 0 t ′ = t
Basic Idea Ofg-Policy Evaluation Estimators T T T OPE with State Baselines PG with State Baselines 4 Policy Gradient Estimators J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � � � � � � γ t ′ r t ′ − γ t b t ∇ log π t b 0 + γ t ρ [ 0 : t ] r t + γ b t + 1 − b t θ t = 0 t ′ = t t = 0
Basic Idea T t T T T DR-PG (Ours) t t Doubly Robust OPE T T 5 Policy Gradient Estimators Ofg-Policy Evaluation Estimators Trajectory-wise CV (Cheng et al., 2019) T J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � γ t ′ � �� � � � ↖ � t ′ − � γ t ′ r t ′ + V π θ Q π θ ∇ log π t θ t ′ t = 0 t ′ = t t ′ = t + 1 + γ t � �� � � � ∇ � − � � V π ′ r t + γ � V π ′ t + 1 − � Q π ′ V π θ Q π θ t ∇ log π t 0 + γ t ρ [ 0 : t ] θ t = 0 ↙ � � γ t ′ � �� � � � γ t ′ r t ′ + � t ′ − � V π θ Q π θ ∇ log π t θ t ′ t = 0 t ′ = t t ′ = t + 1 + γ t � �� ∇ � t −∇ θ � − � V π θ Q π θ Q π θ t ∇ log π t θ
Preliminaries MDP Setting • Fixed initial state distribution; Frequently used notations 6 • Episodic RL with discount factor γ , and maximum episode length T ; • Trajectory is defined as s 0 , a 0 , r 0 , s 1 , ..., s T , a T , r T . • π θ : Policy parameterized by θ . • J ( π θ ) = E π θ [ � T t = 0 γ t r ( s t , a t )] : Expected discounted return of π θ .
A Concrete and Simple Example T t From Stepwise IS OPE to Standard PG 7 π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0
A Concrete and Simple Example t t T From Stepwise IS OPE to Standard PG 8 T π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0
A Concrete and Simple Example t t T From Stepwise IS OPE to Standard PG T t 9 T π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0 � � � � = � ∇ θ log π t ′ J ( π θ ) + ∆ θ + o (∆ θ ) . γ t r t θ t = 0 t ′ = 0
A Concrete and Simple Example T which is known to be the standard PG. t T Then t T From Stepwise IS OPE to Standard PG t 10 T t π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0 � � � � = � ∇ θ log π t ′ J ( π θ ) + ∆ θ + o (∆ θ ) . γ t r t θ t = 0 t ′ = 0 � J ( π θ +∆ θ ) − � � � J ( π θ ) ∇ θ log π t ′ lim = γ t r t θ ∆ θ ∆ θ → 0 t = 0 t ′ = 0
Doubly-Robust Policy Gradient (DR-PG) T t Definition: Doubly-robust OPE estimator ( unbiased ) (Jiang and Li, 2016) t 11 0 γ t � �� � � � π t ′ � J ( π θ +∆ θ ) = � V π θ +∆ θ θ +∆ θ r t + γ � V π θ +∆ θ − � Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 V θ +∆ θ = E a ∼ π θ +∆ θ [ � where � Q θ +∆ θ ] .
Doubly-Robust Policy Gradient (DR-PG) t t t 2 T T T t t 2 Definition: Doubly-robust OPE estimator ( unbiased ) (Jiang and Li, 2016) T T T Theorem: Given DR-OPE estimator above, we can derive two unbiased estimators: 12 t 0 T γ t � �� � � � π t ′ � J ( π θ +∆ θ ) = � V π θ +∆ θ θ +∆ θ r t + γ � V π θ +∆ θ − � Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 V θ +∆ θ = E a ∼ π θ +∆ θ [ � where � Q θ +∆ θ ] . • If � Q π θ +∆ θ = � Q π θ for arbitrary ∆ θ [ Traj-CV , (Cheng, Yan, and Boots., 2019) ] � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � − � V π θ Q π θ V π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 • else [ DR-PG (Ours) ] � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ t −∇ θ � Q π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1
Doubly-Robust Policy Gradient (DR-PG) t t t 2 T T T t Definition: Doubly-robust OPE estimator (Jiang and Li, 2016) t 2 T T T Theorem: Given DR-OPE estimator above, we can derive: 13 t 0 T γ t � �� � � � π t ′ � J ( π θ +∆ θ ) = � r t + γ � − � V π θ +∆ θ θ +∆ θ V π θ +∆ θ Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 • If � Q π θ +∆ θ = � Q π θ for arbitrary ∆ θ [ Traj-CV , (Cheng, Yan, and Boots., 2019) ] � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � − � V π θ Q π θ V π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 • else [ DR-PG (Ours) ] � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Remark 1: The definitions of ∇ θ � V are difgerent. In Traj-CV, ∇ θ � V = E π θ [ � Q π θ ∇ θ log π θ ] , while in DR-PG, ∇ θ � V = E π θ [ � Q π θ ∇ θ log π θ + ∇ θ � Q π θ ] Remark 2: ∇ θ � Q π θ is not necessary a gradient but just an approximation of ∇ θ Q π θ .
Special Cases of DR-PG T t t 2 DR-PG T 14 T � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ t −∇ θ � Q π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1
Special Cases of DR-PG T t t 2 T T T DR-PG t t 2 15 T T � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Q π ′ invariant to π ′ ↓ Traj-CV Use � � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1
Special Cases of DR-PG t 2 t T T T t t 2 DR-PG T T T t 16 T T T � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Q π ′ invariant to π ′ ↓ Traj-CV Use � � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 � s t + 1 ] = 0 , dropped ↓ PG with state-action baselines � � � γ t 2 ( � t 2 − � V π θ Q π θ E [ t 2 ) t 2 = t + 1 � � � + γ t � �� � � ∇ θ � − � V π θ Q π θ ∇ θ log π t γ t 1 r t 1 t ∇ θ log π t . θ θ t = 0 t 1 = t
Recommend
More recommend