From Importance Sampling to Doubly Robust Policy Gradient Jiawei - PowerPoint PPT Presentation

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 1

Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 2 J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖

Basic Idea T T T T Step-wise IS Standard PG (Tang and Abbeel, 2010) T T Traj-wise IS Ofg-Policy Evaluation Estimators Policy Gradient Estimators 3 REINFORCE J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � � γ t ′ r t ′ ρ [ 0 : T ] γ t r t ∇ log π t θ t = 0 t = 0 t ′ = 0 � � � γ t ′ r t ′ ∇ log π t γ t ρ [ 0 : t ] r t θ t = 0 t = 0 t ′ = t

Basic Idea Ofg-Policy Evaluation Estimators T T T OPE with State Baselines PG with State Baselines 4 Policy Gradient Estimators J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � � � � � � γ t ′ r t ′ − γ t b t ∇ log π t b 0 + γ t ρ [ 0 : t ] r t + γ b t + 1 − b t θ t = 0 t ′ = t t = 0

Basic Idea T t T T T DR-PG (Ours) t t Doubly Robust OPE T T 5 Policy Gradient Estimators Ofg-Policy Evaluation Estimators Trajectory-wise CV (Cheng et al., 2019) T J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � γ t ′ � �� ↖ � t ′ − � γ t ′ r t ′ + V π θ Q π θ ∇ log π t θ t ′ t = 0 t ′ = t t ′ = t + 1 + γ t � �� ∇ � − � � V π ′ r t + γ � V π ′ t + 1 − � Q π ′ V π θ Q π θ t ∇ log π t 0 + γ t ρ [ 0 : t ] θ t = 0 ↙ � � γ t ′ � �� γ t ′ r t ′ + � t ′ − � V π θ Q π θ ∇ log π t θ t ′ t = 0 t ′ = t t ′ = t + 1 + γ t � �� ∇ � t −∇ θ � − � V π θ Q π θ Q π θ t ∇ log π t θ

Preliminaries MDP Setting • Fixed initial state distribution; Frequently used notations 6 • Episodic RL with discount factor γ , and maximum episode length T ; • Trajectory is defined as s 0 , a 0 , r 0 , s 1 , ..., s T , a T , r T . • π θ : Policy parameterized by θ . • J ( π θ ) = E π θ [ � T t = 0 γ t r ( s t , a t )] : Expected discounted return of π θ .

A Concrete and Simple Example T t From Stepwise IS OPE to Standard PG 7 π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0

A Concrete and Simple Example t t T From Stepwise IS OPE to Standard PG 8 T π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0

A Concrete and Simple Example t t T From Stepwise IS OPE to Standard PG T t 9 T π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0 � � � � = � ∇ θ log π t ′ J ( π θ ) + ∆ θ + o (∆ θ ) . γ t r t θ t = 0 t ′ = 0

A Concrete and Simple Example T which is known to be the standard PG. t T Then t T From Stepwise IS OPE to Standard PG t 10 T t π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0 � � � � = � ∇ θ log π t ′ J ( π θ ) + ∆ θ + o (∆ θ ) . γ t r t θ t = 0 t ′ = 0 � J ( π θ +∆ θ ) − � � � J ( π θ ) ∇ θ log π t ′ lim = γ t r t θ ∆ θ ∆ θ → 0 t = 0 t ′ = 0

Doubly-Robust Policy Gradient (DR-PG) T t Definition: Doubly-robust OPE estimator ( unbiased ) (Jiang and Li, 2016) t 11 0 γ t � �� π t ′ � J ( π θ +∆ θ ) = � V π θ +∆ θ θ +∆ θ r t + γ � V π θ +∆ θ − � Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 V θ +∆ θ = E a ∼ π θ +∆ θ [ � where � Q θ +∆ θ ] .

Doubly-Robust Policy Gradient (DR-PG) t t t 2 T T T t t 2 Definition: Doubly-robust OPE estimator ( unbiased ) (Jiang and Li, 2016) T T T Theorem: Given DR-OPE estimator above, we can derive two unbiased estimators: 12 t 0 T γ t � �� π t ′ � J ( π θ +∆ θ ) = � V π θ +∆ θ θ +∆ θ r t + γ � V π θ +∆ θ − � Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 V θ +∆ θ = E a ∼ π θ +∆ θ [ � where � Q θ +∆ θ ] . • If � Q π θ +∆ θ = � Q π θ for arbitrary ∆ θ [ Traj-CV , (Cheng, Yan, and Boots., 2019) ] � � γ t 2 � �� + γ t � �� t 2 − � ∇ θ � − � V π θ Q π θ V π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 • else [ DR-PG (Ours) ] � � γ t 2 � �� + γ t � �� V π θ t 2 − � Q π θ ∇ θ � V π θ t −∇ θ � Q π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1

Doubly-Robust Policy Gradient (DR-PG) t t t 2 T T T t Definition: Doubly-robust OPE estimator (Jiang and Li, 2016) t 2 T T T Theorem: Given DR-OPE estimator above, we can derive: 13 t 0 T γ t � �� π t ′ � J ( π θ +∆ θ ) = � r t + γ � − � V π θ +∆ θ θ +∆ θ V π θ +∆ θ Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 • If � Q π θ +∆ θ = � Q π θ for arbitrary ∆ θ [ Traj-CV , (Cheng, Yan, and Boots., 2019) ] � � γ t 2 � �� + γ t � �� t 2 − � ∇ θ � − � V π θ Q π θ V π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 • else [ DR-PG (Ours) ] � � γ t 2 � �� + γ t � �� t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Remark 1: The definitions of ∇ θ � V are difgerent. In Traj-CV, ∇ θ � V = E π θ [ � Q π θ ∇ θ log π θ ] , while in DR-PG, ∇ θ � V = E π θ [ � Q π θ ∇ θ log π θ + ∇ θ � Q π θ ] Remark 2: ∇ θ � Q π θ is not necessary a gradient but just an approximation of ∇ θ Q π θ .

Special Cases of DR-PG T t t 2 DR-PG T 14 T � � γ t 2 � �� + γ t � �� V π θ t 2 − � Q π θ ∇ θ � V π θ t −∇ θ � Q π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1

Special Cases of DR-PG T t t 2 T T T DR-PG t t 2 15 T T � � γ t 2 � �� + γ t � �� t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Q π ′ invariant to π ′ ↓ Traj-CV Use � � � γ t 2 � �� + γ t � �� V π θ t 2 − � Q π θ ∇ θ � V π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1

Special Cases of DR-PG t 2 t T T T t t 2 DR-PG T T T t 16 T T T � � γ t 2 � �� + γ t � �� t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Q π ′ invariant to π ′ ↓ Traj-CV Use � � � γ t 2 � �� + γ t � �� V π θ t 2 − � Q π θ ∇ θ � V π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 � s t + 1 ] = 0 , dropped ↓ PG with state-action baselines � � � γ t 2 ( � t 2 − � V π θ Q π θ E [ t 2 ) t 2 = t + 1 � � � + γ t � �� ∇ θ � − � V π θ Q π θ ∇ θ log π t γ t 1 r t 1 t ∇ θ log π t . θ θ t = 0 t 1 = t

From Importance Sampling to Doubly Robust Policy Gradient Jiawei - PowerPoint PPT Presentation

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC) Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 1 Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Searching for Doubly Self Searching for Doubly Self- Orthogonal Latin Squares Orthogonal Latin

doubly linked lists Sept. 20/21, 2017 1 Singly linked list head tail 2 Doubly linked list

Doubly-Linked Lists 4-02-2013 Doubly-linked list Implementation of List ListIterator

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Overview 1. Probabilistic Reasoning/Graphical models 2. Importance Sampling 3. Markov Chain

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Solvability of Cubic Graphs and The Four Color Theorem Tony T. Lee Shanghai Jiao Tong University

From p-Boxes to What Is Needed p-Ellipsoids: Towards an Main Result and Its . . . Auxiliary

Revisiting Paulsons Theory of the Con- structible Universe with Isar and Sledge-

Hash Proof Systems and Password Protocols I Hash Proof Systems David Pointcheval CNRS, Ecole

Equilibria for collisions kernels appearing in weak turbulence Laurent Desvillettes, Univ. Paris

Matrix Multiplication and Graph Algorithms Uri Zwick Tel Aviv University NoNA Summer School on

Integer Programming and Totally unimodular matrices Carlo Mannino (from Geir Dahl and Carlo

Butterflies from Information Metric hep-th/1507.07555 Phys. Rev. Lett 115 (2015) with Numasawa,