Data-efficient Policy Evaluation through Behavior Policy Search Josiah Hanna 1 Philip Thomas 2 Peter Stone 1 Scott Niekum 1 1 University of Texas at Austin 2 University of Massachusetts, Amherst August 8th, 2017 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 1
Policy Evaluation Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 2
Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3
Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. 2 Show how to improve the behavior policy for importance-sampling policy evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3
Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. 2 Show how to improve the behavior policy for importance-sampling policy evaluation. 3 Empirically evaluate (1) and (2). Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3
Background Finite-horizon MDP. Agent selects actions with a stochastic policy, π . The policy and environment determine a distribution over trajectories, H : S 0 , A 0 , R 0 , S 1 , A 1 , R 1 , ..., S L , A L , R L Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 4
Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5
Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Given a target policy, π e , estimate ρ ( π e ). Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5
Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Given a target policy, π e , estimate ρ ( π e ). Let π e ≡ π θ e Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5
Monte Carlo Policy Evaluation Given a dataset D of trajectories where ∀ H ∈ D , H ∼ π e : L MC( D ) := 1 � � γ t R ( i ) t |D| H i ∈D t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 6
Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7
Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Monte Carlo evaluation of π e has high variance . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7
Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Monte Carlo evaluation of π e has high variance . Importance-sampling with a behavior policy that samples either action with equal probability gives a low variance evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7
Importance-Sampling Policy Evaluation 1 Given a dataset D of trajectories where ∀ H i ∈ D , H i is sampled from a behavior policy π i : L L IS( D ) := 1 π e ( A t | S t ) � � � γ t R ( i ) t |D| π i ( A t | S t ) H i ∈D t =0 t =0 � �� � re-weighting factor 1 Precup, Sutton, and Singh (2000) Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8
Importance-Sampling Policy Evaluation 1 Given a dataset D of trajectories where ∀ H i ∈ D , H i is sampled from a behavior policy π i : L L IS( D ) := 1 π e ( A t | S t ) � � � γ t R ( i ) t |D| π i ( A t | S t ) H i ∈D t =0 t =0 � �� � re-weighting factor For convenience: L L π e ( A t | S t ) � � γ t R t IS( H , π ) := π ( A t | S t ) t =0 t =0 1 Precup, Sutton, and Singh (2000) Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8
The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9
The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9
The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Requires the reward function be known. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9
The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Requires the reward function be known. Requires deterministic transitions. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9
Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10
Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10
Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . 2 Sample m trajectories, H ∼ θ i and add to a data set D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10
Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . 2 Sample m trajectories, H ∼ θ i and add to a data set D . 3 Estimate ρ ( π e ) with trajectories in D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10
Behavior Policy Gradient Key Idea: Adapt the behavior policy parameters, θ , with gradient descent on the mean squared error of importance-sampling. θ i +1 = θ i − α ∂ ∂ θ MSE[IS( H i , θ )] Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11
Behavior Policy Gradient Key Idea: Adapt the behavior policy parameters, θ , with gradient descent on the mean squared error of importance-sampling. θ i +1 = θ i − α ∂ ∂ θ MSE[IS( H i , θ )] MSE[ IS ( H , θ )] is not computable. ∂ ∂ θ MSE[ IS ( H , θ )] is computable. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11
Behavior Policy Gradient Theorem Theorem � � L ∂ ∂ � − IS( H , θ ) 2 ∂ θ MSE(IS( H , θ )) = E π θ ∂ θ log ( π θ ( A t | S t )) t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 12
Empirical Results Cartpole Swing-up Acrobot Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 13
Empirical Results Cartpole Swing-up Acrobot Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 13
GridWorld Results High Variance Policy Low Variance Policy Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 14
Recommend
More recommend