Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone Department of Computer Science The University of Texas at Austin
50 millions 21 days, millions actions taken of games 1.5 years of compute The University of Texas at Austin Josiah Hanna 2
Can reinforcement learning be data efficient enough for real world applications? 3
Reinforcement Learning Learn a policy that maps the world state to an action that maximizes long term utility. The University of Texas at Austin Josiah Hanna 4
<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="DSJhErRj+yv5wNX6uajra9Q+Otw=">AEbHicrVPbatAEN3Ybpu6lzhtH1pCYalxscEYKQmt8xCIKYU+OnWdBCxbjNYre4lWErurECPrpZ/Yt35CX/oNXclyY7s3SjtCMJwzlzOzjBN6TCrD+LxVKJZu3b6zfbd87/6DhzuV3UdnMogEoX0SeIG4cEBSj/m0r5jy6EUoKHDHo+fO5ZuUP7+iQrLA/6BmIR1ymPjMZQSUhuzdwsfaVd0KmW2pKVXQwMfY4qCmjhu/TQano/iGS+q9ZqeBh+Wa/n5IkhG3Y5lgqyvqEs/xKp2RsALVYS4beKO6bEIjrW354HiQw/gv1a0nW14wW3b0cJ6+Qi/a2NBGIrgGluABKbScyTfEJ2bCYjvim9Z7Nmx2Z/bG+zVIAOzASkRDhly6GUikFMOFwndpziIysUjNONzt9rsfUIastl7X/ZbgFsfpaNpunlZN4qfkG+g+rsCtVo3V01DZemXjhHCydwzY2W0ZmVZRb1658sYBiTj1FfFAyoFphGqot6cY8WhStiJQyCXMKED7frAqRzG2bEkuKaRMXYDoX9f4QxdzYiBSznjo5MX0Vucin4M24QKbc9jJkfRor6ZNHIjTysApxeHh4zQYnyZtoBIpjWiskU9KqVvs+yXsJyUvxr52y/ZR609k8Pqyfv83Vsoz30AtWRiV6jE/QOdVEfkcKX4k7xafFZ8WvpSWmv9HwRWtjKcx6jNSu9/AZI3G4h</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Reinforcement Learning π : S × A → [0 , 1] Reach +1 Destination Probability = 0.15 X X π θ ( a | s ) Q π θ ( s, a ) v ( π θ ) = Pr( s | π θ ) Unknown Known s a Probability = 0.85 -100 Crash v ( π θ ) = E [ Q π θ ( S, A )] “How good is taking action A in state S” The University of Texas at Austin Josiah Hanna 5
<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="oDwFjJGpSpcVOR/lifXRc+cjIPY=">AE9nicrVRNb9MwGPaWAKM6LYjF4uqUipVbJN0B0mrUJIHDtKt0lNGzmu01qLk8h2plZpfgXDiDEld/CjX+Dk6asLR8VYo4ivXqe9+N5/dp2I58KaZrft7Y1/d79BzsPS492Hz95Wt7bvxBhzDHp4tAP+ZWLBPFpQLqSp9cRZwg5vrk0r1+lfGXN4QLGgbv5DQifYZGAfUoRlJBzp62W70x7Ig6thwTiWrwFNoMybHrJa/T3vkgueVSo1Nv1WC/VM2+X6JEzJxEpNBuc0PAGVymcxItQaiRpcSy/qJaltgPk+qiA4T/KWw2/XC0XLalhHWKHv5WxkZRxMJtD2OcGKlCUuLDumplQ7YuvSOQ+sth24s79BMgHLMBWRENKaLpqRMEB8xNEmdJMHdsQpI2uVf+Zc9TFQnlscv9Pc3NieVoOnWZ02Sh+Ra6i63YMPG7OVobBGVhTrliNk5OmuYLC86No4Vx3IRWw8xXBRSr7ZS/2cMQx4wEvtIiJ5lRrKvxigp9klasmNBIoSv0Yj0lBkgRkQ/ya9tCqsKGUIv5OoPJMzR5YgEMSGmzFWe2fEQ61wG/o7rxdJr9hMaRLEkAZ4X8mIfyhBmbwAcUk6w9KfKQJhTpRXiMVIzl+qlKlNWHQK/2xcHDaso8bh+XHl7G2xHTvgGXgODGCBl+AMvAFt0AVYE9p7aP2SZ/oH/TP+pe56/ZWEXMAVpb+9QfJFaHR</latexit> Policy Gradient Reinforcement Learning v ( π θ ) = E [ Q π θ ( S, A )] X X π θ ( a | s ) Q π θ ( s, a ) r θ log π θ ( a | s ) r θ v ( π θ ) = Pr( s | π θ ) Unknown Known s a m r θ v ( π θ ) ⇡ 1 X r θ v ( π θ ) = E [ Q π θ ( S, A ) r θ log π θ ( A | S )] Q π θ ( S i , A i ) r θ log π θ ( A i | S i )] m i =1 The University of Texas at Austin Josiah Hanna 6
Monte Carlo Policy Gradient 1. Execute current policy for m steps. 2. Update policy with Monte Carlo policy gradient estimate. 3. Throw away observed data and repeat (on-policy). The University of Texas at Austin Josiah Hanna 7
Sampling Error Proportion = 0.15 Proportion = 0.1 Reach +1 Destination Probability = 0.15 For a finite amount of data, it may appear that the wrong policy generated the data. Probability = 0.85 -100 Crash Proportion = 0.85 Proportion = 0.9 The University of Texas at Austin Josiah Hanna 8
<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Correcting Sampling Error Pretend data was generated by policy that most closely matches the observed data. m X π φ = argmax φ 0 log π φ 0 ( a i | s i ) i =1 Correct weight on each state-action pair towards the policy we Importance Sampling actually took actions with. Correction m r θ v ( π θ ) ⇡ 1 π θ ( a i | s i ) X π φ ( a i | s i ) Q π θ ( S i , A i ) r θ log π θ ( A i | S i ) m i =1 The University of Texas at Austin Josiah Hanna 9
Is this method on-policy or off-policy? On-policy: Can only use data from the current policy. Off-policy: Can use data from any policy. Our method pretends on-policy data is off-policy data and uses importance sampling to correct! The University of Texas at Austin Josiah Hanna 10
Sampling Error Corrected Policy Gradient 1. Execute current policy for m steps. 2. Estimate empirical policy with maximum likelihood estimation. 3. Update policy with Sampling Error Corrected (SEC) policy gradient estimate. 4. Throw away data and repeat (on-policy). The University of Texas at Austin Josiah Hanna 11
Empirical Results GridWorld Discrete State and Actions The University of Texas at Austin Josiah Hanna 12
Empirical Results Cartpole Continuous state and discrete actions The University of Texas at Austin Josiah Hanna 13
Related Work 1. Expected SARSA (van Seijen et al. 2009). 2. Expected Policy Gradients (Ciosek and Whiteson 2018). 3. Estimated Propensity Scores (Hirano et al. 2003, Li et al. 2015). 4. Many people outside of RL + Bandits: • Blackbox importance sampling (Liu and Lee 2017), Bayesian Monte Carlo (Gharamani and Rasmussen 2003). The University of Texas at Austin Josiah Hanna 14
1. Any Monte Carlo method will have sampling error with finite data. 2. Sampling Error can slow down learning in policy gradient methods. 3. We introduced the sampling error corrected policy gradient estimator to address this problem. 4. Similar approach can be used for other Monte Carlo estimators. • For example: on- and off-policy policy evaluation. Josiah Hanna , Scott Niekum, Peter Stone (to appear ICML 2019) The University of Texas at Austin Josiah Hanna 15
Open Questions 1. Finite sample bias / variance analysis. 2. Correcting sampling error in online RL methods. The University of Texas at Austin Josiah Hanna 16
Thank you! Questions? jphanna@cs.utexas.edu The University of Texas at Austin Josiah Hanna 17
Ceci n’est pas un blank slide. The University of Texas at Austin Josiah Hanna 18
Empirical Results GridWorld Discrete State and Actions The University of Texas at Austin Josiah Hanna 19
Recommend
More recommend