Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator - PowerPoint PPT Presentation

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone Department of Computer Science The University of Texas at Austin

50 millions 21 days, millions actions taken of games 1.5 years of compute The University of Texas at Austin Josiah Hanna 2

Can reinforcement learning be data efficient enough for real world applications? 3

Reinforcement Learning Learn a policy that maps the world state to an action that maximizes long term utility. The University of Texas at Austin Josiah Hanna 4

<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="DSJhErRj+yv5wNX6uajra9Q+Otw=">AEbHicrVPbatAEN3Ybpu6lzhtH1pCYalxscEYKQmt8xCIKYU+OnWdBCxbjNYre4lWErurECPrpZ/Yt35CX/oNXclyY7s3SjtCMJwzlzOzjBN6TCrD+LxVKJZu3b6zfbd87/6DhzuV3UdnMogEoX0SeIG4cEBSj/m0r5jy6EUoKHDHo+fO5ZuUP7+iQrLA/6BmIR1ymPjMZQSUhuzdwsfaVd0KmW2pKVXQwMfY4qCmjhu/TQano/iGS+q9ZqeBh+Wa/n5IkhG3Y5lgqyvqEs/xKp2RsALVYS4beKO6bEIjrW354HiQw/gv1a0nW14wW3b0cJ6+Qi/a2NBGIrgGluABKbScyTfEJ2bCYjvim9Z7Nmx2Z/bG+zVIAOzASkRDhly6GUikFMOFwndpziIysUjNONzt9rsfUIastl7X/ZbgFsfpaNpunlZN4qfkG+g+rsCtVo3V01DZemXjhHCydwzY2W0ZmVZRb1658sYBiTj1FfFAyoFphGqot6cY8WhStiJQyCXMKED7frAqRzG2bEkuKaRMXYDoX9f4QxdzYiBSznjo5MX0Vucin4M24QKbc9jJkfRor6ZNHIjTysApxeHh4zQYnyZtoBIpjWiskU9KqVvs+yXsJyUvxr52y/ZR609k8Pqyfv83Vsoz30AtWRiV6jE/QOdVEfkcKX4k7xafFZ8WvpSWmv9HwRWtjKcx6jNSu9/AZI3G4h</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Reinforcement Learning π : S × A → [0 , 1] Reach +1 Destination Probability = 0.15 X X π θ ( a | s ) Q π θ ( s, a ) v ( π θ ) = Pr( s | π θ ) Unknown Known s a Probability = 0.85 -100 Crash v ( π θ ) = E [ Q π θ ( S, A )] “How good is taking action A in state S” The University of Texas at Austin Josiah Hanna 5

<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="oDwFjJGpSpcVOR/lifXRc+cjIPY=">AE9nicrVRNb9MwGPaWAKM6LYjF4uqUipVbJN0B0mrUJIHDtKt0lNGzmu01qLk8h2plZpfgXDiDEld/CjX+Dk6asLR8VYo4ivXqe9+N5/dp2I58KaZrft7Y1/d79BzsPS492Hz95Wt7bvxBhzDHp4tAP+ZWLBPFpQLqSp9cRZwg5vrk0r1+lfGXN4QLGgbv5DQifYZGAfUoRlJBzp62W70x7Ig6thwTiWrwFNoMybHrJa/T3vkgueVSo1Nv1WC/VM2+X6JEzJxEpNBuc0PAGVymcxItQaiRpcSy/qJaltgPk+qiA4T/KWw2/XC0XLalhHWKHv5WxkZRxMJtD2OcGKlCUuLDumplQ7YuvSOQ+sth24s79BMgHLMBWRENKaLpqRMEB8xNEmdJMHdsQpI2uVf+Zc9TFQnlscv9Pc3NieVoOnWZ02Sh+Ra6i63YMPG7OVobBGVhTrliNk5OmuYLC86No4Vx3IRWw8xXBRSr7ZS/2cMQx4wEvtIiJ5lRrKvxigp9klasmNBIoSv0Yj0lBkgRkQ/ya9tCqsKGUIv5OoPJMzR5YgEMSGmzFWe2fEQ61wG/o7rxdJr9hMaRLEkAZ4X8mIfyhBmbwAcUk6w9KfKQJhTpRXiMVIzl+qlKlNWHQK/2xcHDaso8bh+XHl7G2xHTvgGXgODGCBl+AMvAFt0AVYE9p7aP2SZ/oH/TP+pe56/ZWEXMAVpb+9QfJFaHR</latexit> Policy Gradient Reinforcement Learning v ( π θ ) = E [ Q π θ ( S, A )] X X π θ ( a | s ) Q π θ ( s, a ) r θ log π θ ( a | s ) r θ v ( π θ ) = Pr( s | π θ ) Unknown Known s a m r θ v ( π θ ) ⇡ 1 X r θ v ( π θ ) = E [ Q π θ ( S, A ) r θ log π θ ( A | S )] Q π θ ( S i , A i ) r θ log π θ ( A i | S i )] m i =1 The University of Texas at Austin Josiah Hanna 6

Monte Carlo Policy Gradient 1. Execute current policy for m steps. 2. Update policy with Monte Carlo policy gradient estimate. 3. Throw away observed data and repeat (on-policy). The University of Texas at Austin Josiah Hanna 7

Sampling Error Proportion = 0.15 Proportion = 0.1 Reach +1 Destination Probability = 0.15 For a finite amount of data, it may appear that the wrong policy generated the data. Probability = 0.85 -100 Crash Proportion = 0.85 Proportion = 0.9 The University of Texas at Austin Josiah Hanna 8

<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Correcting Sampling Error Pretend data was generated by policy that most closely matches the observed data. m X π φ = argmax φ 0 log π φ 0 ( a i | s i ) i =1 Correct weight on each state-action pair towards the policy we Importance Sampling actually took actions with. Correction m r θ v ( π θ ) ⇡ 1 π θ ( a i | s i ) X π φ ( a i | s i ) Q π θ ( S i , A i ) r θ log π θ ( A i | S i ) m i =1 The University of Texas at Austin Josiah Hanna 9

Is this method on-policy or off-policy? On-policy: Can only use data from the current policy. Off-policy: Can use data from any policy. Our method pretends on-policy data is off-policy data and uses importance sampling to correct! The University of Texas at Austin Josiah Hanna 10

Sampling Error Corrected Policy Gradient 1. Execute current policy for m steps. 2. Estimate empirical policy with maximum likelihood estimation. 3. Update policy with Sampling Error Corrected (SEC) policy gradient estimate. 4. Throw away data and repeat (on-policy). The University of Texas at Austin Josiah Hanna 11

Empirical Results GridWorld Discrete State and Actions The University of Texas at Austin Josiah Hanna 12

Empirical Results Cartpole Continuous state and discrete actions The University of Texas at Austin Josiah Hanna 13

Related Work 1. Expected SARSA (van Seijen et al. 2009). 2. Expected Policy Gradients (Ciosek and Whiteson 2018). 3. Estimated Propensity Scores (Hirano et al. 2003, Li et al. 2015). 4. Many people outside of RL + Bandits: • Blackbox importance sampling (Liu and Lee 2017), Bayesian Monte Carlo (Gharamani and Rasmussen 2003). The University of Texas at Austin Josiah Hanna 14

1. Any Monte Carlo method will have sampling error with finite data. 2. Sampling Error can slow down learning in policy gradient methods. 3. We introduced the sampling error corrected policy gradient estimator to address this problem. 4. Similar approach can be used for other Monte Carlo estimators. • For example: on- and off-policy policy evaluation. Josiah Hanna , Scott Niekum, Peter Stone (to appear ICML 2019) The University of Texas at Austin Josiah Hanna 15

Open Questions 1. Finite sample bias / variance analysis. 2. Correcting sampling error in online RL methods. The University of Texas at Austin Josiah Hanna 16

Thank you! Questions? jphanna@cs.utexas.edu The University of Texas at Austin Josiah Hanna 17

Ceci n’est pas un blank slide. The University of Texas at Austin Josiah Hanna 18

Empirical Results GridWorld Discrete State and Actions The University of Texas at Austin Josiah Hanna 19

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator - PowerPoint PPT Presentation

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone Department of Computer Science The University of Texas at Austin 50 millions 21 days, millions actions taken of games 1.5 years of compute

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Geometrically Coupled Monte Carlo Sampling Mark Rowland Krzysztof Choromanski Franois Chalus

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

The Monte Carlo Method Estimating through sampling (estimating , p -value, integrals,...)

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this

1 Prior Sampling Prior Sampling For i=1, 2, , n +c 0.5 -c 0.5 Sample x i from P(X i

Approximate Knowledge Compilation by Online Collapsed Importance Sampling Tal Friedman and Guy

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory Charles J. Geyer

Monte Carlo Methods Lecture slides for Chapter 17 of Deep Learning www.deeplearningbook.org Ian

Chapter 7 Inferences Based on a Single Sample: Estimation with Confidence Intervals

Decision Reductions Crypto 2011 Daniele Micciancio Petros Mol August 17, 2011 1 Learning With

Meet Our Panelists Stanley "Skip" Pruss Dr. Edward E. Timm, PhD, PE Former Dir. of MI

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator - PowerPoint PPT Presentation

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone Department of Computer Science The University of Texas at Austin 50 millions 21 days, millions actions taken of games 1.5 years of compute

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Geometrically Coupled Monte Carlo Sampling Mark Rowland Krzysztof Choromanski Franois Chalus

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

The Monte Carlo Method Estimating through sampling (estimating , p -value, integrals,...)

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this

1 Prior Sampling Prior Sampling For i=1, 2, , n +c 0.5 -c 0.5 Sample x i from P(X i

Approximate Knowledge Compilation by Online Collapsed Importance Sampling Tal Friedman and Guy

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory Charles J. Geyer

Monte Carlo Methods Lecture slides for Chapter 17 of Deep Learning www.deeplearningbook.org Ian

Chapter 7 Inferences Based on a Single Sample: Estimation with Confidence Intervals

Decision Reductions Crypto 2011 Daniele Micciancio Petros Mol August 17, 2011 1 Learning With

Meet Our Panelists Stanley &quot;Skip&quot; Pruss Dr. Edward E. Timm, PhD, PE Former Dir. of MI

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

Meet Our Panelists Stanley "Skip" Pruss Dr. Edward E. Timm, PhD, PE Former Dir. of MI