safe reinforcement learning
play

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: - PowerPoint PPT Presentation

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017 Lecture overview What makes a reinforcement learning algorithm safe ? Notation Creating a safe reinforcement learning


  1. Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017

  2. Lecture overview • What makes a reinforcement learning algorithm safe ? • Notation • Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • High-confidence off-policy policy evaluation (HCOPE) • Safe policy improvement (SPI) • Empirical results • Research directions

  3. What does it mean for a reinforcement learning algorithm to be safe ?

  4. Changing the objective +20 +20 -50 +0 +20 +20 +20 Policy 1 +0 +0 +0 +0 +0 +0 +20 Policy 2

  5. Changing the objective • Policy 1: • Reward = 0 with probability 0.999999 • Reward = 10 9 with probability 1-0.999999 • Expected reward approximately 1000 • Policy 2: • Reward = 999 with probability 0.5 • Reward = 1000 with probability 0.5 • Expected reward 999.5

  6. Another notion of safety

  7. Another notion of safety (Munos et. al)

  8. Another notion of safety

  9. The Problem • If you apply an existing method, do you have confidence that it will work?

  10. Reinforcement learning successes

  11. A property of many real applications • Deploying “bad” policies can be costly or dangerous .

  12. Deploying bad policies can be costly

  13. Deploying bad policies can be dangerous

  14. What property should a safe algorithm have? • Guaranteed to work on the first try • “I guarantee that with probability at least 1 − 𝜀 , I will not change your policy to one that is worse than the current policy.” • You get to choose 𝜀 • This guarantee is not contingent on the tuning of any hyperparameters

  15. Lecture overview • What makes a reinforcement learning algorithm safe ? • Notation • Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • High-confidence off-policy policy evaluation (HCOPE) • Safe policy improvement (SPI) • Empirical results • Research directions

  16. Notation • Policy, 𝜌 𝜌 𝑏 𝑡 = Pr⁡ (𝐵 𝑢 = 𝑏|𝑇 𝑢 = 𝑡) • History: Action, 𝑏 Agent 𝐼 = 𝑡 1 , 𝑏 1 , 𝑠 1 , 𝑡 2 , 𝑏 2 , 𝑠 2 , … , 𝑡 𝑀 , 𝑏 𝑀 , 𝑠 𝑀 Reward, 𝑠 State, 𝑡 • Historical data: 𝐸 = 𝐼 1 , 𝐼 2 , … , 𝐼 𝑜 Environment • Historical data from behavior policy , 𝜌 b • Objective: 𝑀 𝐾 𝜌 = 𝐅 𝛿 𝑢 𝑆 𝑢 𝜌 𝑢=1 19

  17. Safe reinforcement learning algorithm • Reinforcement learning algorithm, 𝑏 • Historical data, 𝐸 , which is a random variable • Policy produced by the algorithm, 𝑏(𝐸) , which is a random variable • A safe reinforcement learning algorithm, 𝑏 , satisfies: Pr 𝐾 𝑏 𝐸 ≥ 𝐾 𝜌 b ≥ 1 − 𝜀 or, in general: Pr 𝐾 𝑏 𝐸 ≥ 𝐾 min ≥ 1 − 𝜀

  18. Lecture overview • What makes a reinforcement learning algorithm safe ? • Notation • Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • High-confidence off-policy policy evaluation (HCOPE) • Safe policy improvement (SPI) • Empirical results • Research directions

  19. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

  20. Off-policy policy evaluation (OPE) Historical Data, 𝐸 Estimate of 𝐾(𝜌 e ) Proposed Policy, 𝜌 e

  21. Importance Sampling (Intuition) • Reminder: • History, 𝐼 = 𝑡 1 , 𝑏 1 , 𝑠 𝑜 𝑜 𝑀 𝑀 1 , 𝑡 2 , 𝑏 2 , 𝑠 2 , … , 𝑡 𝑀 , 𝑏 𝑀 , 𝑠 𝑀 𝜌 e = 1 𝜌 𝑓 = 1 𝑀 • Objective, 𝐾 𝜌 e = 𝐅 𝛿 𝑢 𝑆 𝑢 𝑗 𝑗 𝑜 𝑥 𝑗 𝛿 𝑢 𝑆 𝑢 𝑜 𝛿 𝑢 𝑆 𝑢 𝜌 e 𝐾 𝐾 𝑢=1 𝑗=1 𝑗=1 𝑢=1 𝑢=1 𝑥 𝑗 = Pr 𝐼 𝑗 𝜌 e Pr 𝐼 𝑗 𝜌 b Evaluation Policy, 𝜌 e 𝑀 Behavior Policy, 𝜌 b = 𝜌 e 𝑏 𝑢 𝑡 𝑢 𝜌 b 𝑏 𝑢 𝑡 𝑢 Probability of history 𝑢=1 Math Slide 2/3 Math Slide 2/3 24

  22. Importance sampling (History) • Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal of the Operations Research Society of America , 1(5):263 – 278 • Let 𝑌 = 0 with probability 1 − 10 −10 and 𝑌 = 10 10 with probability 10 −10 • 𝐅 𝑌 = 1 • Monte Carlo estimate from 𝑜 ≪ 10 10 samples of 𝑌 is almost always zero • Idea: Sample 𝑌 from some other distribution and use importance sampling to “correct” the estimate • Can produce lower variance estimates. • Josiah Hannah et. al, ICML 2017 (to appear).

  23. Importance sampling (History, continued) • Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off- policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning , pp. 759 – 766. Morgan Kaufmann

  24. Importance sampling (Proof) • Estimate 𝐅 𝑞 [𝑔 𝑌 ] given a sample of 𝑌~𝑟 • Let 𝑄 = supp 𝑞 , 𝑅 = supp(𝑟) , and 𝐺 = supp(𝑔) 𝑞 𝑌 • Importance sampling estimate: 𝑟 𝑌 𝑔 𝑌 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑟(𝑌) 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) 𝑦∈𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) + 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 ∩𝑅 𝑦∈𝑄 𝑦∈𝑄 𝑦∈𝑄∩𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅

  25. Importance sampling (Proof) ) • Assume 𝑄 ⊆ 𝑅 (can relax assumption to 𝑄 ⊆ 𝑅 ∪ 𝐺 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) 𝑦∈𝑄 = 𝐅 𝑞 𝑔 𝑌 • Importance sampling is an unbiased estimator of 𝐅 𝑞 𝑔 𝑌

  26. Importance sampling (proof) • Assume 𝑔 𝑦 ≥ 0 for all 𝑦 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅 ≤ 𝑞(𝑌)⁡𝑔(𝑌) 𝑦∈𝑄 = 𝐅 𝑞 𝑔 𝑌 • Importance sampling is a negative-bias estimator of 𝐅 𝑞 𝑔 𝑌

  27. Importance sampling (reminder) 𝑜 𝑀 𝑀 IS 𝐸 = 1 𝜌 e 𝑏 𝑢 𝑡 𝑢 𝛿 𝑢 𝑆 𝑢 𝑗 𝑜 𝜌 b 𝑏 𝑢 𝑡 𝑢 𝑗=1 𝑢=1 𝑢=1 𝐅 IS(𝐸) = 𝐾 𝜌 e

  28. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

  29. High confidence off-policy policy evaluation (HCOPE) Historical Data, 𝐸 1 − 𝜀 confidence lower Proposed Policy, 𝜌 𝑓 bound on 𝐾(𝜌 𝑓 ) Probability, 1 − 𝜀

  30. Hoeffding’s inequality • Let 𝑌 1 , … , 𝑌 𝑜 be 𝑜 independent identically distributed random variables such that ⁡𝑌 i ∈ [0, 𝑐] • Then with probability at least 1 − 𝜀 : ln 1 𝜀 𝑜 𝐅 𝑌 𝑗 ≥ 1 𝑜 𝑌 𝑗 − 𝑐 2𝑜 𝑗=1 𝑜 𝑀 1 𝑥 𝑗 𝛿 𝑢 𝑆 𝑢 𝑗 𝑜 𝑗=1 𝑢=1 Math Slide 3/3

  31. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

  32. Safe policy improvement (SPI) Historical Data, 𝐸 New policy 𝜌 , or No Solution Found Probability, 1 − 𝜀

  33. Safe policy improvement (SPI) Training Set Candidate Policy, 𝜌 (20%) Historical Data Testing Set Safety Test (80%) Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌 cur ) ? 36

  34. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏 WON’T WORK

Recommend


More recommend