exploration conscious reinforcement learning
play

Exploration Conscious Reinforcement Learning Revisited Lior Shani* - PowerPoint PPT Presentation

Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology Why? To learn a good policy, an RL agent must explore! However, it can cause hazardous behavior during


  1. Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology

  2. Why? โ€ข To learn a good policy, an RL agent must explore! โ€ข However, it can cause hazardous behavior during training. I LOVE ๐‘ -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 2 /1 9

  3. Why? โ€ข To learn a good policy, an RL agent must explore! โ€ข However, it can cause hazardous behavior during training. Damn you Exploration! I LOVE ๐‘ -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 3 /1 9

  4. Exploration Conscious Reinforcement Learning โ€ข Objective: Find the optimal policy knowing that exploration might occur โ€ข For example : ๐‘ -greedy exploration ( ๐œท = ๐‘ ) โˆž โˆ— โˆˆ argmax ๐†โˆˆ๐œฌ ๐”ฝ ๐Ÿโˆ’๐œท ๐†+๐œท๐† ๐Ÿ เท ๐œน ๐’– ๐’” ๐’• ๐’– , ๐’ƒ ๐’– ๐† ๐œท ๐’–=๐Ÿ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 4 /1 9

  5. Exploration Conscious Reinforcement Learning โ€ข Objective: Find the optimal policy knowing that exploration might occur โ€ข For example : ๐‘ -greedy exploration ( ๐œท = ๐‘ ) โˆž โˆ— โˆˆ argmax ๐†โˆˆ๐œฌ ๐”ฝ ๐Ÿโˆ’๐œท ๐†+๐œท๐† ๐Ÿ เท ๐œน ๐’– ๐’” ๐’• ๐’– , ๐’ƒ ๐’– ๐† ๐œท ๐’–=๐Ÿ โ€ข Solving the Exploration-Conscious problem = Solving an MDP โ€ข We describe a bias-error sensitivity tradeoff in ๐œท Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 5 /1 9

  6. Exploration Conscious Reinforcement Learning โ€ข Objective: Find the optimal policy knowing that exploration might occur I โ€™ m Exploration Conscious Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 6 /1 9

  7. Fixed Exploration Schemes (e.g. ๐œ— -greedy) Choose Greedy Action ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› โ€ข ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› โˆˆ argmax ๐’ƒ ๐‘น ๐† ๐œท ๐’•, ๐’ƒ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 7 /1 9

  8. Fixed Exploration Schemes (e.g. ๐œ— -greedy) Choose Draw Greedy Exploratory Action Action ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› ๐’ƒ ๐’ƒ๐’…๐’– โ€ข ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› โˆˆ argmax ๐’ƒ ๐‘น ๐† ๐œท ๐’•, ๐’ƒ โ€ข For ๐œท -greedy: ๐’ƒ ๐’ƒ๐’…๐’– โˆˆ ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› w.p. ๐Ÿ โˆ’ ๐œท ๐† ๐Ÿ else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 8 /1 9

  9. Fixed Exploration Schemes (e.g. ๐œ— -greedy) Choose Draw Greedy Exploratory Act Action Action ๐’ƒ ๐’ƒ๐’…๐’– ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› ๐’ƒ ๐’ƒ๐’…๐’– โ€ข ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› โˆˆ argmax ๐’ƒ ๐‘น ๐† ๐œท ๐’•, ๐’ƒ โ€ข For ๐œท -greedy: ๐’ƒ ๐’ƒ๐’…๐’– โˆˆ ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› w.p. ๐Ÿ โˆ’ ๐œท ๐† ๐Ÿ else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 9 /1 9

  10. Fixed Exploration Schemes (e.g. ๐œ— -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action ๐’ƒ ๐’ƒ๐’…๐’– ๐’”, ๐’•โ€ฒ ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› ๐’ƒ ๐’ƒ๐’…๐’– โ€ข ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› โˆˆ argmax ๐’ƒ ๐‘น ๐† ๐œท ๐’•, ๐’ƒ โ€ข For ๐œท -greedy: ๐’ƒ ๐’ƒ๐’…๐’– โˆˆ ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› w.p. ๐Ÿ โˆ’ ๐œท ๐† ๐Ÿ else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 10 /1 9

  11. Fixed Exploration Schemes (e.g. ๐œ— -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action ๐’ƒ ๐’ƒ๐’…๐’– ๐’”, ๐’•โ€ฒ ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› ๐’ƒ ๐’ƒ๐’…๐’– โ€ข Normally used information: ๐’•, ๐’ƒ ๐’ƒ๐’…๐’– , ๐’”, ๐’• โ€ฒ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 11 /1 9

  12. Fixed Exploration Schemes (e.g. ๐œ— -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action ๐’ƒ ๐’ƒ๐’…๐’– ๐’”, ๐’•โ€ฒ ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› ๐’ƒ ๐’ƒ๐’…๐’– โ€ข Normally used information: ๐’•, ๐’ƒ ๐’ƒ๐’…๐’– , ๐’”, ๐’• โ€ฒ โ€ข Using information about the exploration process: ๐’•, ๐’ƒ ๐’‰๐’”๐’‡๐’‡๐’†๐’› , ๐’ƒ ๐’ƒ๐’…๐’– , ๐’”, ๐’• โ€ฒ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 12 /1 9

  13. Two Approaches โ€“ Expected approach Update ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– 1. ๐’ƒ๐’…๐’– 2. Expect that the agent might explore in the next state ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’ƒ๐’…๐’– += ๐œฝ ๐’” ๐’– + ๐œน๐”ฝ ๐Ÿโˆ’๐œท ๐†+๐œท๐† ๐Ÿ ๐‘น ๐† ๐œท ๐’• ๐’–+๐Ÿ , ๐’ƒ โˆ’ ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’ƒ๐’…๐’– Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 13 /1 9

  14. Two Approaches โ€“ Expected approach Update ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– 1. ๐’ƒ๐’…๐’– 2. Expect that the agent might explore in the next state ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’ƒ๐’…๐’– += ๐œฝ ๐’” ๐’– + ๐œน๐”ฝ ๐Ÿโˆ’๐œท ๐†+๐œท๐† ๐Ÿ ๐‘น ๐† ๐œท ๐’• ๐’–+๐Ÿ , ๐’ƒ โˆ’ ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’ƒ๐’…๐’– โ€ข Calculating expectations can be hard. โ€ข Requires sampling in the continuous case! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 14 /1 9

  15. Two Approaches โ€“ Surrogate approach โ€ข Exploration is incorporated into the environment! Update ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’‰๐’”๐’‡๐’‡๐’†๐’› 1. 2. The rewards and next state ๐’” ๐’– , ๐’• ๐’–+๐Ÿ are given by the acted action ๐’ƒ ๐’– ๐’ƒ๐’…๐’– ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’‰๐’”๐’‡๐’‡๐’†๐’› += ๐œฝ ๐’” ๐’– + ๐œน๐‘น ๐† ๐œท ๐’• ๐’–+๐Ÿ , ๐’ƒ ๐’–+๐Ÿ โˆ’ ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’‰๐’”๐’‡๐’‡๐’†๐’› ๐’‰๐’”๐’‡๐’‡๐’†๐’› Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 15 /1 9

  16. Two Approaches โ€“ Surrogate approach โ€ข Exploration is incorporated into the environment ! Update ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’‰๐’”๐’‡๐’‡๐’†๐’› 1. 2. The rewards and next state ๐’” ๐’– , ๐’• ๐’–+๐Ÿ are given by the acted action ๐’ƒ ๐’– ๐’ƒ๐’…๐’– ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’‰๐’”๐’‡๐’‡๐’†๐’› += ๐œฝ ๐’” ๐’– + ๐œน๐‘น ๐† ๐œท ๐’• ๐’–+๐Ÿ , ๐’ƒ ๐’–+๐Ÿ โˆ’ ๐‘น ๐† ๐œท ๐’• ๐’– , ๐’ƒ ๐’– ๐’‰๐’”๐’‡๐’‡๐’†๐’› ๐’‰๐’”๐’‡๐’‡๐’†๐’› โ€ข NO NEED TO SAMPLE! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 16 /1 9

  17. Deep RL Experimental Results Training Evaluation Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 17 /1 9

  18. Summary โ€ข We define Exploration Conscious RL and analyze its properties. โ€ข Exploration Conscious RL can improve performance over both the training and evaluation regimes . โ€ข Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 18 /1 9

  19. Summary โ€ข We define Exploration Conscious RL and analyze its properties. โ€ข Exploration Conscious RL can improve performance over both the training and evaluation regimes . โ€ข Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. SEE YOU AT POSTER #90 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 19 /1 9

Recommend


More recommend