Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology
Why? โข To learn a good policy, an RL agent must explore! โข However, it can cause hazardous behavior during training. I LOVE ๐ -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 2 /1 9
Why? โข To learn a good policy, an RL agent must explore! โข However, it can cause hazardous behavior during training. Damn you Exploration! I LOVE ๐ -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 3 /1 9
Exploration Conscious Reinforcement Learning โข Objective: Find the optimal policy knowing that exploration might occur โข For example : ๐ -greedy exploration ( ๐ท = ๐ ) โ โ โ argmax ๐โ๐ฌ ๐ฝ ๐โ๐ท ๐+๐ท๐ ๐ เท ๐น ๐ ๐ ๐ ๐ , ๐ ๐ ๐ ๐ท ๐=๐ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 4 /1 9
Exploration Conscious Reinforcement Learning โข Objective: Find the optimal policy knowing that exploration might occur โข For example : ๐ -greedy exploration ( ๐ท = ๐ ) โ โ โ argmax ๐โ๐ฌ ๐ฝ ๐โ๐ท ๐+๐ท๐ ๐ เท ๐น ๐ ๐ ๐ ๐ , ๐ ๐ ๐ ๐ท ๐=๐ โข Solving the Exploration-Conscious problem = Solving an MDP โข We describe a bias-error sensitivity tradeoff in ๐ท Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 5 /1 9
Exploration Conscious Reinforcement Learning โข Objective: Find the optimal policy knowing that exploration might occur I โ m Exploration Conscious Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 6 /1 9
Fixed Exploration Schemes (e.g. ๐ -greedy) Choose Greedy Action ๐ ๐๐๐๐๐๐ โข ๐ ๐๐๐๐๐๐ โ argmax ๐ ๐น ๐ ๐ท ๐, ๐ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 7 /1 9
Fixed Exploration Schemes (e.g. ๐ -greedy) Choose Draw Greedy Exploratory Action Action ๐ ๐๐๐๐๐๐ ๐ ๐๐ ๐ โข ๐ ๐๐๐๐๐๐ โ argmax ๐ ๐น ๐ ๐ท ๐, ๐ โข For ๐ท -greedy: ๐ ๐๐ ๐ โ ๐ ๐๐๐๐๐๐ w.p. ๐ โ ๐ท ๐ ๐ else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 8 /1 9
Fixed Exploration Schemes (e.g. ๐ -greedy) Choose Draw Greedy Exploratory Act Action Action ๐ ๐๐ ๐ ๐ ๐๐๐๐๐๐ ๐ ๐๐ ๐ โข ๐ ๐๐๐๐๐๐ โ argmax ๐ ๐น ๐ ๐ท ๐, ๐ โข For ๐ท -greedy: ๐ ๐๐ ๐ โ ๐ ๐๐๐๐๐๐ w.p. ๐ โ ๐ท ๐ ๐ else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 9 /1 9
Fixed Exploration Schemes (e.g. ๐ -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action ๐ ๐๐ ๐ ๐, ๐โฒ ๐ ๐๐๐๐๐๐ ๐ ๐๐ ๐ โข ๐ ๐๐๐๐๐๐ โ argmax ๐ ๐น ๐ ๐ท ๐, ๐ โข For ๐ท -greedy: ๐ ๐๐ ๐ โ ๐ ๐๐๐๐๐๐ w.p. ๐ โ ๐ท ๐ ๐ else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 10 /1 9
Fixed Exploration Schemes (e.g. ๐ -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action ๐ ๐๐ ๐ ๐, ๐โฒ ๐ ๐๐๐๐๐๐ ๐ ๐๐ ๐ โข Normally used information: ๐, ๐ ๐๐ ๐ , ๐, ๐ โฒ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 11 /1 9
Fixed Exploration Schemes (e.g. ๐ -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action ๐ ๐๐ ๐ ๐, ๐โฒ ๐ ๐๐๐๐๐๐ ๐ ๐๐ ๐ โข Normally used information: ๐, ๐ ๐๐ ๐ , ๐, ๐ โฒ โข Using information about the exploration process: ๐, ๐ ๐๐๐๐๐๐ , ๐ ๐๐ ๐ , ๐, ๐ โฒ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 12 /1 9
Two Approaches โ Expected approach Update ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ 1. ๐๐ ๐ 2. Expect that the agent might explore in the next state ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐ ๐ += ๐ฝ ๐ ๐ + ๐น๐ฝ ๐โ๐ท ๐+๐ท๐ ๐ ๐น ๐ ๐ท ๐ ๐+๐ , ๐ โ ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐ ๐ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 13 /1 9
Two Approaches โ Expected approach Update ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ 1. ๐๐ ๐ 2. Expect that the agent might explore in the next state ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐ ๐ += ๐ฝ ๐ ๐ + ๐น๐ฝ ๐โ๐ท ๐+๐ท๐ ๐ ๐น ๐ ๐ท ๐ ๐+๐ , ๐ โ ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐ ๐ โข Calculating expectations can be hard. โข Requires sampling in the continuous case! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 14 /1 9
Two Approaches โ Surrogate approach โข Exploration is incorporated into the environment! Update ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐๐๐๐๐ 1. 2. The rewards and next state ๐ ๐ , ๐ ๐+๐ are given by the acted action ๐ ๐ ๐๐ ๐ ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐๐๐๐๐ += ๐ฝ ๐ ๐ + ๐น๐น ๐ ๐ท ๐ ๐+๐ , ๐ ๐+๐ โ ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐๐๐๐๐ ๐๐๐๐๐๐ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 15 /1 9
Two Approaches โ Surrogate approach โข Exploration is incorporated into the environment ! Update ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐๐๐๐๐ 1. 2. The rewards and next state ๐ ๐ , ๐ ๐+๐ are given by the acted action ๐ ๐ ๐๐ ๐ ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐๐๐๐๐ += ๐ฝ ๐ ๐ + ๐น๐น ๐ ๐ท ๐ ๐+๐ , ๐ ๐+๐ โ ๐น ๐ ๐ท ๐ ๐ , ๐ ๐ ๐๐๐๐๐๐ ๐๐๐๐๐๐ โข NO NEED TO SAMPLE! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 16 /1 9
Deep RL Experimental Results Training Evaluation Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 17 /1 9
Summary โข We define Exploration Conscious RL and analyze its properties. โข Exploration Conscious RL can improve performance over both the training and evaluation regimes . โข Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 18 /1 9
Summary โข We define Exploration Conscious RL and analyze its properties. โข Exploration Conscious RL can improve performance over both the training and evaluation regimes . โข Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. SEE YOU AT POSTER #90 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 19 /1 9
Recommend
More recommend