Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit † Matteo Pirotta ∗ Alessandro Lazaric ∗ † SequeL – INRIA Lille ∗ FAIR – Facebook Paris NeurIPS 2018, Montreal, December 5th
Exploration–exploitation in RL with Misspecified State Space Ronan Fruit † Matteo Pirotta ∗ Alessandro Lazaric ∗ † SequeL – INRIA Lille ∗ FAIR – Facebook Paris NeurIPS 2018, Montreal, December 5th
TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5
TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5
TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5
TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5
TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Non reachable from s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5
TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Cannot be observed! initial state s 1 Non reachable from s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5
TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Cannot be observed! initial state s 1 Non reachable from s 1 Plausible state after some time... Misspecified state space = ∃ states non-observable from initial state + difficult to exclude explicitly from the state space Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5
TUCRL Why is exploration more challenging with a misspecified state space? Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 s r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Optimism (UCB, etc.) = Optimal Strategy s r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s s s ′ r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s a 0 , r + s s ′ 0 = 1 = r max r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + Why not ignore s ′ ? s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 ❍ Not reachable from s ✟ ✟ ❍ ? a 0 , r + Why not ignore s ′ ? s s ′ 0 = 1 = r max linear regret if s ′ is reachable ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5
Recommend
More recommend