Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning Alberto Maria Metelli Flavio Mazzolini Lorenzo Bisi Luca Sabbioni Marcello Restelli July 2020 Thirty-seventh International Conference on Machine Learning
1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Control Opportunities Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Control Opportunities Sample Complexity Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
1 1 Motivations Problem : How to select the control frequency for a system? Lower Frequencies Higher Frequencies Trade-Off Control Opportunities Sample Complexity Research Question : Can we exploit this trade-off to find an optimal control frequency? A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
2 Control Frequency and Action Persistence 2 Idea : persisting each action for k steps continuous- discrete-time k -persistent time MDP MDP MDP M 0 M ∆ t M k ∆ t time action control 0 ∆ t k ∆ t discretization persistence k time-step control 1 f 8 f “ ∆ t frequency k Action persistence as form of environment configurability (Metelli et al., 2018) A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
3 3 Outline 1 Action persistence formalization A 0 A 0 A 0 2 Performance loss due to persistence S 0 S 1 S 2 S 3 3 Persistent Fitted Q-Iteration A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
3 3 Outline 1 1 Action persistence formalization 0.8 0.6 1 − γ k − 1 1 − γ k 0.4 2 Performance loss due to persistence 0.2 3 Persistent Fitted Q-Iteration 5 10 15 20 k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
3 3 Outline 1 Action persistence formalization � T δ � T ∗ � T δ Π F Π F 2 Performance loss due to persistence Π F Q ( j +1) Q ( j ) F Q ( j + k ) 3 Persistent Fitted Q-Iteration A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
4 4 No Action Persistence M “ p S , A , P, R, γ q and π π : S Ñ P p A q is Markovian and Stationary (Puterman, 2014; Sutton and Barto, 2018) A 0 ∼ π ( ·| S 0 ) A 1 ∼ π ( ·| S 1 ) A 2 ∼ π ( ·| S 2 ) A 3 ∼ π ( ·| S 3 ) A 4 ∼ π ( ·| S 4 ) A 5 ∼ π ( ·| S 5 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
5 Action Persistence 5 Policy View Change the policy Ñ k -persistent policy # π p a | s t q if t mod k “ 0 M “ p S , A , P, R, γ q and π k π t,k p a | h t q “ δ a t ´ 1 p a q otherwise History h t “ p s 0 , a 0 , . . . , s t ´ 1 , a t ´ 1 , s t q π k is Non-Markovian and Non-Stationary A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
5 Action Persistence 5 Policy View Change the policy Ñ k -persistent policy # π p a | s t q if t mod k “ 0 M “ p S , A , P, R, γ q and π k π t,k p a | h t q “ δ a t ´ 1 p a q otherwise History h t “ p s 0 , a 0 , . . . , s t ´ 1 , a t ´ 1 , s t q π k is Non-Markovian and Non-Stationary A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) A 0 A 0 A 3 A 3 S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
5 Action Persistence 5 Policy View Change the policy Ñ k -persistent policy # π p a | s t q if t mod k “ 0 M “ p S , A , P, R, γ q and π k π t,k p a | h t q “ δ a t ´ 1 p a q otherwise History h t “ p s 0 , a 0 , . . . , s t ´ 1 , a t ´ 1 , s t q π k is Non-Markovian and Non-Stationary A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) A 0 A 0 A 3 A 3 S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
6 Action Persistence 6 Environment View Change the MDP Ñ k -persistent MDP ` ˘ P k p s 1 | s, a q “ p P δ q k ´ 1 P p s 1 | s, a q ` S , A , P k , R k , γ k ˘ i “ 0 γ i ` ˘ M k “ and π R k p s 1 | s, a q “ ř k ´ 1 p s 1 | s, a q p P δ q i R Persistent state-action kernel P δ p s 1 , a 1 | s, a q “ δ a 1 p a q P p s 1 | s, a q M k has smaller discount factor γ k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
6 Action Persistence 6 Environment View Change the MDP Ñ k -persistent MDP ` ˘ P k p s 1 | s, a q “ p P δ q k ´ 1 P p s 1 | s, a q ` S , A , P k , R k , γ k ˘ i “ 0 γ i ` ˘ M k “ and π R k p s 1 | s, a q “ ř k ´ 1 p s 1 | s, a q p P δ q i R Persistent state-action kernel P δ p s 1 , a 1 | s, a q “ δ a 1 p a q P p s 1 | s, a q M k has smaller discount factor γ k A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
6 Action Persistence 6 Environment View Change the MDP Ñ k -persistent MDP ` ˘ P k p s 1 | s, a q “ p P δ q k ´ 1 P p s 1 | s, a q ` S , A , P k , R k , γ k ˘ i “ 0 γ i ` ˘ M k “ and π R k p s 1 | s, a q “ ř k ´ 1 p s 1 | s, a q p P δ q i R Persistent state-action kernel P δ p s 1 , a 1 | s, a q “ δ a 1 p a q P p s 1 | s, a q M k has smaller discount factor γ k A 0 ∼ π ( ·| S 0 ) A 3 ∼ π ( ·| S 3 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7 Persistent Bellman Operators 7 k -persistent MDP M k MDP M Persistence Operator ż Bellman Operator (Bertsekas, 2005) p T δ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q f p s 1 , a q S ż k -persistent Bellman Operator p T ˚ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q max a 1 P A f p s 1 , a 1 q S T ˚ k “ p T δ q k ´ 1 T ˚ T ˚ is a γ -contraction in L 8 -norm T ˚ k is a γ k -contraction in L 8 -norm Q ˚ is the unique fixed point of T ˚ Q ˚ k is the unique fixed point of T ˚ T ˚ Q ˚ “ Q ˚ k T ˚ k Q ˚ k “ Q ˚ k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
7 Persistent Bellman Operators 7 k -persistent MDP M k MDP M Persistence Operator ż Bellman Operator (Bertsekas, 2005) p T δ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q f p s 1 , a q S ż k -persistent Bellman Operator p T ˚ f qp s, a q “ r p s, a q ` γ P p d s 1 | s, a q max a 1 P A f p s 1 , a 1 q S T ˚ k “ p T δ q k ´ 1 T ˚ T ˚ is a γ -contraction in L 8 -norm T ˚ k is a γ k -contraction in L 8 -norm Q ˚ is the unique fixed point of T ˚ Q ˚ k is the unique fixed point of T ˚ T ˚ Q ˚ “ Q ˚ k T ˚ k Q ˚ k “ Q ˚ k A. M. Metelli Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning ICML 2020
Recommend
More recommend