CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning Haarnoja, Tang et al. (2017) Reinforcement Learning with Deep Energy Based Policies, ICML . Haarnoja, Zhou et al. (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML . University of Waterloo CS885 Spring 2020 Pascal Poupart 1
Maximum Entropy RL • Why do several implementations of important RL baselines (e.g., A2C, PPO) add an entropy regularizer? • Why is maximizing entropy desirable in RL? • What is the Soft Actor Critic algorithm? University of Waterloo CS885 Spring 2020 Pascal Poupart 2
Reinforcement Learning Deterministic Policies Stochastic Policies • There always exists an • Search space is continuous optimal deterministic policy for stochastic policies (helps with gradient descent) • Search space is smaller for deterministic than • More robust (less likely to stochastic policies overfit) • Practitioners prefer • Naturally incorporate deterministic policies exploration • Facilitate transfer learning • Mitigate local optima University of Waterloo CS885 Spring 2020 Pascal Poupart 3
Encouraging Stochasticity Standard MDP Soft MDP • States: 𝑇 • States: 𝑇 • Actions: 𝐵 • Actions: 𝐵 • Reward: 𝑆(𝑡, 𝑏) • Reward: 𝑆 𝑡, 𝑏 + 𝜇𝐼 𝜌 ⋅ 𝑡 • Transition: Pr(𝑡 ! |𝑡, 𝑏) • Transition: Pr(𝑡 ! |𝑡, 𝑏) • Discount: 𝛿 • Discount: 𝛿 University of Waterloo CS885 Spring 2020 Pascal Poupart 4
Entropy • Entropy: measure of uncertainty – Information theory: expected # of bits needed to communicate the result of a sample 𝐼(𝑞) 𝐼 𝑞 = − ∑ ! 𝑞 𝑦 log 𝑞(𝑦) 𝑞(𝑦) University of Waterloo CS885 Spring 2020 Pascal Poupart 5
Optimal Policy • Standard MDP & 𝜌 ∗ = argmax 𝛿 # 𝐹 ' ! ,) ! |" 𝑆 𝑡 # , 𝑏 # ) " #$% • Soft MDP & ∗ 𝛿 # 𝐹 ' ! ,) ! |" 𝑆 𝑡 # , 𝑏 # + 𝜇𝐼 𝜌 ⋅ 𝑡 # 𝜌 '+,- = argmax ) " #$% Maximum entropy policy Entropy regularized policy University of Waterloo CS885 Spring 2020 Pascal Poupart 6
Q-function • Standard MDP / 𝑅 " 𝑡 % , 𝑏 % = 𝑆 𝑡 % , 𝑏 % + ) 𝛿 # 𝐹 ' ! ,) ! |' " ,) " ," [𝑆 𝑡 # , 𝑏 # ] #$. • Soft MDP / " 𝛿 # 𝐹 ' ! ,) ! |' " ,) " ," 𝑆 𝑡 # , 𝑏 # + 𝜇𝐼 𝜌 ⋅ 𝑡 # 𝑅 '+,- 𝑡 % , 𝑏 % = 𝑆 𝑡 % , 𝑏 % + ) #$. NB: No entropy with first reward term since action is not chosen according to 𝜌 University of Waterloo CS885 Spring 2020 Pascal Poupart 7
Greedy Policy • Standard MDP (deterministic policy) 𝜌 '())*+ (𝑡) = argmax 𝑅(𝑡, 𝑏) , • Soft MDP (stochastic policy) ∑ , 𝜌 𝑏|𝑡 𝑅 𝑡, 𝑏 + 𝜇𝐼 𝜌 ⋅ 𝑡 𝜌 '())*+ ⋅ 𝑡 = argmax - ./0 1 2,⋅ /6 = ∑ 0 ./0 1 2,, /6 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑅 𝑡,⋅ /𝜇) when 𝜇 → 0 then 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 becomes regular max University of Waterloo CS885 Spring 2020 Pascal Poupart 8
Derivation • Concave objective (can find global maximum) 𝐾(𝜌, 𝑅) = ∑ , 𝜌 𝑏|𝑡 𝑅 𝑡, 𝑏 + 𝜇𝐼 𝜌 ⋅ 𝑡 = ∑ , 𝜌 𝑏 𝑡 [𝑅 𝑡, 𝑏 − 𝜇 log 𝜌 𝑏 𝑡 ] • Partial derivative 𝜖𝐾 𝜖𝜌 𝑏 𝑡 = 𝑅 𝑡, 𝑏 − 𝜇 log 𝜌 𝑏 𝑡 + 1 • Setting the derivative to 0 and isolating 𝜌 𝑏 𝑡 yields 𝜌 𝑏 𝑡 = exp 𝑅 𝑡, 𝑏 /𝜇 − 1 ∝ exp(𝑅 𝑡, 𝑏 /𝜇) ./0 1 2,⋅ /6 • Hence 𝜌 '())*+ ⋅ 𝑡 = ∑ 0 ./0 1 2,, /6 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑅 𝑡,⋅ /𝜇) University of Waterloo CS885 Spring 2020 Pascal Poupart 9
Greedy Value function • What is the value function induced by the greedy policy? • Standard MDP: 𝑊 𝑡 = max 𝑅(𝑡, 𝑏) ! • Soft MDP: + ∑ * 𝜌 %&''() 𝑏 𝑡 𝑅 !"#$ 𝑡, 𝑏 𝑊 !"#$ 𝑡 = 𝜇𝐼 𝜌 %&''() ⋅ 𝑡 + !"#$ !,* = 𝜇 log ∑ * exp = 3 𝑛𝑏𝑦 - 𝑅 !"#$ (𝑡, 𝑏) - * when 𝜇 → 0 then 3 𝑛𝑏𝑦 - becomes regular max University of Waterloo CS885 Spring 2020 Pascal Poupart 10
Derivation 𝑊 #$%& 𝑡 + ∑ , 𝜌 '())*+ 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 *+, - "#$% .,0 /2 since 𝜌 %&''() 𝑏 𝑡 = ∑ &! *+, - "#$% .,0 ! /2 - "#$% #,, ! + ∑ , 𝜌 '())*+ 𝑏 𝑡 𝜇 log 𝜌 '())*+ 𝑏 𝑡 + log ∑ , ! exp = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 / - "#$% #,, ! + 𝜇 ∑ , 𝜌 '())*+ 𝑏 𝑡 log 𝜌 '())*+ 𝑏 𝑡 + 𝜇 log ∑ , ! exp = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 / - "#$% #,, ! + 𝜇 log ∑ , ! exp = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 − 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 / - "#$% #,, ! = 𝜇 log ∑ , ! exp / = @ max / 𝑅 #$%& (𝑡, 𝑏) , University of Waterloo CS885 Spring 2020 Pascal Poupart 11
Soft Q-Value Iteration SoftQValueIteration(MDP, 𝜇 ) Initialize 𝜌 % to any policy 𝑗 ← 0 Repeat 12. 𝑡, 𝑏 ← 𝑆 𝑡, 𝑏 + 𝛿 ∑ ' 0 Pr(𝑡 3 |𝑡, 𝑏) ? 1 𝑅 '+,- max 4 𝑅 '+,- (𝑡′, 𝑏′) ) 0 𝑗 ← 𝑗 + 1 15. 𝑡, 𝑏 1 Until 𝑅 '+,- 𝑡, 𝑏 − 𝑅 '+,- / ≤ 𝜗 1 Extract policy: 𝜌 67889: ⋅ 𝑡 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑅 '+,- 𝑡,⋅ /𝜇) Soft Bellman equation: ∗ ∗ Pr(𝑡 3 |𝑡, 𝑏) ? 𝑅 '+,- 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 ) max 4 𝑅 '+,- (𝑡′, 𝑏′) ) 0 ' 0 University of Waterloo CS885 Spring 2020 Pascal Poupart 12
Soft Q-learning • Q-learning based on Soft Q-Value Iteration • Replace expectations by samples • Represent Q-function by a function approximator (e.g., neural network) • Do gradient updates based on temporal differences University of Waterloo CS885 Spring 2020 Pascal Poupart 13
̂ ̂ ̂ Soft Q-learning (soft variant of DQN) Initialize weights 𝒙 and = 𝒙 at random in [−1,1] Observe current state 𝑡 Loop Select action 𝑏 and execute it Receive immediate reward 𝑠 Observe new state 𝑡’ Add (𝑡, 𝑏, 𝑡 4 , 𝑠) to experience buffer Sample mini-batch of experiences from buffer 𝑡 4 , ̂ For each experience 𝑡, E 𝑏, ̂ 𝑠 in mini-batch "#$% 56&& 5- 𝒙 ., ; 0 .89: .89: ( ̂ Gradient: 5𝒙 = 𝑅 𝒙 𝑡, E 𝑏 − ̂ 𝑠 − 𝛿 H max 2 𝑅 𝒙 𝑡′, E 𝑏′) 5𝒙 0 ! ; 56&& Update weights: 𝒙 ← 𝒙 − 𝛽 5𝒙 Update state: 𝑡 ← 𝑡 ’ Every 𝑑 steps, update target: = 𝒙 ← 𝒙 University of Waterloo CS885 Spring 2020 Pascal Poupart 14
Soft Actor Critic • In practice, actor critic techniques tend to perform better than Q-learning. • Can we derive a soft actor-critic algorithm? • Yes, idea: – Critic: soft Q-function – Actor: (greedy) softmax policy University of Waterloo CS885 Spring 2020 Pascal Poupart 15
Soft Policy Iteration SoftPolicyIteration(MDP, 𝜇 ) Initialize 𝜌 % to any policy 𝑗 ← 0 Repeat Policy evaluation: Repeat until convergence " 1 𝑅 '+,- 𝑡, 𝑏 ← 𝑆 𝑡, 𝑏 +𝛿 ∑ ' 0 Pr 𝑡 3 𝑡, 𝑏 " 1 ∑ ) 0 𝜌 1 𝑏′ 𝑡′ 𝑅 '+,- 𝑡′, 𝑏′ + 𝜇𝐼 𝜌 1 ⋅ 𝑡′ ∀𝑡, 𝑏 Policy improvement: 61 ;<= > 2345 ',) /4 " 1 𝜌 12. 𝑏 𝑡 ← 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 𝑅 '+,- 𝑡, 𝑏 /𝜇 = ∀𝑡, 𝑏 61 ',) 0 /4 ∑ 70 ;<= > 2345 𝑗 ← 𝑗 + 1 " 189 𝑡, 𝑏 " 1 Until 𝑅 '+,- 𝑡, 𝑏 − 𝑅 '+,- ≤ 𝜗 / University of Waterloo CS885 Spring 2020 Pascal Poupart 16
Policy improvement P E Theorem 1: Let 𝑅 LMNO (𝑡, 𝑏) be the Q-function of 𝜌 Q P E Let 𝜌 QRS 𝑏 𝑡 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 𝑅 LMNO 𝑡, 𝑏 /𝜇 P EFG 𝑡, 𝑏 ≥ 𝑅 LMNO P E Then 𝑅 LMNO 𝑡, 𝑏 ∀𝑡, 𝑏 Proof: first show that - A ∑ , 𝜌 P 𝑏 𝑡 𝑅 2QRS (𝑡, 𝑏) + 𝜇𝐼 𝜌 P ⋅ 𝑡 - A ≤ ∑ , 𝜌 PTU 𝑏 𝑡 𝑅 2QRS 𝑡, 𝑏 + 𝜇𝐼(𝜌 PTU ⋅ 𝑡 ) then use this inequality to show that - ABC 𝑡, 𝑏 ≥ 𝑅 2QRS - A 𝑅 2QRS 𝑡, 𝑏 ∀𝑡, 𝑏 University of Waterloo CS885 Spring 2020 Pascal Poupart 17
Inequality derivation ' ! ∑ ! 𝜌 " 𝑏 𝑡 𝑅 #$%& (𝑡, 𝑏) + 𝜇𝐼 𝜌 " ⋅ 𝑡 ' ! = ∑ ! 𝜌 " 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 − 𝜇 log 𝜌 " 𝑏 𝑡 () *+, - "#$% .,0 /2 since 𝜌 =>? 𝑏 𝑡 = () .,0 ! /2 ∑ &! *+, - "#$% 𝑡, 𝑏 * /𝜇) − 𝜇 log 𝜌 " 𝑏 𝑡 ] ' ! = ∑ ! 𝜌 " 𝑏 𝑡 [ 𝜇 log 𝜌 "() 𝑏 𝑡 − 𝜇 log ∑ ! " exp(𝑅 #$%& ' !#$ 𝑏 𝑡 ' ! = 𝜇 ∑ ! 𝜌 " 𝑏 𝑡 [ log + log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇)] ' ! 𝑏 𝑡 ' ! = −𝜇𝐿𝑀(𝜌 "() | 𝜌 " + 𝜇 ∑ ! 𝜌 " 𝑏 𝑡 log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇) ' ! ≤ 𝜇 ∑ ! 𝜌 " 𝑏 𝑡 log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇) ' ! = ∑ ! 𝜌 "() 𝑏 𝑡 𝜇 log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇) *+, - () .,0 /2 since 𝜌 =>? 𝑏 𝑡 = ∑ &! *+, - () .,0 ! /2 ' ! = ∑ ! 𝜌 "() 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 − 𝜇 log 𝜌 "() 𝑡, 𝑏 ' ! = ∑ ! 𝜌 "() 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 + 𝜇𝐼 𝜌 "() ⋅ 𝑡 University of Waterloo CS885 Spring 2020 Pascal Poupart 18
Recommend
More recommend