CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy - PowerPoint PPT Presentation

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning Haarnoja, Tang et al. (2017) Reinforcement Learning with Deep Energy Based Policies, ICML . Haarnoja, Zhou et al. (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML . University of Waterloo CS885 Spring 2020 Pascal Poupart 1

Maximum Entropy RL • Why do several implementations of important RL baselines (e.g., A2C, PPO) add an entropy regularizer? • Why is maximizing entropy desirable in RL? • What is the Soft Actor Critic algorithm? University of Waterloo CS885 Spring 2020 Pascal Poupart 2

Reinforcement Learning Deterministic Policies Stochastic Policies • There always exists an • Search space is continuous optimal deterministic policy for stochastic policies (helps with gradient descent) • Search space is smaller for deterministic than • More robust (less likely to stochastic policies overfit) • Practitioners prefer • Naturally incorporate deterministic policies exploration • Facilitate transfer learning • Mitigate local optima University of Waterloo CS885 Spring 2020 Pascal Poupart 3

Encouraging Stochasticity Standard MDP Soft MDP • States: 𝑇 • States: 𝑇 • Actions: 𝐵 • Actions: 𝐵 • Reward: 𝑆(𝑡, 𝑏) • Reward: 𝑆 𝑡, 𝑏 + 𝜇𝐼 𝜌 ⋅ 𝑡 • Transition: Pr(𝑡 ! |𝑡, 𝑏) • Transition: Pr(𝑡 ! |𝑡, 𝑏) • Discount: 𝛿 • Discount: 𝛿 University of Waterloo CS885 Spring 2020 Pascal Poupart 4

Entropy • Entropy: measure of uncertainty – Information theory: expected # of bits needed to communicate the result of a sample 𝐼(𝑞) 𝐼 𝑞 = − ∑ ! 𝑞 𝑦 log 𝑞(𝑦) 𝑞(𝑦) University of Waterloo CS885 Spring 2020 Pascal Poupart 5

Optimal Policy • Standard MDP & 𝜌 ∗ = argmax 𝛿 # 𝐹 ' ! ,) ! |" 𝑆 𝑡 # , 𝑏 # ) " #$% • Soft MDP & ∗ 𝛿 # 𝐹 ' ! ,) ! |" 𝑆 𝑡 # , 𝑏 # + 𝜇𝐼 𝜌 ⋅ 𝑡 # 𝜌 '+,- = argmax ) " #$% Maximum entropy policy Entropy regularized policy University of Waterloo CS885 Spring 2020 Pascal Poupart 6

Q-function • Standard MDP / 𝑅 " 𝑡 % , 𝑏 % = 𝑆 𝑡 % , 𝑏 % + ) 𝛿 # 𝐹 ' ! ,) ! |' " ,) " ," [𝑆 𝑡 # , 𝑏 # ] #$. • Soft MDP / " 𝛿 # 𝐹 ' ! ,) ! |' " ,) " ," 𝑆 𝑡 # , 𝑏 # + 𝜇𝐼 𝜌 ⋅ 𝑡 # 𝑅 '+,- 𝑡 % , 𝑏 % = 𝑆 𝑡 % , 𝑏 % + ) #$. NB: No entropy with first reward term since action is not chosen according to 𝜌 University of Waterloo CS885 Spring 2020 Pascal Poupart 7

Greedy Policy • Standard MDP (deterministic policy) 𝜌 '())*+ (𝑡) = argmax 𝑅(𝑡, 𝑏) , • Soft MDP (stochastic policy) ∑ , 𝜌 𝑏|𝑡 𝑅 𝑡, 𝑏 + 𝜇𝐼 𝜌 ⋅ 𝑡 𝜌 '())*+ ⋅ 𝑡 = argmax - ./0 1 2,⋅ /6 = ∑ 0 ./0 1 2,, /6 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑅 𝑡,⋅ /𝜇) when 𝜇 → 0 then 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 becomes regular max University of Waterloo CS885 Spring 2020 Pascal Poupart 8

Derivation • Concave objective (can find global maximum) 𝐾(𝜌, 𝑅) = ∑ , 𝜌 𝑏|𝑡 𝑅 𝑡, 𝑏 + 𝜇𝐼 𝜌 ⋅ 𝑡 = ∑ , 𝜌 𝑏 𝑡 [𝑅 𝑡, 𝑏 − 𝜇 log 𝜌 𝑏 𝑡 ] • Partial derivative 𝜖𝐾 𝜖𝜌 𝑏 𝑡 = 𝑅 𝑡, 𝑏 − 𝜇 log 𝜌 𝑏 𝑡 + 1 • Setting the derivative to 0 and isolating 𝜌 𝑏 𝑡 yields 𝜌 𝑏 𝑡 = exp 𝑅 𝑡, 𝑏 /𝜇 − 1 ∝ exp(𝑅 𝑡, 𝑏 /𝜇) ./0 1 2,⋅ /6 • Hence 𝜌 '())*+ ⋅ 𝑡 = ∑ 0 ./0 1 2,, /6 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑅 𝑡,⋅ /𝜇) University of Waterloo CS885 Spring 2020 Pascal Poupart 9

Greedy Value function • What is the value function induced by the greedy policy? • Standard MDP: 𝑊 𝑡 = max 𝑅(𝑡, 𝑏) ! • Soft MDP: + ∑ * 𝜌 %&''() 𝑏 𝑡 𝑅 !"#$ 𝑡, 𝑏 𝑊 !"#$ 𝑡 = 𝜇𝐼 𝜌 %&''() ⋅ 𝑡 + !"#$ !,* = 𝜇 log ∑ * exp = 3 𝑛𝑏𝑦 - 𝑅 !"#$ (𝑡, 𝑏) - * when 𝜇 → 0 then 3 𝑛𝑏𝑦 - becomes regular max University of Waterloo CS885 Spring 2020 Pascal Poupart 10

Derivation 𝑊 #$%& 𝑡 + ∑ , 𝜌 '())*+ 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 *+, - "#$% .,0 /2 since 𝜌 %&''() 𝑏 𝑡 = ∑ &! *+, - "#$% .,0 ! /2 - "#$% #,, ! + ∑ , 𝜌 '())*+ 𝑏 𝑡 𝜇 log 𝜌 '())*+ 𝑏 𝑡 + log ∑ , ! exp = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 / - "#$% #,, ! + 𝜇 ∑ , 𝜌 '())*+ 𝑏 𝑡 log 𝜌 '())*+ 𝑏 𝑡 + 𝜇 log ∑ , ! exp = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 / - "#$% #,, ! + 𝜇 log ∑ , ! exp = 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 − 𝜇𝐼 𝜌 '())*+ ⋅ 𝑡 / - "#$% #,, ! = 𝜇 log ∑ , ! exp / = @ max / 𝑅 #$%& (𝑡, 𝑏) , University of Waterloo CS885 Spring 2020 Pascal Poupart 11

Soft Q-Value Iteration SoftQValueIteration(MDP, 𝜇 ) Initialize 𝜌 % to any policy 𝑗 ← 0 Repeat 12. 𝑡, 𝑏 ← 𝑆 𝑡, 𝑏 + 𝛿 ∑ ' 0 Pr(𝑡 3 |𝑡, 𝑏) ? 1 𝑅 '+,- max 4 𝑅 '+,- (𝑡′, 𝑏′) ) 0 𝑗 ← 𝑗 + 1 15. 𝑡, 𝑏 1 Until 𝑅 '+,- 𝑡, 𝑏 − 𝑅 '+,- / ≤ 𝜗 1 Extract policy: 𝜌 67889: ⋅ 𝑡 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑅 '+,- 𝑡,⋅ /𝜇) Soft Bellman equation: ∗ ∗ Pr(𝑡 3 |𝑡, 𝑏) ? 𝑅 '+,- 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 ) max 4 𝑅 '+,- (𝑡′, 𝑏′) ) 0 ' 0 University of Waterloo CS885 Spring 2020 Pascal Poupart 12

Soft Q-learning • Q-learning based on Soft Q-Value Iteration • Replace expectations by samples • Represent Q-function by a function approximator (e.g., neural network) • Do gradient updates based on temporal differences University of Waterloo CS885 Spring 2020 Pascal Poupart 13

̂ ̂ ̂ Soft Q-learning (soft variant of DQN) Initialize weights 𝒙 and = 𝒙 at random in [−1,1] Observe current state 𝑡 Loop Select action 𝑏 and execute it Receive immediate reward 𝑠 Observe new state 𝑡’ Add (𝑡, 𝑏, 𝑡 4 , 𝑠) to experience buffer Sample mini-batch of experiences from buffer 𝑡 4 , ̂ For each experience 𝑡, E 𝑏, ̂ 𝑠 in mini-batch "#$% 56&& 5- 𝒙 ., ; 0 .89: .89: ( ̂ Gradient: 5𝒙 = 𝑅 𝒙 𝑡, E 𝑏 − ̂ 𝑠 − 𝛿 H max 2 𝑅 𝒙 𝑡′, E 𝑏′) 5𝒙 0 ! ; 56&& Update weights: 𝒙 ← 𝒙 − 𝛽 5𝒙 Update state: 𝑡 ← 𝑡 ’ Every 𝑑 steps, update target: = 𝒙 ← 𝒙 University of Waterloo CS885 Spring 2020 Pascal Poupart 14

Soft Actor Critic • In practice, actor critic techniques tend to perform better than Q-learning. • Can we derive a soft actor-critic algorithm? • Yes, idea: – Critic: soft Q-function – Actor: (greedy) softmax policy University of Waterloo CS885 Spring 2020 Pascal Poupart 15

Soft Policy Iteration SoftPolicyIteration(MDP, 𝜇 ) Initialize 𝜌 % to any policy 𝑗 ← 0 Repeat Policy evaluation: Repeat until convergence " 1 𝑅 '+,- 𝑡, 𝑏 ← 𝑆 𝑡, 𝑏 +𝛿 ∑ ' 0 Pr 𝑡 3 𝑡, 𝑏 " 1 ∑ ) 0 𝜌 1 𝑏′ 𝑡′ 𝑅 '+,- 𝑡′, 𝑏′ + 𝜇𝐼 𝜌 1 ⋅ 𝑡′ ∀𝑡, 𝑏 Policy improvement: 61 ;<= > 2345 ',) /4 " 1 𝜌 12. 𝑏 𝑡 ← 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 𝑅 '+,- 𝑡, 𝑏 /𝜇 = ∀𝑡, 𝑏 61 ',) 0 /4 ∑ 70 ;<= > 2345 𝑗 ← 𝑗 + 1 " 189 𝑡, 𝑏 " 1 Until 𝑅 '+,- 𝑡, 𝑏 − 𝑅 '+,- ≤ 𝜗 / University of Waterloo CS885 Spring 2020 Pascal Poupart 16

Policy improvement P E Theorem 1: Let 𝑅 LMNO (𝑡, 𝑏) be the Q-function of 𝜌 Q P E Let 𝜌 QRS 𝑏 𝑡 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 𝑅 LMNO 𝑡, 𝑏 /𝜇 P EFG 𝑡, 𝑏 ≥ 𝑅 LMNO P E Then 𝑅 LMNO 𝑡, 𝑏 ∀𝑡, 𝑏 Proof: first show that - A ∑ , 𝜌 P 𝑏 𝑡 𝑅 2QRS (𝑡, 𝑏) + 𝜇𝐼 𝜌 P ⋅ 𝑡 - A ≤ ∑ , 𝜌 PTU 𝑏 𝑡 𝑅 2QRS 𝑡, 𝑏 + 𝜇𝐼(𝜌 PTU ⋅ 𝑡 ) then use this inequality to show that - ABC 𝑡, 𝑏 ≥ 𝑅 2QRS - A 𝑅 2QRS 𝑡, 𝑏 ∀𝑡, 𝑏 University of Waterloo CS885 Spring 2020 Pascal Poupart 17

Inequality derivation ' ! ∑ ! 𝜌 " 𝑏 𝑡 𝑅 #$%& (𝑡, 𝑏) + 𝜇𝐼 𝜌 " ⋅ 𝑡 ' ! = ∑ ! 𝜌 " 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 − 𝜇 log 𝜌 " 𝑏 𝑡 () *+, - "#$% .,0 /2 since 𝜌 =>? 𝑏 𝑡 = () .,0 ! /2 ∑ &! *+, - "#$% 𝑡, 𝑏 * /𝜇) − 𝜇 log 𝜌 " 𝑏 𝑡 ] ' ! = ∑ ! 𝜌 " 𝑏 𝑡 [ 𝜇 log 𝜌 "() 𝑏 𝑡 − 𝜇 log ∑ ! " exp(𝑅 #$%& ' !#$ 𝑏 𝑡 ' ! = 𝜇 ∑ ! 𝜌 " 𝑏 𝑡 [ log + log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇)] ' ! 𝑏 𝑡 ' ! = −𝜇𝐿𝑀(𝜌 "() | 𝜌 " + 𝜇 ∑ ! 𝜌 " 𝑏 𝑡 log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇) ' ! ≤ 𝜇 ∑ ! 𝜌 " 𝑏 𝑡 log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇) ' ! = ∑ ! 𝜌 "() 𝑏 𝑡 𝜇 log ∑ ! " exp(𝑅 #$%& 𝑡, 𝑏′ /𝜇) *+, - () .,0 /2 since 𝜌 =>? 𝑏 𝑡 = ∑ &! *+, - () .,0 ! /2 ' ! = ∑ ! 𝜌 "() 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 − 𝜇 log 𝜌 "() 𝑡, 𝑏 ' ! = ∑ ! 𝜌 "() 𝑏 𝑡 𝑅 #$%& 𝑡, 𝑏 + 𝜇𝐼 𝜌 "() ⋅ 𝑡 University of Waterloo CS885 Spring 2020 Pascal Poupart 18

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy - PowerPoint PPT Presentation

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning Haarnoja, Tang et al. (2017) Reinforcement Learning with Deep Energy Based Policies, ICML . Haarnoja, Zhou et al. (2018) Soft Actor-Critic: Off-Policy

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put]

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7,

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

CS885 Reinforcement Learning Lecture 1b: May 2, 2018 Markov Processes [RusNor] Sec. 15.1

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7,

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G.,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning-Based End-to-End Parking for Automatic Parking System CS885

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Poverty Measurement and the Distribution of Deprivations among the Poor Sabina Alkire OPHI,

Specification of Concretization and Symbolization Policies in Symbolic Execution S ebastien

Kernel-based Reinforcement Learning in Robust Markov Decision Processes Shiau Hong Lim, Arnaud

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

The Robustness of Go A study of Go and its ecosystem Agenda - What does it mean to be robust?

in 4 Steps Dr. Michle B. Nuijten Sounds like Newton/Nowton @MicheleNuijten m.b.nuijten@uvt.nl

Research Infrastructures ensuring the trust and quality of data Session 5A Simon Hodson,

Health Privacy Project at CDT Projects aim: Develop and promote workable privacy and