Markov Decision Process and Reinforcement Learning Zeqian (Chris) Li Feb 28, 2019 Zeqian (Chris) Li MDP and RL Feb 28, 2019 1 / 26
Outline 1 Introduction 2 Markov decision process 3 Statistical mechanics of MDP 4 Reinforcement learning 5 Discussion Zeqian (Chris) Li MDP and RL Feb 28, 2019 2 / 26
Introduction Hungry rat experiment, Yale, 1948 Modeling reinforcement: agend-based model π ( a | s ) Agent a r ( s, a, s ′ ) Environment s p ( s ′ | s, a ) s : state; a : action; r : reward p ( s ′ | s, a ): transitional probability; r ( s, a, s ′ ): reward model; π ( a | s ): policy This is a dynamical process: s t , a t , r t ; s t +1 , a t +1 , r t +1 ; ... Zeqian (Chris) Li MDP and RL Feb 28, 2019 3 / 26
Examples: Atari games π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Atari games State: brick positions, board positions, ball coordinate and velocity Action: controller/keyboard inputs Reward: game score Zeqian (Chris) Li MDP and RL Feb 28, 2019 4 / 26
Examples: Go π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Go State: positions of stones Action: next move Reward: advantage evaluation Zeqian (Chris) Li MDP and RL Feb 28, 2019 5 / 26
Examples: robots π ( a | s ) Agent a r ( s, a, s ′ ) Environment s p ( s ′ | s, a ) (Boston Dynamics) Robots State: positions, mass distribution, ... Action: adjusting forces on feet Reward: chance of falling Zeqian (Chris) Li MDP and RL Feb 28, 2019 6 / 26
Other examples Example in physics? Zeqian (Chris) Li MDP and RL Feb 28, 2019 7 / 26
Objective of reinforcement learning π ( a | s ) Agent s t , a t a r ( s, a, s ′ ) p ( s ′ | s, a ): transitional probability r ( s, a, s ′ ): reward model Environment π ( a | s ): policy s p ( s ′ | s, a ) Objective of reinforcement learning Find optimal policy π ∗ ( a | y ) to maximize expected reward: � ∞ � � π ∗ ( a | s ) = argmax γ t r ( t ) E [ V ] = argmax E π π t =0 ( γ : 0 ≤ γ < 1, discount factor) Zeqian (Chris) Li MDP and RL Feb 28, 2019 8 / 26
Simplest example: one-armed bandits States Actions p =1: r =0 0 0 1 p =0 . 9: r =1 p =0 . 1: r =0 Optimal policy: π ∗ (0 | 0) = 0 , π ∗ (1 | 0) = 1 Zeqian (Chris) Li MDP and RL Feb 28, 2019 9 / 26
Markov decision process π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Suppose that I have full knowledge of p ( s ′ | a, s ) , r ( s, a, s ′ ). This is called Markov Decision Process . Objective of MDP: compute � ∞ � π ∗ ( a | s ) = argmax � γ t r ( t ) E [ V ] = argmax E π π t =0 This is a computing problem. No learning. Zeqian (Chris) Li MDP and RL Feb 28, 2019 10 / 26
Quality function Q ( s, a ) �� ∞ π ∗ ( a | s ) = argmax π E [ V ] = argmax π E t =0 γ t r ( t ) � Define � ∞ � � � γ t r ( t ) � Q ( s, a ) = E π ∗ � s 0 = s, a 0 = a � t =0 Given the initial state s and the initial action a , Q is the maximum expected future reward. Recursive relationship: � � � p ( s ′ | as ) r ( sas ′ ) + γ max Q ( s ′ a ′ ) Q ( sa ) = a ′ s ′ � � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) � = E s ′ � sa � a ′ Zeqian (Chris) Li MDP and RL Feb 28, 2019 11 / 26
Bellman equation � � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) � Q ( sa ) = E s ′ � sa � a ′ Solve Q ( sa ) (or ψ ( s )) by Bellman equation, and the optimal policy is given by (when ǫ → 0): � , a ∗ ( s ) = argmax a Q ( a, s ) 1 π ∗ ( a | s ) = 0 , otherwise. “Curse of dimensionality” Zeqian (Chris) Li MDP and RL Feb 28, 2019 12 / 26
Solve Bellman equation: iterative method � � � r ( sas ′ ) + γ max Q i ( s ′ a ′ ) � Q i +1 ( sa ) = E s ′ � sa � a ′ = B [ Q i ] Start with Q 0 , and update by Q i +1 = B [ Q i ]. Can prove the convergence by calculating the Jacobian of B near the fixed point. Problem : only update one entry (one ( s, a ) pair) at each iteration; converges too slow. Zeqian (Chris) Li MDP and RL Feb 28, 2019 13 / 26
Statistical mechanics of MDP s t , a t ; p ( s ′ | s, a ) , r ( s, a, s ′ ) , π ( a | s ) �� ∞ Find π ∗ ( a | s ) = argmax π E [ V ] = argmax π E t =0 γ t r ( t ) � Define ρ t ( s ): probability in state s at time t Chapman–Kolmogorov equation: ρ t +1 ( s ′ ) = � p ( s ′ | sa ) π ( a | s ) ρ t ( s ) s,a Zeqian (Chris) Li MDP and RL Feb 28, 2019 14 / 26
∞ � γ t � ρ t ( s ) π ( a | s ) p ( s ′ | sa ) r ( sas ′ ) V π = E π,ρ [ R ] = t =0 sas ′ ∞ � γ t ρ t ( s ), average residence time in s before death) (Let η ( s ) ≡ t =0 � η ( s ) π ( a | s ) p ( s ′ | sa ) r ( s ′ as ) = s ′ as Constraints : - η ( s ) depends on π : η ( s ′ ) = ρ 0 ( s ′ ) + γ � p ( s ′ | sa ) π ( a | s ) η ( s ) sa - � a π ( a | s ) = 1 - introduce Lagrange multipliers Zeqian (Chris) Li MDP and RL Feb 28, 2019 15 / 26
� � � � φ ( s ′ ) η ( s ′ ) − ρ 0 ( s ′ ) − γ p ( s ′ | sa ) π ( a | s ) η ( s ) F π,η = V π,η − sa s ′ �� � � − λ ( s ) π ( a | s ) − 1 s a δF δF Optimization: δπ ( a | s ) = 0 , δη ( s ) = 0. Problem : linear function → derivative is constant → extreme value on the boundary → Optimal policy is deterministic (0 or 1) Introduce non-linearity: entropy � H s [ π ] = − π ( a | s ) log π ( a | s ) a (Similar to regularization.) Zeqian (Chris) Li MDP and RL Feb 28, 2019 16 / 26
� η ( s ) π ( a | s ) p ( s ′ | sa ) r ( s ′ as ) F π,η = ( V π,η ) s ′ as � � � φ ( s ′ ) η ( s ′ ) − ρ 0 ( s ′ ) − γ � p ( s ′ | sa ) π ( a | s ) η ( s ) − sa s ′ (dynamical constraint) �� � � − λ ( s ) π ( a | s ) − 1 (normalization) s a � + ǫ η ( s ) H s [ π ] (entropy) s δF δF δπ ( a | s ) = 0 , δη ( s ) = 0. Zeqian (Chris) Li MDP and RL Feb 28, 2019 17 / 26
Results exp( Q ( s,a ) /ǫ ) π ∗ ( a | s ) = b exp( Q ( s,b ) /ǫ ) - Boltzmann distribution! � ǫ : temperature! Q : quality function - (minus) energy! � �� �� exp Q ( s ′ a ′ ) � p ( s ′ | sa ) r ( sas ′ ) + γǫ log Q ( sa ) = ǫ s ′ a ′ � � r ( sas ′ ) + γ softmax Q ( s ′ a ′ ) = E s ′ a ′ ; ǫ � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) ( ǫ → 0) = E s ′ a ′ Can show that �� � � γ t r ( t ) � Q ( sa ) = E π ∗ � s 0 = s, a 0 = a � t Zeqian (Chris) Li MDP and RL Feb 28, 2019 18 / 26
φ ( x ): value function - (minus) free energy! �� �� � 1 φ ( s ) = ǫ log exp ǫ Q ( as ) a = softmax Q ( as ) a ; ǫ ( ǫ → 0) = max Q ( as ) a Iterative equation: r ( sas ′ ) + γφ ( s ′ ) � E s ′ � �� φ ( s ) = softmax a ; ǫ r ( sas ′ ) + γφ ( s ′ ) � E s ′ � �� ( ǫ → 0) = max a Physical meaning of φ ( s ): maximum expected future reward, given initial state s . Zeqian (Chris) Li MDP and RL Feb 28, 2019 19 / 26
Spectrum of reinforcement learning problems Accuracy of observation y Model-free Markov decision reinforcement process (MDP) learning Partially observable Full RL markov decision (very hard) process (POMDP) Knowledge about environment p ( s ′ as ) , r ( as ) Zeqian (Chris) Li MDP and RL Feb 28, 2019 20 / 26
MDP Bellman equation ( ǫ > 0) � � � r ( sas ′ ) + γ softmax Q ( s ′ a ′ ) � Q ( s, a ) = E s ′ � sa � a ′ ; ǫ Reinforcement learning : don’t know r ( s, a, s ′ ) , p ( s ′ | s, a ), only have samples of ( s 0 , a 0 , s 1 ; r 0 ) , ( s 1 , a 1 , s 2 ; r 2 ) , ..., ( s t , a t , s t +1 ; r t ) , ... Rewrite Bellman equation: � � Q ( · , a ′ ) − Q ( s, a ) r ( s, a, · ) + γ softmax = 0 E samples of ( ·| sa ) a ′ ; ǫ Zeqian (Chris) Li MDP and RL Feb 28, 2019 21 / 26
RL algorithm: soft Q-learning ˆ Q t +1 ( s, a ) = � � ˆ r t +1 + γ softmax a ′ ; ǫ ˆ Q t ( s t +1 , a ′ ) − ˆ Q t ( s, a ) − α t Q t ( s t , a t ) δ s,s t δ a,a t (Update if s = s t , a = a t ; otherwise, ˆ Q t +1 ( s, a ) = ˆ Q t ( s, a )) exp ( ˆ Q t +1 ( s,a ) /ǫ ) π t +1 ( a | s ) = ˆ b exp ( ˆ Q t +1 ( s,b ) /ǫ ) � Problem : only update one entry (one ( s, a ) pair) at each iteration; converges too slow. Zeqian (Chris) Li MDP and RL Feb 28, 2019 22 / 26
Solution: parameterize Q ( s, a ) by Q ( s, a ; w ), and update w in each iteration. Parameterize function with a small number of parameters: neural network . Deep reinforcement learning: RMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529. Zeqian (Chris) Li MDP and RL Feb 28, 2019 23 / 26
Recommend
More recommend