Markov Decision Process and Reinforcement Learning Zeqian (Chris) - PowerPoint PPT Presentation

Markov Decision Process and Reinforcement Learning Zeqian (Chris) Li Feb 28, 2019 Zeqian (Chris) Li MDP and RL Feb 28, 2019 1 / 26

Outline 1 Introduction 2 Markov decision process 3 Statistical mechanics of MDP 4 Reinforcement learning 5 Discussion Zeqian (Chris) Li MDP and RL Feb 28, 2019 2 / 26

Introduction Hungry rat experiment, Yale, 1948 Modeling reinforcement: agend-based model π ( a | s ) Agent a r ( s, a, s ′ ) Environment s p ( s ′ | s, a ) s : state; a : action; r : reward p ( s ′ | s, a ): transitional probability; r ( s, a, s ′ ): reward model; π ( a | s ): policy This is a dynamical process: s t , a t , r t ; s t +1 , a t +1 , r t +1 ; ... Zeqian (Chris) Li MDP and RL Feb 28, 2019 3 / 26

Examples: Atari games π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Atari games State: brick positions, board positions, ball coordinate and velocity Action: controller/keyboard inputs Reward: game score Zeqian (Chris) Li MDP and RL Feb 28, 2019 4 / 26

Examples: Go π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Go State: positions of stones Action: next move Reward: advantage evaluation Zeqian (Chris) Li MDP and RL Feb 28, 2019 5 / 26

Examples: robots π ( a | s ) Agent a r ( s, a, s ′ ) Environment s p ( s ′ | s, a ) (Boston Dynamics) Robots State: positions, mass distribution, ... Action: adjusting forces on feet Reward: chance of falling Zeqian (Chris) Li MDP and RL Feb 28, 2019 6 / 26

Other examples Example in physics? Zeqian (Chris) Li MDP and RL Feb 28, 2019 7 / 26

Objective of reinforcement learning π ( a | s ) Agent s t , a t a r ( s, a, s ′ ) p ( s ′ | s, a ): transitional probability r ( s, a, s ′ ): reward model Environment π ( a | s ): policy s p ( s ′ | s, a ) Objective of reinforcement learning Find optimal policy π ∗ ( a | y ) to maximize expected reward: � ∞ � � π ∗ ( a | s ) = argmax γ t r ( t ) E [ V ] = argmax E π π t =0 ( γ : 0 ≤ γ < 1, discount factor) Zeqian (Chris) Li MDP and RL Feb 28, 2019 8 / 26

Simplest example: one-armed bandits States Actions p =1: r =0 0 0 1 p =0 . 9: r =1 p =0 . 1: r =0 Optimal policy: π ∗ (0 | 0) = 0 , π ∗ (1 | 0) = 1 Zeqian (Chris) Li MDP and RL Feb 28, 2019 9 / 26

Markov decision process π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Suppose that I have full knowledge of p ( s ′ | a, s ) , r ( s, a, s ′ ). This is called Markov Decision Process . Objective of MDP: compute � ∞ � π ∗ ( a | s ) = argmax � γ t r ( t ) E [ V ] = argmax E π π t =0 This is a computing problem. No learning. Zeqian (Chris) Li MDP and RL Feb 28, 2019 10 / 26

Quality function Q ( s, a ) �� ∞ π ∗ ( a | s ) = argmax π E [ V ] = argmax π E t =0 γ t r ( t ) � Define � ∞ � � � γ t r ( t ) � Q ( s, a ) = E π ∗ � s 0 = s, a 0 = a � t =0 Given the initial state s and the initial action a , Q is the maximum expected future reward. Recursive relationship: � � � p ( s ′ | as ) r ( sas ′ ) + γ max Q ( s ′ a ′ ) Q ( sa ) = a ′ s ′ � � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) � = E s ′ � sa � a ′ Zeqian (Chris) Li MDP and RL Feb 28, 2019 11 / 26

Bellman equation � � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) � Q ( sa ) = E s ′ � sa � a ′ Solve Q ( sa ) (or ψ ( s )) by Bellman equation, and the optimal policy is given by (when ǫ → 0): � , a ∗ ( s ) = argmax a Q ( a, s ) 1 π ∗ ( a | s ) = 0 , otherwise. “Curse of dimensionality” Zeqian (Chris) Li MDP and RL Feb 28, 2019 12 / 26

Solve Bellman equation: iterative method � � � r ( sas ′ ) + γ max Q i ( s ′ a ′ ) � Q i +1 ( sa ) = E s ′ � sa � a ′ = B [ Q i ] Start with Q 0 , and update by Q i +1 = B [ Q i ]. Can prove the convergence by calculating the Jacobian of B near the fixed point. Problem : only update one entry (one ( s, a ) pair) at each iteration; converges too slow. Zeqian (Chris) Li MDP and RL Feb 28, 2019 13 / 26

Statistical mechanics of MDP s t , a t ; p ( s ′ | s, a ) , r ( s, a, s ′ ) , π ( a | s ) �� ∞ Find π ∗ ( a | s ) = argmax π E [ V ] = argmax π E t =0 γ t r ( t ) � Define ρ t ( s ): probability in state s at time t Chapman–Kolmogorov equation: ρ t +1 ( s ′ ) = � p ( s ′ | sa ) π ( a | s ) ρ t ( s ) s,a Zeqian (Chris) Li MDP and RL Feb 28, 2019 14 / 26

∞ � γ t � ρ t ( s ) π ( a | s ) p ( s ′ | sa ) r ( sas ′ ) V π = E π,ρ [ R ] = t =0 sas ′ ∞ � γ t ρ t ( s ), average residence time in s before death) (Let η ( s ) ≡ t =0 � η ( s ) π ( a | s ) p ( s ′ | sa ) r ( s ′ as ) = s ′ as Constraints : - η ( s ) depends on π : η ( s ′ ) = ρ 0 ( s ′ ) + γ � p ( s ′ | sa ) π ( a | s ) η ( s ) sa - � a π ( a | s ) = 1 - introduce Lagrange multipliers Zeqian (Chris) Li MDP and RL Feb 28, 2019 15 / 26

� � � � φ ( s ′ ) η ( s ′ ) − ρ 0 ( s ′ ) − γ p ( s ′ | sa ) π ( a | s ) η ( s ) F π,η = V π,η − sa s ′ �� − λ ( s ) π ( a | s ) − 1 s a δF δF Optimization: δπ ( a | s ) = 0 , δη ( s ) = 0. Problem : linear function → derivative is constant → extreme value on the boundary → Optimal policy is deterministic (0 or 1) Introduce non-linearity: entropy � H s [ π ] = − π ( a | s ) log π ( a | s ) a (Similar to regularization.) Zeqian (Chris) Li MDP and RL Feb 28, 2019 16 / 26

� η ( s ) π ( a | s ) p ( s ′ | sa ) r ( s ′ as ) F π,η = ( V π,η ) s ′ as � � � φ ( s ′ ) η ( s ′ ) − ρ 0 ( s ′ ) − γ � p ( s ′ | sa ) π ( a | s ) η ( s ) − sa s ′ (dynamical constraint) �� − λ ( s ) π ( a | s ) − 1 (normalization) s a � + ǫ η ( s ) H s [ π ] (entropy) s δF δF δπ ( a | s ) = 0 , δη ( s ) = 0. Zeqian (Chris) Li MDP and RL Feb 28, 2019 17 / 26

Results exp( Q ( s,a ) /ǫ ) π ∗ ( a | s ) = b exp( Q ( s,b ) /ǫ ) - Boltzmann distribution! � ǫ : temperature! Q : quality function - (minus) energy! � �� exp Q ( s ′ a ′ ) � p ( s ′ | sa ) r ( sas ′ ) + γǫ log Q ( sa ) = ǫ s ′ a ′ � � r ( sas ′ ) + γ softmax Q ( s ′ a ′ ) = E s ′ a ′ ; ǫ � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) ( ǫ → 0) = E s ′ a ′ Can show that �� γ t r ( t ) � Q ( sa ) = E π ∗ � s 0 = s, a 0 = a � t Zeqian (Chris) Li MDP and RL Feb 28, 2019 18 / 26

φ ( x ): value function - (minus) free energy! �� 1 φ ( s ) = ǫ log exp ǫ Q ( as ) a = softmax Q ( as ) a ; ǫ ( ǫ → 0) = max Q ( as ) a Iterative equation: r ( sas ′ ) + γφ ( s ′ ) � E s ′ � �� φ ( s ) = softmax a ; ǫ r ( sas ′ ) + γφ ( s ′ ) � E s ′ � �� ( ǫ → 0) = max a Physical meaning of φ ( s ): maximum expected future reward, given initial state s . Zeqian (Chris) Li MDP and RL Feb 28, 2019 19 / 26

Spectrum of reinforcement learning problems Accuracy of observation y Model-free Markov decision reinforcement process (MDP) learning Partially observable Full RL markov decision (very hard) process (POMDP) Knowledge about environment p ( s ′ as ) , r ( as ) Zeqian (Chris) Li MDP and RL Feb 28, 2019 20 / 26

MDP Bellman equation ( ǫ > 0) � � � r ( sas ′ ) + γ softmax Q ( s ′ a ′ ) � Q ( s, a ) = E s ′ � sa � a ′ ; ǫ Reinforcement learning : don’t know r ( s, a, s ′ ) , p ( s ′ | s, a ), only have samples of ( s 0 , a 0 , s 1 ; r 0 ) , ( s 1 , a 1 , s 2 ; r 2 ) , ..., ( s t , a t , s t +1 ; r t ) , ... Rewrite Bellman equation: � � Q ( · , a ′ ) − Q ( s, a ) r ( s, a, · ) + γ softmax = 0 E samples of ( ·| sa ) a ′ ; ǫ Zeqian (Chris) Li MDP and RL Feb 28, 2019 21 / 26

RL algorithm: soft Q-learning ˆ Q t +1 ( s, a ) = � � ˆ r t +1 + γ softmax a ′ ; ǫ ˆ Q t ( s t +1 , a ′ ) − ˆ Q t ( s, a ) − α t Q t ( s t , a t ) δ s,s t δ a,a t (Update if s = s t , a = a t ; otherwise, ˆ Q t +1 ( s, a ) = ˆ Q t ( s, a )) exp ( ˆ Q t +1 ( s,a ) /ǫ ) π t +1 ( a | s ) = ˆ b exp ( ˆ Q t +1 ( s,b ) /ǫ ) � Problem : only update one entry (one ( s, a ) pair) at each iteration; converges too slow. Zeqian (Chris) Li MDP and RL Feb 28, 2019 22 / 26

Solution: parameterize Q ( s, a ) by Q ( s, a ; w ), and update w in each iteration. Parameterize function with a small number of parameters: neural network . Deep reinforcement learning: RMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529. Zeqian (Chris) Li MDP and RL Feb 28, 2019 23 / 26

Markov Decision Process and Reinforcement Learning Zeqian (Chris) - PowerPoint PPT Presentation

Markov Decision Process and Reinforcement Learning Zeqian (Chris) Li Feb 28, 2019 Zeqian (Chris) Li MDP and RL Feb 28, 2019 1 / 26 Outline 1 Introduction 2 Markov decision process 3 Statistical mechanics of MDP 4 Reinforcement learning 5

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes MATH5835, P. Del Moral UNSW, School of Mathematics & Statistics

Randomized Algorithms Randomized Algorithms Markov Chains and Random Walks Markov Chains and

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Continuous Time Markov Chain Birth and Death Process IE 502: Probabilistic Models Jayendran

Stochastic Search and Diffusion Arturo Berrones Some basic physics: Newtons 2nd Law of

Novel Latency Bounds for Distributed Coded Storage Jean-Francois Chamberland Parimal Parag

Quantum Information Processing in Non-Markovian Quantum Complex Systems Francesco Buscemi 1

Overview Motivation 1 Quantitative Automata Models and Model Checking What are discrete-time

Markov Decision Process and Reinforcement Learning Zeqian (Chris) - PowerPoint PPT Presentation

Markov Decision Process and Reinforcement Learning Zeqian (Chris) Li Feb 28, 2019 Zeqian (Chris) Li MDP and RL Feb 28, 2019 1 / 26 Outline 1 Introduction 2 Markov decision process 3 Statistical mechanics of MDP 4 Reinforcement learning 5

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes MATH5835, P. Del Moral UNSW, School of Mathematics &amp; Statistics

Randomized Algorithms Randomized Algorithms Markov Chains and Random Walks Markov Chains and

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Continuous Time Markov Chain Birth and Death Process IE 502: Probabilistic Models Jayendran

Stochastic Search and Diffusion Arturo Berrones Some basic physics: Newtons 2nd Law of

Novel Latency Bounds for Distributed Coded Storage Jean-Francois Chamberland Parimal Parag

Quantum Information Processing in Non-Markovian Quantum Complex Systems Francesco Buscemi 1

Overview Motivation 1 Quantitative Automata Models and Model Checking What are discrete-time

Stochastic Processes MATH5835, P. Del Moral UNSW, School of Mathematics & Statistics