CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

Markov Decision Process • Markov process augmented with… – Actions e.g., ! " – Rewards e.g., # " a 0 a 1 a 3 a 2 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 0 r 1 University of Waterloo CS885 Spring 2018 Pascal Poupart 2

Current Assumptions • Uncertainty: stochastic process • Time: sequential process • Observability: fully observable states • No learning: complete model • Variable type: discrete (e.g., discrete states and actions) University of Waterloo CS885 Spring 2018 Pascal Poupart 3

Rewards • Rewards : ! " ∈ ℜ • Reward function : % & " , ( " = ! " mapping from state-action pairs to rewards • Common assumption: stationary reward function – % & " , ( " is the same ∀+ • Exception: terminal reward function often different – E.g., in a game: 0 reward at each turn and +1/-1 at the end for winning/losing • Goal: maximize sum of rewards ∑ - .(0 - , 1 - ) University of Waterloo CS885 Spring 2018 Pascal Poupart 4

Discounted/Average Rewards • If process infinite, isn’t ∑ " #(% " , ' " ) infinite? • Solution 1: discounted rewards – Discount factor: 0 ≤ + < 1 – Finite utility: ∑ " + " #(% " , ' " ) is a geometric sum – + induces an inflation rate of 1/+ − 1 – Intuition: prefer utility sooner than later • Solution 2: average rewards – More complicated computationally – Beyond the scope of this course University of Waterloo CS885 Spring 2018 Pascal Poupart 5

Markov Decision Process • Definition – Set of states: S – Set of actions: A – Transition model: Pr($ % |$ %'( , * %'( ) – Reward model: ,($ % , * % ) – Discount factor: 0 ≤ / ≤ 1 • discounted: / < 1 undiscounted: / = 1 – Horizon (i.e., # of time steps): ℎ • Finite horizon: ℎ ∈ ℕ infinite horizon: ℎ = ∞ • Goal: find optimal policy University of Waterloo CS885 Spring 2018 Pascal Poupart 6

Inventory Management • Markov Decision Process – States: inventory levels – Actions: {doNothing, orderWidgets} – Transition model: stochastic demand – Reward model: Sales – Costs - Storage – Discount factor: 0.999 – Horizon: ∞ • Tradeoff: increasing supplies decreases odds of missed sales, but increases storage costs University of Waterloo CS885 Spring 2018 Pascal Poupart 7

Policy • Choice of action at each time step • Formally: – Mapping from states to actions – i.e., ! " # = % # – Assumption: fully observable states • Allows % # to be chosen only based on current state " # University of Waterloo CS885 Spring 2018 Pascal Poupart 8

Policy Optimization • Policy evaluation: – Compute expected utility ! " # $ = ∑ '($ * ' ∑ + , Pr # ' # $ , 0 1(# ' , 0 # ' ) ) • Optimal policy: – Policy with highest expected utility ! " ∗ # $ ≥ ! " # $ ∀0 University of Waterloo CS885 Spring 2018 Pascal Poupart 9

Policy Optimization • Several classes of algorithms: – Value iteration – Policy iteration – Linear Programming – Search techniques • Computation may be done – Offline: before the process starts – Online: as the process evolves University of Waterloo CS885 Spring 2018 Pascal Poupart 10

Value Iteration • Performs dynamic programming • Optimizes decisions in reverse order a 0 a 1 a 3 a 2 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 0 r 1 University of Waterloo CS885 Spring 2018 Pascal Poupart 11

Value Iteration • Value when no time left: ! " # = max ( ) *(" # , - # ) • Value with one time step left: ( )12 * " #/0 , - #/0 + 4 ∑ 6 ) Pr " # " #/0 , - #/0 !(" # ) ! " #/0 = max • Value with two time steps left: ! " #/9 = max ( )1: * " #/9 , - #/9 + 4 ∑ 6 )12 Pr " #/0 " #/9 , - #/9 !(" #/0 ) • … • Bellman’s equation: ( < * " ; , - ; + 4 ∑ 6 <=2 Pr " ;>0 " ; , - ; !(" ;>0 ) ! " ; = max ∗ = argmax * " ; , - ; + 4 ∑ 6 <=2 Pr " ;>0 " ; , - ; !(" ;>0 ) - ; ( < University of Waterloo CS885 Spring 2018 Pascal Poupart 12

A Markov Decision Process 1 g = 0.9 S ½ ½ 1 Poor & You own a Poor & A Unknown company Famous A +0 +0 In every state you must S choose between ½ S aving money or ½ 1 ½ ½ A dvertising ½ S A A Rich & Rich & S Famous Unknown ½ +10 +10 ½ ½ University of Waterloo CS885 Spring 2018 Pascal Poupart 13

1 g = 0.9 S ½ 1 PU PF ½ A A +0 +0 1 S ½ ½ ½ ½ ½ A A S RF RU S +10 +10 ½ ½ ½ ! "($%) '($%) "($() '($() "()%) '()%) "()() '()() ℎ 0 A,S 0 A,S 10 A,S 10 A,S ℎ − 1 0 A,S 4.5 S 14.5 S 19 S ℎ − 2 2.03 A 8.55 S 16.53 S 25.08 S ℎ − 3 4.76 A 12.20 S 18.35 S 28.72 S ℎ − 4 7.63 A 15.07 S 20.40 S 31.18 S ℎ − 5 10.21 A 17.46 S 22.61 S 33.21 S University of Waterloo CS885 Spring 2018 Pascal Poupart 14

Finite Horizon • When h is finite, • Non-stationary optimal policy • Best action different at each time step • Intuition: best action varies with the amount of time left University of Waterloo CS885 Spring 2018 Pascal Poupart 15

Infinite Horizon • When h is infinite, • Stationary optimal policy • Same best action at each time step • Intuition: same (infinite) amount of time left at each time step, hence same best action • Problem: value iteration does an infinite number of iterations… University of Waterloo CS885 Spring 2018 Pascal Poupart 16

Infinite Horizon • Assuming a discount factor ! , after " time steps, rewards are scaled down by ! # • For large enough " , rewards become insignificant since ! # → 0 • Solution: – pick large enough " – run value iteration for " steps – Execute policy found at the " &' iteration University of Waterloo CS885 Spring 2018 Pascal Poupart 17

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Markov

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7,

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Reinforcement Learning Lecture 1b: May 2, 2018 Markov Processes [RusNor] Sec. 15.1

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7,

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put]

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning-Based End-to-End Parking for Automatic Parking System CS885

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Discounting Lecture slides Brd Harstad University of Oslo 2019 Brd Harstad (University of

Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: Basics Lecture 12 Volker

CMU MDPs 15-381/781 Emma Brunskill (THIS TIME) Ariel Procaccia DeepMind 2 So long

Mathematical Foundations for Finance Exercise 2 Martin Stefanik ETH Zurich Notation S k S k

Markov Decision Processes (MDPs) and Reinforcement Learning (RL) Sven Koenig, USC Russell and

Secretary Problem Secretary Problem Mohammad Mahdian R. Preston McAfee David Pennock Secretary

Factor Saving Innovation Michele Boldrin and David K. Levine 1 Introduction endogeneity of

Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control