cs885 reinforcement learning lecture 15c june 20 2018
play

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put] Sec. 11.1-11.3 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Hierarchical RL Hierarchy of goals Reach and actions in Destination


  1. CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put] Sec. 11.1-11.3 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

  2. Hierarchical RL • Hierarchy of goals Reach and actions in Destination autonomous driving Reach A Reach B Reach C Turn Overtake Stop Park Break Gas Steering • Theory: Semi-Markov Decision Processes University of Waterloo CS885 Spring 2018 Pascal Poupart 2

  3. Semi-Markov Process • Definition – Set of States: ! – Transition dynamics: Pr $ % , ' $ = Pr $ % $ Pr ' $ where ' indicates the time to transition • Semi-Markovian: – Next state depends only on current state – Time spent in each state varies '′′ ' '′ $ $′ $′′ $′′′ University of Waterloo CS885 Spring 2018 Pascal Poupart 3

  4. Semi-Markov Decision Process • Definition – Set of states: ! – Set of actions: " – Transition model: Pr(& ' ,)|&,+) – Reward model: - &,+ = /[1|&,+] – Discount factor: 0 ≤ 5 ≤ 1 • discounted: 5 < 1 undiscounted: 5 = 1 – Horizon (i.e., # of time steps): ℎ • Finite horizon: ℎ ∈ ℕ infinite horizon: ℎ = ∞ • Goal: find optimal policy University of Waterloo CS885 Spring 2018 Pascal Poupart 4

  5. Example from Queuing Theory • Consider a retail store with two queues: – Customer service queue – Cashier queue • Semi-Markov decision process – State: ! = ($ % , $ ' ) where $ ) = # of customers in queue + – Action: , ∈ {1,2} (i.e., serve customer in queue 1 or 2) – Transition model: distribution over arrival and service times for customers in each queue. – Reward model: expected revenue of each serviced customer – expected cost associated with waiting times – Discount factor: 0 ≤ 4 < 1 – Horizon (i.e., # of time steps): ℎ = ∞ University of Waterloo CS885 Spring 2018 Pascal Poupart 5

  6. Value Function and Policy • Objective: ! " # = ∑ & ' ( ) * + # ( ) ,-(# ( ) ) – Where 0 & = 1 2 + 1 4 + ⋯+ 1 & – Optimal policy: - ∗ such that ! " ∗ # ≥ ! " # ∀#,- • Bellman’s equation: ! ∗ # = max Pr # J ,1 #,C ' G ! ∗ (# J ) B + #,C + D E F ,G • Q-learning update: K #,C ← K #,C + M N + ' G max B F K # J ,C J − K(#,C) University of Waterloo CS885 Spring 2018 Pascal Poupart 6

  7. Option Framework • Semi-Markov decision process where actions are options (temporally extended sub-policies) • Let ! be an option with sub-policy " and terminal states # $%& ∀( )*+ ∈ # $%& : Pr ( )*+ , 0 ( ) , ! = +CB Pr ( )*@ ( )*@CB , " ( )*@CB ∑ 3 456:45896 ∉ ; <=> ∏ @AB + F ∑ 3 456 Pr ( )*B ( ) , " ( ) D ( ) , !, ( )*+ , 0 = D ( ) , " ( ) +⋯F∑ 3 458 Pr ( )*+ ( )*+CB ,"(( )*+CB ) D ( )*B ," ( )*B D ( )*+ ," ( )*+ … University of Waterloo CS885 Spring 2018 Pascal Poupart 7

  8. Option Framework • Bellman’s equation: ! ∗ # = max Pr # 0 ,1 #,2 [4 #,2,# 0 ,1 + 6 - ! ∗ (# 0 )] ( ) * + ,- • Q-learning update: : #,2 ← : #,2 + < = - + 6 - max ( + : # 0 ,2 0 − :(#,2) - 6 @ C where = - = ∑ @AB @ University of Waterloo CS885 Spring 2018 Pascal Poupart 8

Recommend


More recommend