csce 496 896 lecture 7
play

CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement - PowerPoint PPT Presentation

CSCE 496/896 Lecture 7: Reinforcement CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement Learning Introduction MDPs Q Learning Stephen Scott TD Learning DQN (Adapted from Paul Quint) Atari Example Go Example


  1. CSCE 496/896 Lecture 7: Reinforcement CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement Learning Introduction MDPs Q Learning Stephen Scott TD Learning DQN (Adapted from Paul Quint) Atari Example Go Example sscott@cse.unl.edu 1 / 53

  2. Introduction CSCE 496/896 Lecture 7: Reinforcement Learning Consider learning to choose actions, e.g., Stephen Scott Robot learning to dock on battery charger Learning to choose actions to optimize factory output Introduction Learning to play Backgammon, chess, Go, etc. MDPs Note several problem characteristics: Q Learning TD Learning Delayed reward (thus have problem of temporal credit assignment ) DQN Opportunity for active exploration (versus exploitation of Atari Example known good actions) Go Example ⇒ Learner has some influence over the training data it sees Possibility that state only partially observable 2 / 53

  3. Example: TD-Gammon (Tesauro, 1995) CSCE 496/896 Lecture 7: Reinforcement Learning Stephen Scott Learn to play Backgammon Introduction Immediate Reward: MDPs + 100 if win Q Learning − 100 if lose TD Learning 0 for all other states DQN Trained by playing 1.5 million games against itself Atari Example Approximately equal to best human player at that time Go Example 3 / 53

  4. Outline CSCE 496/896 Lecture 7: Reinforcement Learning Stephen Scott Markov decision processes Introduction The agent’s learning task MDPs Q learning Q Learning TD Learning Temporal difference learning DQN Deep Q learning Atari Example Example: Learning to play Atari Go Example 4 / 53

  5. Reinforcement Learning Problem CSCE 496/896 Agent Lecture 7: Reinforcement Learning Stephen Scott State Reward Action Introduction MDPs Environment Q Learning TD Learning DQN a a a 0 1 2 Atari Example s s s ... 0 1 2 r r r Go Example 0 1 2 Goal: Learn to choose actions that maximize 2 r + γ r + r + ... , where 0 < γ <1 γ 0 1 2 5 / 53

  6. Markov Decision Processes CSCE 496/896 Assume Lecture 7: Reinforcement Learning Finite set of states S Stephen Scott Set of actions A Introduction At each discrete time t agent observes state s t ∈ S and MDPs chooses action a t ∈ A Q Learning TD Learning Then receives immediate reward r t , and state changes DQN to s t + 1 Atari Example Markov assumption: s t + 1 = δ ( s t , a t ) and Go Example r t = r ( s t , a t ) I.e., r t and s t + 1 depend only on current state and action Functions δ and r may be nondeterministic Functions δ and r not necessarily known to agent 6 / 53

  7. Agent’s Learning Task CSCE 496/896 Lecture 7: Execute actions in environment, observe results, and Reinforcement Learning Learn action policy π : S → A that maximizes Stephen Scott � � r t + γ r t + 1 + γ 2 r t + 2 + · · · E Introduction MDPs from any starting state in S Q Learning Here 0 ≤ γ < 1 is the discount factor for future TD Learning rewards DQN Note something new: Atari Example Target function is π : S → A Go Example But we have no training examples of form � s , a � Training examples are of form �� s , a � , r � I.e., not told what best action is, instead told reward for executing action a in state s 7 / 53

  8. Value Function CSCE 496/896 Lecture 7: First consider deterministic worlds Reinforcement Learning For each possible policy π the agent might adopt, we Stephen Scott can define discounted cumulative reward as Introduction ∞ MDPs � V π ( s ) ≡ r t + γ r t + 1 + γ 2 r t + 2 + · · · = γ i r t + i , Q Learning i = 0 TD Learning DQN where r t , r t + 1 , . . . are generated by following policy π , Atari Example starting at state s Go Example Restated, the task is to learn an optimal policy π ∗ π ∗ ≡ argmax V π ( s ) , ( ∀ s ) π 8 / 53

  9. Value Function CSCE 496/896 Lecture 7: 0 0 100 Reinforcement G Learning 0 0 0 Stephen Scott 0 0 100 0 0 Introduction 0 0 MDPs Q Learning Q ( s , a ) values r ( s , a ) values TD Learning DQN G G 90 100 0 Atari Example Go Example 81 90 100 V ∗ ( s ) values One optimal policy 9 / 53

  10. What to Learn CSCE We might try to have agent learn the evaluation 496/896 function V π ∗ (which we write as V ∗ ) Lecture 7: Reinforcement Learning It could then do a lookahead search to choose best Stephen Scott action from any state s because Introduction π ∗ ( s ) = argmax [ r ( s , a ) + γ V ∗ ( δ ( s , a ))] , MDPs a Q Learning TD Learning i.e., choose action that maximized immediate reward + DQN discounted reward if optimal strategy followed from Atari Example then on Go Example E.g., V ∗ ( bot . ctr . ) = 0 + γ 100 + γ 2 0 + γ 3 0 + · · · = 90 A problem: This works well if agent knows δ : S × A → S , and r : S × A → R But when it doesn’t, it can’t choose actions this way 10 / 53

  11. Q Function CSCE 496/896 Define new function very similar to V ∗ : Lecture 7: Reinforcement Learning Q ( s , a ) ≡ r ( s , a ) + γ V ∗ ( δ ( s , a )) Stephen Scott Introduction i.e., Q ( s , a ) = total discounted reward if action a taken in MDPs state s and optimal choices made from then on Q Learning If agent learns Q , it can choose optimal action even TD Learning without knowing δ DQN Atari Example π ∗ ( s ) [ r ( s , a ) + γ V ∗ ( δ ( s , a ))] = argmax Go Example a = Q ( s , a ) argmax a Q is the evaluation function the agent will learn 11 / 53

  12. Training Rule to Learn Q CSCE Note Q and V ∗ closely related: 496/896 Lecture 7: Reinforcement V ∗ ( s ) = max Q ( s , a ′ ) Learning a ′ Stephen Scott Which allows us to write Q recursively as Introduction MDPs r ( s t , a t ) + γ V ∗ ( δ ( s t , a t ))) Q ( s t , a t ) = Q Learning Q ( s t + 1 , a ′ ) = r ( s t , a t ) + γ max TD Learning a ′ DQN Let ˆ Atari Example Q denote learner’s current approximation to Q ; Go Example consider training rule ˆ ˆ Q ( s ′ , a ′ ) , Q ( s , a ) ← r + γ max a ′ where s ′ is the state resulting from applying action a in state s 12 / 53

  13. Q Learning for Deterministic Worlds CSCE 496/896 For each s , a initialize table entry ˆ Q ( s , a ) ← 0 Lecture 7: Reinforcement Learning Observe current state s Stephen Scott Do forever: Select an action a (greedily or probabilistically) and Introduction execute it MDPs Receive immediate reward r Q Learning Observe the new state s ′ TD Learning Update the table entry for ˆ Q ( s , a ) as follows: DQN Atari Example ˆ ˆ Q ( s , a ) ← r + γ max Q ( s ′ , a ′ ) Go Example a ′ s ← s ′ Note that actions not taken and states not seen don’t get explicit updates (might need to generalize) 13 / 53

  14. Updating ˆ Q CSCE 496/896 Lecture 7: Reinforcement Learning ˆ Q ( s 2 , a ′ ) ˆ Q ( s 1 , a right ) ← r + γ max 90 72 100 100 R R a ′ 66 66 Stephen Scott 81 81 = 0 + 0 . 9 max { 66 , 81 , 100 } a right = 90 Introduction Initial state: s 1 Next state: s 2 MDPs Q Learning Can show via induction on n that if rewards non-negative and ˆ TD Learning Q s initially 0, then DQN ( ∀ s , a , n ) ˆ Q n + 1 ( s , a ) ≥ ˆ Atari Example Q n ( s , a ) Go Example and ( ∀ s , a , n ) 0 ≤ ˆ Q n ( s , a ) ≤ Q ( s , a ) 14 / 53

  15. Updating ˆ Q Convergence CSCE 496/896 Lecture 7: ˆ Q converges to Q : Consider case of deterministic Reinforcement Learning world where each � s , a � is visited infinitely often Stephen Scott Proof : Define a full interval to be an interval during Introduction which each � s , a � is visited. Will show that during each MDPs full interval the largest error in ˆ Q table is reduced by Q Learning factor of γ TD Learning Let ˆ Q n be table after n updates, and ∆ n be the DQN maximum error in ˆ Q n ; i.e., Atari Example Go Example s , a | ˆ ∆ n = max Q n ( s , a ) − Q ( s , a ) | Let s ′ = δ ( s , a ) 15 / 53

  16. Updating ˆ Q Convergence CSCE 496/896 For any table entry ˆ Q n ( s , a ) updated on iteration n + 1 , Lecture 7: Reinforcement error in the revised estimate ˆ Q n + 1 ( s , a ) is Learning Stephen Scott | ˆ ˆ Q n ( s ′ , a ′ )) Q n + 1 ( s , a ) − Q ( s , a ) | = | ( r + γ max a ′ Introduction − ( r + γ max Q ( s ′ , a ′ )) | MDPs a ′ Q Learning ˆ Q n ( s ′ , a ′ ) − max Q ( s ′ , a ′ ) | = γ | max TD Learning a ′ a ′ | ˆ Q n ( s ′ , a ′ ) − Q ( s ′ , a ′ ) | DQN ( ∗ ) ≤ γ max a ′ Atari Example s ′′ , a ′ | ˆ Q n ( s ′′ , a ′ ) − Q ( s ′′ , a ′ ) | ( ∗∗ ) ≤ γ max Go Example = γ ∆ n ( ∗ ) works since | max a f 1 ( a ) − max a f 2 ( a ) | ≤ max a | f 1 ( a ) − f 2 ( a ) | ( ∗∗ ) works since max will not decrease 16 / 53

  17. Updating ˆ Q Convergence CSCE 496/896 Lecture 7: Reinforcement Learning Stephen Scott Also, ˆ Q 0 ( s , a ) and Q ( s , a ) are both bounded ∀ s , a Introduction ⇒ ∆ 0 bounded MDPs Q Learning Thus after k full intervals, error ≤ γ k ∆ 0 TD Learning Finally, each � s , a � visited infinitely often ⇒ number of DQN intervals infinite, so ∆ n → 0 as n → ∞ Atari Example Go Example 17 / 53

Recommend


More recommend