refresh your knowledge 6
play

Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234


  1. Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 52

  2. Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank of prior (s,a,r,s’) tuples and doing Q-learning 1 updates using all the tuples in the bank Always uses the most recent history of tuples 2 Reduces the data efficiency of DQN 3 Increases the computational cost 4 Not sure 5 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 52

  3. Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 52

  4. Class Structure Last time: CNNs and Deep Reinforcement learning This time: DRL and Imitation Learning in Large State Spaces Next time: Policy Search Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 52

  5. Double DQN Recall maximization bias challenge Max of the estimated state-action values can be a biased estimate of the max Double Q-learning Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 52

  6. Recall: Double Q-Learning 1: Initialize Q 1 ( s , a ) and Q 2 ( s , a ), ∀ s ∈ S , a ∈ A t = 0, initial state s t = s 0 2: loop Select a t using ǫ -greedy π ( s ) = arg max a Q 1 ( s t , a ) + Q 2 ( s t , a ) 3: Observe ( r t , s t +1 ) 4: if (with 0.5 probability) then 5: 6: Q 1 ( s t , a t ) ← Q 1 ( s t , a t )+ α ( r t + Q 1 ( s t +1 , arg max a ′ Q 2 ( s t +1 , a ′ )) − Q 1 ( s t , a t )) else 7: 8: Q 2 ( s t , a t ) ← Q 2 ( s t , a t )+ α ( r t + Q 2 ( s t +1 , arg max a ′ Q 1 ( s t +1 , a ′ )) − Q 2 ( s t , a t )) end if 9: t = t + 1 10: 11: end loop Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 52 This was using a lookup table representation for the state-action value

  7. Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 52

  8. Check Your Understanding: Mars Rover Model-Free Policy Evaluation ! " ! # ! $ ! % ! & ! ' ! ( ) ! ' = 0 ) ! ( = +10 ) ! " = +1 ) ! # = 0 ) ! $ = 0 ) ! % = 0 ) ! & = 0 89/: ./01/!123 .2456 7214 .2456 7214 π ( s ) = a 1 ∀ s , γ = 1. Any action from s 1 and s 7 terminates episode Trajectory = ( s 3 , a 1 , 0, s 2 , a 1 , 0, s 2 , a 1 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of V of each state? [1 1 1 0 0 0 0] TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0] Chose 2 ”replay” backups to do. Which should we pick to get estimate closest to MC first visit estimate? Doesn’t matter, any will yield the same 1 ( s 3 , a 1 , 0 , s 2 ) then ( s 2 , a 1 , 0 , s 1 ) 2 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 3 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 4 Not sure 5 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 52

  9. Impact of Replay? In tabular TD-learning, order of replaying updates could help speed learning Repeating some updates seem to better propagate info than others Systematic ways to prioritize updates? Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 52

  10. Potential Impact of Ordering Episodic Replay Updates Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Schaul, Quan, Antonoglou, Silver ICLR 2016 Oracle: picks ( s , a , r , s ′ ) tuple to replay that will minimize global loss Exponential improvement in convergence Number of updates needed to converge Oracle is not a practical method but illustrates impact of ordering Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 52

  11. Prioritized Experience Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) � � p i = � r + γ max � � � Update p i every update. p i for new tuples is set to 0 One method 1 : proportional (stochastic prioritization) p α i P ( i ) = k p α � k 1 See paper for details and an alternative Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 52

  12. Exercise: Prioritized Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) � p i = � r + γ max � � � Update p i every update. p i for new tuples is set to 0 One method 1 : proportional (stochastic prioritization) p α i P ( i ) = k p α � k α = 0 yields what rule for selecting among existing tuples? Selects randomly Selects the one with the highest priority It depends on the priorities of the tuples Not Sure Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 52

  13. Performance of Prioritized Replay vs Double DQN Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 52

  14. Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 52

  15. Value & Advantage Function Intuition: Features need to accurate represent value may be different than those needed to specify difference in actions E.g. Game score may help accurately predict V ( s ) But not necessarily in indicating relative action values Q ( s , a 1 ) vs Q ( s , a 2 ) Advantage function (Baird 1993) A π ( s , a ) = Q π ( s , a ) − V π ( s ) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 52

  16. Dueling DQN Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 52

  17. Check Understanding: Unique? Advantage function A π ( s , a ) = Q π ( s , a ) − V π ( s ) For a given advantage function, is there a unique Q and V ? Yes 1 No 2 Not sure 3 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 52

  18. Uniqueness Advantage function A π ( s , a ) = Q π ( s , a ) − V π ( s ) Not unique Option 1: Force A ( s , a ) = 0 if a is action taken � � Q ( s , a ; w ) = ˆ ˆ ˆ A ( s , a ′ ; w ) ˆ V ( s ; w ) + A ( s , a ; w ) − max a ′ ∈A Option 2: Use mean as baseline (more stable) � � A ( s , a ; w ) − 1 Q ( s , a ; w ) = ˆ ˆ ˆ � ˆ A ( s , a ′ ; w ) V ( s ; w ) + |A| a ′ ∈A Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 52

  19. Dueling DQN V.S. Double DQN with Prioritized Replay Figure: Wang et al, ICML 2016 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 52

Recommend


More recommend