Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 52

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank of prior (s,a,r,s’) tuples and doing Q-learning 1 updates using all the tuples in the bank Always uses the most recent history of tuples 2 Reduces the data efficiency of DQN 3 Increases the computational cost 4 Not sure 5 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 52

Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 52

Class Structure Last time: CNNs and Deep Reinforcement learning This time: DRL and Imitation Learning in Large State Spaces Next time: Policy Search Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 52

Double DQN Recall maximization bias challenge Max of the estimated state-action values can be a biased estimate of the max Double Q-learning Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 52

Recall: Double Q-Learning 1: Initialize Q 1 ( s , a ) and Q 2 ( s , a ), ∀ s ∈ S , a ∈ A t = 0, initial state s t = s 0 2: loop Select a t using ǫ -greedy π ( s ) = arg max a Q 1 ( s t , a ) + Q 2 ( s t , a ) 3: Observe ( r t , s t +1 ) 4: if (with 0.5 probability) then 5: 6: Q 1 ( s t , a t ) ← Q 1 ( s t , a t )+ α ( r t + Q 1 ( s t +1 , arg max a ′ Q 2 ( s t +1 , a ′ )) − Q 1 ( s t , a t )) else 7: 8: Q 2 ( s t , a t ) ← Q 2 ( s t , a t )+ α ( r t + Q 2 ( s t +1 , arg max a ′ Q 1 ( s t +1 , a ′ )) − Q 2 ( s t , a t )) end if 9: t = t + 1 10: 11: end loop Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 52 This was using a lookup table representation for the state-action value

Check Your Understanding: Mars Rover Model-Free Policy Evaluation ! " ! # ! $ ! % ! & ! ' ! ( ) ! ' = 0 ) ! ( = +10 ) ! " = +1 ) ! # = 0 ) ! $ = 0 ) ! % = 0 ) ! & = 0 89/: ./01/!123 .2456 7214 .2456 7214 π ( s ) = a 1 ∀ s , γ = 1. Any action from s 1 and s 7 terminates episode Trajectory = ( s 3 , a 1 , 0, s 2 , a 1 , 0, s 2 , a 1 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of V of each state? [1 1 1 0 0 0 0] TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0] Chose 2 ”replay” backups to do. Which should we pick to get estimate closest to MC first visit estimate? Doesn’t matter, any will yield the same 1 ( s 3 , a 1 , 0 , s 2 ) then ( s 2 , a 1 , 0 , s 1 ) 2 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 3 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 4 Not sure 5 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 52

Impact of Replay? In tabular TD-learning, order of replaying updates could help speed learning Repeating some updates seem to better propagate info than others Systematic ways to prioritize updates? Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 52

Potential Impact of Ordering Episodic Replay Updates Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Schaul, Quan, Antonoglou, Silver ICLR 2016 Oracle: picks ( s , a , r , s ′ ) tuple to replay that will minimize global loss Exponential improvement in convergence Number of updates needed to converge Oracle is not a practical method but illustrates impact of ordering Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 52

Prioritized Experience Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) � � p i = � r + γ max � � � Update p i every update. p i for new tuples is set to 0 One method 1 : proportional (stochastic prioritization) p α i P ( i ) = k p α � k 1 See paper for details and an alternative Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 52

Exercise: Prioritized Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) � p i = � r + γ max � � � Update p i every update. p i for new tuples is set to 0 One method 1 : proportional (stochastic prioritization) p α i P ( i ) = k p α � k α = 0 yields what rule for selecting among existing tuples? Selects randomly Selects the one with the highest priority It depends on the priorities of the tuples Not Sure Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 52

Performance of Prioritized Replay vs Double DQN Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 52

Value & Advantage Function Intuition: Features need to accurate represent value may be different than those needed to specify difference in actions E.g. Game score may help accurately predict V ( s ) But not necessarily in indicating relative action values Q ( s , a 1 ) vs Q ( s , a 2 ) Advantage function (Baird 1993) A π ( s , a ) = Q π ( s , a ) − V π ( s ) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 52

Dueling DQN Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 52

Check Understanding: Unique? Advantage function A π ( s , a ) = Q π ( s , a ) − V π ( s ) For a given advantage function, is there a unique Q and V ? Yes 1 No 2 Not sure 3 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 52

Uniqueness Advantage function A π ( s , a ) = Q π ( s , a ) − V π ( s ) Not unique Option 1: Force A ( s , a ) = 0 if a is action taken � � Q ( s , a ; w ) = ˆ ˆ ˆ A ( s , a ′ ; w ) ˆ V ( s ; w ) + A ( s , a ; w ) − max a ′ ∈A Option 2: Use mean as baseline (more stable) � � A ( s , a ; w ) − 1 Q ( s , a ; w ) = ˆ ˆ ˆ � ˆ A ( s , a ′ ; w ) V ( s ; w ) + |A| a ′ ∈A Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 52

Dueling DQN V.S. Double DQN with Prioritized Replay Figure: Wang et al, ICML 2016 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 52

Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Lower Don Trail Master Plan Refresh Public Open House_September 17 2019 1 Lower Don Trail

and Sustainable Food Systems in the Circular Economy REFRESH final Policy Workshop Brussels, 22

SCIM Refresh Presented by Paul Mortimer Asset Management Policy Advisor paulmortimer@nhs.net 1

PC Refresh in the Consumerized IT Environment David Buchholz, Director of Consumerization, Intel

Adding value to food waste and by-products REFRESH Community of Experts webinar series

Voluntary Agreements to Address Food Waste REFRESH Community of Experts webinar series

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to

Measuring and managing retail food waste REFRESH Community of Experts webinar series

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

COVID-19 Response Webinar Thursday, April 2, 2020 - 2:15 to 3:45 PM Welcome & Introductions

Vision to Reduce Hunger in CA Goal: By 2016, California becomes a top 10 state for CalFresh access,

7/28/2014 Matthew 5:6 Blessed are those who hunger and thirst for righteousness, for they

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Using Reeb Graphs Jacopo Aleotti aleotti@ce.unipr.it Stefano Caselli caselli@ce.unipr.it

Sambuz

Useful Links

Newsletter

Mail Us

Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Lower Don Trail Master Plan Refresh Public Open House_September 17 2019 1 Lower Don Trail

and Sustainable Food Systems in the Circular Economy REFRESH final Policy Workshop Brussels, 22

SCIM Refresh Presented by Paul Mortimer Asset Management Policy Advisor paulmortimer@nhs.net 1

PC Refresh in the Consumerized IT Environment David Buchholz, Director of Consumerization, Intel

Adding value to food waste and by-products REFRESH Community of Experts webinar series

Voluntary Agreements to Address Food Waste REFRESH Community of Experts webinar series

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to

Measuring and managing retail food waste REFRESH Community of Experts webinar series

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

COVID-19 Response Webinar Thursday, April 2, 2020 - 2:15 to 3:45 PM Welcome &amp; Introductions

Vision to Reduce Hunger in CA Goal: By 2016, California becomes a top 10 state for CalFresh access,

7/28/2014 Matthew 5:6 Blessed are those who hunger and thirst for righteousness, for they

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&amp;M University Shift

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Using Reeb Graphs Jacopo Aleotti aleotti@ce.unipr.it Stefano Caselli caselli@ce.unipr.it

Sambuz

Useful Links

Newsletter

Mail Us

COVID-19 Response Webinar Thursday, April 2, 2020 - 2:15 to 3:45 PM Welcome & Introductions

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift