What Can Learned Intrinsic Reward Capture? Gao Chenxiao LAMDA, - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . What Can Learned Intrinsic Reward Capture? Gao Chenxiao LAMDA, Nanjing University . . . . . . . . . . . . . . . . . . . . . . . November 17, 2020

. . . . . . . . . . . . . . . . . Table of Contents 1. Overview 2. Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3. Experiments and Analysis

. . . . . . . . . . . . . . . . . Contributions expoitation agent should strive to do” . . . . . . . . . . . . . . . . . . . . . . . and learning), but it can generalize to difgerence algorithms better. • Use intrinsic reward to capture long-term knowledge about both exploration and • Distinguish the roles of policy and reward function in RL problems: • Policy decribes ”How should the agent behave”, while the reward function decribes ”What the • Knowledge about ”what” is indirect and slower to take efgect on agent’s behavior (through palnning

. . . . . . . . . . . . . . . . . . Related Work interfaces and algorithms. . . . . . . . . . . . . . . . . . . uncertainty, however the latter one hasn’t been fully discussed in prior articles and essays. . . . . • Reward Shaping: aims to handcraft reward towards known optimal rewards • Two main mathods: task dependent / task independent • Reward learned from experience • Optimal Reward Framework: introduced by Singh et al.,2009 • Compared to LIRPG and AGILE, this agent is able to generalize to new agent-environment • Cognitive study • Humans use both a random exploration strategy and an information seeking strategy when facing

. . . . . . . . . . . . . . . . . . Terminology: MDP Part . . . . . . . . . . . . . . . . . . . . . . • MDP = System Dynamic + Extrinsic Reward • Agent: A learning system interacting with an environment. Each time step the agent selects an action a t , receives an extrinsic reward r t defjned by a task T , and transits from state s t to s t + 1 . • Policy: A mapping ( π θ ( a | s ) ) from the environment state to the agent’s behavior, which is parameterized by θ . • Episode: A fjnite sequence of agent-environment interactions until the end. • episodic return: G ep = ∑ T ep − 1 γ t r t + 1 t = 0

. . . . . . . . . . . . . . . . . Terminology: Intrinsic Part training defjned by an agent designer. In this paper, lifetime consists of a fjxed number of episodes. . . . . . . . . . . . . . . . . . . the accumulate lifetime return of the states. . . . . . • Lifetime: A fjnite sequence of agent-environment interactions until the end of the • Lifetime return: G life = ∑ T − 1 t = 0 γ t r t + 1 • Intrinsic Reward: A reward function r η ( τ t ) parameterised by η , where τ t = ( s 0 , a 0 , r 1 , d 1 , s 1 ..., r t , d t , s t ) is a lifetime history experienced by the agent. • Lifetime Value Function: A value function V φ ( s ) which is used to approximate

. . . . . . . . . . . . . . . . . . Terminology: The Optimal Reward Problem 1 . . . . . . . . . . . . . . . . . . 1 Intrinsically motivated reinforcement learning: An evolutionary perspective . . . . • Objective [ ] maximize J ( η ) = E θ 0 ∼ Θ , T ∼ p ( T ) E τ ∼ p η ( τ | θ 0 ) [ G life ] • Θ and p ( T ) are an initial policy distribution and a distribution over tasks, τ is agent’s history and p η ( τ | θ 0 ) = p ( s 0 ) ∏ T − 1 t = 0 π θ t ( a t | s t ) p ( d t + 1 , r t + 1 , s t + 1 | s t , a t ) .

. . . . . . . . . . . . . . . . . . Algorithm: Overview Parameters: . . . . . . . . . . . . . . . . . . . . . . • θ → π θ • η → r η • φ → V φ

. . . . . . . . . . . . . . . . . Algorithm: Policy Update G ep where G ep . . . . . . . . . . . . . . . . . . . . . . . • maximizing the episodic cumulated intrinsic reward   T ep − 1 ∑ J η ( θ ) = E θ ¯ γ t r η ( τ t + 1 )   t = 0 [ ] ∇ θ J η ( θ ) = E θ η, t ∇ θ log π θ ( a | s ) ( for each t ) η, t = ∑ T ep − 1 γ k − t r η ( τ k + 1 ) . ¯ k = t • similar to REINFORCE

. . . . . . . . . . . . . . . Algorithm: Intrinsic Reward Update G life Computing the meta-gradient requires backpropagation through the entire lifetime. In t to approximate G life t . . . . . . . . . . . . . . . . . . . . . . . . . • maximizing the lifetime reward [ ] J ( η ) = E θ 0 ∼ Θ , T ∼ p ( T ) E τ ∼ p η ( τ | θ 0 ) [ G life ] [ ] ∇ η J ( η ) = E θ t , T t ∇ θ t log π θ t ( a t | s t ) ∇ η θ t practice we truncate the meta-gradient after N steps and use G life ,φ t . N − 1 G life ,φ ∑ = γ k r t + k + 1 + γ n V φ ( τ t + n ) k = 0

. . . . . . . . . . . . . . . . . . Algorithm: Intrinsic Reward Update (cont’d) . . . . . . . . . . . . . . . . . . . . . . similar to the drivation of policy gradient theorem.

. . . . . . . . . . . . . . . . . Algorithm: Lifetime Value Update t t . . . . . . . . . . . . . . . . . . . . . . . • using a temporal difgerence from n-step trajectory 2 ( G life ,φ J ( φ ) = 1 − V φ ( τ t )) 2 ∇ φ J ( φ ) = ( G life ,φ − V φ ( τ t )) ∇ φ V φ ( τ t )

. . . . . . . . . . . . . . . . . . Empty Room: Exploring Uncertainty unknown locations; after the goal is found, it makes the agent to exploit the . . . . . . . . . . . . . . . . . . . . . . knowledge. • blue squares: the hidden goal — yellow squares: the agent • When the goal is not found, the intrinsic reward encourages the agent to visit

. . . . . . . . . . . . . . . . . . ABC: Exploring Uncertain Objects . . . . . . . . . . . . . . . . . . . . . . emerge in the intrinsic reward. • r ( A ) ∼ U [ − 1 , 1 ] , r ( B ) ∼ U [ − 0 . 5 , 0 ] , r ( c ) ∼ U [ 0 , 0 . 5 ] • These results show that avoidance and curiosity about uncertain objects can

. . . . . . . . . . . . . . . . . . Key-Box: Exploring and Exploiting Casual Relationship open any box,which demonstrates that intrinsic reward can learn the relationships . . . . . . . . . . . . . . . . . . . . . . between objects when the domain has this kind of invariant dynamics. • Based Random ABC, but the agent need fjrst to collect the key. • Algorithms except Intrinsic Reward all failed to capture that the key is necessary to

. . . . . . . . . . . . . . . . . . Non-stationary ABC: Dealing with Non-stationary episodes. . . . . . . . . . . . . . . . . . . . . . . non-stationary pattern of the tasks. • based on Random ABC, but the rewards of A and C are swapped every 250 • This experiment shows that the intrinsic reward can capture the regularly repeated

. . . . . . . . . . . . . . . . . . Ablation Two ablation studies . . . . . . . . . . . . . . . . . . . . . . history • replace the lifetime return objective G life with episodic return G EP • restrict the input of the reward network to current state instead of the lifetime

. . . . . . . . . . . . . . . . . Back to The Title What can intrinsic reward capture? non-stationary patterns and Why? . . . . . . . . . . . . . . . . . . . . . . . comparison between the goal states(Random ABC) • useful exploring strategy • characteristics of the task: stochastic reward, casual relationship and • it tells agent ”what to do”, rather than ”do what” • lifetime return utilizes cross-episode knowledge in a more explicit way, such as the

. . . . . . . . . . . . . . . . . . Intrinsic Reward for Meta RL while other Meta-RL algorithms like MAML and RL2 are designed for fast . . . . . . . . . . . . . . . . . . . . . . adaptation to new tasks. • Knowledge captured by intrinsic is useful for training randomly-initialised policies,

. . . . . . . . . . . . . . . . . . Intrinsic Reward for Meta RL . . . . . . . . . . . . . . . . . . . . . . agent-agnostic. • ...but it can provide more robustness because it is model-agnostic and

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thanks for listening!

What Can Learned Intrinsic Reward Capture? Gao Chenxiao LAMDA, - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . What Can Learned Intrinsic Reward Capture? Gao Chenxiao LAMDA, Nanjing University . . . . . . . . . . . . . . . . . . . . . . . November 17, 2020 . . . . . . . . .

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

6 Feedback, Reinforcement, and Intrinsic Motivation Session Outline (continued) Intrinsic

Desktop Capture 164.pdf Page 1 of 35 Made with Doceri Desktop Capture 164.pdf Page 2 of 35

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

What Can Learned Intrinsic Rewards Capture? Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu,

DOR Data Capture and Imaging Automation Presented by: Department of Revenue Data Capture and

Lecture Capture Project Powered by Much more than Lecture Capture (Replacing Echo360)

Carbon Capture Technology Carbon Capture Technology Strategies Strategies ARPA- -E Carbon

Carbon Capture and Storage Value Chain Capture and Compression Large Stationary Sources Capture

Cisco IOS Embedded Packet Capture (EPC) Cisco IOS Embedded Packet Capture (EPC) The Cisco IOS

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Intrinsic Motivation Ho How to to G Get et You our r Kid ids s Mo Moti tivate ted d

The Intrinsic Fragility of DNA The Intrinsic Fragility of DNA Tomas Lindahl omas Lindahl Nobel

Understanding Intrinsic Motivation Understanding Intrinsic Motivation A Caution

Instruments and Collaborators: IIA (India): M. Safonova, Chinthak Murali ARIES (India): S.Bose,

Update digraphs and Boolean networks Julio B. Aracena Lucero (J. Demongeot, E. Fanchon, E. Goles,

Analysing Kauffman Boolean Networks PAVEL EMELYANOV Institute of Informatics Systems and

Formal approaches Static Graph v.s. Dynamic Behaviour to model Difficulty to predict the result

Markov Decision Processes Lecture 3, CMU 10-403 Katerina Fragkiadaki Katerina Fragkiadaki

Acquiring Mental Resources For a Green Zone Brain Sounds True Neuroscience Summit March 20,

PRODUCTION OF 99 MO IN THE FRAMEWORK OF IFMIF/ELAMAT PROJECT A. Marchix, CEA Saclay

McStas-MCNP interface solutions Erik B Knudsen 1 , Peter Willendrup 1,2 , Esben Klinkby 3,4 1

What Can Learned Intrinsic Reward Capture? Gao Chenxiao LAMDA, - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . What Can Learned Intrinsic Reward Capture? Gao Chenxiao LAMDA, Nanjing University . . . . . . . . . . . . . . . . . . . . . . . November 17, 2020 . . . . . . . . .

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

6 Feedback, Reinforcement, and Intrinsic Motivation Session Outline (continued) Intrinsic

Desktop Capture 164.pdf Page 1 of 35 Made with Doceri Desktop Capture 164.pdf Page 2 of 35

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

What Can Learned Intrinsic Rewards Capture? Zeyu Zheng*, Junhyuk Oh*, Matteo Hessel, Zhongwen Xu,

DOR Data Capture and Imaging Automation Presented by: Department of Revenue Data Capture and

Lecture Capture Project Powered by Much more than Lecture Capture (Replacing Echo360)

Carbon Capture Technology Carbon Capture Technology Strategies Strategies ARPA- -E Carbon

Carbon Capture and Storage Value Chain Capture and Compression Large Stationary Sources Capture

Cisco IOS Embedded Packet Capture (EPC) Cisco IOS Embedded Packet Capture (EPC) The Cisco IOS

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Intrinsic Motivation Ho How to to G Get et You our r Kid ids s Mo Moti tivate ted d

The Intrinsic Fragility of DNA The Intrinsic Fragility of DNA Tomas Lindahl omas Lindahl Nobel

Understanding Intrinsic Motivation Understanding Intrinsic Motivation A Caution

Instruments and Collaborators: IIA (India): M. Safonova, Chinthak Murali ARIES (India): S.Bose,

Update digraphs and Boolean networks Julio B. Aracena Lucero (J. Demongeot, E. Fanchon, E. Goles,

Analysing Kauffman Boolean Networks PAVEL EMELYANOV Institute of Informatics Systems and

Formal approaches Static Graph v.s. Dynamic Behaviour to model Difficulty to predict the result

Markov Decision Processes Lecture 3, CMU 10-403 Katerina Fragkiadaki Katerina Fragkiadaki

Acquiring Mental Resources For a Green Zone Brain Sounds True Neuroscience Summit March 20,

PRODUCTION OF 99 MO IN THE FRAMEWORK OF IFMIF/ELAMAT PROJECT A. Marchix, CEA Saclay

McStas-MCNP interface solutions Erik B Knudsen 1 , Peter Willendrup 1,2 , Esben Klinkby 3,4 1

What Can Learned Intrinsic Rewards Capture? Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu,