. . . . . . . . . . . . . . . . . What Can Learned Intrinsic Reward Capture? Gao Chenxiao LAMDA, Nanjing University . . . . . . . . . . . . . . . . . . . . . . . November 17, 2020
. . . . . . . . . . . . . . . . . Table of Contents 1. Overview 2. Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3. Experiments and Analysis
. . . . . . . . . . . . . . . . . Table of Contents 1. Overview 2. Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3. Experiments and Analysis
. . . . . . . . . . . . . . . . . Contributions expoitation agent should strive to do” . . . . . . . . . . . . . . . . . . . . . . . and learning), but it can generalize to difgerence algorithms better. • Use intrinsic reward to capture long-term knowledge about both exploration and • Distinguish the roles of policy and reward function in RL problems: • Policy decribes ”How should the agent behave”, while the reward function decribes ”What the • Knowledge about ”what” is indirect and slower to take efgect on agent’s behavior (through palnning
. . . . . . . . . . . . . . . . . . Related Work interfaces and algorithms. . . . . . . . . . . . . . . . . . . uncertainty, however the latter one hasn’t been fully discussed in prior articles and essays. . . . . • Reward Shaping: aims to handcraft reward towards known optimal rewards • Two main mathods: task dependent / task independent • Reward learned from experience • Optimal Reward Framework: introduced by Singh et al.,2009 • Compared to LIRPG and AGILE, this agent is able to generalize to new agent-environment • Cognitive study • Humans use both a random exploration strategy and an information seeking strategy when facing
. . . . . . . . . . . . . . . . . Table of Contents 1. Overview 2. Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3. Experiments and Analysis
. . . . . . . . . . . . . . . . . . Terminology: MDP Part . . . . . . . . . . . . . . . . . . . . . . • MDP = System Dynamic + Extrinsic Reward • Agent: A learning system interacting with an environment. Each time step the agent selects an action a t , receives an extrinsic reward r t defjned by a task T , and transits from state s t to s t + 1 . • Policy: A mapping ( π θ ( a | s ) ) from the environment state to the agent’s behavior, which is parameterized by θ . • Episode: A fjnite sequence of agent-environment interactions until the end. • episodic return: G ep = ∑ T ep − 1 γ t r t + 1 t = 0
. . . . . . . . . . . . . . . . . Terminology: Intrinsic Part training defjned by an agent designer. In this paper, lifetime consists of a fjxed number of episodes. . . . . . . . . . . . . . . . . . . the accumulate lifetime return of the states. . . . . . • Lifetime: A fjnite sequence of agent-environment interactions until the end of the • Lifetime return: G life = ∑ T − 1 t = 0 γ t r t + 1 • Intrinsic Reward: A reward function r η ( τ t ) parameterised by η , where τ t = ( s 0 , a 0 , r 1 , d 1 , s 1 ..., r t , d t , s t ) is a lifetime history experienced by the agent. • Lifetime Value Function: A value function V φ ( s ) which is used to approximate
. . . . . . . . . . . . . . . . . . Terminology: The Optimal Reward Problem 1 . . . . . . . . . . . . . . . . . . 1 Intrinsically motivated reinforcement learning: An evolutionary perspective . . . . • Objective [ ] maximize J ( η ) = E θ 0 ∼ Θ , T ∼ p ( T ) E τ ∼ p η ( τ | θ 0 ) [ G life ] • Θ and p ( T ) are an initial policy distribution and a distribution over tasks, τ is agent’s history and p η ( τ | θ 0 ) = p ( s 0 ) ∏ T − 1 t = 0 π θ t ( a t | s t ) p ( d t + 1 , r t + 1 , s t + 1 | s t , a t ) .
. . . . . . . . . . . . . . . . . . Algorithm: Overview Parameters: . . . . . . . . . . . . . . . . . . . . . . • θ → π θ • η → r η • φ → V φ
. . . . . . . . . . . . . . . . . Algorithm: Policy Update G ep where G ep . . . . . . . . . . . . . . . . . . . . . . . • maximizing the episodic cumulated intrinsic reward T ep − 1 ∑ J η ( θ ) = E θ ¯ γ t r η ( τ t + 1 ) t = 0 [ ] ∇ θ J η ( θ ) = E θ η, t ∇ θ log π θ ( a | s ) ( for each t ) η, t = ∑ T ep − 1 γ k − t r η ( τ k + 1 ) . ¯ k = t • similar to REINFORCE
. . . . . . . . . . . . . . . Algorithm: Intrinsic Reward Update G life Computing the meta-gradient requires backpropagation through the entire lifetime. In t to approximate G life t . . . . . . . . . . . . . . . . . . . . . . . . . • maximizing the lifetime reward [ ] J ( η ) = E θ 0 ∼ Θ , T ∼ p ( T ) E τ ∼ p η ( τ | θ 0 ) [ G life ] [ ] ∇ η J ( η ) = E θ t , T t ∇ θ t log π θ t ( a t | s t ) ∇ η θ t practice we truncate the meta-gradient after N steps and use G life ,φ t . N − 1 G life ,φ ∑ = γ k r t + k + 1 + γ n V φ ( τ t + n ) k = 0
. . . . . . . . . . . . . . . . . . Algorithm: Intrinsic Reward Update (cont’d) . . . . . . . . . . . . . . . . . . . . . . similar to the drivation of policy gradient theorem.
. . . . . . . . . . . . . . . . . Algorithm: Lifetime Value Update t t . . . . . . . . . . . . . . . . . . . . . . . • using a temporal difgerence from n-step trajectory 2 ( G life ,φ J ( φ ) = 1 − V φ ( τ t )) 2 ∇ φ J ( φ ) = ( G life ,φ − V φ ( τ t )) ∇ φ V φ ( τ t )
. . . . . . . . . . . . . . . . . Table of Contents 1. Overview 2. Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3. Experiments and Analysis
. . . . . . . . . . . . . . . . . . Empty Room: Exploring Uncertainty unknown locations; after the goal is found, it makes the agent to exploit the . . . . . . . . . . . . . . . . . . . . . . knowledge. • blue squares: the hidden goal — yellow squares: the agent • When the goal is not found, the intrinsic reward encourages the agent to visit
. . . . . . . . . . . . . . . . . . ABC: Exploring Uncertain Objects . . . . . . . . . . . . . . . . . . . . . . emerge in the intrinsic reward. • r ( A ) ∼ U [ − 1 , 1 ] , r ( B ) ∼ U [ − 0 . 5 , 0 ] , r ( c ) ∼ U [ 0 , 0 . 5 ] • These results show that avoidance and curiosity about uncertain objects can
. . . . . . . . . . . . . . . . . . Key-Box: Exploring and Exploiting Casual Relationship open any box,which demonstrates that intrinsic reward can learn the relationships . . . . . . . . . . . . . . . . . . . . . . between objects when the domain has this kind of invariant dynamics. • Based Random ABC, but the agent need fjrst to collect the key. • Algorithms except Intrinsic Reward all failed to capture that the key is necessary to
. . . . . . . . . . . . . . . . . . Non-stationary ABC: Dealing with Non-stationary episodes. . . . . . . . . . . . . . . . . . . . . . . non-stationary pattern of the tasks. • based on Random ABC, but the rewards of A and C are swapped every 250 • This experiment shows that the intrinsic reward can capture the regularly repeated
. . . . . . . . . . . . . . . . . . Ablation Two ablation studies . . . . . . . . . . . . . . . . . . . . . . history • replace the lifetime return objective G life with episodic return G EP • restrict the input of the reward network to current state instead of the lifetime
. . . . . . . . . . . . . . . . . Back to The Title What can intrinsic reward capture? non-stationary patterns and Why? . . . . . . . . . . . . . . . . . . . . . . . comparison between the goal states(Random ABC) • useful exploring strategy • characteristics of the task: stochastic reward, casual relationship and • it tells agent ”what to do”, rather than ”do what” • lifetime return utilizes cross-episode knowledge in a more explicit way, such as the
. . . . . . . . . . . . . . . . . . Intrinsic Reward for Meta RL while other Meta-RL algorithms like MAML and RL2 are designed for fast . . . . . . . . . . . . . . . . . . . . . . adaptation to new tasks. • Knowledge captured by intrinsic is useful for training randomly-initialised policies,
. . . . . . . . . . . . . . . . . . Intrinsic Reward for Meta RL . . . . . . . . . . . . . . . . . . . . . . agent-agnostic. • ...but it can provide more robustness because it is model-agnostic and
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thanks for listening!
Recommend
More recommend