Learning AIXI Tutorial Agent doesn’t know µ a priori. Part II John Aslanides Recall the incomputable Solomonoff model class and Tom Everitt � 2 − ℓ ( p ) � p ( a < t ) = e < t � M ( e < t | a < t ) = Short Recap p Approximations Introduce a finite model class M : (Break) Variants of AIXI � ξ ( e t | æ < t a t ) = w ν ν ( e t | æ < t a t ) ν ∈M Update posterior w ν with Bayes rule: w ν ← ν ( e t ) ξ ( e t ) w ν ∀ ν ∈ M For very small M we can compute this exactly. Let’s look at this with some toy examples. 14/41
Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41
Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI 15/41
Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI with A = {← , → , ↑ , ↓ , ∅} . The agent is a robot 15/41
Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI with A = {← , → , ↑ , ↓ , ∅} . The agent is a robot The grey tiles are walls that yield − 5 reward if hit. 15/41
Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI with A = {← , → , ↑ , ↓ , ∅} . The agent is a robot The grey tiles are walls that yield − 5 reward if hit. are empty, but moving costs − 1. The white tiles 15/41
Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap Approximations (Break) Variants of AIXI 16/41
Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations (Break) Variants of AIXI 16/41
Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations e.g. 200 steps on 10 × 10 grid. (Break) Variants of AIXI 16/41
Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations e.g. 200 steps on 10 × 10 grid. (Break) The observations consist of just four bits , O = B 4 : Variants of AIXI 16/41
Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations e.g. 200 steps on 10 × 10 grid. (Break) The observations consist of just four bits , O = B 4 : Variants of AIXI This is a stochastic & partially observable environment with simple & easy-to-understand dynamics [3]. 16/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Approximations (Break) Variants of AIXI 17/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations (Break) Variants of AIXI 17/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Variants of AIXI 17/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) 17/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) There are at most |M| ≤ N 2 ‘legal’ dispenser positions. 17/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) There are at most |M| ≤ N 2 ‘legal’ dispenser positions. Let the agent have a uniform prior w ν = |M| − 1 ∀ ν ∈ M . 17/41
Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) There are at most |M| ≤ N 2 ‘legal’ dispenser positions. Let the agent have a uniform prior w ν = |M| − 1 ∀ ν ∈ M . Each ν is a complete gridworld simulator, and µ ∈ M . 17/41
AIXIjs AIXI Tutorial Part II Enough talk. Let’s see an John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 18/41
AIXIjs AIXI Tutorial Part II Enough talk. Let’s see an John Aslanides and Tom Everitt Short Recap Online web demo Approximations (Break) Variants of AIXI aslanides.io/aixijs 18/41
Simple model class What did we just see? AIXI Tutorial Part II Let’s visualize the agent’s uncertainty as it learns. John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 19/41
Simple model class What did we just see? AIXI Tutorial Part II Let’s visualize the agent’s uncertainty as it learns. John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Initially, the agent has a uniform prior, shown in green. 19/41
Simple model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI After exploring a little, the agent’s beliefs have changed. Lighter green corresponds to less probability mass. 20/41
Simple model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI After discovering the dispenser, the agent’s posterior concentrates on µ . This concentration is immediate – global ‘collapse’. 21/41
A more general model class The previous model class was limited. Here’s a more AIXI Tutorial Part II interesting one. John Aslanides and Tom Model each tile independently with a categorical/Dirichlet Everitt � � distribution over , , : Short Recap Approximations (Break) � ρ ( e t | . . . ) = Dirichlet ( p | α s ′ ) . Variants of AIXI s ′ ∈ ne( s t ) Joint distribution factorizes over the grid. The agent learns about state dynamics only locally , rather than globally . Using this model, the agent is uncertain about: Maze layout Location, number and payout probabilities θ i of each dispenser(s). 22/41
A more general model class What did we just see? AIXI Tutorial Part II Let’s visualize the agent’s uncertainty as it learns. John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Initially the agent knows nothing about the layout. There are two dispensers, visualized for our benefit. 23/41
A more general model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Tiles that the agent knows are walls are blue . Purple tiles show the agent’s belief of θ . 24/41
A more general model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Note: the smaller has lower θ than the larger . The agent explores efficiently and learns quickly. 25/41
A more general model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Even so, the agent settles for a locally optimal policy. Due to its short horizon m , it can’t see the value in exploring further. 26/41
Exploration/exploitation trade-off AIXI Tutorial Part II John Aslanides and Tom Everitt Here we see the classic exploration/exploitation dilemma. Short Recap Bayesian agents are not immune to this! Approximations Choices of: (Break) model class Variants of AIXI priors discount function planning horizon are all significant! Corollary: AI ξ is not asymptotically optimal . 27/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Approximations (Break) Variants of AIXI 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations (Break) Variants of AIXI 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) Variants of AIXI 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . Based on the KT estimator (similar to Beta distribution). 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length. 28/41
(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length. Learns to play PacMan, Tic-Tac-Toe, Kuhn Poker, and Rock/Paper/Scissors tabula rasa [3]. 28/41
Break Time AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Let’s take a tea/coffee break! Approximations (Break) Variants of AIXI (See you again in 30 mins) 29/41
Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI 30/41
Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI Thompson Sampling 30/41
Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI Thompson Sampling Knowledge-Seeking Agents 30/41
Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI Thompson Sampling Knowledge-Seeking Agents BayesExp 30/41
MDL Agent AIXI Tutorial Minimum Description Length (MDL) principle: prefer Part II simple models John Aslanides and Tom Everitt Short Recap t Approximations � ρ = arg min K ( ν ) − λ log log ν ( e k | æ < k a k ) (Break) ν ∈M k =1 Variants of AIXI � �� � Log-likelihood 31/41
MDL Agent AIXI Tutorial Minimum Description Length (MDL) principle: prefer Part II simple models John Aslanides and Tom Another take on the ‘Occam principle’: Everitt Short Recap t Approximations � ρ = arg min K ( ν ) − λ log log ν ( e k | æ < k a k ) (Break) ν ∈M k =1 Variants of AIXI � �� � Log-likelihood 31/41
MDL Agent AIXI Tutorial Minimum Description Length (MDL) principle: prefer Part II simple models John Aslanides and Tom Another take on the ‘Occam principle’: Everitt Short Recap t Approximations � ρ = arg min K ( ν ) − λ log log ν ( e k | æ < k a k ) (Break) ν ∈M k =1 Variants of AIXI � �� � Log-likelihood In deterministic environments: “use the simplest yet-unfalsified hypothesis” 31/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time, 32/41
Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time, this encourages ‘deep’ exploration. 32/41
Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Approximations (Break) Variants of AIXI 33/41
Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Train these using e.g. DQN using the statistical Approximations bootstrap. (Break) Variants of AIXI 33/41
Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Train these using e.g. DQN using the statistical Approximations bootstrap. (Break) Thompson sampling: draw a Q -function at random each Variants of AIXI episode and use a greedy policy. 33/41
Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Train these using e.g. DQN using the statistical Approximations bootstrap. (Break) Thompson sampling: draw a Q -function at random each Variants of AIXI episode and use a greedy policy. Exhibits much better exploration properties than many alternatives 33/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations (Break) Variants of AIXI 34/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Variants of AIXI 34/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI 34/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation 34/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation Two forms: 34/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation Two forms: Shannon KSA (“surprise”): U ( e t | æ < t a t ) = − log ξ ( e t | æ < t a t ) 34/41
Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation Two forms: Shannon KSA (“surprise”): U ( e t | æ < t a t ) = − log ξ ( e t | æ < t a t ) Kullback-Leibler KSA (“information gain”): U ( e t | æ < t a t ) = Ent ( w | æ < t a t ) − Ent ( w | æ 1: t ) 34/41
Knowledge-Seeking Agents AIXI Tutorial Kullback Leibler (“information-seeking”) is superior to Part II John Aslanides Shannon & Renyi (“entropy-seeking”): and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 35/41
Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations (Break) Variants of AIXI 36/41
Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations Learn a forward dynamics model in tandem with (Break) model-free RL Variants of AIXI 36/41
Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations Learn a forward dynamics model in tandem with (Break) model-free RL Variants of AIXI Use a variational approximation to compute the information gain in closed form 36/41
Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations Learn a forward dynamics model in tandem with (Break) model-free RL Variants of AIXI Use a variational approximation to compute the information gain in closed form Use this as an ‘exploration bonus’, or intrinsic reward 36/41
Recommend
More recommend