AIXI: Universal Optimal Sequential Decision Making Marcus Hutter (2005)
Reinforcement Learning β’ State space π, Action space π΅, Policy π, Reward π(π, π‘) β’ Goal: Find policy which maximizes expected cumulative reward. β’ Challenge: Environment which RL interacts with is unknown β’ Explore and approximate the environment β’ Hard to balance exploration vs exploitation β’ AIXI: why approximate one environment? Consider them all!
Optimal Agents in Known Environments β’ π, π«, π = ( action, observation, reward) spaces β’ π - = action at time π , π¦ - = π - π - = perception at time π β’ Agent follows policy π: πΓπ«Γβ β β π β’ Environment reacts with π: πΓπ«Γβ β Γπ β π«Γβ
Agent-Environment Visualization
Optimal Agents in Known Environments β’ Performance of π is expected cumulative reward A : = π½ 9 9: ] : [= π π 9 > >?@ β’ If π is true environment, optimal policy is π 9 β arg max : π 9 : ?
Definition of the Environment β’ An environment, π , is a sequence of conditional probability functions {π L , π @ , π M , β¦ } and is unknown to the agent β’ Each element in the sequence satisfies the βchronological conditionβ : βπ @:Q βπ¦ @:QR@ : π QR@ (π¦ @:QR@ π @:QR@ = = π Q (π¦ @:Q |π @:Q ) S T βV
Definition of the Environment βπ @:Q βπ¦ @:QR@ : π QR@ (π¦ @:QR@ π @:QR@ = = π Q (π¦ @:Q |π @:Q ) S T βV Conditioned Conditioned on all actions on all actions up to π β 1 Marginalization of π Q over up to π the current observation- reward
Dealing with the Unknown Environment β’ The idea is to maintain a mixture of environment models, in which each model is assigned a weight that represents the agentβs confidence in what it believes is the true environment β’ As the agent obtains more experience, it updates the weights and thus its belief of the underlying environment β’ Reminiscent of a Bayesian agent
Mixture Model β’ β³ β {π @ , π M , β¦ , π Q } is the countable class of environments ^ > 0 is the weight assigned to each π β β³ such that β ^ββ³ π₯ L ^ = β’ π₯ L 1 ^ π(π¦ @:Q |π @:Q ) π π¦ @:Q |π @:Q β = π₯ L ^ββ³
Selecting a Universal Prior β’ Occamβs Razor: The simplest solution is the most likely ^ = 1 β’ Formalized as Kolmogorov Complexity β ^ββ³ π₯ L Type equation here. ^ π(π¦ @:Q |π @:Q ) π π¦ @:Q |π @:Q β = π₯ L ^ββ³ βYo.β
Kolmogorov Complexity β’ Length of the shortest program on a Universal Turing Machine which specifies an object β’ In our domain: shortest program which produces environment π πΏ π β min πππππ’β π : π π = π p β’ Advantage: completely independent of prior assumptions β’ Problem: Incomputable due to halting problem. β’ NaΓ―ve search over all inputs will contain those with infinite loops β’ Paradoxical: βShortest object describable by N bitsβ is less than N bits.
Solomonoff Prior β’ Key idea: Use inverse Kolmogorov Complexity as environmental prior to compute mixture over all possible environments 2 Rz(^) β π : Ξ₯ π = = ^ ^ββ³ x β’ Ξ₯ π measures agentβs ability to perform in all possible environments β’ Hutter describes this Ξ₯ π as Universal Intelligence
AIXI β’ Expectimax over Solomonoff Prior β’ β³ are chronologically conditional environments β’ Converges to agent acting with knowledge of true environment β’ Mathematically proven
Evaluation: Pros and Cons β’ Theoretically optimal decision making. β’ Proven to converge to optimal agent acting in true environment β’ Universal β’ Prior completely independent of actual environment behavior β’ βReduces any conceptual AI problem to computation problemβ β’ Incomputable and Intractable β’ Cannot compute Kolgomorov Complexity β’ Reward function? β’ Unclear how to define reward function which is also independent of problem
Related Works: Approximations β’ Work in AIXI mainly in approximating the theoretical framework. β’ AIXI π’π β’ Marcus Hutter. Universal algorithmic intelligence: A mathematical topβdown approach. In B. Goertzel and C. Pennachin, editors, Artificial General Intelligence, Cognitive Technologies, pages 227β290. Springer, Berlin, 2007. ISBN 3-540-23733-X. URL http://www.hutter1.net/ai/aixigentle.htm. β’ Summary: provides approximate AIXI which is more optimal than any other RL agent with the same time and space constraints. β’ MC-AIXI (Next!) β’ Summary: Monte Carlo approximation of AIXI.
MC-AIXI CTW β’ βMonte Carlo β AIXI with Context Tree Weightingsβ β’ Veness et al 2011 β’ Solves main barriers to applying AIXI: 1. Expectimax is intractable β Estimate using MCTS 2. Kolmogorov Complexity is incomputable β Replace universe of environments with smaller model class with surrogate for complexity
Part 1: MCTS β’ π UCT is used to estimate AIXI Expectimax by adapting the classic selection-expansion-rollout-backprop MCTS algorithm β’ Decision node (circle): β’ Contains a history, h , and a value function estimate, { π(β) β’ It has children (called βChance nodesβ) corresponding to the number of possible actions β’ An action, a, is selected based on the UCB action-selection policy that balances exploration and exploitation β’ Chance node (star): β’ Follows a decision node β’ Contains the history, ha; an estimate of the future value, { π(βπ) ; and the environment model, π(β |βπ) , that returns a percept conditioned on the history β’ A new child of the chance node is added when a new percept is received
Part 2: Approximating the Solomonoff Prior β’ Solomonoff Prior: β ^ 2 Rz(^) is incomputable β’ Solution: Replace with smaller class of environments β’ Variable Order Markov Process β’ Calculates probability of next observation depending on last k observations β’ Replace entire universe of environments with mixture of Markov Processes
Prediction Suffix Tree β’ Representation of a sequence of binary events β’ Able to encode all variable order Markov Models up to depth D β’ Represents a space of 2^2^D
Context Tree Weighting β’ Provides method to evaluate PST in linear time β’ Naively computable in π« (2^2^D), CTW algorithm reduces to π« (D) β’ Smaller trees represent simpler Markov Models β’ Evaluate prior probability under Occamβs razor as size of tree Ξ ~ π = # nodes in PST β’ Replace Kolmogorov prior with CTW prior
Context Tree Weighting: Updated Formula β’ Original intractable prior β’ MC-AIXI with CTW
Algorithm Performance Partially Observable Pacman Cheese Maze β’ The agent must navigate to a piece of cheese β’ -1 for entering an open cell Agent is unaware of the monstersβ β’ β’ -10 for hitting a wall locations and the maze β’ +10 for finding cheese It can only βsmellβ food and observe β’ food in its direct line of sight
Performance on Cheese Maze
Performance on PO-Pacman
Related Work β’ Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State . PhD thesis, University of Rochester, 1996 βΆ "ππ’ππππ’π§ ππ£ππππ¦ πππππ π§β β’ V.F. Farias, C.C.Moallemi, B. Van Roy, and T.Weissman. Universal reinforcement learning. Information Theory, IEEE Transactions on , 56(5):2441 β2454, may 2010. βΆ "π΅ππ’ππ€π β ππ"
Timeline Context Tree MCTS Weightings MC-AIXI-CTW βBandit based MC Solomonoff Induction Planningβ Willems, Shtarkov, Veness et al Kocsis & Szepesvari Ray Solomonoff Tjalkens 2010 2006 1960βs 1995 Kolmogorov Complexity AIXI AIXI π’π Marcus Hutter Andrey Kolmogorov 2005 Marcus Hutter 1963 2007
MC-AIXI-CTW Playing Pac-Man β’ jveness.info/publications/pacman_jair_2010.wmv
Recommend
More recommend