aixi universal optimal sequential decision making
play

AIXI: Universal Optimal Sequential Decision Making Marcus Hutter - PowerPoint PPT Presentation

AIXI: Universal Optimal Sequential Decision Making Marcus Hutter (2005) Reinforcement Learning State space , Action space , Policy , Reward (, ) Goal: Find policy which maximizes expected cumulative reward.


  1. AIXI: Universal Optimal Sequential Decision Making Marcus Hutter (2005)

  2. Reinforcement Learning β€’ State space 𝑇, Action space 𝐡, Policy 𝜌, Reward 𝑆(𝑏, 𝑑) β€’ Goal: Find policy which maximizes expected cumulative reward. β€’ Challenge: Environment which RL interacts with is unknown β€’ Explore and approximate the environment β€’ Hard to balance exploration vs exploitation β€’ AIXI: why approximate one environment? Consider them all!

  3. Optimal Agents in Known Environments β€’ 𝒝, 𝒫, 𝑆 = ( action, observation, reward) spaces β€’ 𝑏 - = action at time 𝑙 , 𝑦 - = 𝑝 - 𝑠 - = perception at time 𝑙 β€’ Agent follows policy 𝜌: 𝒝×𝒫×ℛ βˆ— β†’ 𝒝 β€’ Environment reacts with 𝜈: 𝒝×𝒫×ℛ βˆ— ×𝒝 β†’ 𝒫×ℛ

  4. Agent-Environment Visualization

  5. Optimal Agents in Known Environments β€’ Performance of 𝜌 is expected cumulative reward A : = 𝔽 9 9: ] : [= π‘Š 𝑠 9 > >?@ β€’ If 𝜈 is true environment, optimal policy is π‘ž 9 ≔ arg max : π‘Š 9 : ?

  6. Definition of the Environment β€’ An environment, 𝜍 , is a sequence of conditional probability functions {𝜍 L , 𝜍 @ , 𝜍 M , … } and is unknown to the agent β€’ Each element in the sequence satisfies the β€œchronological condition” : βˆ€π‘ @:Q βˆ€π‘¦ @:QR@ : 𝜍 QR@ (𝑦 @:QR@ 𝑏 @:QR@ = = 𝜍 Q (𝑦 @:Q |𝑏 @:Q ) S T ∈V

  7. Definition of the Environment βˆ€π‘ @:Q βˆ€π‘¦ @:QR@ : 𝜍 QR@ (𝑦 @:QR@ 𝑏 @:QR@ = = 𝜍 Q (𝑦 @:Q |𝑏 @:Q ) S T ∈V Conditioned Conditioned on all actions on all actions up to π‘œ βˆ’ 1 Marginalization of 𝜍 Q over up to π‘œ the current observation- reward

  8. Dealing with the Unknown Environment β€’ The idea is to maintain a mixture of environment models, in which each model is assigned a weight that represents the agent’s confidence in what it believes is the true environment β€’ As the agent obtains more experience, it updates the weights and thus its belief of the underlying environment β€’ Reminiscent of a Bayesian agent

  9. Mixture Model β€’ β„³ β‰œ {𝜍 @ , 𝜍 M , … , 𝜍 Q } is the countable class of environments ^ > 0 is the weight assigned to each 𝜍 ∈ β„³ such that βˆ‘ ^βˆˆβ„³ π‘₯ L ^ = β€’ π‘₯ L 1 ^ 𝜍(𝑦 @:Q |𝑏 @:Q ) 𝜊 𝑦 @:Q |𝑏 @:Q β‰œ = π‘₯ L ^βˆˆβ„³

  10. Selecting a Universal Prior β€’ Occam’s Razor: The simplest solution is the most likely ^ = 1 β€’ Formalized as Kolmogorov Complexity βˆ‘ ^βˆˆβ„³ π‘₯ L Type equation here. ^ 𝜍(𝑦 @:Q |𝑏 @:Q ) 𝜊 𝑦 @:Q |𝑏 @:Q β‰œ = π‘₯ L ^βˆˆβ„³ β€œYo.”

  11. Kolmogorov Complexity β€’ Length of the shortest program on a Universal Turing Machine which specifies an object β€’ In our domain: shortest program which produces environment 𝜍 𝐿 𝜍 ≔ min π‘šπ‘“π‘œπ‘•π‘’β„Ž π‘ž : 𝑉 π‘ž = 𝜍 p β€’ Advantage: completely independent of prior assumptions β€’ Problem: Incomputable due to halting problem. β€’ NaΓ―ve search over all inputs will contain those with infinite loops β€’ Paradoxical: β€œShortest object describable by N bits” is less than N bits.

  12. Solomonoff Prior β€’ Key idea: Use inverse Kolmogorov Complexity as environmental prior to compute mixture over all possible environments 2 Rz(^) βˆ— π‘Š : Ξ₯ 𝜌 = = ^ ^βˆˆβ„³ x β€’ Ξ₯ 𝜌 measures agent’s ability to perform in all possible environments β€’ Hutter describes this Ξ₯ 𝜌 as Universal Intelligence

  13. AIXI β€’ Expectimax over Solomonoff Prior β€’ β„³ are chronologically conditional environments β€’ Converges to agent acting with knowledge of true environment β€’ Mathematically proven

  14. Evaluation: Pros and Cons β€’ Theoretically optimal decision making. β€’ Proven to converge to optimal agent acting in true environment β€’ Universal β€’ Prior completely independent of actual environment behavior β€’ β€œReduces any conceptual AI problem to computation problem” β€’ Incomputable and Intractable β€’ Cannot compute Kolgomorov Complexity β€’ Reward function? β€’ Unclear how to define reward function which is also independent of problem

  15. Related Works: Approximations β€’ Work in AIXI mainly in approximating the theoretical framework. β€’ AIXI π‘’π‘š β€’ Marcus Hutter. Universal algorithmic intelligence: A mathematical topβ†’down approach. In B. Goertzel and C. Pennachin, editors, Artificial General Intelligence, Cognitive Technologies, pages 227–290. Springer, Berlin, 2007. ISBN 3-540-23733-X. URL http://www.hutter1.net/ai/aixigentle.htm. β€’ Summary: provides approximate AIXI which is more optimal than any other RL agent with the same time and space constraints. β€’ MC-AIXI (Next!) β€’ Summary: Monte Carlo approximation of AIXI.

  16. MC-AIXI CTW β€’ β€œMonte Carlo – AIXI with Context Tree Weightings” β€’ Veness et al 2011 β€’ Solves main barriers to applying AIXI: 1. Expectimax is intractable β†’ Estimate using MCTS 2. Kolmogorov Complexity is incomputable β†’ Replace universe of environments with smaller model class with surrogate for complexity

  17. Part 1: MCTS β€’ 𝜍 UCT is used to estimate AIXI Expectimax by adapting the classic selection-expansion-rollout-backprop MCTS algorithm β€’ Decision node (circle): β€’ Contains a history, h , and a value function estimate, { π‘Š(β„Ž) β€’ It has children (called β€œChance nodes”) corresponding to the number of possible actions β€’ An action, a, is selected based on the UCB action-selection policy that balances exploration and exploitation β€’ Chance node (star): β€’ Follows a decision node β€’ Contains the history, ha; an estimate of the future value, { π‘Š(β„Žπ‘) ; and the environment model, 𝜍(β‹… |β„Žπ‘) , that returns a percept conditioned on the history β€’ A new child of the chance node is added when a new percept is received

  18. Part 2: Approximating the Solomonoff Prior β€’ Solomonoff Prior: βˆ‘ ^ 2 Rz(^) is incomputable β€’ Solution: Replace with smaller class of environments β€’ Variable Order Markov Process β€’ Calculates probability of next observation depending on last k observations β€’ Replace entire universe of environments with mixture of Markov Processes

  19. Prediction Suffix Tree β€’ Representation of a sequence of binary events β€’ Able to encode all variable order Markov Models up to depth D β€’ Represents a space of 2^2^D

  20. Context Tree Weighting β€’ Provides method to evaluate PST in linear time β€’ Naively computable in 𝒫 (2^2^D), CTW algorithm reduces to 𝒫 (D) β€’ Smaller trees represent simpler Markov Models β€’ Evaluate prior probability under Occam’s razor as size of tree Ξ“ ~ 𝑁 = # nodes in PST β€’ Replace Kolmogorov prior with CTW prior

  21. Context Tree Weighting: Updated Formula β€’ Original intractable prior β€’ MC-AIXI with CTW

  22. Algorithm Performance Partially Observable Pacman Cheese Maze β€’ The agent must navigate to a piece of cheese β€’ -1 for entering an open cell Agent is unaware of the monsters’ β€’ β€’ -10 for hitting a wall locations and the maze β€’ +10 for finding cheese It can only β€œsmell” food and observe β€’ food in its direct line of sight

  23. Performance on Cheese Maze

  24. Performance on PO-Pacman

  25. Related Work β€’ Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State . PhD thesis, University of Rochester, 1996 ⟢ "π‘‰π‘’π‘—π‘šπ‘—π‘’π‘§ 𝑇𝑣𝑔𝑔𝑗𝑦 π‘π‘“π‘›π‘π‘ π‘§β€œ β€’ V.F. Farias, C.C.Moallemi, B. Van Roy, and T.Weissman. Universal reinforcement learning. Information Theory, IEEE Transactions on , 56(5):2441 –2454, may 2010. ⟢ "𝐡𝑑𝑒𝑗𝑀𝑓 βˆ’ π‘€π‘Ž"

  26. Timeline Context Tree MCTS Weightings MC-AIXI-CTW β€œBandit based MC Solomonoff Induction Planning” Willems, Shtarkov, Veness et al Kocsis & Szepesvari Ray Solomonoff Tjalkens 2010 2006 1960’s 1995 Kolmogorov Complexity AIXI AIXI π‘’π‘š Marcus Hutter Andrey Kolmogorov 2005 Marcus Hutter 1963 2007

  27. MC-AIXI-CTW Playing Pac-Man β€’ jveness.info/publications/pacman_jair_2010.wmv

Recommend


More recommend