The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017
Intelligence: the ability to generalize and adapt efficiently to new and uncertain situations • Having good representations is key “[...] solving a problem simply means representing it so as to make the solution transparent.” — Simon, 1969 1 / 18
Reinforcement Learning: a general framework for AI Equipped with a good state representation , RL has led to impressive results: • Tesauro’s TD Gammon (1995), • Watson’s Daily-Double Wagering in Jeopardy! (2013), • Human-level video game play in the Atari games (2013), • AlphaGo (2016)... The ability to abstract knowledge temporally over many different time scales is still missing. 2 / 18
Temporal abstraction Higher level steps Choosing the type of coffee maker, type of coffee beans Medium level steps Grind the beans, measure the right quantity of water, boil the water Lower level steps Wrist and arm movements while adding coffee to the filter, ... 3 / 18
Temporal abstraction in AI A cornerstone of AI planning since the 1970’s: • Fikes et al. (1972), Newell (1972, Kuipers (1979), Korf (1985), Laird (1986), Iba (1989), Drescher (1991) etc. It has been shown to : • Generate shorter plans • Reduce the complexity of choosing actions • Provide robustness against model misspecification • Improve exploration by taking shortcuts in the environment 4 / 18
Temporal abstraction in RL Options (Sutton, Singh, Precup 2000) can represent courses of action at variable time scales: High level Low level Trajectory, time 5 / 18
Options framework An option ω is a triple: 1. initiation set : I ω 2. internal policy : π ω 3. termination condition : β ω Example Robot navigation: if there is no obstacle in front ( I ω ), go forward ( π ω ) until you get too close to another object ( β ω ) We can derive a policy over options π Ω that maximizes the expected discounted sum of rewards: � ∞ � � � � γ t r ( s t , a t ) E � s 0 , ω 0 � � t =0 6 / 18
Contribution of this work The problem of constructing/discovering good options has been a challenge for more than 15 years. Option-critic is a scalable solution to this problem: • Online, continual and model-free (but models can be used if desired) • Requires no a priori domain knowledge, decomposition, or human intervention • Learns in a single task, at least as fast as other methods which do not use temporal abstraction • Applies to general continuous state and action spaces 7 / 18
Actor-Critic Architecture (Sutton 1984) Actor Policy Gradient Critic TD error Value s t a t function r t Environment • The policy (actor) is decoupled from its value function. • The critic provides feedback to improve the actor • Learning is fully online 8 / 18
Option-Critic Architecture Policy over options ω t π Ω Options π ω , β ω Gradients TD error Critic s t a t Q U , A Ω r t Environment • Parameterize internal policies and termination conditions • Policy over options is computed by a separate process 9 / 18
Main result: Gradient updates • The gradient wrt. the internal policy parameters θ is given by: � ∂ log π ω,θ ( a | s ) � Q U ( s, ω, a ) E ∂θ This has the usual interpretation: take better primitives more often inside the option • The gradient wrt. the termination parameters ν is given by: � − ∂β ω,ν ( s ′ ) � A π Ω ( s ′ , ω ) E ∂ν where A π Ω = Q π Ω − V π Ω is the advantage function This means that we want to lengthen options that have a large advantage 10 / 18
Results: Options transfer Hallways Walls Random goal Initial goal after 1000 episodes 11 / 18
Results: Options transfer 600 SARSA(0) 500 AC-PG Goal moves randomly OC 4 options OC 8 options 400 Steps 300 200 Primitive actions 100 Using temporal abstractions 0 0 500 1000 1500 2000 discovered by option-critic Episodes • Learning in the first task no slower than using primitives • Learning once the goal is moved faster with the options 12 / 18
Results: Learned options are intuitive Probability of terminating in a particular state, for each option: Option 1 Option 2 Option 3 Option 4 • Terminations are more likely near hallways (although there are no pseudo-rewards provided) 13 / 18
Results: Nonlinear function approximation Policy over options Termination functions Internal policies Last 4 frames Convolutional layers Shared representation Same architecture as DQN (Mnih & al., 2013) for the 4 first layers but hybridized with options and the policy over them. 14 / 18
Performance matching or better than DQN 2500 10000 2000 8000 Avg. Score 1500 6000 4000 1000 2000 Option-Critic Option-Critic 500 DQN DQN 0 0 50 100 150 200 0 50 100 150 200 Epoch Epoch (a) Asterix (b) Ms. Pacman 8000 10000 Option-Critic DQN 6000 8000 6000 4000 4000 2000 2000 Option-Critic 0 DQN 0 0 50 100 150 200 0 50 100 150 200 Epoch Epoch (c) Seaquest (d) Zaxxon 15 / 18
Interpretable and specialized options in Seaquest Action trajectory, time White: option 1 Black: option 2 Transition from option 1 to 2 Option 1: downward shooting sequences Option 2: upward shooting sequences 16 / 18
Conclusion Our results seem to be the first to be: • fully end-to-end • within a single task • at speed comparable or better than using just primitive methods Using ideas from policy gradient methods, option-critic • provides continual option construction • can be used with nonlinear function approximators • can incorporate regularizers or pseudo-rewards easily 17 / 18
Future work • Learn initiation sets: ◮ Would require a new notion of stochastic initiation functions • More empirical results ! Try our code : https://github.com/jeanharb/option_critic 18 / 18
Recommend
More recommend