Stochastic Optimal Control – part 4 research issues, robotics applications Marc Toussaint Machine Learning & Robotics Group – TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki • challenges in stochastic optimal control • probabilistic inference approaches to control • robotics • model learning 1/14
challenges in stochastic optimal control • often said: “scale up” • Efficient Application in Real Systems! → try to extract the fundamental problems 2/14
research issues 1/3: structured state • notion of state (i.e., having one big state space) – curse of dimensionality – real systems are typically decomposed/modular/hierarchical/structured → exploit this! • interesting lines of work – Carlos Guestrin (PhD thesis) – probabilistic inference methods! (in graphical models, belief propagation, etc) – probabilistic inference for computing optimal policies 3/14
research issues 2/3: learning • learning – want to learn models from experience • interesting lines of work – ML for model learning in robotics 4/14
research issues 3/3: integration • integration – complex systems (e.g., robots) collect state information from many different modalities (sensors) – many subsystems (e.g, vision, position, haptics) – delayed/partial information – integration is hard 5/14
probabilistic inference approach • general idea: decision making, motion control and planning can be viewed as a problem of inferring a posterior over unknown variables (actions, control signals, whole trajectories) conditioned on available information (targets, goals, constraints) 6/14
probabilistic inference approach • given some model of the future: a 0 a 1 a 2 π x 0 x 1 x 2 r 0 r 1 r 2 (here a Markov-Decision Process with P ( x 0 ) , P ( x ′ | a, x ) , P ( r | a, x ) given, and the policy π ax = P ( a | x ) unknown) • condition it on something you want to see in the future • compute the posterior over actions/decisions to get there • Toussaint & Storkey (ICML 2006): proof that maximization of expected future rewards → likelihood maximization problem (EM-algorithm) [fwd-bwd video] 7/14
probabilistic inference approach [details: Toussaint, Storkey, ICML 2006] • problem: Find a policy π that maximizes V π = E { � ∞ t =0 γ t r t ; π } γ ∈ [0 , 1] with discount factor Maximizing the likelihood L π = P (ˆ • Theorem: r =1; π ) in the mixture of finite-time MDPs ( P ( T ) = γ T (1 − γ ) ) is equivalent to maximizing V π = E { � ∞ t =0 γ t r t ; π } in the original MDP . • problem of optimal policy → problem of likelihood maximization (EM-algorithm) [demo] 8/14
POMDP application • in POMDPs the agent needs some kind of memory b 0 b 1 b 2 y 0 a 0 y 1 a 1 y 2 a 2 x 0 x 1 x 2 r 0 r 1 r 2 • mazes: T-junctions, halls & corridors (379 locations, 1516 states) (Toussaint, Harmeling, Storkey & 2006) 9/14
POMDP application • UAI paper persented on Friday: Marc Toussaint, Laurent Charlin, Pascal Poupart: Hierarchical POMDP Controller Optimization by Likelihood Maximization N 2 N ′ 2 N 2 N ′ 2 N 2 N ′ 2 N 2 N 1 N 0 SS ′ E 1 N 2 N 1 N 0 S ′ N 1 N ′ 1 N 1 N ′ 1 N 1 N ′ 1 N 2 N ′ 2 N 1 N 0 S ′ E 0 N ′ 2 N 1 N 0 S ′ N 0 N ′ 0 N 0 N ′ 0 N 0 N ′ 0 N ′ 2 N 1 N ′ 1 N 0 S ′ O ′ O ′ N ′ 2 N ′ 1 N 0 S ′ O A O A N ′ 2 N ′ 1 N 0 N ′ 0 S ′ S ′ S ′ S ′ S S S V ∗ | S | , | A | , | O | HSVI2 Best results from [1] ML approach (avg. over 10 runs) Problem nodes t(s) nodes t(s) V V V paint 4, 4, 2 3.28 3 . 29 ± 0 . 04 (1,3) < 1 3.29 (5,3) 0 . 96 ± 0 . 3 3 . 26 ± 0 . 004 shuttle 8, 3, 5 32.7 32 . 9 ± 0 . 8 (1,3) 2 31.87 (5,3) 2 . 81 ± 0 . 2 31 . 6 ± 0 . 5 4x4 maze 16, 4, 2 3.7 3 . 75 ± 0 . 1 (1,2) 30 3.73 (3,3) 2 . 8 ± 0 . 8 3 . 72 ± 8 e − 5 157 . 1 ± 0 6 . 4 ± 0 . 2 151 . 6 ± 2 . 6 chain-of-chains 10, 4, 1 157.1 (3,3) 10 0.0 (10,3) handwashing 84, 7, 12 � 1052 N/A N/A (10,5) 655 ± 2 984 ± 1 − 9 ± 11(2 . 25 ∗ ) cheese-taxi 33, 7, 10 � 5.3 2 . 53 ± 0 . 3 N/A (10,3) 311 ± 14
robotic motion inference application 100 bayes (repeats) bayes (fwd-bwd) gradient (direct) four task variables gradient (spline) 10 (MAP) cost – position of right finger – collision with objects 1 – balance – comfortableness 0.1 0 1 2 3 4 5 6 time (sec) (Toussaint & Goerick, IROS 2007) 11/14
on Asimo • Toussaint, Gienger & Goerick (Humanoids 2007): Optimization of sequential attractor-based movement for compact behavior generation (other technique than inference) Time: 3s Time: 3s Time: 4s Control points: 8 Control points: 4 Control points: 10 Controlled: Both hands position Controlled: Left hand position Controlled: Both hands position and attitude and attitude and attitude 12/14
model learning • Control of a dynamic robot system dynamics: f : x, ˙ x, u �→ ¨ x x ∗ �→ u learning inverse model φ : x, ˙ x, ¨ [learn] [pole] (methods: A. Moore, C. Atkeson, S. Schaal, S. Vijayakumar, et al) 13/14
conclusions core of optimal control DP Bellman LQG HJB RL inference MDPs Value Iteration Path integral likelihood max TD Q-learning posterior trajectories/control Bayesian RL E^3 graphical models - state estimation - sensor processing • exciting potential for Machine Learning methods – structured state, abstraction, learning, integration • integrative view from ML perspective possible 14/14
Recommend
More recommend