Stochastic Optimal Control – part 2 discrete time, Markov Decision Processes, Reinforcement Learning Marc Toussaint Machine Learning & Robotics Group – TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki, July 5th, 2008 • Why stochasticity? • Markov Decision Processes • Bellman optimality equation, Dynamic Programming, Value Iteration • Reinforcement Learning: learning from experience 1/21
Why consider stochasticity? 1) system is inherently stochastic 2) true system is actually deterministic but a) system is described on level of abstraction, simplification which makes model approximate and stochastic b) sensors/observations are noisy, we never know the exact state c) we can handle only a part of the whole system – partial knowledge → uncertainty – decomposed planning; factored state representation world agent agent 1 2 • probabilities are a tool to represent information and uncertainty – there are many sources of uncertainty 2/21
Machine Learning models of stochastic processes • Markov Processes defined by random variables x 0 , x 1 , .. and transition probabilities P ( x t + 1 | x t ) x 0 x 1 x 2 • non-Markovian Processes – higher order Markov Processes, auto regression models – structured models (hierarchical, grammars, text models) – Gaussian processes (both, discrete and continuous time) – etc • continuous time processes – stochastic differential equations 3/21
Markov Decision Processes • Markov Process on the random variables of states x t , actions a t , and rewards r t a 0 a 1 a 2 π x 0 x 1 x 2 r 0 r 1 r 2 P ( x t + 1 | a t , x t ) transition probability (1) P ( r t | a t , x t ) reward probability (2) P ( a t | x t ) = π ( a t | x t ) policy (3) • we will assume stationarity, no explicit dependency on time – P ( x ′ | a, x ) and P ( r | a, x ) are invariable properties of the world – the policy π is a property of the agent 4/21
optimal policies • value ( expected discounted return) of policy π when started in x V π ( x ) = E r 0 + γr 1 + γ 2 r 2 + · · · | x 0 = x ; π � � C ( x 0 , a 0: T ) = φ ( x T ) + � T - 1 (cf. cost function R ( t, x t , a t ) ) 0 • optimal value function: V π ( x ) V ∗ ( x ) = max π • policy π ∗ if optimal iff ∀ x : V π ∗ ( x ) = V ∗ ( x ) (simultaneously maximizing the value in all states) • There always exists (at least one) optimal deterministic policy! 5/21
Bellman optimality equation V π ( x ) = E r 0 + γr 1 + γ 2 r 2 + · · · | x 0 = x ; π � � = E { r 0 | x 0 = x ; π } + γ E { r 1 + γr 2 + · · · | x 0 = x ; π } x ′ P ( x ′ | π ( x ) , x ) E { r 1 + γr 2 + · · · | x 1 = x ′ ; π } = R ( π ( x ) , x ) + γ � x ′ P ( x ′ | π ( x ) , x ) V π ( x ′ ) = R ( π ( x ) , x ) + γ � • Bellman optimality equation � � x ′ P ( x ′ | a, x ) V ∗ ( x ′ ) V ∗ ( x ) = max R ( a, x ) + γ � a � x ′ P ( x ′ | a, x ) V ∗ ( x ′ ) � π ∗ ( x ) = argmax R ( a, x ) + γ � a (if π would select another action than argmax a [ · ] , π wouldn’t be optimal: π ′ which = π everywhere except π ′ ( x ) = argmax a [ · ] would be better) • this is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm) 6/21
Dynamic Programming • Bellman optimality equation � x ′ P ( x ′ | a, x ) V ∗ ( x ′ ) � V ∗ ( x ) = max R ( a, x ) + γ � a • Value iteration (initialize V 0 ( x ) = 0 , iterate k = 0 , 1 , .. ) � � R ( a, x ) + γ � ∀ x : V k +1 ( x ) = max x ′ P ( x ′ | a, x ) V k ( x ′ ) a – stopping criterion: max x | V k +1 ( x ) − V k ( x ) | ≤ ǫ (see script for proof of convergence) • once it converged, choose the policy � � x ′ P ( x ′ | a, x ) V k ( x ′ ) π k ( x ) = argmax R ( a, x ) + γ � a 7/21
maze example • typical example for a value function in navigation [online demo – or switch to Terran Lane’s lecture...] 8/21
comments • Bellman’s principle of optimality is the core of the methods • it refers to the recursive thinking of what makes a path optimal – the recursive property of the optimal value function • related to Viterbi, max-product algorithm 9/21
Learning from experience • Reinforcement Learning problem: model P ( x ′ | a, x ) and P ( r | a, x ) are not known, only exploration is allowed experience { x t , a t , r t } g n p i o n l i r c a y e l TD learning s Q-learning e l e a d r o c m h MDP model policy P, R π policy optim. EM D y e n t a a m d p i c u P y c r o i l o g . p value/Q-function V, Q 10/21
Model learning • trivial on direct discrete representation: use experience data to estimate model P ( x ′ | a, x ) ∝ #( x ′ ← x | a ) ˆ – for non-direct representations: Machine Learning methods • use DP to compute optimal policy for estimated model • Exploration-Exploitation is not a Dilemma possible solutions: E 3 algorithm, Bayesian RL (see later) 11/21
Temporal Difference • recall Value Iteration � � ∀ x : V k +1 ( x ) = max R ( a, x ) + γ � x ′ P ( x ′ | a, x ) V k ( x ′ ) a • Temporal Difference learning (TD): given experience ( x t a t r t x t +1 ) V new ( x t ) = (1 − α ) V old ( x t ) + α [ r t + γV old ( x t + 1 )] = V old ( x t ) + α [ r t + γV old ( x t + 1 ) − V old ( x t )] . ... is a stochastic variant of Dynamic Programming → one can prove convergence with probability 1 (see Q-learning in script) • reinforcement: ( r t > γV old ( x t + 1 ) − V old ( x t ) ) – more reward than expected → increase V ( x t ) – less reward than expected ( r t < γV old ( x t + 1 ) − V old ( x t ) ) → decrease V ( x t ) 12/21
Q-learning convergence with prob 1 • Q-learning Q π ( a, x ) = E r 0 + γr 1 + γ 2 r 2 + · · · | x 0 = x, a 0 = u ; π � � x ′ P ( x ′ | a, x ) max a ′ Q ∗ ( a ′ , x ′ ) Q ∗ ( a, x ) = R ( a, x ) + γ � ∀ a,x : Q k +1 ( a, x ) = R ( a, x ) + γ � x ′ P ( x ′ | a, x ) max a ′ Q k ( a ′ , x ′ ) Q new ( x t , a t ) = (1 − α ) Q old ( x t , a t ) + α [ r t + γ max Q old ( x t +1 , a )] a • Q-learning is a stochastic approximation of Q-VI: Q-VI is deterministic: Q k +1 = T ( Q k ) Q-learning is stochastic: Q k +1 = (1 − α ) Q k + α [ T ( Q k ) + η k ] η k is zero mean! 13/21
Q-learning impact • Q-Learning (Watkins, 1988) is the first provably convergent direct adaptive optimal control algorithm • Great impact on the field of Reinforcement Learning – smaller representation than models – automatically focuses attention to where it is needed i.e., no sweeps through state space – though does not solve the exploration versus exploitation issue – epsilon-greedy, optimistic initialization, etc,... 14/21
Eligibility traces • Temporal Difference: V new ( x 0 ) = V old ( x 0 ) + α [ r 0 + γV old ( x 1 ) − V old ( x 0 )] • longer reward sequence: r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 temporal credit assignment, think further backwards, receiving r 3 also tells us something about V ( x 0 ) V new ( x 0 ) = V old ( x 0 ) + α [ r 0 + γr 1 + γ 2 r 2 + γ 3 V old ( x 3 ) − V old ( x 0 )] • online implementation: remember where you’ve been recently (“eligibility trace”) and update those values as well: e ( x t ) ← e ( x t ) + 1 ∀ x : V new ( x ) = V old ( x ) + αe ( x )[ r t + γV old ( x t + 1 ) − V old ( x t )] ∀ x : e ( x ) ← γλe ( x ) • core topic of Sutton & Barto book – great improvement 15/21
comments • again, Bellman’s principle of optimality is the core of the methods TD ( λ ) , Q-learning, eligibilities, are all methods to converge to a function obeying the Bellman optimality equation 16/21
E 3 : Explicit Explore or Exploit • (John Langford) from observed data construct two MDPs: (1) MDP known includes sufficiently often visited states and executed actions with (rather exact) estimates of P and R . (model which captures what you know) (2) MDP unknown = MDP known except the reward is 1 for all actions which leave the known states and 0 otherwise. (model which captures optimism of exploration) • the algorithm: (1) If last x not in Known: choose the least previously used action (2) Else: (a) [seek exploration] If V unknown > ǫ then act according to V unknown until state is unknown (or t mod T = 0 ) then goto (1) (b) [exploit] else act according to V known 17/21
E 3 – Theory • for any (unknown) MDP: – total number of actions and computation time required by E 3 are poly ( | X | , | A | , T ∗ , 1 ǫ , ln 1 δ ) – performance guarantee: with probability at least (1 − δ ) exp. return of E 3 will exceed V ∗ − ǫ • details � T 1 – actual return: t =1 r t T – let T ∗ denote the (unknown) mixing time of the MDP – one key insight: even the optimal policy will take time O ( T ∗ ) to achieve actual return that is near-optimal • straight-forward & intuitive approach! – the exploration-exploitation dilemma is not a dilemma! – cf. active learning, information seeking, curiosity, variance analysis 18/21
Recommend
More recommend