A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology Lucca, Italy June 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 1 / 29
Second Lecture APPROXIMATE DYNAMIC PROGRAMMING I Bertsekas (M.I.T.) Approximate Dynamic Programming 2 / 29
Outline Review of the Exact DP Algorithm 1 Approximation in Value Space 2 Parametric Cost Approximation 3 Tail Problem Approximation 4 Bertsekas (M.I.T.) Approximate Dynamic Programming 3 / 29
Recall the Basic Problem Structure for DP Discrete-time system x k + 1 = f k ( x k , u k , w k ) , k = 0 , 1 , . . . , N − 1 x k : State u k : Control from a constraint set U k ( x k ) w k : Disturbance; random parameter with distribution P ( w k | x k , u k ) Optimization over Feedback Policies π = { µ 0 , µ 1 , . . . , µ N − 1 } , with u k = µ k ( x k ) ∈ U ( x k ) Cost of a policy starting at initial state x 0 : � N − 1 � � � � J π ( x 0 ) = E g N ( x N ) + g k x k , µ k ( x k ) , w k k = 0 Optimal cost function: J ∗ ( x 0 ) = min π J π ( x 0 ) Bertsekas (M.I.T.) Approximate Dynamic Programming 5 / 29
Recall the Exact DP Algorithm Computes for all k and states x k : J k ( x k ) , the opt. cost of tail problem that starts at x k Go backwards, k = N − 1 , . . . , 0, using J N ( x N ) = g N ( x N ) � �� � J k ( x k ) = u k ∈ U k ( x k ) E min g k ( x k , u k , w k ) + J k + 1 f k ( x k , u k , w k ) Notes: J 0 ( x 0 ) = J ∗ ( x 0 ) : Cost generated at the last step, is equal to the optimal cost Let µ ∗ k ( x k ) minimize in the right side above for each x k and k . Then the policy π ∗ = { µ ∗ 0 , . . . , µ ∗ N − 1 } is optimal Potentially ENORMOUS computational requirements IF we knew J k + 1 , the computation of the minimizing u k would be much simpler Bertsekas (M.I.T.) Approximate Dynamic Programming 6 / 29
One-Step and Multistep Lookahead One-Step Lookahead Replace J k + 1 by an approximation ˜ J k + 1 Apply ¯ u k that attains the minimum in � �� g k ( x k , u k , w k ) + ˜ � f k ( x k , u k , w k ) u k ∈ U k ( x k ) E min J k + 1 ℓ -Step Lookahead At state x k solve the ℓ -step DP problem starting at x k and using terminal cost ˜ J k + ℓ If u k , µ k + 1 , . . . , µ k + ℓ − 1 is an optimal policy for the ℓ -step problem, apply the first control u k Notes Other names used: Rolling or receding horizon control A key issue: How do we choose ˜ J k + ℓ ? Another issue: How do we deal with the minimization and the computation of E {·} Implementation issues; e.g., tradeoff between on-line vs off-line computation Performance issues; e.g., error bounds (we will not cover) Bertsekas (M.I.T.) Approximate Dynamic Programming 8 / 29
A Summary of Approximation Possibilities in Value Space At State x k DP minimization : (Could be approximate) First ℓ Steps ps “Future” � k + ℓ − 1 � � + ˜ � � u k ,µ k +1 ,...,µ k + ℓ − 1 E min g k ( x k , u k , w k ) + g k x m , µ m ( x m ) , w m J k + ℓ ( x k + ℓ ) m = k +1 s: Computation of ˜ J k + ℓ : Approximations: Simple choices Replace E {·} with nominal values (certainty equivalent control) es Parametric approximation Tail problem approximation Limited simulation Rollout n (Monte Carlo tree search) Bertsekas (M.I.T.) Approximate Dynamic Programming 9 / 29
A First-Order Division of Lookahead Choices Long lookahead ℓ and simple choice of ˜ J k + ℓ Some examples ˜ J k + ℓ ( x ) ≡ 0 (or a constant) ˜ J k + ℓ ( x ) = g N ( x ) For problems with a “goal state" use a simple penalty ˜ J k + ℓ � 0 if x is a goal state ˜ J k + ℓ ( x ) = >> 1 if x is not a goal state Long lookahead = ⇒ A lot of DP computation Often must be done off-line Short lookahead ℓ and sophisticated choice ˜ J k + ℓ ≈ J k + ℓ The lookahead cost function approximates (to within a constant) the optimal cost-to-go produced by exact DP We will next describe a variety of off-line and on-line approximation approaches Bertsekas (M.I.T.) Approximate Dynamic Programming 10 / 29
Approximation in Value Space n Cost-to-go S Lookahead Minimization o Approximation ps “Future” First ℓ Steps � k + ℓ − 1 � � + ˜ � � u k ,µ k +1 ,...,µ k + ℓ − 1 E min g k ( x k , u k , w k ) + g k x m , µ m ( x m ) , w m J k + ℓ ( x k + ℓ ) m = k +1 es Parametric approximation Bertsekas (M.I.T.) Approximate Dynamic Programming 12 / 29
Parametric Approximation: Approximation Architectures We approximate J k ( x k ) with a function from an approximation architecture, i.e., a parametric class ˜ J k ( x k , r k ) , where r k = ( r 1 , k , . . . , r m , k ) is a vector of “tunable" scalar weights We use ˜ J k in place of J k (the optimal cost-to-go function) in a one-step or multistep lookahead scheme Role of r k : By adjusting r k we can change the “shape" of ˜ J k so that it is “close" to to the optimal J k (at least within a constant) Two key Issues The choice of the parametric class ˜ J k ( x k , r k ) ; there is a large variety The method for tuning/adjusting the weights (“training" the architecture) Bertsekas (M.I.T.) Approximate Dynamic Programming 13 / 29
Feature-Based Architectures Feature extraction � � A process that maps the state x k into a vector φ k ( x k ) = φ 1 , k ( x k ) , . . . , φ m , k ( x k ) , called the feature vector associated with x k A feature-based cost approximator has the form J k ( x k , r k ) = ˆ ˜ � � J k φ k ( x k ) , r k where r k is a parameter vector and ˆ J k is some function, linear or nonlinear in r k With a well-chosen feature vector φ k ( x k ) , a good approximation to the cost-to-go is often provided by linearly weighting the features, i.e., m J k ( x k , r k ) = ˆ ˜ � � � r i , k φ i , k ( x k ) = r ′ J k φ k ( x k ) , r k = k φ k ( x k ) i = 1 i ) Linear Cost State x k k Feature Vector φ k ( x k ) ) Approximator r 0 k φ k ( x k ) i Feature Extraction i ) Linear on Mapping on Mapping This can be viewed as approximation onto a subspace of basis functions of x k defined by the features φ i , k ( x k ) Bertsekas (M.I.T.) Approximate Dynamic Programming 14 / 29
Feature-Based Architectures Any generic basis functions, such as classes of polynomials, wavelets, radial basis functions, etc, can serve as features In some cases, problem-specific features can be “hand-crafted" Computer chess example on Features: : Material Balance, Mobility, y, Safety, etc W s Score P Feature c Weighting of e Extraction of Features S e Position Evaluator Think of state: board position; control: move choice Use a feature-based position evaluator assigning a score to each position Most chess programs use a linear architecture with “manual" choice of weights Some computer programs choose the weights by a least squares fit using lots of grandmaster play examples Bertsekas (M.I.T.) Approximate Dynamic Programming 15 / 29
An Example of Architecture Training: Sequential DP Approximation A common way to train architectures ˜ J k ( x k , r k ) in the context of DP We start with ˜ J N = g N and sequentially train going backwards, until k = 1 Given a cost-to-go approximation ˜ J k + 1 , we use one-step lookahead to construct a large number of state-cost pairs ( x s k , β s k ) , s = 1 , . . . , q , where � �� β s g ( x s k , u , w k ) + ˜ f k ( x s � k = min k ) E J k + 1 k , u , w k ) , r k + 1 , s = 1 , . . . , q u ∈ U k ( x s We “train" an architecture ˜ J k on the training set ( x s k , β s k ) , s = 1 , . . . , q Training by least squares/regression We minimize over r k q � ˜ k , r k ) − β s � 2 + γ � r k − ¯ � J k ( x s r � 2 s = 1 where ¯ r is an initial guess for r k and γ > 0 is a regularization parameter Special algorithms called incremental gradient methods are typically used for this. They take advantage of the large sum structure of the cost function For a linear architecture the training problem is a linear least squares problem Bertsekas (M.I.T.) Approximate Dynamic Programming 16 / 29
Neural Networks for Constructing Cost-to-Go Approximations ˜ J k Neural nets can be used in the preceding sequential DP approximation scheme: Train the stage k neural net using a training set generated with the stage k + 1 neural net Two ways to view neural networks As nonlinear approximation architectures As linear architectures with automatically constructed features Focus at the typical stage k and drop the index k for convenience Neural nets are approximation architectures of the form m ˜ � r i φ i ( x , v ) = r ′ φ ( x , v ) J ( x , v , r ) = i = 1 involving two parameter vectors r and v with different roles View φ ( x , v ) as a feature vector; view r as a vector of linear weighting parameters for φ ( x , v ) By training v jointly with r , we obtain automatically generated features! Bertsekas (M.I.T.) Approximate Dynamic Programming 17 / 29
Recommend
More recommend