a series of lectures on approximate dynamic programming
play

A Series of Lectures on Approximate Dynamic Programming Dimitri P - PowerPoint PPT Presentation

A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology Lucca, Italy June 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 1


  1. A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology Lucca, Italy June 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 1 / 24

  2. Our Aim Discuss optimization by Dynamic Programming (DP) and the use of approximations Purpose: Computational tractability in a broad variety of practical contexts Bertsekas (M.I.T.) Approximate Dynamic Programming 2 / 24

  3. The Scope of these Lectures After an intoduction to exact DP , we will focus on approximate DP for optimal control under stochastic uncertainty The subject is broad with rich variety of theory/math, algorithms, and applications Applications come from a vast array of areas: control/robotics/planning, operations research, economics, artificial intelligence, and beyond ... We will concentrate on control of discrete-time systems with a finite number of stages (a finite horizon), and the expected value criterion We will focus mostly on algorithms ... less on theory and modeling We will not cover: Infinite horizon problems Imperfect state information and minimax/game problems Simulation-based methods: reinforcement learning, neuro-dynamic programming A series of video lectures on the latter can be found at the author’s web site Reference: The lectures will follow Chapters 1 and 6 of the author’s book “Dynamic Programming and Optimal Control," Vol. I, Athena Scientific, 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 3 / 24

  4. Lectures Plan Exact DP The basic problem formulation Some examples The DP algorithm for finite horizon problems with perfect state information Computational limitations; motivation for approximate DP Approximate DP - I Approximation in value space; limited lookahead Parametric cost approximation, including neural networks Q -factor approximation, model-free approximate DP Problem approximation Approximate DP - II Simulation-based on-line approximation; rollout and Monte Carlo tree search Applications in backgammon and AlphaGo Approximation in policy space Bertsekas (M.I.T.) Approximate Dynamic Programming 4 / 24

  5. First Lecture EXACT DYNAMINC PROGRAMMING Bertsekas (M.I.T.) Approximate Dynamic Programming 5 / 24

  6. Outline Basic Problem 1 Some Examples 2 The DP Algorithm 3 Approximation Ideas 4 Bertsekas (M.I.T.) Approximate Dynamic Programming 6 / 24

  7. Basic Problem Structure for DP Discrete-time system x k + 1 = f k ( x k , u k , w k ) , k = 0 , 1 , . . . , N − 1 x k : State; summarizes past information that is relevant for future optimization at time k u k : Control; decision to be selected at time k from a given set U k ( x k ) w k : Disturbance; random parameter with distribution P ( w k | x k , u k ) For deterministic problems there is no w k Cost function that is additive over time � N − 1 � � E g N ( x N ) + g k ( x k , u k , w k ) k = 0 Perfect state information The control u k is applied with (exact) knowledge of the state x k Bertsekas (M.I.T.) Approximate Dynamic Programming 8 / 24

  8. Optimization over Feedback Policies w k u k = µ k ( x k ) x k System x k +1 = f k ( x k , u k , w k ) µ k Feedback policies: Rules that specify the control to apply at each possible state x k that can occur Major distinction: We minimize over sequences of functions of state π = { µ 0 , µ 1 , . . . , µ N − 1 } , with u k = µ k ( x k ) ∈ U k ( x k ) - not sequences of controls { u 0 , u 1 , . . . , u N − 1 } Cost of a policy π = { µ 0 , µ 1 , . . . , µ N − 1 } starting at initial state x 0 � N − 1 � � � � J π ( x 0 ) = E g N ( x N ) + g k x k , µ k ( x k ) , w k k = 0 Optimal cost function: J ∗ ( x 0 ) = min π J π ( x 0 ) Bertsekas (M.I.T.) Approximate Dynamic Programming 9 / 24

  9. Scope of DP Any optimization (deterministic, stochastic, minimax, etc) involving a sequence of decisions fits the framework A continuous-state example: Linear-quadratic optimal control Linear discrete-time system: x k + 1 = Ax k + Bu k + w k , k = 0 , . . . , N − 1 x k ∈ ℜ n : The state at time k u k ∈ ℜ m : The control at time k (no constraints in the classical version) w k ∈ ℜ n : The disturbance at time k ( w 0 , . . . , w N − 1 are independent random variables with given distribution) Quadratic Cost Function � N − 1 � � x ′ x ′ k Qx k + u ′ E N Qx N + � k Ru k � k = 0 where Q and R are positive definite symmetric matrices Bertsekas (M.I.T.) Approximate Dynamic Programming 11 / 24

  10. Discrete-State Deterministic Scheduling Example 9 6 ABC 3 5 AB ACB 1 2 4 6 2 e A 2 4 6 AC 3 5 Empty schedule 8 3 ACD 2 4 6 3 5 1 Initial al State 1 CAB CA 2 4 6 2 3 5 C 2 4 6 2 4 6 CAD 8 3 2 4 6 CD 3 5 CDA 1 2 Find optimal sequence of operations A, B, C, D (A must precede B and C must precede D) DP Problem Formulation States: Partial schedules; Controls: Stage 0, 1, and 2 decisions DP idea: Break down the problem into smaller pieces (tail subproblems) Start from the last decision and go backwards Bertsekas (M.I.T.) Approximate Dynamic Programming 12 / 24

  11. Scheduling Example Algorithm I 9 6 ABC 3 5 AB ACB 1 3 9 2 4 6 2 e A 2 4 6 AC 3 5 8 3 ACD 2 4 6 3 5 10 5 A Stage 2 1 Initial Subproblem al State 1 CAB CA 2 4 6 2 3 5 C 2 4 6 8 3 2 4 6 CAD 8 3 2 4 6 CD 10 5 3 5 CDA 1 2 Solve the stage 2 subproblems (using the terminal costs) At each state of stage 2, we record the optimal cost-to-go and the optimal decision Bertsekas (M.I.T.) Approximate Dynamic Programming 13 / 24

  12. Scheduling Example Algorithm II 9 6 ABC 3 5 AB ACB 1 3 9 2 4 6 2 e A 2 4 6 AC 3 5 8 ACD 8 3 2 4 6 3 5 5 1 Initial A Stage 1 al State Subproblem 1 CAB CA 2 4 6 2 3 5 C 2 4 6 8 3 2 4 6 CAD 8 3 5 7 2 4 6 CD 5 3 5 CDA 1 2 Solve the stage 1 subproblems (using the solution of stage 2 subproblems) At each state of stage 1, we record the optimal cost-to-go and the optimal decision Bertsekas (M.I.T.) Approximate Dynamic Programming 14 / 24

  13. Scheduling Example Algorithm III 9 6 ABC 3 5 AB ACB 1 3 9 2 4 6 2 e A 2 4 6 AC 3 5 8 3 8 ACD 2 4 6 3 5 5 1 Initial Stage 0 al State Subproblem CAB 1 CA 2 4 6 2 10 3 5 C 2 4 6 8 3 2 4 6 8 3 CAD 5 7 2 4 6 CD 5 3 5 CDA 1 2 Solve the stage 0 subproblem (using the solution of stage 1 subproblems) The stage 0 subproblem is the entire problem The optimal value of the stage 0 subproblem is the optimal cost J ∗ (initial state) Construct the optimal sequence going forward Bertsekas (M.I.T.) Approximate Dynamic Programming 15 / 24

  14. Principle of Optimality Let π ∗ = { µ ∗ 0 , µ ∗ 1 , . . . , µ ∗ N − 1 } be an optimal policy Consider the “tail subproblem" whereby we are at x k at time k and wish to minimize the “cost-to-go” from time k to time N � N − 1 � � � � E g N ( x N ) + g m x m , µ m ( x m ) , w m m = k Consider the “tail" { µ ∗ k , µ ∗ k + 1 , . . . , µ ∗ N − 1 } of the optimal policy Tail Subproblem x k Time 0 k N THE TAIL OF AN OPTIMAL POLICY IS OPTIMAL FOR THE TAIL SUBPROBLEM DP Algorithm Start with the last tail (stage N − 1) subproblems Solve the stage k tail subproblems, using the optimal costs-to-go of the stage ( k + 1 ) tail subproblems The optimal value of the stage 0 subproblem is the optimal cost J ∗ (initial state) In the process construct the optimal policy Bertsekas (M.I.T.) Approximate Dynamic Programming 16 / 24

  15. Formal Statement of the DP Algorithm Computes for all k and states x k : J k ( x k ) : opt. cost of tail problem that starts at x k Go backwards, k = N − 1 , . . . , 0, using J N ( x N ) = g N ( x N ) � �� � J k ( x k ) = u k ∈ U k ( x k ) E w k min g k ( x k , u k , w k ) + J k + 1 f k ( x k , u k , w k ) Interpretation: To solve a tail problem that starts at state x k Minimize the ( k th-stage cost + Opt. cost of the tail problem that starts at state x k + 1 ) Notes: J 0 ( x 0 ) = J ∗ ( x 0 ) : Cost generated at the last step, is equal to the optimal cost Let µ ∗ k ( x k ) minimize in the right side above for each x k and k . Then the policy π ∗ = { µ ∗ 0 , . . . , µ ∗ N − 1 } is optimal Proof by induction Bertsekas (M.I.T.) Approximate Dynamic Programming 18 / 24

  16. Practical Difficulties of DP The curse of dimensionality (too many values of x k ) In continuous-state problems: ◮ Discretization needed ◮ Exponential growth of the computation with the dimensions of the state and control spaces In naturally discrete/combinatorial problems: Quick explosion of the number of states as the search space increases Length of the horizon (what if it is infinite?) The curse of modeling; we may not know exactly f k and P ( x k | x k , u k ) It is often hard to construct an accurate math model of the problem Sometimes a simulator of the system is easier to construct than a model The problem data may not be known well in advance A family of problems may be addressed. The data of the problem to be solved is given with little advance notice The problem data may change as the system is controlled – need for on-line replanning and fast solution Bertsekas (M.I.T.) Approximate Dynamic Programming 19 / 24

Recommend


More recommend