approximate dynamic programming
play

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning


  1. Linear Fitted Q-iteration Input : space F , iterations K , sampling distribution ρ , num of samples n Initial function � Q 0 ∈ F For k = 1 , . . . , K i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) 3. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 4. Build training set ( x i , a i ) , y i i = 1 5. Solve the least squares problem � n � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 14/82

  2. Linear Fitted Q-iteration Input : space F , iterations K , sampling distribution ρ , num of samples n Initial function � Q 0 ∈ F For k = 1 , . . . , K i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) 3. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 4. Build training set ( x i , a i ) , y i i = 1 5. Solve the least squares problem � n � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 6. Return � Q k = f ˆ α k (truncation may be needed) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 14/82

  3. Linear Fitted Q-iteration Input : space F , iterations K , sampling distribution ρ , num of samples n Initial function � Q 0 ∈ F For k = 1 , . . . , K i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) 3. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 4. Build training set ( x i , a i ) , y i i = 1 5. Solve the least squares problem � n � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 6. Return � Q k = f ˆ α k (truncation may be needed) Return π K ( · ) = arg max a � Q K ( · , a ) ( greedy policy ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 14/82

  4. Linear Fitted Q-iteration: Sampling i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 15/82

  5. Linear Fitted Q-iteration: Sampling i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) ◮ In practice it can be done once before running the algorithm ◮ The sampling distribution ρ should cover the state-action space in all relevant regions ◮ If not possible to choose ρ , a database of samples can be used A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 15/82

  6. Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 5. Build training set ( x i , a i ) , y i i = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 16/82

  7. Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 5. Build training set ( x i , a i ) , y i i = 1 ◮ Each sample y i is an unbiased sample, since � � Q k − 1 ( x ′ Q k − 1 ( x ′ E [ y i | x i , a i ] = E [ r i + γ max i , a )] = r ( x i , a i ) + γ E [ max i , a )] a a � Q k − 1 ( x ′ , a ) p ( dy | x , a ) = T � � = r ( x i , a i ) + γ max Q k − 1 ( x i , a i ) a X ◮ The problem “reduces” to standard regression ◮ It should be recomputed at each iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 16/82

  8. Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 7. Return � Q k = f ˆ α k (truncation may be needed) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 17/82

  9. Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 7. Return � Q k = f ˆ α k (truncation may be needed) ◮ Thanks to the linear space we can solve it as � φ ( x 1 , a 1 ) ⊤ . . . φ ( x n , a n ) ⊤ � ◮ Build matrix Φ = α k = (Φ ⊤ Φ) − 1 Φ ⊤ y ( least–squares solution) ◮ Compute ˆ ◮ Truncation to [ − V max ; V max ] (with V max = R max / ( 1 − γ ) ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 17/82

  10. Sketch of the Analysis Q 0 T � Q 1 Q 1 ǫ 1 T � T � Q 2 Q 2 Q 1 ǫ 2 T T � � Q 2 Q 3 Q 3 ǫ 3 T T � Q 3 Q 4 · · · T · · · � Q K greedy π K final error Q π K Q ∗ Skip Theory A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 18/82

  11. Theoretical Objectives Objective : derive a bound on the performance ( quadratic ) loss w.r.t. a testing distribution µ || Q ∗ − Q π K || µ ≤ ??? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 19/82

  12. Theoretical Objectives Objective : derive a bound on the performance ( quadratic ) loss w.r.t. a testing distribution µ || Q ∗ − Q π K || µ ≤ ??? Sub-Objective 1 : derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ ||T � Q k − 1 − � Q k || ρ ≤ ??? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 19/82

  13. Theoretical Objectives Objective : derive a bound on the performance ( quadratic ) loss w.r.t. a testing distribution µ || Q ∗ − Q π K || µ ≤ ??? Sub-Objective 1 : derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ ||T � Q k − 1 − � Q k || ρ ≤ ??? Sub-Objective 2 : analyze how the error at each iteration is propagated through iterations || Q ∗ − Q π K || µ ≤ propagation ( ||T � Q k − 1 − � Q k || ρ ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 19/82

  14. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  15. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  16. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ ⇒ Error from the approximation space F A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  17. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ ⇒ Error from the approximation space F ◮ Returned solution n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  18. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ ⇒ Error from the approximation space F ◮ Returned solution n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 ⇒ Error from the (random) samples A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  19. Per-Iteration Error Theorem At each iteration k, Linear-FQI returns an approximation � Q k such that ( Sub-Objective 1 ) || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max , n with probability 1 − δ . Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces, ... A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 21/82

  20. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 22/82

  21. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n Remarks ◮ No algorithm can do better ◮ Constant 4 ◮ Depends on the space F ◮ Changes with the iteration k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 23/82

  22. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n Remarks ◮ Vanishing to zero as O ( n − 1 / 2 ) ◮ Depends on the features ( L ) and on the best solution ( || α ∗ k || ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 24/82

  23. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n Remarks ◮ Vanishing to zero as O ( n − 1 / 2 ) ◮ Depends on the dimensionality of the space ( d ) and the number of samples ( n ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 25/82

  24. Error Propagation Objective || Q ∗ − Q π K || µ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  25. Error Propagation Objective || Q ∗ − Q π K || µ ◮ Problem 1 : the test norm µ is different from the sampling norm ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  26. Error Propagation Objective || Q ∗ − Q π K || µ ◮ Problem 1 : the test norm µ is different from the sampling norm ρ ◮ Problem 2 : we have bounds for � Q k not for the performance of the corresponding π k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  27. Error Propagation Objective || Q ∗ − Q π K || µ ◮ Problem 1 : the test norm µ is different from the sampling norm ρ ◮ Problem 2 : we have bounds for � Q k not for the performance of the corresponding π k ◮ Problem 3 : we have bounds for one single iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  28. Error Propagation Transition kernel for a fixed policy P π . ◮ m -step (worst-case) concentration of future state distribution � � � � � � � � d ( µ P π 1 . . . P π m ) � � � � c ( m ) = sup < ∞ � � � � � � d ρ � � π 1 ...π m ∞ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 27/82

  29. Error Propagation Transition kernel for a fixed policy P π . ◮ m -step (worst-case) concentration of future state distribution � � � � � � � � d ( µ P π 1 . . . P π m ) � � � � c ( m ) = sup < ∞ � � � � � � d ρ � � π 1 ...π m ∞ ◮ Average (discounted) concentration C µ,ρ = ( 1 − γ ) 2 � m γ m − 1 c ( m ) < + ∞ m ≥ 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 27/82

  30. Error Propagation Remark : relationship to top-Lyapunov exponent m log + � � 1 L + = sup || ρ P π 1 P π 2 · · · P π m || π lim sup m →∞ If L + ≤ 0 ( stable system ), then c ( m ) has a growth rate which is polynomial and C µ,ρ < ∞ is finite A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 28/82

  31. Error Propagation Proposition Let ǫ k = Q k − � Q k be the propagation error at each iteration, then after K iteration the performance loss of the greedy policy π K is � � 2 � � γ K 2 γ || Q ∗ − Q π K || 2 || ǫ k || 2 2 µ ≤ C µ,ρ max ρ + O ( 1 − γ ) 3 V max ( 1 − γ ) 2 k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 29/82

  32. The Final Bound Bringing everything together... � � 2 � � γ K 2 γ || Q ∗ − Q π K || 2 2 || ǫ k || 2 µ ≤ C µ,ρ max ρ + O ( 1 − γ ) 3 V max ( 1 − γ ) 2 k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 30/82

  33. The Final Bound Bringing everything together... � � 2 � � γ K 2 γ || Q ∗ − Q π K || 2 2 || ǫ k || 2 µ ≤ C µ,ρ max ρ + O ( 1 − γ ) 3 V max ( 1 − γ ) 2 k || ǫ k || ρ = || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 30/82

  34. The Final Bound Theorem (see e.g., Munos,’03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 31/82

  35. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � C µ,ρ 4 d ( F , T F ) + O V max � 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The propagation (and different norms) makes the problem more complex ⇒ how do we choose the sampling distribution ? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 32/82

  36. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The approximation error is worse than in regression A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 33/82

  37. The Final Bound The inherent Bellman error || Q k − f α ∗ k || ρ = inf f ∈F || Q k − f || ρ f ∈F ||T � = inf Q k − 1 − f || ρ ≤ inf f ∈F ||T f α k − 1 − f || ρ ≤ sup f ∈F ||T g − f || ρ = d ( F , T F ) inf g ∈F Question: how to design F to make it “compatible” with the Bellman operator? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 34/82

  38. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The dependency on γ is worse than at each iteration ⇒ is it possible to avoid it? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 35/82

  39. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The error decreases exponentially in K ⇒ K ≈ ǫ/ ( 1 − γ ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 36/82

  40. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The smallest eigenvalue of the Gram matrix ⇒ design the features so as to be orthogonal w.r.t. ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 37/82

  41. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The asymptotic rate O ( d / n ) is the same as for regression A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 38/82

  42. Summary Q k − � Q k Propagation Approximation Dynamic programming algorithm algorithm Samples Performance Markov decision (sampling strategy, number) process number n , sampling dist. ρ Concentrability C µ,ρ Range V max Approximation space d ( F , T F ) size d , features ω A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 39/82

  43. Other implementations Replace the regression step with ◮ K -nearest neighbour ◮ Regularized linear regression with L 1 or L 2 regularisation ◮ Neural network ◮ Support vector regression ◮ ... A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 40/82

  44. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  45. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  46. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . Cost : ◮ c ( x , R ) = C ◮ c ( x , K ) = c ( x ) maintenance plus extra costs. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  47. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . Cost : ◮ c ( x , R ) = C ◮ c ( x , K ) = c ( x ) maintenance plus extra costs. Dynamics : ◮ p ( ·| x , R ) = exp ( β ) with density d ( y ) = β exp − β y I { y ≥ 0 } , ◮ p ( ·| x , K ) = x + exp ( β ) with density d ( y − x ) . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  48. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . Cost : ◮ c ( x , R ) = C ◮ c ( x , K ) = c ( x ) maintenance plus extra costs. Dynamics : ◮ p ( ·| x , R ) = exp ( β ) with density d ( y ) = β exp − β y I { y ≥ 0 } , ◮ p ( ·| x , K ) = x + exp ( β ) with density d ( y − x ) . Problem : Minimize the discounted expected cost over an infinite horizon. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  49. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  50. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 Optimal policy : action that attains the minimum A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  51. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 Optimal policy : action that attains the minimum 70 70 Value function Management cost 60 60 50 50 40 40 30 30 20 20 10 wear R K R K R K 0 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  52. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 Optimal policy : action that attains the minimum 70 70 Value function Management cost 60 60 50 50 40 40 30 30 20 20 10 wear R K R K R K 0 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 � � V n ( x ) = � 20 x Linear approximation space F := k = 1 α k cos ( k π x max ) . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  53. Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 60 ++++ +++++++++++++++++++++ 50 ++++ 40 +++++++++++++++++++++ 30 +++++++++++++++++++++ ++++ 20 +++++++++++++++++++++++++ 10 0 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 43/82

  54. Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 70 70 60 60 60 ++++ +++++++++++++++++++++ 50 50 50 ++++ 40 40 40 +++++++++++++++++++++ 30 30 30 +++++++++++++++++++++ ++++ 20 20 20 +++++++++++++++++++++++++ 10 10 10 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 Figure: Left: the target values computed as {T V 0 ( x n ) } 1 ≤ n ≤ N . Right: the approximation V 1 ∈ F of the target function T V 0 . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 43/82

  55. Example: the Optimal Replacement Problem 70 70 70 70 70 60 ++++ +++++++++++++++++++++++++ 60 60 60 60 +++++++++++++++++++++ 50 50 50 50 50 +++++++++++++++++++++ 40 40 40 ++++ 40 40 30 30 30 +++++++++++++++++++++++++ 30 30 20 20 20 20 20 10 10 10 0 0 0 10 10 0 1 2 3 4 5 6 7 8 9 10 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 Figure: Left: the target values computed as {T V 1 ( x n ) } 1 ≤ n ≤ N . Center: the approximation V 2 ∈ F of T V 1 . Right: the approximation V n ∈ F after n iterations. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 44/82

  56. Example: the Optimal Replacement Problem Simulation A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 45/82

  57. Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 46/82

  58. Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1 , 2 , . . . , K ◮ Policy evaluation given π k , compute V k = V π k . ◮ Policy improvement : compute the greedy policy � � � π k + 1 ( x ) ∈ arg max a ∈ A r ( x , a ) + γ p ( y | x , a ) V π k ( y ) . y 3. Return the last policy π K A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 47/82

  59. Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1 , 2 , . . . , K ◮ Policy evaluation given π k , compute V k = V π k . ◮ Policy improvement : compute the greedy policy � � � π k + 1 ( x ) ∈ arg max a ∈ A r ( x , a ) + γ p ( y | x , a ) V π k ( y ) . y 3. Return the last policy π K ◮ Problem : how can we approximate V π k ? ◮ Problem : if V k � = V π k , does (approx.) policy iteration still work? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 47/82

  60. Approximate Policy Iteration: performance loss Problem : the algorithm is no longer guaranteed to converge. V * − V π k Asymptotic Error k Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: 2 γ � V ∗ − V π k � ∞ lim sup ≤ ( 1 − γ ) 2 lim sup � V k − V π k � ∞ � �� � � �� � k →∞ k →∞ performance loss approximation error A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 48/82

  61. Least-Squares Policy Iteration (LSPI) LSPI uses ◮ Linear space to approximate value functions* � d α j ϕ j ( x ) , α ∈ R d � � F = f ( x ) = j = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 49/82

  62. Least-Squares Policy Iteration (LSPI) LSPI uses ◮ Linear space to approximate value functions* � d α j ϕ j ( x ) , α ∈ R d � � F = f ( x ) = j = 1 ◮ Least-Squares Temporal Difference (LSTD) algorithm for policy evaluation . *In practice we use approximations of action-value functions. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 49/82

  63. Least-Squares Temporal-Difference Learning (LSTD) ◮ V π may not belong to F V π / ∈ F ◮ Best approximation of V π in F is Π V π = arg min f ∈F || V π − f || ( Π is the projection onto F ) T π V π Π V π F A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 50/82

  64. Least-Squares Temporal-Difference Learning (LSTD) ◮ V π is the fixed-point of T π V π = T π V π = r π + γ P π V π ◮ LSTD searches for the fixed-point of Π 2 ,ρ T π Π 2 ,ρ g = arg min f ∈F || g − f || 2 ,ρ ◮ When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V TD T π V π T π V TD = Π ρ T π V TD Π ρ V π F A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 51/82

  65. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ◮ The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1 , . . . , ϕ d � � ( T π V TD ( x ) − V TD ( x )) ϕ i ( x ) = 0 , ∀ i ∈ [ 1 , d ] E x ∼ ρ �T π V TD − V TD , ϕ i � ρ = 0 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 52/82

  66. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ◮ The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1 , . . . , ϕ d � � ( T π V TD ( x ) − V TD ( x )) ϕ i ( x ) = 0 , ∀ i ∈ [ 1 , d ] E x ∼ ρ �T π V TD − V TD , ϕ i � ρ = 0 ◮ By definition of Bellman operator � r π + γ P π V TD − V TD , ϕ i � ρ = 0 � r π , ϕ i � ρ − � ( I − γ P π ) V TD , ϕ i � ρ = 0 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 52/82

  67. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ◮ The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1 , . . . , ϕ d � � ( T π V TD ( x ) − V TD ( x )) ϕ i ( x ) = 0 , ∀ i ∈ [ 1 , d ] E x ∼ ρ �T π V TD − V TD , ϕ i � ρ = 0 ◮ By definition of Bellman operator � r π + γ P π V TD − V TD , ϕ i � ρ = 0 � r π , ϕ i � ρ − � ( I − γ P π ) V TD , ϕ i � ρ = 0 ◮ Since V TD ∈ F , there exists α TD such that V TD ( x ) = φ ( x ) ⊤ α TD d � � r π , ϕ i � ρ − � ( I − γ P π ) ϕ j α TD , j , ϕ i � ρ = 0 j = 1 d � � r π , ϕ i � ρ − � ( I − γ P π ) ϕ j , ϕ i � ρ α TD , j = 0 j = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 52/82

  68. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ⇓ d � � r π , ϕ i � ρ � ( I − γ P π ) ϕ j , ϕ i � ρ − α TD , j = 0 � �� � � �� � j = 1 b i A i , j ⇓ A α TD = b A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 53/82

  69. Least-Squares Temporal-Difference Learning (LSTD) ◮ Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. ◮ Solution: If ρ = ρ π ( stationary dist. of π ) then Π ρ π T π has a unique fixed-point. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 54/82

Recommend


More recommend