sample complexity of adp algorithms
play

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Sources of Error Approximation error . If X is large or continuous , value functions V cannot be


  1. Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

  2. Sources of Error ◮ Approximation error . If X is large or continuous , value functions V cannot be represented correctly ⇒ use an approximation space F ◮ Estimation error . If the reward r and dynamics p are unknown , the Bellman operators T and T π cannot be computed exactly ⇒ estimate the Bellman operators from samples A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 2/82

  3. In This Lecture ◮ Infinite horizon setting with discount γ ◮ Study the impact of estimation error A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 3/82

  4. In This Lecture: Warning!! Problem: are these performance bounds accurate/useful? Answer: of course not! :) Reason: upper bounds, non-tight analysis, worst case. A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 4/82

  5. In This Lecture: Warning!! Chernoff-Hoeffding inequality �� � � � n � � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 ⇒ worst-case w.r.t. to all the distributions bounded in [ a , b ] , loose for other distributions. A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 5/82

  6. In This Lecture: Warning!! Question: so why should we derive/study these bounds? Answer: ◮ General guarantees ◮ Rates of convergence (not always available in asymptotic analysis) ◮ Explicit dependency on the design parameters ◮ Explicit dependency on the problem parameters ◮ First guess on how to tune parameters ◮ Better understanding of the algorithms A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 6/82

  7. Sample Complexity of LSTD Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 7/82

  8. Sample Complexity of LSTD The Algorithm Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 8/82

  9. Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) � � f : f ( · ) = � d ◮ Linear function space F = j = 1 α j ϕ j ( · ) ◮ V π is the fixed-point of T π V π = T π V π ◮ V π may not belong to F V π / ∈ F ◮ Best approximation of V π in F is Π V π = arg min f ∈F || V π − f || ( Π is the projection onto F ) T π V π Π V π F A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 9/82

  10. Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) ◮ LSTD searches for the fixed-point of Π ? T π instead ( Π ? is a projection into F w.r.t. L ? -norm) ◮ Π ∞ T π is a contraction in L ∞ -norm ◮ L ∞ -projection is numerically expensive when the number of states is large or infinite ◮ LSTD searches for the fixed-point of Π 2 ,ρ T π Π 2 ,ρ g = arg min f ∈F || g − f || 2 ,ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 10/82

  11. Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V TD T π V π T π V TD = Π ρ T π V TD Π ρ V π F �T π V TD − V TD , ϕ i � ρ = 0 , i = 1 , . . . , d � r π + γ P π V TD − V TD , ϕ i � ρ = 0 � d · α ( j ) � r π , ϕ i � ρ � ϕ j − γ P π ϕ j , ϕ i � ρ − TD = 0 − → A α TD = b � �� � � �� � i = 1 b i A ij A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 11/82

  12. Sample Complexity of LSTD The Algorithm LSTD Algorithm ◮ In general, Π ρ T π is not a contraction and does not have a fixed-point. ◮ If ρ = ρ π , the stationary dist. of π , then Π ρ π T π has a unique fixed-point. Proposition (LSTD Performance) 1 || V π − V TD || ρ π ≤ V ∈F || V π − V || ρ π � 1 − γ 2 inf A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 12/82

  13. Sample Complexity of LSTD The Algorithm LSTD Algorithm Empirical LSTD ◮ We observe a trajectory ( X 0 , R 0 , X 1 , R 1 , . . . , X N ) where � � � � X t + 1 ∼ P · | X t , π ( X t ) and R t = r X t , π ( X t ) ◮ We build estimators of the matrix A and vector b N − 1 N − 1 � � � � A ij = 1 b i = 1 � � ϕ i ( X t ) ϕ j ( X t ) − γϕ j ( X t + 1 ) , ϕ i ( X t ) R t N N t = 0 t = 0 ◮ � α TD = � � V TD ( · ) = φ ( · ) ⊤ � A � b , α TD when n → ∞ then � A → A and � b → b , and thus, � α TD → α TD and � V TD → V TD . A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 13/82

  14. Sample Complexity of LSTD LSTD and LSPI Error Bounds Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 14/82

  15. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSTD Error Bound When the Markov chain induced by the policy under evaluation π has a stationary distribution ρ π (Markov chain is ergodic - e.g. β -mixing) , then Theorem (LSTD Error Bound) Let � V be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π . Then with probability 1 − δ , we have �� � c d log ( d /δ ) || V π − � f ∈F || V π − f || ρ π + O V || ρ π ≤ � 1 − γ 2 inf n ν ◮ n = # of samples , d = dimension of the linear function space F � ◮ ν = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j d ρ π ) i , j ( Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution ) ◮ β -mixing coefficients are hidden in the O ( · ) notation A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 15/82

  16. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSTD Error Bound LSTD Error Bound �� � c d log ( d /δ ) || V π − � f ∈F || V π − f || ρ π V || ρ π ≤ � inf + O n ν 1 − γ 2 � �� � � �� � approximation error estimation error ◮ Approximation error: it depends on how well the function space F can approximate the value function V π ◮ Estimation error: it depends on the number of samples n , the dim of the function space d , the smallest eigenvalue of the Gram matrix ν , the mixing properties of the Markov chain (hidden in O ) A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 16/82

  17. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 17/82

  18. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 18/82

  19. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ cE 0 ( F ) + O + γ CC µ,ρ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � ◮ Estimation error: depends on n , d , ν ρ , K A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 19/82

  20. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � ◮ Estimation error: depends on n , d , ν ρ , K ◮ Initialization error: error due to the choice of the initial value function or initial policy | V ∗ − V π 0 | A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 20/82

Recommend


More recommend