approximate dynamic programming
play

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K


  1. Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

  2. Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = T V k 3. Return the greedy policy � � � π K ( x ) ∈ arg max r ( x , a ) + γ p ( y | x , a ) V K ( y ) . a ∈ A y A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 2/63

  3. Value Iteration: the Guarantees ◮ From the fixed point property of T : k →∞ V k = V ∗ lim ◮ From the contraction property of T || V k + 1 − V ∗ || ∞ ≤ γ k + 1 || V 0 − V ∗ || ∞ → 0 Problem : what if V k + 1 � = T V k ?? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 3/63

  4. Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1 , 2 , . . . , K ◮ Policy evaluation given π k , compute V k = V π k . ◮ Policy improvement : compute the greedy policy � � � π k + 1 ( x ) ∈ arg max a ∈ A r ( x , a ) + γ p ( y | x , a ) V π k ( y ) . y 3. Return the last policy π K A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 4/63

  5. Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k + 1 ≥ V π k , and it converges to π ∗ in a finite number of iterations. Problem : what if V k � = V π k ?? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 5/63

  6. Sources of Error ◮ Approximation error . If X is large or continuous , value functions V cannot be represented correctly ⇒ use an approximation space F ◮ Estimation error . If the reward r and dynamics p are unknown , the Bellman operators T and T π cannot be computed exactly ⇒ estimate the Bellman operators from samples A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 6/63

  7. In This Lecture ◮ Infinite horizon setting with discount γ ◮ Study the impact of approximation error ◮ Study the impact of estimation error in the next lecture A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 7/63

  8. Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 8/63

  9. Performance Loss From Approximation Error to Performance Loss Question : if V is an approximation of the optimal value function V ∗ with an error error = � V − V ∗ � how does it translate to the (loss of) performance of the greedy policy � � � π ( x ) ∈ arg max p ( y | x , a ) r ( x , a , y ) + γ V ( y ) a ∈ A y i.e. performance loss = � V ∗ − V π � ??? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 9/63

  10. Performance Loss From Approximation Error to Performance Loss Proposition Let V ∈ R N be an approximation of V ∗ and π its corresponding greedy policy, then 2 γ � V ∗ − V π � ∞ 1 − γ � V ∗ − V � ∞ ≤ . � �� � � �� � approx. error performance loss Furthermore, there exists ǫ > 0 such that if � V − V ∗ � ∞ ≤ ǫ , then π is optimal . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 10/63

  11. Performance Loss From Approximation Error to Performance Loss Proof. � V ∗ − V π � ∞ ≤ �T V ∗ − T π V � ∞ + �T π V − T π V π � ∞ ≤ �T V ∗ − T V � ∞ + γ � V − V π � ∞ ≤ γ � V ∗ − V � ∞ + γ ( � V − V ∗ � ∞ + � V ∗ − V π � ∞ ) 2 γ 1 − γ � V ∗ − V � ∞ . ≤ � A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 11/63

  12. Performance Loss From Approximation Error to Performance Loss Question: how do we compute V ? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V ∗ Objective: given an approximation space F , compute an approximation V which is as close as possible to the best approximation of V ∗ in F , i.e. f ∈F || V ∗ − f || V ≈ arg inf A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 12/63

  13. Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 13/63

  14. Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator . 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = AT V k 3. Return the greedy policy � � � π K ( x ) ∈ arg max r ( x , a ) + γ p ( y | x , a ) V K ( y ) . a ∈ A y A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 14/63

  15. Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π ∞ be a projection operator in L ∞ -norm, which corresponds to V k + 1 = Π ∞ T V k = arg inf V ∈F �T V k − V � ∞ A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 15/63

  16. Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π ∞ is a non-expansion and the joint operator Π ∞ T is a contraction . Then there exists a unique fixed point ˜ V = Π ∞ T ˜ V which guarantees the convergence of AVI. A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 16/63

  17. Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then + 2 γ K + 1 2 γ 1 − γ � V ∗ − V 0 � ∞ � V ∗ − V π K � ∞ ≤ ( 1 − γ ) 2 max 0 ≤ k < K �T V k − AT V k � ∞ . � �� � � �� � initial error worst approx. error A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 17/63

  18. Approximate Value Iteration Approximate Value Iteration: performance loss Proof. Let ε = max 0 ≤ k < K �T V k − AT V k � ∞ . For any 0 ≤ k < K we have � V ∗ − V k + 1 � ∞ �T V ∗ − T V k � ∞ + �T V k − V k + 1 � ∞ ≤ γ � V ∗ − V k � ∞ + ε, ≤ then � V ∗ − V K � ∞ ( 1 + γ + · · · + γ K − 1 ) ε + γ K � V ∗ − V 0 � ∞ ≤ 1 1 − γ ε + γ K � V ∗ − V 0 � ∞ ≤ Since from Proposition 1 we have that � V ∗ − V π K � ∞ ≤ 1 − γ � V ∗ − V K � ∞ , then we obtain 2 γ ( 1 − γ ) 2 ε + 2 γ K + 1 2 γ � V ∗ − V π K � ∞ ≤ 1 − γ � V ∗ − V 0 � ∞ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 18/63

  19. Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. r ( x, a ) State x Reward Generative model Action a y ∼ p ( ·| x, a ) Next state Idea: work with Q -functions and linear spaces. ◮ Q ∗ is the unique fixed point of T defined over X × A as: � T Q ( x , a ) = p ( y | x , a )[ r ( x , a , y ) + γ max Q ( y , b )] . b y ◮ F is a space defined by d features φ 1 , . . . , φ d : X × A → R as: � α j φ j ( x , a ) , α ∈ R d � d � F = Q α ( x , a ) = . j = 1 ⇒ At each iteration compute Q k + 1 = Π ∞ T Q k A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 19/63

  20. Approximate Value Iteration Fitted Q-iteration with linear approximation ⇒ At each iteration compute Q k + 1 = Π ∞ T Q k Problems: ◮ the Π ∞ operator cannot be computed efficiently ◮ the Bellman operator T is often unknown A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 20/63

  21. Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π ∞ operator cannot be computed efficiently . Let µ a distribution over X . We use a projection in L 2 ,µ -norm onto the space F : Q ∈F � Q − T Q k � 2 Q k + 1 = arg min µ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 21/63

  22. Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown . 1. Sample n state actions ( X i , A i ) with X i ∼ µ and A i random, 2. Simulate Y i ∼ p ( ·| X i , A i ) and R i = r ( X i , A i , Y i ) with the generative model, 3. Estimate T Q k ( X i , A i ) with Z i = R i + γ max a ∈ A Q k ( Y i , a ) (unbiased E [ Z i | X i , A i ] = T Q k ( X i , A i ) ), A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 22/63

  23. Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k + 1 as n � � � 2 1 Q k + 1 = arg min Q α ( X i , A i ) − Z i n Q α ∈F i = 1 ⇒ Since Q α is a linear function in α , the problem is a simple quadratic minimization problem with closed form solution. A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 23/63

  24. Approximate Value Iteration Other implementations ◮ K -nearest neighbour ◮ Regularized linear regression with L 1 or L 2 regularisation ◮ Neural network ◮ Support vector machine A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 24/63

Recommend


More recommend