some theoretical aspects of reinforcement learning cs 285
play

Some Theoretical Aspects of Reinforcement Learning CS 285 - PowerPoint PPT Presentation

Some Theoretical Aspects of Reinforcement Learning CS 285 Instructor: Aviral Kumar UC Berkeley What Will We Discuss Today? A brief introduction to some theoretical aspects of RL: In particular error/suboptimality- analysis of RL algorithms,


  1. Some Theoretical Aspects of Reinforcement Learning 
 CS 285 Instructor: Aviral Kumar UC Berkeley

  2. What Will We Discuss Today? A brief introduction to some theoretical aspects of RL: In particular error/suboptimality- analysis of RL algorithms, understanding of regret, and function approximation • Notions of Convergence in RL, Assumptions and Preliminaries 
 • Optimization Error in RL and Analyses of Fitted Q-Iteration Algorithms 
 • Regret Analyses of RL Algorithms: An Introduction 
 • RL with Function Approximation: When can we still obtain convergent algorithms? This is not at all an exhaustive coverage of topics in RL theory, checkout various resources on the last slide of this lecture.

  3. Metrics used to evaluate RL methods Used typically for measuring how easy is to Sample complexity infer the optimal policy assuming no exploration bottlenecks (e.g., in offline RL) How many transitions/episodes do I need to obtain a good policy? ✓ ✓ ◆◆ 1 s,a | Q π ( s, a ) − ˆ then max Q π ( s, a ) | ≤ ε N = O | S | , | A | , poly 1 − γ Regret Used typically for measuring how good an exploration scheme is π 0 , π 1 , π 2 , · · · , π N N X Reg(N) = E s 0 ∼ ρ [ V ∗ ( s 0 )] − E s 0 ∼ ρ [ V π i ( s 0 )] i =1 √ This area Reg(N) = O ( N )

  4. Assumptions used in RL Analyses We can breakdown the RL into two parts: 
 - the exploration part 
 - given data from the exploration policy, we should be able to learn from it Can we analyze these separately? To remove the exploration aspect, perform analysis under the “generative model” assumption access to sampling a model s 0 ∼ P ( ·| s, a ) Suppose we can query the true dynamics model of the MDP for each (s, a) pair N times and construct an empirical dynamics model P ( s 0 | s, a ) = #( s 0 , a, s ) ˆ N Goal: Approximate the Q-function or the value function How does the approximation error of this model translate to errors in the value function?

  5. Preliminaries Concentration Says that average over samples gets closer to the mean More complex variants: We will use this version to obtain a worst case bound on the generative model. Lemmas from RL Theory Textbook (Draft). Agarwal, Jiang, Kakade, Sun. https://rltheorybook.github.io/

  6. Part 1: Sampling/Optimization Error in RL Goal: How does error in training translate to error in the value-function? We will analyze this optimization error in two settings: 
 (1) generative model (2) Fitted Q-iteration We want results of the form: if || ˆ P ( s 0 | s, a ) − P ( s 0 | s, a ) || 1 ≤ ε then || Q ( s, a ) − ˆ Q ( s, a ) || 1 ≤ δ if || Q ( s, a ) − ˆ TQ ( s, a ) || ∞ ≤ ε then || Q ( s, a ) − ˆ Q ( s, a ) || ∞ ≤ δ h i “Empirical” Bellman operator: a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = r ( s, a ) + γ E s 0 ⇠ P ( s 0 | s,a ) max constructed using transition samples observed by h i ˆ a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = ˆ r ( s, a ) + γ E s 0 ⇠ ˆ max sampling the MDP P ( s 0 | s,a )

  7. Sampling Error with Generative Model ˆ 1. Estimate 
 P ( s 0 | s, a ) = #( s 0 , a, s ) P ( s 0 | s, a ) ˆ N 2. For a given policy, plan under this dynamics model to obtain the Q- function ˆ Q π First Step: Bound the difference between the learned and true dynamics model Use concentration inequalities with high probability greater than 1 − δ m = number of samples used to estimate p ( s 0 | s, a ) The empirical dynamics model and the actual dynamics model are close

  8. Sampling Error with Generative Model Second step: Compute how the dynamics model affects the Q-function 1. Express Q in the vector form 2. Express the difference between the two vectors in a more closed form version and obtain ( - P) in the ˆ P Q-function depends on the expression dynamics model P(s’|s, a) via a non-linear transformation

  9. Sampling Error with Generative Model Third step: Understand how error in the Q-function depends on error in the model Define || P π || ∞ ≤ 1 Triangle inequality Thus, || w || ∞ ≤ || v || ∞ / (1 − γ )

  10. Sampling Error with Generative Model Final step: Completing the Proof Bound the max element of the product by product of max elements Assume R max = 1 Now use the We want atmost eps previous relation error in , compute Q π r the minimum number | S | log(1 / δ ) γ || Q π − ˆ of samples m needed Q π || ≤ (1 − γ ) 2 c m for this..

  11. Proof Takeaways and Summary r | S | log(1 / δ ) γ || Q π − ˆ Q π || ≤ (1 − γ ) 2 c m • A small error in estimating the dynamics model implies small error in the Q-function 
 • However, error “compounds”: Note the (1 - gamma)^2 factor in the denominator of the bound. 
 • The more samples we collect, the better our estimate will be, but sadly samples aren’t free! How does optimization error manifest in model-free variants (e.g., fitted Q-iteration)?

  12. Part 2: Optimization Error in FQI Fitted Q-iteration runs a sequence of backups by minimizing mean-squared error || Q − ˆ TQ k || 2 initial Q-value Q 0 Q k +1 ← min 2 Q if we use T instead of ˆ and || Q k +1 − TQ k || = 0 T then FQI converges to the optimal Q-function Q ∗ Which sources of error are we considering here? - T is inexact, “sampling error” due to limited samples 
 - Bellman errors in that may not be 0 | Q k +1 − TQ k | h i ˆ a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = ˆ r ( s, a ) + γ E s 0 ⇠ ˆ max P ( s 0 | s,a ) h i a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = r ( s, a ) + γ E s 0 ⇠ P ( s 0 | s,a ) max

  13. Optimization Error in Fitted Q-Iteration First Step: Bound the difference between the empirical and actual Bellman backup Concentration of reward � | ˆ TQ ( s, a ) − TQ ( s, a ) | ≤ � ˆ r ( s, a ) − r ( s, a ) + � ⌘� ⇣ a 0 Q ( s 0 , a 0 )] − E s ⇠ P ( s 0 | s,a ) [max a 0 Q ( s 0 , a 0 )] ) + γ P ( s 0 | s,a ) [max E s ⇠ ˆ � � Triangle inequality, Concentration of dynamics bound each term separately � � a 0 Q ( s 0 , a 0 )] − E s ⇠ P ( s 0 | s,a ) [max a 0 Q ( s 0 , a 0 )] ≤ | ˆ r ( s, a ) − r ( s, a ) | + + γ P ( s 0 | s,a ) [max � E s ⇠ ˆ � � � ( ˆ X P ( s 0 | s, a ) − P ( s 0 | s, a )) max a 0 Q ( s 0 , a 0 ) | := | Directly apply Hoe ff ding’s s 0 Vector-form r ≤ || ˆ log(1 / δ ) P ( ·| s, a ) − P ( ·| s, a ) || 1 || Q || ∞ ≤ 2 R max 2 m Sum of product ≤ sum of product of absolute values, Q-values bounded by the ∞ -norm

  14. Optimization Error in Fitted Q-Iteration Combining the bounds on the previous slide, and taking a max over (s, a) we get: r r log( | S || A | / δ ) | S | log(1 / δ ) || ˆ TQ − TQ || ∞ ≤ 2 R max c 1 + c 2 || Q | ∞ m m Second step: How does error in each fitting iteration affect optimality Let’s say, we incur error in each fitting step of FQI, i.e., || Q k +1 − TQ k || ∞ ≤ ε k ε k || Q k − Q ∗ || ∞ ≤ ? Then, what can we say about: || Q k − Q ∗ || ∞ ≤ || TQ k − 1 + ( Q k − TQ k − 1 ) − TQ ∗ || = || ( TQ k − 1 − TQ ∗ ) + ( Q k − TQ k − 1 ) || || ≤ || TQ k − 1 − TQ ∗ || + || Q k − TQ k − 1 || ≤ γ || Q k − 1 − Q ∗ || ∞ + ε k

  15. Optimization Error in Fitted Q-Iteration || Q k − Q ∗ || ∞ ≤ γ || Q k − 1 − Q ∗ || ∞ + ε k ≤ γ 2 || Q k − 2 − Q ∗ || ∞ + γε k − 1 + ε k Error from previous iteration X ≤ γ k || Q 0 − Q ∗ || ∞ + γ j ε k − j “compounds”, “propagates”, etc… j Let’s consider a large number of fitting iterations in FQI (so k tends ∞ ) X γ j ε k − j k →∞ || Q k − Q ∗ || ∞ ≤ 0 + lim lim k →∞ j We pay a price for each 0 1 error term, and the total ∞ A || ε || ∞ = || ε || ∞ X γ j error in the worst-case is ≤ @ scaled by the (1 - gamma) 1 − γ j =0 factor in the denominator.

  16. 
 Optimization Error in Fitted Q-Iteration Completing the Proof So far, we have seen how errors in the Bellman error can accumulate to form error against Q* 
 What is the total error in the Bellman error? 
 - optimization error 
 ε k - “sampling error” due to limited data || Q k − TQ k − 1 || ∞ = || Q k − ˆ TQ k − 1 + ˆ TQ k − 1 − TQ k − 1 || ∞ ≤ || Q k − ˆ TQ k − 1 || ∞ + || ˆ TQ k − 1 − TQ k − 1 || ∞ Sampling error: depends Optimization error: how on number of times we easily can we minimize see each (s, a) Bellman error 1 k →∞ || Q k − Q ∗ || ∞ ≤ || Q k − TQ k − 1 || ∞ ≤ · · · lim 1 − γ max k

  17. Proof Takeaways and Summary • Error compounds with FQI or DQN-style methods: especially a problem in o ffl ine RL settings, where the “sampling error” component is also quite high 
 • A stringent requirements with these bounds is that they directly ∞ -norm of the error in the Q- function: but can we ever practically bound the error at the worst state-action pair? — Mostly not since we can’t even enumerate the state or action-space! Can we remove the dependency on the ∞ -norm? Yes! Can derive similar results for other data-distributions ( µ ) and L p norms � 1 /p || Q k − Q ∗ || µ E s,a ∼ µ ( s,a ) [ | Q k ( s, a ) − Q ∗ ( s, a ) | p ] � p = • So far we’ve looked at the generative model setting, where we have oracle MDP access to compute an approximate dynamics model. What happens in the substantially harder setting without this access, where we need exploration strategies? Coming up next…

Recommend


More recommend