Finite-Sample Analysis in Reinforcement Learning Mohammad - PowerPoint PPT Presentation

Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille – Nord Europe, Team SequeL

Outline Introduction to RL and DP 1 Approximate Dynamic Programming (AVI & API) 2 How does Statistical Learning Theory come to the picture? 3 Error Propagation (AVI & API Error Propagation) 4 An AVI Algorithm (Fitted Q-Iteration) 5 FQI: error at each iteration Final performance bound of FQI An API Algorithm (Least-Squares Policy Iteration) 6 Error at each iteration (LSTD error) Final performance bound of LSPI Discussion 7

Sequential Decision-Making under Uncertainty ! !"#$%&'$($)))$* Move around in the physical world (e.g. driving, navigation) Play and win a game Retrieve information over the web Medical diagnosis and treatment Maximize the throughput of a factory Optimize the performance of a rescue team

Reinforcement Learning (RL) RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance Interaction: Modeled as a MDP or a POMDP

Markov Decision Process MDP An MDP M is a tuple �X , A , r , p , γ � . The state space X is a bounded closed subset of R d . The set of actions A is finite ( |A| < ∞ ). The reward function r : X × A → R is bounded by R max . The transition model p ( ·| x , a ) is a distribution over X . γ ∈ ( 0 , 1 ) is a discount factor. Policy: a mapping from states to actions π ( x ) ∈ A

Value Function For a policy π V π : X → R Value function � ∞ � � � � V π ( x ) = E γ t r X t , π ( X t ) | X 0 = x t = 0 Q π : X × A → R Action-value function � ∞ � � Q π ( x , a ) = E γ t r ( X t , A t ) | X 0 = x , A 0 = a t = 0

Notation Bellman Operator Bellman operator for policy π T π : B V ( X ; V max ) → B V ( X ; V max ) V π is the unique fixed-point of the Bellman operator � � � � � ( T π V )( x ) = r x , π ( x ) + γ p dy | x , π ( x ) V ( y ) X The action-value function Q π is defined as � Q π ( x , a ) = r ( x , a ) + γ p ( dy | x , a ) V π ( y ) X

Optimal Value Function and Optimal Policy Optimal value function V ∗ ( x ) = sup V π ( x ) ∀ x ∈ X π Optimal action-value function Q ∗ ( x , a ) = sup Q π ( x , a ) ∀ x ∈ X , ∀ a ∈ A π A policy π is optimal if V π ( x ) = V ∗ ( x ) ∀ x ∈ X

Notation Bellman Optimality Operator Bellman optimality operator T : B V ( X ; V max ) → B V ( X ; V max ) V ∗ is the unique fixed-point of the Bellman optimality operator � � � ( T V )( x ) = max r ( x , a ) + γ p ( dy | x , a ) V ( y ) a ∈A X Optimal action-value function Q ∗ is defined as � Q ∗ ( x , a ) = r ( x , a ) + γ p ( dy | x , a ) V ∗ ( y ) X

Properties of Bellman Operators Monotonicity: if V 1 ≤ V 2 component-wise T π V 1 ≤ T π V 2 and T V 1 ≤ T V 2 ∀ V 1 , V 2 ∈ B V ( X ; V max ) Max-Norm Contraction: ||T π V 1 − T π V 2 || ∞ ≤ γ || V 1 − V 2 || ∞ ||T V 1 − T V 2 || ∞ ≤ γ || V 1 − V 2 || ∞

Dynamic Programming Algorithms Value Iteration start with an arbitrary action-value function Q 0 at each iteration k Q k + 1 = T Q k Convergence lim k →∞ V k = V ∗ . || V ∗ − V k + 1 || ∞ = ||T V ∗ −T V k || ∞ ≤ γ || V ∗ − V k || ∞ ≤ γ k + 1 || V ∗ − V 0 || ∞ k →∞ − → 0

Dynamic Programming Algorithms Policy Iteration start with an arbitrary policy π 0 at each iteration k Policy Evaluation: Compute Q π k Policy Improvement: Compute the greedy policy w.r.t. Q π k π k + 1 ( x ) = ( G π k )( x ) = arg max Q π k ( x , a ) a ∈A Convergence PI generates a sequence of policies with increasing performance ( V π k + 1 ≥ V π k ) and stops after a finite number of iterations with the optimal policy π ∗ . V π k = T π k V π k ≤ T V π k = T π k + 1 V π k ≤ lim n →∞ ( T π k + 1 ) n V π k = V π k + 1

Approximate Dynamic Programming

Approximate Dynamic Programming Algorithms Value Iteration start with an arbitrary action-value function Q 0 at each iteration k Q k + 1 = T Q k What if Q k + 1 ≈ T Q k ? ? || Q ∗ − Q k + 1 || ≤ γ || Q ∗ − Q k ||

Approximate Dynamic Programming Algorithms Policy Iteration start with an arbitrary policy π 0 at each iteration k Policy Evaluation: Compute Q π k Policy Improvement: Compute the greedy policy w.r.t. Q π k π k + 1 ( x ) = ( G π k )( x ) = arg max Q π k ( x , a ) a ∈A What if we cannot compute Q π k exactly? (Compute � Q π k ≈ Q π k instead) ? π k + 1 ( x ) = arg max Q π k ( x , a ) � = ( G π k )( x ) − � → V π k + 1 ≥ V π k a ∈A

Statistical Learning Theory in RL & ADP Approximate Value Iteration (AVI) Q k + 1 ≈ T Q k finding a function that best approximates T Q k Q = min f || f − T Q k || µ � only noisy observations of T Q k are available T Q k Noisy Observation = � Target Function = T Q k T Q k we minimize the empirical error Q k + 1 = � Q = min f || f − � T Q k || � µ with the target of minimizing the true error Q = min f || f − T Q k || µ Objective: || � Q − T Q k || µ ≤ || � Q − Q || µ + || Q − T Q k || µ to be small � �� estimation error approximation error regression

Statistical Learning Theory in RL & ADP Approximate Policy Iteration (API) - policy evaluation Q = min f || f − Q π k || µ finding a function that best approximates Q π k only noisy observations of Q π k are available � Q π k Target Function = Q π k Noisy Observation = � Q π k we minimize the empirical error Q = min f || f − � � Q π k || � µ with the target of minimizing the true error Q = min f || f − Q π k || µ Objective: || � Q − Q π k || µ ≤ || � + || Q − Q π k || µ Q − Q || µ to be small � �� estimation error approximation error regression

Finite-Sample Analysis in Reinforcement Learning Mohammad - PowerPoint PPT Presentation

Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline Introduction to RL and DP 1 Approximate Dynamic Programming (AVI & API) 2 How does Statistical Learning Theory come to

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Dawn Song dawnsong@cs.berkeley.edu 1 Other Issues & Attacks against GhostBuster? Malware

Tutorial Setup Interactive Session Temporary shell account provided Environment setup

ANOVA: Analysis of Variance An example ANOVA problem 25 individuals split into three

Qualitative Data Collection and Analysis In this lecture Overview of observations, diary

Principles of Program Analysis: A Sampler of Approaches Transparencies based on Chapter 1 of the

Study of Face-to-Face Interaction Main Points: Phenomena that we tend to think of as

Start-up Time, Run-up Time, and R9 Analysis for ENERGY STAR Lamps V2.0 Draft 1 Section 11.4: Start

Cuckoo Filter: Simplification and Analysis David Eppstein 15th Scandinavian Symposium and

Finite-Sample Analysis in Reinforcement Learning Mohammad - PowerPoint PPT Presentation

Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline Introduction to RL and DP 1 Approximate Dynamic Programming (AVI & API) 2 How does Statistical Learning Theory come to

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Dawn Song dawnsong@cs.berkeley.edu 1 Other Issues &amp; Attacks against GhostBuster? Malware

Tutorial Setup Interactive Session Temporary shell account provided Environment setup

ANOVA: Analysis of Variance An example ANOVA problem 25 individuals split into three

Qualitative Data Collection and Analysis In this lecture Overview of observations, diary

Principles of Program Analysis: A Sampler of Approaches Transparencies based on Chapter 1 of the

Study of Face-to-Face Interaction Main Points: Phenomena that we tend to think of as

Start-up Time, Run-up Time, and R9 Analysis for ENERGY STAR Lamps V2.0 Draft 1 Section 11.4: Start

Cuckoo Filter: Simplification and Analysis David Eppstein 15th Scandinavian Symposium and

Dawn Song dawnsong@cs.berkeley.edu 1 Other Issues & Attacks against GhostBuster? Malware