Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint - PowerPoint PPT Presentation

Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint work with Wang Chi Cheung (NUS) and David Simchi-Levi (MIT) 1 / 18

Epidemic Control A DM iteratively: 1. Pick a measure to contain the virus. 2. See the corresponding outcome. Goal : Minimize the total infected cases. 2 / 18

Epidemic Control A DM iteratively: 1. Pick a measure to contain the virus. 2. See the corresponding outcome. Goal : Minimize the total infected cases. Challenges: ◮ Uncertainty: effectiveness of each measure is unknown. ◮ Bandit feedback: no feedback for un-chosen measures. ◮ Non-stationarity: virus might mutate throughout. 2 / 18

Epidemic Control The DM’s action could have long-term impact . ◮ Quarantine lockdown stem the spread of virus to elsewhere, but also delayed key supplies from getting in. 3 / 18

Model Model epidemic control by a Markov decision process (MDP) (Nowzari et al. 15, Kiss et al. 17) . For each time step t = 1 , . . . , T , ◮ Observe the current state s t = { 1 , 2 } , and receive a reward. For example r (1) = 1 and r (2) = 0 . ◮ Pick an action a t ∈ { B , G } , and transition to the next state s t +1 ∼ p t ( ·| s t , a t ) (unknown). 4 / 18

Model cont’d ◮ Task: Design a reward-maximizing policy π. For every time step t : π t : { 1 , 2 } → { B , G } ◮ Dynamic regret (Besbes et al. 15) :   �� T � � T  − E dym-reg T = E t =1 r ( s t ( π ∗ )) t =1 r ( s t ( π )) . �� knows p t ’s ◮ Variation budget: � p 1 − p 2 � + � p 2 − p 3 � + . . . + � p T − 1 − p T � ≤ B p . 5 / 18

Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. 6 / 18

Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process. 6 / 18

Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process. Definition ( (Jaksch et al. 10) Informal) Diameter = max { E [min. time(1 → 2)] , E [min. time(2 → 1)] } 6 / 18

Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process. Definition ( (Jaksch et al. 10) Informal) Diameter = max { E [min. time(1 → 2)] , E [min. time(2 → 1)] } Example. Diameter = max { 1 / 0 . 8 , 1 / 0 . 1 } = 10 . 6 / 18

Existing Works Stationary Non-stationary Multi-armed bandit OFU* Forgetting + OFU † Reinforcement learning OFU ‡ ? (Forgetting + OFU) * Auer et al. 03 † Besbes et al. 14, Cheung et al. 19 ‡ Jaksch et al. 10, Agrawal and Jia 20 7 / 18

UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 8 / 18

UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 N t (2 , B ) = 10 : 5 × (2 , B ) → 1 , 5 × (2 , B ) → 2 8 / 18

UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 N t (2 , B ) = 10 : 5 × (2 , B ) → 1 , 5 × (2 , B ) → 2 Empirical state transition distribution: 8 / 18

UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 N t (2 , B ) = 10 : 5 × (2 , B ) → 1 , 5 × (2 , B ) → 2 Empirical state transition distribution: 2. Confidence intervals: √ � ˆ p t ( ·| 1 , B ) − p ( ·| 1 , B ) � ≤ c t (1 , B ) := C / 10 √ � ˆ p t ( ·| 2 , B ) − p ( ·| 2 , B ) � ≤ c t (2 , B ) := C / 10 8 / 18

UCB for Stationary RL 3. UCB of reward: find the ˚ p that maximizes Pr(visiting state 1) within the confidence interval. 4. Execute the optimal policy w.r.t. the UCB until some termination criteria are met. 9 / 18

UCB for RL cont’d Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. �� ◮ Regret ∝ LCB × ( s , a ) c t ( s , a ) . ◮ Under stationarity, LCB of diameter ≤ Diameter( p ) . Theorem Denote D := Diameter ( p ) , the regret of the UCB algorithm is √ O ( D T ) . ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret. 10 / 18

SWUCB for RL According to (Cheung et al. 19) : ◮ SWUCB for RL: UCB for RL with W most recent samples. 11 / 18

SWUCB for RL According to (Cheung et al. 19) : ◮ SWUCB for RL: UCB for RL with W most recent samples. ◮ The perils of drift: Under non-stationarity, LCB of diameter ≫ Diameter( p s ) for all s ∈ [ T ] . 11 / 18

Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. 12 / 18

Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. 12 / 18

Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: { (1 , B ) → 1 , (2 , B ) → 2 } 12 / 18

Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: { (1 , B ) → 1 , (2 , B ) → 2 } Empirical state transition ˆ p t : 12 / 18

Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: { (1 , B ) → 1 , (2 , B ) → 2 } Empirical state transition ˆ p t : Diameter explodes! 12 / 18

Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint - PowerPoint PPT Presentation

Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint work with Wang Chi Cheung (NUS) and David Simchi-Levi (MIT) 1 / 18 Epidemic Control A DM iteratively: 1. Pick a measure to contain the virus. 2. See the corresponding outcome.

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Outline Outline Stationary Solution to Fokker Stationary Solution to Fokker- - Planck

Semi-stationary reflection, stationary reflection and combinatorics Hiroshi Sakai (joint work

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

EPAs Air Quality Regulations for Stationary Engines for Stationary Engines Melanie King U.S.

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

COVID-19 Resources for Texas Museums Welcome! The webinar will begin at 10:00 a.m. CT. While

What is the COVID-19 Epidemic in BC Teaching Us? July 16, 2020 David M. Patrick, MD, FRCPC, MHSc

Behavioural responses and epidemic spread on networks Joan Saldaa Universitat de Girona (joint

Epidemics Social and Technological Networks Rik Sarkar University of Edinburgh, 2017. Spread of

Stuck: Contextualizing the U.S. HIV epidemic among black MSM Greg Millett amfAR April 2,

BioDiaspora Evidence Based Decision Making for Emerging Global Infectious Disease Threats Kamran

Homelessness Prevention in the Midst of COVID-19 June 10, 2020 1 Housekeeping A recording of

Clio-epidemiology to Neo-epidemiology Studying past epidemics and