reinforcement learning
play

Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT - PowerPoint PPT Presentation

Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019 Reminder: Causal effects Potential outcomes under treatment and control, 1 , 0 Covariates and treatment, ,


  1. Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019

  2. Reminder: Causal effects β–Ί Potential outcomes under treatment and control, 𝑍 1 , 𝑍 0 β–Ί Covariates and treatment, π‘Œ, π‘ˆ π‘ˆ π‘Œ β–Ί Conditional average treatment effect (CATE) π·π΅π‘ˆπΉ π‘Œ = 𝔽 𝑍 1 βˆ’ 𝑍 0 ∣ π‘Œ 𝑍 Potential outcomes Features

  3. Today: Treatment policies/regimes β–Ί A policy 𝝆 assigns treatments to patients (typically depending on their medical history/state) β–Ί Example: For a patient with medical history 𝑦 , 𝜌(𝑦) = 𝕁[π·π΅π‘ˆπΉ 𝑦 > 0] β€œTreat if effect is positive” β–Ί Today we focus on policies guided by clinical outcomes (as opposed to legislation, monetary cost or side-effects)

  4. Example: Sepsis management β–Ί Sepsis is a complication of an infection which can lead to massive organ failure and death β–Ί One of the leading causes of death in the ICU β–Ί The primary treatment target is the infection β–Ί Other symptoms need management: breathing difficulties, low blood pressure, …

  5. Recall: Potential outcomes Septic patient with breathing difficulties 𝑍(0) Unobserved responses Blood 1. Should the patient be put on oxygen mechanical ventilation? π‘Œ Observed decisions 𝑍(1) & response π‘ˆ Mechanical ventilation? Sedation? Vasopressors? Time

  6. Today: Sequential decision making β–Ί Many clinical decisions are made in sequence β–Ί Choices early may rule out actions later β–Ί Can we optimize the policy by which actions are made? 𝐡 9 𝑆 8 𝑆 9 𝑆 : 𝑒 𝑇 8 𝑇 9 𝑇 : 𝑒 8 𝑒 9 𝑒 :

  7. Recall: Potential outcomes Septic patient with breathing difficulties Unobserved responses 1. Should the patient be put on mechanical ventilation? Observed decisions & response Mechanical ventilation? Sedation? Vasopressors? Time

  8. Example: Sepsis management Septic patient with breathing difficulties Unobserved 2. Should the patient be responses sedated? (To alleviate discomfort due Observed decisions to mech. ventilation) & response Mechanical ventilation? Sedation? Vasopressors? Time

  9. Example: Sepsis management Septic patient with breathing difficulties 3. Should we Unobserved responses artificially raise blood pressure? (Which may have Observed decisions dropped due to & response sedation) Mechanical ventilation? Sedation? Vasopressors? Time

  10. Example: Sepsis management Septic patient with breathing difficulties Observed decisions & response Mechanical ventilation? Sedation? Vasopressors? Time

  11. Finding optimal policies β–Ί How can we treat patients so that their outcomes are as good as possible ? Outcome β–Ί What are good outcomes ? β–Ί Which policies should we consider? Mechanical ventilation? Sedation? Vasopressors?

  12. Success stories in popular press β–Ί AlphaStar β–Ί AlphaGo β–Ί DQN Atari β–Ί Open AI Five

  13. Reinforcement learning Game state 𝑇 8 β–Ί Maximize reward! Possible actions 𝐡 8 Next state 𝑇 9 Reward 𝑆 9 (Loss) Figure by Tim Wheeler, tim.hibal.org

  14. Great! Now let’s treat patients β–Ί Patient state at time 𝑇 = is like the game board β–Ί Medical treatments 𝐡 = are like the actions β–Ί Outcomes 𝑆 = are the rewards in the game 𝐡 9 β–Ί What could possibly go wrong? 𝑆 8 𝑆 9 𝑆 : 𝑒 𝑇 8 𝑇 9 𝑇 : 𝑒 8 𝑒 9 𝑒 :

  15. 1. Decision processes 2. Reinforcement learning 3. Learning from batch (off-policy) data 4. Reinforcement learning in healthcare

  16. Decision processes β–Ί An agent repeatedly, at times 𝑒 takes actions 𝐡 = Agent to receive rewards 𝑆 = Reward 𝑆 = Action 𝐡 = from an environment , the state 𝑇 = of which is (partially) observed Environment State 𝑇 =

  17. Decision process: Mechanical ventilation @FG= HII + 𝑆 = @A=BCD + 𝑆 = @FG= HG 𝑆 = = 𝑆 = Agent Reward $ # Action " # Environment State % # 𝑇 8 𝐡 8 𝐡 9 𝑇 ? , 𝑆 ? 𝐡 ? 𝑆 : 𝑇 9 , 𝑆 9 Mechanical ventilation? Sedation? Spontaneous breathing trial? Time

  18. Decision process: Mechanical ventilation β–Ί State 𝑇 = includes demographics, physiological measurements, ventilator settings, level of consciousness, dosage of 𝑇 8 sedatives, time to 𝑇 ? ventilation, number of 𝑇 9 intubations

  19. Decision process: Mechanical ventilation β–Ί Actions 𝐡 = include intubation and extubation, as well as administration and dosages of sedatives 𝐡 8 𝐡 ? 𝐡 9

  20. Decision processes β–Ί A decision process specifies how states 𝑇 = , actions 𝐡 = , and rewards 𝑆 = are distributed : π‘ž(𝑇 8 , … , 𝑇 : , 𝐡 8 , … , 𝐡 : , 𝑆 8 , … , 𝑆 : ) β–Ί The agent interacts with the environment according to a behavior policy 𝜈 = π‘ž(𝐡 = ∣ β‹― ) * * The … depends on the type of agent

  21. Markov Decision Processes β–Ί Markov decision processes (MDPs) are a special case β–Ί Markov transitions : π‘ž 𝑇 = 𝑇 8 , … , 𝑇 =N9 , 𝐡 8 , … , 𝐡 =N9 = π‘ž(𝑇 = ∣ 𝑇 =N9 , 𝐡 =N9 ) β–Ί Markov reward function: π‘ž 𝑆 = 𝑇 = , 𝐡 = = π‘ž 𝑆 = 𝑇 8 , … , 𝑇 =N9 , 𝐡 8 , … , 𝐡 =N9 β–Ί Markov action policy 𝜈 = π‘ž(𝐡 = ∣ 𝑇 = ) = π‘ž 𝐡 = 𝑇 8 , … , 𝑇 =N9 , 𝐡 8 , … , 𝐡 =N9

  22. Markov assumption β–Ί State transitions, actions and reward depend only on most recent state-action pair 𝐡 8 𝐡 : … 𝑇 8 𝑇 9 𝑇 : 𝑆 8 𝑆 :

  23. Contextual bandits (special case)* β–Ί States are independent: π‘ž 𝑇 = 𝑇 =N9 , 𝐡 =N9 = π‘ž(𝑇 = ) β–Ί Equivalent to single-step case : potential outcomes! 𝐡 8 𝐡 : … 𝑇 8 𝑇 9 𝑇 : 𝑆 8 𝑆 : * The term β€œcontextual bandits” has connotations of efficient exploration, which is not addressed here

  24. Contextual bandits & potential outcomes β–Ί Think of each state 𝑇 A as an i.i.d. patient, the actions 𝐡 A as the treatment group indicators and 𝑆 A as the outcomes 𝐡 8 𝐡 : … 𝑇 8 𝑇 : 𝑆 8 𝑆 :

  25. Goal of RL β–Ί Like previously with causal effect estimation, we are interested in the effects of actions 𝐡 = on future rewards 𝐡 8 𝐡 : … 𝑇 8 𝑇 9 𝑇 : 𝑆 8 𝑆 :

  26. Value maximization β–Ί The goal of most RL algorithms is to maximize the expected cumulative rewardβ€”the value π‘Š P of its policy 𝜌 : β–Ί Return : 𝐻 = = βˆ‘ 𝑆 D Sum of future rewards DS= β–Ί Value: π‘Š P = 𝔽 T U ∼P 𝐻 8 Expected sum of rewards under policy 𝜌 β–Ί The expectation is taken with respect to scenarios acted out according to the learned policy 𝜌

  27. Example Value G P β‰ˆ 1 π‘œ [ 𝐻 G π‘Š β–Ί Let’s say that we have data from a policy 𝜌 AS9 9 = 0 𝑏 ? Return 9 = 1 𝑏 W 9 9 = 1 𝑆 9 𝑏 9 9 + 𝑆 ? 9 + 𝑆 W 𝐻 9 = 𝑆 9 9 9 𝑆 W 9 𝑆 ? Patient 1 ? = 1 Patient 2 𝑏 W ? = 0 ? + 𝑆 ? ? + 𝑆 W 𝐻 ? = 𝑆 9 ? = 1 𝑏 9 ? 𝑏 ? ? 𝑆 W ? 𝑆 9 ? 𝑆 ? Patient 3 W 𝑆 9 W = 0 W = 0 𝑏 ? 𝑏 9 W W = 0 𝑆 ? 𝑏 W W 𝑆 W 𝐻 W = 𝑆 9 W + 𝑆 ? W + 𝑆 W W

  28. Robot in a room β–Ί Stochastic actions π‘ž Move up 𝐡 = β€π‘£π‘žβ€ = 0.8 + 1 Available non-opposite moves have uniform probability βˆ’1 β–Ί Rewards: +1 at [4,3] (terminal state) Start -1 at [4,2] (terminal) -0.04 per step Slide from Peter Bodik

  29. Robot in a room What is the optimal policy? β–Ί Stochastic actions π‘ž Move up 𝐡 = β€π‘£π‘žβ€ = 0.8 + 1 ? ? ? Available non-opposite moves have uniform probability ? ? βˆ’1 β–Ί Rewards: +1 at [4,3] (terminal state) ? ? ? ? -1 at [4,2] (terminal) -0.04 per step Slide from Peter Bodik

  30. Robot in a room β–Ί The following is the optimal policy/trajectory under + 1 deterministic transitions βˆ’1 β–Ί Not achievable in our stochastic transition model Slide from Peter Bodik

  31. Robot in a room β–Ί Optimal policy + 1 β–Ί How can we learn this? βˆ’1 Slide from Peter Bodik

  32. 1. Decision processes 2. Reinforcement learning 3. Learning from batch (off-policy) data 4. Reinforcement learning in healthcare

  33. Paradigms* Model-based RL Value-based RL Policy-based RL Transitions Value/return Policy π‘ž 𝑇 = 𝑇 =N9 , 𝐡 =N9 π‘ž 𝐻 = 𝑇 = , 𝐡 = π‘ž(𝐡 = ∣ 𝑇 = ) G-computation Q-learning REINFORCE MDP estimation G-estimation Marginal structural models *We focus on off-policy RL here

  34. Paradigms* Model-based RL Value-based RL Policy-based RL Transitions Value/return Policy π‘ž 𝑇 = 𝑇 =N9 , 𝐡 =N9 π‘ž 𝐻 = 𝑇 = , 𝐡 = π‘ž(𝐡 = ∣ 𝑇 = ) G-computation Q-learning REINFORCE MDP estimation G-estimation Marginal structural models *We focus on off-policy RL here

  35. Dynamic programming β–Ί Assume that we know how good a state-action pair is + 1 [3,1] [4,3] β–Ί Q: Which end state is the βˆ’1 best? A: [4,3] Start β–Ί Q: What is the best way to get there? A: Only [3,1] Slide from Peter Bodik

  36. Dynamic programming β–Ί [2,1] is slightly better than [3,2] because of the risk of + 1 transitioning to [4,2] from [3,2] [2,1] βˆ’1 β–Ί Which is the best way to [2,1]? [3,2] [4,2] Start Slide from Peter Bodik

  37. Dynamic programming β–Ί The idea of dynamic programming for + 1 reinforcement learning is to recursively learn the best βˆ’1 action/value in a previous state given the best action/value in future states Slide from Peter Bodik

Recommend


More recommend