Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019
Reminder: Causal effects βΊ Potential outcomes under treatment and control, π 1 , π 0 βΊ Covariates and treatment, π, π π π βΊ Conditional average treatment effect (CATE) π·π΅ππΉ π = π½ π 1 β π 0 β£ π π Potential outcomes Features
Today: Treatment policies/regimes βΊ A policy π assigns treatments to patients (typically depending on their medical history/state) βΊ Example: For a patient with medical history π¦ , π(π¦) = π[π·π΅ππΉ π¦ > 0] βTreat if effect is positiveβ βΊ Today we focus on policies guided by clinical outcomes (as opposed to legislation, monetary cost or side-effects)
Example: Sepsis management βΊ Sepsis is a complication of an infection which can lead to massive organ failure and death βΊ One of the leading causes of death in the ICU βΊ The primary treatment target is the infection βΊ Other symptoms need management: breathing difficulties, low blood pressure, β¦
Recall: Potential outcomes Septic patient with breathing difficulties π(0) Unobserved responses Blood 1. Should the patient be put on oxygen mechanical ventilation? π Observed decisions π(1) & response π Mechanical ventilation? Sedation? Vasopressors? Time
Today: Sequential decision making βΊ Many clinical decisions are made in sequence βΊ Choices early may rule out actions later βΊ Can we optimize the policy by which actions are made? π΅ 9 π 8 π 9 π : π’ π 8 π 9 π : π’ 8 π’ 9 π’ :
Recall: Potential outcomes Septic patient with breathing difficulties Unobserved responses 1. Should the patient be put on mechanical ventilation? Observed decisions & response Mechanical ventilation? Sedation? Vasopressors? Time
Example: Sepsis management Septic patient with breathing difficulties Unobserved 2. Should the patient be responses sedated? (To alleviate discomfort due Observed decisions to mech. ventilation) & response Mechanical ventilation? Sedation? Vasopressors? Time
Example: Sepsis management Septic patient with breathing difficulties 3. Should we Unobserved responses artificially raise blood pressure? (Which may have Observed decisions dropped due to & response sedation) Mechanical ventilation? Sedation? Vasopressors? Time
Example: Sepsis management Septic patient with breathing difficulties Observed decisions & response Mechanical ventilation? Sedation? Vasopressors? Time
Finding optimal policies βΊ How can we treat patients so that their outcomes are as good as possible ? Outcome βΊ What are good outcomes ? βΊ Which policies should we consider? Mechanical ventilation? Sedation? Vasopressors?
Success stories in popular press βΊ AlphaStar βΊ AlphaGo βΊ DQN Atari βΊ Open AI Five
Reinforcement learning Game state π 8 βΊ Maximize reward! Possible actions π΅ 8 Next state π 9 Reward π 9 (Loss) Figure by Tim Wheeler, tim.hibal.org
Great! Now letβs treat patients βΊ Patient state at time π = is like the game board βΊ Medical treatments π΅ = are like the actions βΊ Outcomes π = are the rewards in the game π΅ 9 βΊ What could possibly go wrong? π 8 π 9 π : π’ π 8 π 9 π : π’ 8 π’ 9 π’ :
1. Decision processes 2. Reinforcement learning 3. Learning from batch (off-policy) data 4. Reinforcement learning in healthcare
Decision processes βΊ An agent repeatedly, at times π’ takes actions π΅ = Agent to receive rewards π = Reward π = Action π΅ = from an environment , the state π = of which is (partially) observed Environment State π =
Decision process: Mechanical ventilation @FG= HII + π = @A=BCD + π = @FG= HG π = = π = Agent Reward $ # Action " # Environment State % # π 8 π΅ 8 π΅ 9 π ? , π ? π΅ ? π : π 9 , π 9 Mechanical ventilation? Sedation? Spontaneous breathing trial? Time
Decision process: Mechanical ventilation βΊ State π = includes demographics, physiological measurements, ventilator settings, level of consciousness, dosage of π 8 sedatives, time to π ? ventilation, number of π 9 intubations
Decision process: Mechanical ventilation βΊ Actions π΅ = include intubation and extubation, as well as administration and dosages of sedatives π΅ 8 π΅ ? π΅ 9
Decision processes βΊ A decision process specifies how states π = , actions π΅ = , and rewards π = are distributed : π(π 8 , β¦ , π : , π΅ 8 , β¦ , π΅ : , π 8 , β¦ , π : ) βΊ The agent interacts with the environment according to a behavior policy π = π(π΅ = β£ β― ) * * The β¦ depends on the type of agent
Markov Decision Processes βΊ Markov decision processes (MDPs) are a special case βΊ Markov transitions : π π = π 8 , β¦ , π =N9 , π΅ 8 , β¦ , π΅ =N9 = π(π = β£ π =N9 , π΅ =N9 ) βΊ Markov reward function: π π = π = , π΅ = = π π = π 8 , β¦ , π =N9 , π΅ 8 , β¦ , π΅ =N9 βΊ Markov action policy π = π(π΅ = β£ π = ) = π π΅ = π 8 , β¦ , π =N9 , π΅ 8 , β¦ , π΅ =N9
Markov assumption βΊ State transitions, actions and reward depend only on most recent state-action pair π΅ 8 π΅ : β¦ π 8 π 9 π : π 8 π :
Contextual bandits (special case)* βΊ States are independent: π π = π =N9 , π΅ =N9 = π(π = ) βΊ Equivalent to single-step case : potential outcomes! π΅ 8 π΅ : β¦ π 8 π 9 π : π 8 π : * The term βcontextual banditsβ has connotations of efficient exploration, which is not addressed here
Contextual bandits & potential outcomes βΊ Think of each state π A as an i.i.d. patient, the actions π΅ A as the treatment group indicators and π A as the outcomes π΅ 8 π΅ : β¦ π 8 π : π 8 π :
Goal of RL βΊ Like previously with causal effect estimation, we are interested in the effects of actions π΅ = on future rewards π΅ 8 π΅ : β¦ π 8 π 9 π : π 8 π :
Value maximization βΊ The goal of most RL algorithms is to maximize the expected cumulative rewardβthe value π P of its policy π : βΊ Return : π» = = β π D Sum of future rewards DS= βΊ Value: π P = π½ T U βΌP π» 8 Expected sum of rewards under policy π βΊ The expectation is taken with respect to scenarios acted out according to the learned policy π
Example Value G P β 1 π [ π» G π βΊ Letβs say that we have data from a policy π AS9 9 = 0 π ? Return 9 = 1 π W 9 9 = 1 π 9 π 9 9 + π ? 9 + π W π» 9 = π 9 9 9 π W 9 π ? Patient 1 ? = 1 Patient 2 π W ? = 0 ? + π ? ? + π W π» ? = π 9 ? = 1 π 9 ? π ? ? π W ? π 9 ? π ? Patient 3 W π 9 W = 0 W = 0 π ? π 9 W W = 0 π ? π W W π W π» W = π 9 W + π ? W + π W W
Robot in a room βΊ Stochastic actions π Move up π΅ = βπ£πβ = 0.8 + 1 Available non-opposite moves have uniform probability β1 βΊ Rewards: +1 at [4,3] (terminal state) Start -1 at [4,2] (terminal) -0.04 per step Slide from Peter Bodik
Robot in a room What is the optimal policy? βΊ Stochastic actions π Move up π΅ = βπ£πβ = 0.8 + 1 ? ? ? Available non-opposite moves have uniform probability ? ? β1 βΊ Rewards: +1 at [4,3] (terminal state) ? ? ? ? -1 at [4,2] (terminal) -0.04 per step Slide from Peter Bodik
Robot in a room βΊ The following is the optimal policy/trajectory under + 1 deterministic transitions β1 βΊ Not achievable in our stochastic transition model Slide from Peter Bodik
Robot in a room βΊ Optimal policy + 1 βΊ How can we learn this? β1 Slide from Peter Bodik
1. Decision processes 2. Reinforcement learning 3. Learning from batch (off-policy) data 4. Reinforcement learning in healthcare
Paradigms* Model-based RL Value-based RL Policy-based RL Transitions Value/return Policy π π = π =N9 , π΅ =N9 π π» = π = , π΅ = π(π΅ = β£ π = ) G-computation Q-learning REINFORCE MDP estimation G-estimation Marginal structural models *We focus on off-policy RL here
Paradigms* Model-based RL Value-based RL Policy-based RL Transitions Value/return Policy π π = π =N9 , π΅ =N9 π π» = π = , π΅ = π(π΅ = β£ π = ) G-computation Q-learning REINFORCE MDP estimation G-estimation Marginal structural models *We focus on off-policy RL here
Dynamic programming βΊ Assume that we know how good a state-action pair is + 1 [3,1] [4,3] βΊ Q: Which end state is the β1 best? A: [4,3] Start βΊ Q: What is the best way to get there? A: Only [3,1] Slide from Peter Bodik
Dynamic programming βΊ [2,1] is slightly better than [3,2] because of the risk of + 1 transitioning to [4,2] from [3,2] [2,1] β1 βΊ Which is the best way to [2,1]? [3,2] [4,2] Start Slide from Peter Bodik
Dynamic programming βΊ The idea of dynamic programming for + 1 reinforcement learning is to recursively learn the best β1 action/value in a previous state given the best action/value in future states Slide from Peter Bodik
Recommend
More recommend