Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of these slides. Andrew would be delighted Making Choices: Sequential Decision Making if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Stochastic programming decision ? Probabilistic model dry wet a b c {a(p a ),b(p b ),c(p c )} maximizes expected utility minimizes expected cost
Problem Setup Available Conserved parcels parcels " ! "# $%# Current territories Potential territories Given limited budget, what parcels should I conserve to maximize the expected number of occupied territories in 50 years?
Metapopulation = Cascade • Metapopulation model can be viewed as a cascade in the layered graph representing territories over time i i i i i j j j j j k k k k k Patches l l l l l m m m m m Target nodes: territories at final time step
Management Actions • Conserving parcels adds nodes to the network to create new pathways for the cascade Initial network Parcel 1 Parcel 2
Management Actions • Conserving parcels adds nodes to the network to create new pathways for the cascade Initial network Parcel 1 Parcel 2
Management Actions • Conserving parcels adds nodes to the network to create new pathways for the cascade Initial network Parcel 1 Parcel 2
Cascade Optimization Problem Given: • Patch network – Initially occupied territories – Colonization and extinction probabilities • Management actions – Already-conserved parcels – List of available parcels and their costs • Time horizon T • Budget B Find set of parcels with total cost at most B that maximizes the expected number of occupied territories at time T. Can we make our decision adaptively?
Sequential decision making • We have a systems that changes state over time • Can (partially) control the system state transitions by taking actions • Problem gives an objective that specifies which states (or state sequences) are more/less preferred • Problem: At each time step select an action to optimize the overall (long-term) objective – P roduce most preferred sequences of “states” 9
Discounted Rewards/Costs An assistant professor gets paid, say, 20K per year. How much, in total, will the A.P. earn in their life? 20 + 20 + 20 + 20 + 20 + … = Infinity $ $ What’s wrong with this argument?
Discounted Rewards “A reward (payment) in the future is not worth quite as much as a reward now.” – Because of chance of obliteration – Because of inflation Example: Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now. Assuming payment n years in future is worth only (0.9) n of payment now, what is the AP’s Future Discounted Sum of Rewards ?
Infinite Sum Assuming a discount rate of 0.9, how much does the assistant professor get in total? 20 + .9 20 + .9 2 20 + .9 3 20 + … = 20 + .9 (20 + .9 20 + .9 2 20 + …) x = 20 + .9 x x = 20/(.1) = 200
Discount Factors People in economics and probabilistic decision- making do this all the time. The “Discounted sum of future rewards” using discount factor g ” is (reward now) + g (reward in 1 time step) + g 2 (reward in 2 time steps) + g 3 (reward in 3 time steps) + : : (infinite sum)
Markov System: the Academic Life 0.6 0.6 0.2 0.2 0.7 B. A. T. Assoc. Assistant Tenured Prof Prof Prof 60 20 400 0.2 S. 0.2 0.3 D. On the Dead Street 0 10 Define: 0.7 0.3 J A = Expected discounted future rewards starting in state A J B = Expected discounted future rewards starting in state B J T = “ “ “ “ “ “ “ T J S = “ “ “ “ “ “ “ S J D = “ “ “ “ “ “ “ D How do we compute J A , J B , J T , J S , J D ?
Working Backwards 247 151 0.2 B. Associate A. Assistant 0.2 0.7 Prof.: 60 Prof.: 20 0.2 0.6 270 0.2 0.6 T. Tenured Prof.: 100 27 0.3 0.3 S. Out on the 0 Street: 10 D. Dead: 0 Discount 0.7 factor 0.9 1.0
Reincarnation? 0.2 B. Associate A. Assistant 0.2 0.7 Prof.: 60 Prof.: 20 0.2 0.6 0.2 0.6 T. Tenured Prof.: 100 0.3 0.3 S. Out on the Street: 10 D. Dead: 0 Discount factor 0.9 0.5 0.7 0.5
System of Equations L(A) = 20 + .9(.6 L(A) + .2 L(B) + .2 L(S)) L(B) = 60 + .9(.6 L(B) + .2 L(S) + .2 L(T)) L(S) = 10 + .9(.7 L(S) + .3 L(D)) L(T) = 100 + .9(.7 L(T) + .3 L(D)) L(D) = 0 + .9 (.5 L(D) + .5 L(A))
Solving a Markov System with Matrix Inversion • Upside: You get an exact answer • Downside: If you have 100,000 states you’re solving a 100,000 by 100,000 system of equations.
Value Iteration: another way to solve a Markov System Define J 1 (S i ) = Expected discounted sum of rewards over the next 1 time step. J 2 (S i ) = Expected discounted sum rewards during next 2 steps J 3 (S i ) = Expected discounted sum rewards during next 3 steps : J k (S i ) = Expected discounted sum rewards during next k steps J 1 (S i ) = (what?) J 2 (S i ) = (what?) J k+1 (S i ) = (what?)
Value Iteration: another way to solve a Markov System Define J 1 (S i ) = Expected discounted sum of rewards over the next 1 time step. J 2 (S i ) = Expected discounted sum rewards during next 2 steps J 3 (S i ) = Expected discounted sum rewards during next 3 steps : J k (S i ) = Expected discounted sum rewards during next k steps N = Number of states J 1 (S i ) = r i (what?) N g 1 ( ) r p J s J 2 (S i ) = i ij j (what?) 1 j N : g k ( ) r p J s J k+1 (S i ) = i ij j j 1 (what?)
Let’s do Value Iteration 1/2 WIND g = 0.5 1/2 SUN HAIL 0 .::.:.:: 1/2 +4 1/2 -8 1/2 1/2 J k ( SUN ) J k ( WIND ) J k ( HAIL ) k 1 2 3 4 5
Let’s do Value Iteration g = 0.5 1/2 WIND 1/2 SUN HAIL 0 .::.:.:: 1/2 +4 1/2 -8 1/2 1/2 J k ( SUN ) J k ( WIND ) J k ( HAIL ) k 1 4 0 -8 2 5 -1 -10 3 5 -1.25 -10.75 4 4.94 -1.44 -11 5 4.88 -1.52 -11.11
Value Iteration for solving Markov Systems • Compute J 1 (S i ) for each i • Compute J 2 (S i ) for each i : • Compute J k (S i ) for each i As k →∞ J k (S i )→J*(S i ) When to stop? When Max J k+1 (S i ) – J k (S i ) < ξ i This is faster than matrix inversion (N 3 style) if the transition matrix is sparse What if we have a way to interact with the Markov system?
A Markov Decision Process 1 S You run a 1 Poor & Poor & 1/2 startup A Unknown Famous A 1/2 company. +0 +0 In every S 1/2 state you 1/2 1 1/2 1/2 must choose 1/2 between Saving 1/2 Rich & Rich & money (S) Famous 1/2 S Unknown or +10 1/2 +10 Advertising (A). g = 0.9
Markov Decision Processes An MDP has… • A set of states {s 1 ··· S N } • A set of actions {a 1 ··· a M } • A set of rewards {r 1 ··· r N } (one for each state) • A transition probability function k P Prob Next This and I use action j i k ij On each step: 0. Call current state S i 1. Receive reward r i 2. Choose action {a 1 ··· a M } 3. If you choose action a k you’ll move to state S j with k P probability ij 4. All future rewards are discounted by g What’s a solution to an MDP? A sequence of actions?
A Policy A policy is a mapping from states to actions . Examples 1 S 1 PU PF Policy Number 1: A STATE → ACTION 1/2 0 0 PU S 1 PF A 1/2 RU RF RU S +10 +10 RF A Policy Number 2: STATE → ACTION 1 1/2 1/2 PU A PU PF A A 0 0 PF A 1/2 1/2 RU A 1 RU RF A RF A A 10 10 • How many possible policies in our example? • Which of the above two policies is best? • How do you compute the optimal policy?
Interesting Fact For every M.D.P. there exists an optimal policy. It’s a policy such that for every possible start state there is no better option than to follow the policy.
Computing the Optimal Policy Idea One: Run through all possible policies. Select the best. What’s the problem ??
Optimal Value Function Define J*(S i ) = Expected Discounted Future Rewards, starting from state S i , assuming we use the optimal policy 1 B 1 0 Question: S 2 1/2 1/2 +3 What is an optimal policy S 1 for that MDP? 1/3 1/2 +0 (assume g = 0.9) 1/3 1/2 1/3 S 3 +2 1 What is J*(S 1 ) ? What is J*(S 2 ) ? What is J*(S 3 ) ?
Computing the Optimal Value Function with Value Iteration Define J k (S i ) = Maximum possible expected sum of discounted rewards I can get if I start at state S i and I live for k time steps. Note that J 1 (S i ) = r i
Let’s compute J k (S i ) for our example J k (PU) J k (PF) J k (RU) J k (RF) k 1 2 3 4 5 6
J k (PU) J k (PF) J k (RU) J k (RF) k 1 0 0 10 10 2 0 4.5 14.5 19 3 2.03 8.55 16.52 25.08
Recommend
More recommend