The Winding Path to RL Markov Decision Processes & • Decision Theory • Descriptive theory of optimal Reinforcement Learning behavior • Markov Decision Processes • Mathematical/Algorithmic realization of Decision Theory (Lecture 1) • Application of learning techniques to • Reinforcement Learning challenges of MDPs with numerous or unknown parameters Ron Parr Duke University Decision Theory Covered Today What does it mean to make an optimal decision? • Decision Theory • Asked by economists to study consumer behavior • MDPs • Asked by MBAs to maximize profit • Asked by leaders to allocate resources • Algorithms for MDPs – Value Determination • Asked in OR to maximize efficiency of operations – Optimal Policy Selection • Asked in AI to model intelligence • Value Iteration • Policy Iteration • Linear Programming • Asked (sort of) by any intelligent person every day Are Utility Functions Natural? Utility Functions • Some have argued that people don’t really have utility functions • A utility function is a mapping from • What is the utility of the current state? world states to real numbers • What was your utility at 8:00pm last night? • Also called a value function • Utility elicitation is difficult problem • Rational or optimal behavior is typically viewed as maximizing expected utility: • It’s easy to communicate preferences • Given a plausible set of assumptions about ∑ max ( | ) ( ) P s a U s preferences, must exist consistent utility function a s a = actions, s = states (More precise statement of this is a theorem.) 1
Swept under the rug today… Playing a Game Show • Assume series of questions • Utility of money (assumed 1:1) – Increasing difficulty – Increasing payoff • How to determine costs/utilities • Choice: – Accept accumulated earnings and quit • How to determine probabilities – Continue and risk losing everything • “Who wants to be a millionaire?” State Representation Making Optimal Decisions (simplified game) • Work backwards from future to present Start 1 correct 2 correct 2 correct $100 $1,000 $10,000 $100,000 • Consider $100,000 question $111,100 – Suppose P(correct) = 1/10 – V(stop)=$11,100 $0 $0 $0 $0 – V(continue) = 0.9*$0 + 0.1*$100K = $10K $100 $1,100 $11,100 • Optimal decision STOPS at last step Working Recursively Decision Theory Summary • Provides theory of optimal decisions V=$3,747 V=$4,163 V=$5,550 V=$11.1K 9/10 3/4 • Principle of maximizing utility 1/2 1/10 X X $0 X $0 • Easy for small, tree structured spaces with $0 $0 – Known utilities – Known probabilities $100 $1,100 $11,100 2
Covered Today Dealing with Loops Suppose you can pay $1000 (from any losing state) to play again • Decision Theory • MDPs 9/10 3/4 1/2 1/10 • Algorithms for MDPs – Value Determination $0 $0 – Optimal Policy Selection $0 $0 • Value Iteration $-1000 • Policy Iteration • Linear Programming $100 $1,100 $11,100 And the solution is… From Policies to Linear Systems w/o V=$3.7K V=$4.1K V=$5.6K V=$11.1K • Suppose we always pay until we win. cheat • What is value of following this policy? V=$82.4K V=$82.6K V=$83.0K V=$84.4K ( ) 0 . 10 ( 1000 ( )) 0 . 90 ( ) V s = − + V s + V s 9/10 3/4 1/2 1/10 0 0 1 ( ) 0 . 25 ( 1000 ( )) 0 . 75 ( ) V s = − + V s + V s 1 0 2 ( ) 0 . 50 ( 1000 ( )) 0 . 50 ( ) V s = − + V s + V s 2 0 3 ( ) 0 . 90 ( 1000 ( )) 0 . 10 ( 111100 ) V s = − + V s + $-500 3 0 Is this optimal? Return to Start Continue How do we find the optimal policy? Applications of MDPs The MDP Framework • State space: S • AI/Computer Science • Action space: A – Robotic control (Koenig & Simmons, Thrun et al., Kaelbling et al.) • Transition function: P – Air Campaign Planning (Meuleau et al.) • Reward function: R – Elevator Control (Barto & Crites) • Discount factor: γ – Computation Scheduling (Zilberstein et al.) • Policy: – Control and Automation (Moore et al.) π ( s → ) a – Spoken dialogue management (Singh et al.) Objective: Maximize expected, discounted return – Cellular channel allocation (Singh & Bertsekas) (decision theoretic optimal behavior) 3
Applications of MDPs Applications of MDPs • EE/Control – Missile defense (Bertsekas et al.) – Inventory management (Van Roy et al.) • Economics/Operations Research – Football play selection (Patek & Bertsekas) – Fleet maintenance (Howard, Rust) • Agriculture – Road maintenance (Golabi et al.) – Herd management (Kristensen, Toft) – Packet Retransmission (Feinberg et al.) – Nuclear plant management (Rothwell & Rust) The Markov Assumption Understanding Discounting • Let S t be a random variable for the state at time t • Mathematical motivation – Keeps values bounded – What if I promise you $0.01 every day you visit me? • P(S t |A t-1 S t-1 ,…,A 0 S 0 ) = P(S t |A t-1 S t-1 ) • Economic motivation – Discount comes from inflation – Promise of $1.00 in future is worth $0.99 today • Markov is special kind of conditional independence • Probability of dying – Suppose ε probability of dying at each decision interval • Future is independent of past given current state – Transition w/prob ε to state with value 0 – Equivalent to 1- ε discount factor Discounting in Practice Covered Today • Decision Theory • Often chosen unrealistically low – Faster convergence • MDPs – Slightly myopic policies • Algorithms for MDPs – Value Determination • Can reformulate most algs for avg reward – Optimal Policy Selection – Mathematically uglier • Value Iteration • Policy Iteration – Somewhat slower run time • Linear Programming 4
Value Determination Matrix Form Determine the value of each state under policy π ⎛ ⎞ ( | , ( )) ( | , ( )) ( | , ( )) P s s π s P s s π s P s s π s ∑ ( ) ( , ( )) ( ' | , ( )) ( ' ) V s = R s π s + γ P s s π s V s ⎜ ⎟ 1 1 1 2 1 1 3 1 1 s ' ⎜ ⎟ ( | , ( )) ( | , ( )) ( | , ( )) P = P s s π s P s s π s P s s π s 1 2 2 2 2 2 3 2 2 ⎜ ⎟ Bellman Equation ⎝ ⎠ ( | , ( )) ( | , ( )) ( | , ( )) P s s π s P s s π s P s s π s 1 3 3 2 3 3 3 3 3 0.4 S2 S1 R=1 V = γ P V + R π S3 0.6 How do we solve this system? ( ) 1 ( 0 . 4 ( ) 0 . 6 ( )) V s = + γ V s + V s 1 2 3 Iteratively Solving for Values Solving for Values V = γ P V + R V = γ P V + R π π For larger numbers of states we can solve this system indirectly: For moderate numbers of states we can solve this system exacty: − 1 ( ) V = I − γ P R = γ + V 1 P V R i + i π π Guaranteed convergent because γ P π has spectral radius <1 Guaranteed invertible because γ P π has spectral radius <1 Establishing Convergence Contraction Analysis • Eigenvalue analysis • Define maximum norm max V = i V i • Monotonicity ∞ – Assume all values start pessimistic • Consider V1 and V2 – One value must always increase − = ε V V – Can never overestimate 1 2 ∞ • WLOG say • Contraction analysis… r V ≤ V + ε 1 2 5
Contraction Analysis Contd. Importance of Contraction • At next iteration for V2: • Any two value functions get closer ' 2 2 = + γ V R PV • For V1 • True value function V* is a fixed point ' 1 1 2 2 2 ( ) ( ) V = R + γ P V ≤ R + γ P V + ε r = R + γ PV + γ P ε r = R + γ PV + γ ε r • Max norm distance from V* decreases Distribute exponentially quickly with iterations • Conclude: ' ' 0 * ( n ) * n 2 1 − = ε → − ≤ γ ε V V V V V − V ≤ γε ∞ ∞ ∞ Finding Good Policies Covered Today Suppose an expert told you the “value” of each state: • Decision Theory V(S1) = 10 V(S2) = 5 • MDPs • Algorithms for MDPs – Value Determination – Optimal Policy Selection 0.7 S1 0.5 S1 • Value Iteration • Policy Iteration • Linear Programming S2 S2 0.3 0.5 Action 2 Action 1 Value Iteration Improving Policies We can’t solve the system directly with a max in the equation Can we solve it by iteration? • How do we get the optimal policy? • Need to ensure that we take the optimal action ∑ ( ) = max ( , ) + γ ( ' | , ) ( ' ) V + 1 s R s a P s s a V s i i a ' s in every state: •Called value iteration or simply successive approximation •Same as value determination, but we can change actions ∑ ( ) max ( , ) ( ' | , ) ( ' ) V s = R s a + γ P s s a V s a s ' •Convergence: • Can’t do eigenvalue analysis (not linear) • Still monotonic Decision theoretic optimal choice given V • Still a contraction in max norm (exercise) • Converges exponentially quickly 6
Recommend
More recommend