Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall 2017 1 / 34
Supervised Machine Learning Learning from datasets A passive paradigm Focus on pattern recognition Daniel Russo (Columbia) Fall 2017 2 / 34
Reinforcement Learning Reward Action Environment Outcome Learning to attain a goal through interaction with a poorly understood environment. Daniel Russo (Columbia) Fall 2017 3 / 34
Canonical (and toy) RL environments Cart Pole Mountain Car Daniel Russo (Columbia) Fall 2017 4 / 34
Impressive new (and toy) RL environments Atari from pixels Daniel Russo (Columbia) Fall 2017 5 / 34
Challenges in Reinforcement Learning Partial Feedback ◮ The data one gathers depends on the actions they take. Delayed Consequences ◮ Rather than maximize the immediate benefit from the next interaction, one must consider the impact on future interactions. Daniel Russo (Columbia) Fall 2017 6 / 34
Dream Application: Management of Chronic Diseases Various researchers are working on mobile health interventions Daniel Russo (Columbia) Fall 2017 7 / 34
Dream Application: Intelligent Tutoring Systems *Picture shamelessly lifted from a slide of Emma Brunskill’s Daniel Russo (Columbia) Fall 2017 8 / 34
Dream Application: Beyond Myopia in E-Commerce Online marketplaces and web services have repeated interactions with users, but are deigned to optimize the next interaction . RL provides a framework for optimizing the cumulative value generated by such interactions. How useful will this turn out to be? Daniel Russo (Columbia) Fall 2017 9 / 34
Deep Reinforcement Learning RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Daniel Russo (Columbia) Fall 2017 10 / 34
Deep Reinforcement Learning RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Justified excitement ◮ Hope is to enable direct training of control systems based on complex sensory inputs (e.g. visual or auditory sensors) ◮ DeepMind’s DQN learns to play Atari from pixels, without learning vision first. Daniel Russo (Columbia) Fall 2017 10 / 34
Deep Reinforcement Learning RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Justified excitement ◮ Hope is to enable direct training of control systems based on complex sensory inputs (e.g. visual or auditory sensors) ◮ DeepMind’s DQN learns to play Atari from pixels, without learning vision first. Also a lot of less justified hype. Daniel Russo (Columbia) Fall 2017 10 / 34
Warning This is an advanced PhD course. 1 It will be primarily theoretical. We will prove 2 theorems when we can. The emphasis will be on precise understand of why methods work and why they may fail completely in simple cases. There are tons of engineering tricks to Deep RL. I 3 won’t cover these. Daniel Russo (Columbia) Fall 2017 11 / 34
My Goals Encourage great students to do research in this area. 1 Provide a fun platform for introducing technical 2 tools to operations PhD students. ◮ Dynamic programming, stochastic approximation, exploration algorithms and regret analysis. Sharpen my own understanding. 3 Daniel Russo (Columbia) Fall 2017 12 / 34
Tentative Course Outline Foundational Material on MDPs 1 Estimating Long Run Value 2 Exploration Algorithms 3 * Additional topics as time permits Policy gradients and actor critic Rollout and Monte-Carlo tree search. Daniel Russo (Columbia) Fall 2017 13 / 34
Markov Decision Processes: A warmup On the white-board Shortest path in a directed graph Daniel Russo (Columbia) Fall 2017 14 / 34
Markov Decision Processes: A warmup On the white-board Shortest path in a directed graph Imagine while traversing the shortest path, you discover one of the routes is closed. How should you adjust your path? Daniel Russo (Columbia) Fall 2017 14 / 34
Example: Inventory Control Stochastic demand Orders have lead time Non-perishable inventory Inventory holding costs Finite selling season Daniel Russo (Columbia) Fall 2017 15 / 34
Example: Inventory Control Periods k = 0 , 1 , 2 , . . . , N x k ∈ { 0 , . . . , 1000 } current inventory u k ∈ { 0 , . . . , 1000 − x k } inventory order w k ∈ { 0 , 1 , 2 , . . . } demand ( i.i.d w/ known dist. ) x k +1 = ⌊ x k − w k ⌋ + u k Transition dynamics Daniel Russo (Columbia) Fall 2017 16 / 34
Example: Inventory Control Periods k = 0 , 1 , 2 , . . . , N x k ∈ { 0 , . . . , 1000 } current inventory u k ∈ { 0 , . . . , 1000 − x k } inventory order w k ∈ { 0 , 1 , 2 , . . . } demand ( i.i.d w/ known dist. ) x k +1 = ⌊ x k − w k ⌋ + u k Transition dynamics Cost function g ( x , u , w ) = c H x + c L ⌊ w − x ⌋ + c O ( u ) � �� � � �� � � �� � Holding cost Lost sales Order cost Daniel Russo (Columbia) Fall 2017 16 / 34
Example: Inventory Control Objective: N � min E g ( x k , u k , w k ) k =0 Daniel Russo (Columbia) Fall 2017 17 / 34
Example: Inventory Control Objective: N � min E g ( x k , u k , w k ) k =0 Minimize over what? Daniel Russo (Columbia) Fall 2017 17 / 34
Example: Inventory Control Objective: N � min E g ( x k , u k , w k ) k =0 Minimize over what? ◮ Over fixed sequences of controls u 0 , u 1 , . . . ? ◮ No, over policies (adaptive ordering strategies). Daniel Russo (Columbia) Fall 2017 17 / 34
Example: Inventory Control Objective: N � min E g ( x k , u k , w k ) k =0 Minimize over what? ◮ Over fixed sequences of controls u 0 , u 1 , . . . ? ◮ No, over policies (adaptive ordering strategies). Sequential decision making under uncertainty where ◮ Decisions have delayed consequences. ◮ Relevant information is revealed during the decision process. Daniel Russo (Columbia) Fall 2017 17 / 34
Further Examples Dynamic pricing (over a selling season) Trade execution (with market impact) Queuing admission control Consumption-savings models in economics Search models in economics Timing of maintenance and repairs Daniel Russo (Columbia) Fall 2017 18 / 34
Finite Horizon MDPs: formulation A discrete time dynamic system x k +1 = f k ( x k , u k , w k ) k = 0 , 1 , ..., N where x k ∈ X k state u k ∈ U k ( x k ) control w k ( i.i.d w/ known dist. ) Assume state and control spaces are finite. Daniel Russo (Columbia) Fall 2017 19 / 34
Finite Horizon MDPs: formulation A discrete time dynamic system x k +1 = f k ( x k , u k , w k ) k = 0 , 1 , ..., N where x k ∈ X k state u k ∈ U k ( x k ) control w k ( i.i.d w/ known dist. ) Assume state and control spaces are finite. The total cost incurred is N � g k ( x k , u k , w k ) � �� � k =0 cost in period k Daniel Russo (Columbia) Fall 2017 19 / 34
Finite Horizon MDPs: formulation A policy is a sequence π = ( µ 0 , µ 1 , ..., µ N ) where µ k : x k �→ u k ∈ U k ( x k ) . Expected cost of following π from state x 0 is N � J π ( x 0 ) = E g k ( x k , u k , w k ) k =0 where x k +1 = f k ( x k , u k , w k ) and E [ · ] is over the w ′ k s . Daniel Russo (Columbia) Fall 2017 20 / 34
Finite Horizon MDPs: formulation The optimal expected cost to go from x 0 is J ∗ ( x 0 ) = min π ∈ Π J π ( x 0 ) where Π consists of all feasible policies. We will see the same policy π ∗ is optimal for all initial states. So J ∗ ( x ) = J π ∗ ( x ) ∀ x Daniel Russo (Columbia) Fall 2017 21 / 34
Minor differences with Bertsekas Vol. I Bertsekas Uses a special terminal cost function g N ( x N ) ◮ Can always take g N ( x , u , w ) to be independent of u , w . Lets the distribution of w k depend on k and x k . ◮ This can be embedded in the functions f k , g k . Daniel Russo (Columbia) Fall 2017 22 / 34
Principle of Optimality Regardless of the consequences of initial decisions, an optimal policy should be optimal in the sub-problem beginning in the current state and time period. Daniel Russo (Columbia) Fall 2017 23 / 34
Principle of Optimality Regardless of the consequences of initial decisions, an optimal policy should be optimal in the sub-problem beginning in the current state and time period. Sufficiency: Such policies exist and minimize total expected cost from any initial state. Necessity: A policy that is optimal from some initial state must behave optimally in any subproblem that is reached with positive probability. Daniel Russo (Columbia) Fall 2017 23 / 34
The Dynamic Programming Algorithm Set J ∗ N ( x ) = u ∈ U N ( x ) E [ g N ( x , u , w )] min ∀ x ∈ X n For k = N − 1 , N − 2 , . . . 0, set J ∗ u ∈ U k ( x ) E [ g k ( x , u , w )+ J ∗ k ( x ) = min k +1 ( f k ( x , u , w ))] ∀ x ∈ X k . Daniel Russo (Columbia) Fall 2017 24 / 34
Recommend
More recommend