Class Structure Last time: Batch RL This Time: MCTS Next time: - PowerPoint PPT Presentation

Lecture 16: MCTS 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 1 With many slides from or derived from David Silver Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 57

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 57

Table of Contents Introduction 1 Model-Based Reinforcement Learning 2 Simulation-Based Search 3 Integrated Architectures 4 Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 57

Model-Based Reinforcement Learning Previous lectures : learn value function or policy or directly from experience This lecture : learn model directly from experience and use planning to construct a value function or policy Integrate learning and planning into a single architecture Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 57

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 57

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Model-Based RL Learn a model from experience Plan value function (and/or policy) from model Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 57

Model-Free RL Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 57

Model-Based RL Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 57

Model-Based RL Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 57

Advantages of Model-Based RL Advantages: Can efficiently learn model by supervised learning methods Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs) Disadvantages First learn a model, then construct a value function ⇒ two sources of approximation error Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 57

MDP Model Refresher A model M is a representation of an MDP < S , A , P , R > , parametrized by η We will assume state space S and action space A are known So a model M = < P η , R η > represents state transitions P η ≈ P and rewards R η ≈ R S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Typically assume conditional independence between state transitions and rewards P [ S t +1 , R t +1 | S t , A t ] = P [ S t +1 | S t , A t ] P [ R t +1 | S t , A t ] Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 57

Model Learning Goal: estimate model M η from experience { S 1 , A 1 , R 2 , ..., S T } This is a supervised learning problem S 1 , A 1 → R 2 , S 2 S 2 A 2 → R 3 , S 3 . . . S T − 1 , A T − 1 → R T , S T Learning s , a → r is a regression problem Learning s , a → s ′ is a density estimation problem Pick loss function, e.g. mean-squared error, KL divergence, . . . Find parameters η that minimize empirical loss Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 57

Examples of Models Table Lookup Model Linear Expectation Model Linear Gaussian Model Gaussian Process Model Deep Belief Network Model . . . Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 57

Table Lookup Model Model is an explicit MDP, ˆ P , ˆ R Count visits N ( s , a ) to each state action pair T 1 P a ˆ � ✶ ( S t , A t , S t +1 = s , a , s ′ ) s , s ′ = N ( s , a ) t =1 T 1 ˆ � R a s = ✶ ( S t , A t = s , a ) N ( s , a ) t =1 Alternatively At each time-step t, record experience tuple < S t , A t , R t +1 , S t +1 > To sample model, randomly pick tuple matching < s , a , · , · > Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 57

AB Example Two states A,B; no discounting; 8 episodes of experience We have constructed a table lookup model from the experience Recall: For a particular policy, TD with a tabular representation with infinite experience replay will converge to the same value as computed if construct a MLE model and do planning Check Your Memory: Will MC methods converge to the same solution? Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 57

Planning with a Model Given a model M η = < P η , R η > Solve the MDP < S , A , P η , R η > Using favourite planning algorithm Value iteration Policy iteration Tree search · · · Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 57

Sample-Based Planning A simple but powerful approach to planning Use the model only to generate samples Sample experience from model S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Apply model-free RL to samples, e.g.: Monte-Carlo control Sarsa Q-learning Sample-based planning methods are often more data efficient Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 18 / 57

Back to the AB Example Construct a table-lookup model from real experience Apply model-free RL to sampled experience Real experience Sampled experience B, 1 A, 0, B, 0 B, 0 B, 1 B, 1 B, 1 A, 0 B, 1 B, 1 B, 1 B, 1 A, 0 B, 1 B, 1 B, 1 B, 0 B, 1 B, 0 e.g. Monte-Carlo learning: V (A) = 1, V (B) = 0.75 Check Your Memory: What would have MC on the original experience have converged to? Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 19 / 57

Planning with an Inaccurate Model Given an imperfect model < P η , R η > � = < P , R > Performance of model-based RL is limited to optimal policy for approximate MDP < S , A , P η , R η > i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a sub-optimal policy Solution 1: when model is wrong, use model-free RL Solution 2: reason explicitly about model uncertainty (see Lectures on Exploration / Exploitation) Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 20 / 57

Forward Search Forward search algorithms select the best action by lookahead They build a search tree with the current state st at the root Using a model of the MDP to look ahead No need to solve whole MDP, just sub-MDP starting from now Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 22 / 57

Simulation-Based Search Forward search paradigm using sample-based planning Simulate episodes of experience from now with the model Apply model-free RL to simulated episodes Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 23 / 57

Simulation-Based Search (2) Simulate episodes of experience from now with the model { S k t , A k t , R k t +1 , ..., S k T } K k =1 ∼ M v Apply model-free RL to simulated episodes Monte-Carlo control → Monte-Carlo search Sarsa → TD search Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 24 / 57

Simple Monte-Carlo Search Given a model M v and a simulation policy π For each action a ∈ A Simulate K episodes from current (real) state s t { s t , a , R k t +1 , ..., S k T } K k =1 ∼ M v , π Evaluate actions by mean return (Monte-Carlo evaluation) K Q ( s t , a ) = 1 P � G t − → q π ( s t , a ) (1) K k =1 Select current (real) action with maximum value a t = argmax Q ( s t , a ) a ∈ A Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 25 / 57

Recall Expectimax Tree If have a MDP model M v Can compute optimal q ( s , a ) values for current state by constructing an expectimax tree Limitations: Size of tree scales as ? Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 26 / 57

Recall Expectimax Tree If have a MDP model M v Can compute optimal q ( s , a ) values for current state by constructing an expectimax tree Limitations: Size of tree scales as ( | S || A | ) H 1 With many slides from or derived from David Silver Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 26 / 57

Monte-Carlo Tree Search (MCTS) Given a model M v Build a search tree rooted at the current state s t Samples actions and next states Iteratively construct and update tree by performing K simulation episodes starting from the root state After search is finished, select current (real) action with maximum value in search tree a t = argmax Q ( s t , a ) a ∈ A Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 27 / 57

Class Structure Last time: Batch RL This Time: MCTS Next time: - PowerPoint PPT Presentation

Lecture 16: MCTS 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 1 With many slides from or derived from David Silver Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 57 Class Structure Last time:

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

3/14/16 Review Class/Object Type Class Keyword class class Point

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

STRUCTURE STRUCTURE Highlight the structure of Highlight the structure of material material

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Classroom Assessment Scoring System (CLASS) 104 B New Report Format Interpreting your CLASS

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Inheritance II Is-a versus has-a When an object of class A has a n object of class B, use

Inheritance A class can be a sub-type of another class The inheriting class contains all

UML Class Diagrams Steven Zeil February 25, 2013 UML Class Diagrams Outline Class

Multiple inheritance Multiple inheritance Can derive a class from more than one base class

Linear maps Matthew Macauley Department of Mathematical Sciences Clemson University

Waqar Ali, Heechul Yun University of Kansas Multicore Processors Provide high computing

Analyzable and Practical Real-Time Gang Scheduling on Multicore Using RT-Gang Waqar Ali, Michael

11 RTN / RTL

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Real-Time Cloud Computing Chenyang Lu Cyber-Physical Systems Laboratory

ENABLING RICH WEB APPLICATIONS FOR IN-VEHICLE INFOTAINMENT. USING THE WEBINOS PLATFORM INSIDE THE

Financial Econometrics, Econ 40357 Macro and Financial Time Series N.C. Mark University of Notre

Sambuz

Useful Links

Newsletter

Mail Us