class structure
play

Class Structure Last time: Batch RL This Time: MCTS Next time: - PowerPoint PPT Presentation

Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Lecture 14: MCTS 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 57 Class Structure Last time:


  1. Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Lecture 14: MCTS 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 57

  2. Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 14: MCTS 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 57

  3. Table of Contents Introduction 1 Model-Based Reinforcement Learning 2 Simulation-Based Search 3 Integrated Architectures 4 Lecture 14: MCTS 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 57

  4. Model-Based Reinforcement Learning Previous lectures : learn value function or policy or directly from experience This lecture : learn model directly from experience and use planning to construct a value function or policy Integrate learning and planning into a single architecture Lecture 14: MCTS 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 57

  5. Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Lecture 14: MCTS 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 57

  6. Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Model-Based RL Learn a model from experience Plan value function (and/or policy) from model Lecture 14: MCTS 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 57

  7. Model-Free RL Lecture 14: MCTS 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 57

  8. Model-Based RL Lecture 14: MCTS 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 57

  9. Table of Contents Introduction 1 Model-Based Reinforcement Learning 2 Simulation-Based Search 3 Integrated Architectures 4 Lecture 14: MCTS 11 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 9 / 57

  10. Model-Based RL Lecture 14: MCTS 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 57

  11. Advantages of Model-Based RL Advantages: Can efficiently learn model by supervised learning methods Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs) Disadvantages First learn a model, then construct a value function ⇒ two sources of approximation error Lecture 14: MCTS 13 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 57

  12. MDP Model Refresher A model M is a representation of an MDP < S , A , P , R > , parametrized by η We will assume state space S and action space A are known So a model M = < P η , R η > represents state transitions P η ≈ P and rewards R η ≈ R S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Typically assume conditional independence between state transitions and rewards P [ S t +1 , R t +1 | S t , A t ] = P [ S t +1 | S t , A t ] P [ R t +1 | S t , A t ] Lecture 14: MCTS 14 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 57

  13. Model Learning Goal: estimate model M η from experience { S 1 , A 1 , R 2 , ..., S T } This is a supervised learning problem S 1 , A 1 → R 2 , S 2 S 2 A 2 → R 3 , S 3 . . . S T − 1 , A T − 1 → R T , S T Learning s , a → r is a regression problem Learning s , a → s ′ is a density estimation problem Pick loss function, e.g. mean-squared error, KL divergence, . . . Find parameters η that minimize empirical loss Lecture 14: MCTS 15 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 57

  14. Examples of Models Table Lookup Model Linear Expectation Model Linear Gaussian Model Gaussian Process Model Deep Belief Network Model . . . Lecture 14: MCTS 16 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 57

  15. Table Lookup Model Model is an explicit MDP, ˆ P , ˆ R Count visits N ( s , a ) to each state action pair T 1 P a ˆ � ✶ ( S t , A t , S t +1 = s , a , s ′ ) s , s ′ = N ( s , a ) t =1 T 1 ˆ � R a s = ✶ ( S t , A t = s , a ) N ( s , a ) t =1 Alternatively At each time-step t, record experience tuple < S t , A t , R t +1 , S t +1 > To sample model, randomly pick tuple matching < s , a , · , · > Lecture 14: MCTS 17 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 57

  16. AB Example Two states A,B; no discounting; 8 episodes of experience We have constructed a table lookup model from the experience Recall: For a particular policy, TD with a tabular representation with infinite experience replay will converge to the same value as computed if construct a MLE model and do planning Check Your Memory: Will MC methods converge to the same solution? Lecture 14: MCTS 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 57

  17. Planning with a Model Given a model M η = < P η , R η > Solve the MDP < S , A , P η , R η > Using favourite planning algorithm Value iteration Policy iteration Tree search · · · Lecture 14: MCTS 19 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 57

  18. Sample-Based Planning A simple but powerful approach to planning Use the model only to generate samples Sample experience from model S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Apply model-free RL to samples, e.g.: Monte-Carlo control Sarsa Q-learning Sample-based planning methods are often more efficient Lecture 14: MCTS 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 18 / 57

  19. Back to the AB Example Construct a table-lookup model from real experience Apply model-free RL to sampled experience Real experience Sampled experience B, 1 A, 0, B, 0 B, 0 B, 1 B, 1 B, 1 A, 0 B, 1 B, 1 B, 1 B, 1 A, 0 B, 1 B, 1 B, 1 B, 0 B, 1 B, 0 e.g. Monte-Carlo learning: V (A) = 1, V (B) = 0.75 Check Your Memory: What would have MC on the original experience have converged to? Lecture 14: MCTS 21 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 19 / 57

  20. Planning with an Inaccurate Model Given an imperfect model < P η , R η > � = < P , R > Performance of model-based RL is limited to optimal policy for approximate MDP < S , A , P η , R η > i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a sub-optimal policy Solution 1: when model is wrong, use model-free RL Solution 2: reason explicitly about model uncertainty (see Lectures on Exploration / Exploitation) Lecture 14: MCTS 22 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 20 / 57

  21. Table of Contents Introduction 1 Model-Based Reinforcement Learning 2 Simulation-Based Search 3 Integrated Architectures 4 Lecture 14: MCTS 23 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 21 / 57

  22. Forward Search Forward search algorithms select the best action by lookahead They build a search tree with the current state st at the root Using a model of the MDP to look ahead No need to solve whole MDP, just sub-MDP starting from now Lecture 14: MCTS 24 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 22 / 57

  23. Simulation-Based Search Forward search paradigm using sample-based planning Simulate episodes of experience from now with the model Apply model-free RL to simulated episodes Lecture 14: MCTS 25 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 23 / 57

  24. Simulation-Based Search (2) Simulate episodes of experience from now with the model { S k t , A k t , R k t +1 , ..., S k T } K k =1 ∼ M v Apply model-free RL to simulated episodes Monte-Carlo control → Monte-Carlo search Sarsa → TD search Lecture 14: MCTS 26 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 24 / 57

  25. Simple Monte-Carlo Search Given a model M v and a simulation policy π For each action a ∈ A Simulate K episodes from current (real) state s t { s t , a , R k t +1 , ..., S k T } K k =1 ∼ M v , π Evaluate actions by mean return (Monte-Carlo evaluation) K Q ( s t , a ) = 1 P � G t − → q π ( s t , a ) (1) K k =1 Select current (real) action with maximum value a t = argmin Q ( s t , a ) a ∈ A Lecture 14: MCTS 27 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 25 / 57

  26. Recall Expectimax Tree If have a MDP model M v Can compute optimal q ( s , a ) values for current state by constructing an expectimax tree Limitations: Size of tree scales as Lecture 14: MCTS 28 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 26 / 57

  27. Monte-Carlo Tree Search (MCTS) Given a model M v Build a search tree rooted at the current state s t Samples actions and next states Iteratively construct and update tree by performing K simulation episodes starting from the root state After search is finished, select current (real) action with maximum value in search tree a t = argmin Q ( s t , a ) a ∈ A Lecture 14: MCTS 29 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 27 / 57

Recommend


More recommend