Motivation Approaches for Metareasoners Results Summary Metareasoning for Deliberation Time Distribution in the Prost Planner Ferdinand Badenberg University of Basel Bachelor Thesis Presentation, 2017
Motivation Approaches for Metareasoners Results Summary Outline Motivation 1 Why Metareasoning? Metareasoning Problem Approaches for Metareasoners 2 Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner Results 3 Results for the Hand Made Functions Results for the Formal Procedure
Motivation Approaches for Metareasoners Results Summary Table of Contents Motivation 1 Why Metareasoning? Metareasoning Problem Approaches for Metareasoners 2 Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner Results 3 Results for the Hand Made Functions Results for the Formal Procedure
Motivation Approaches for Metareasoners Results Summary Cycle act think Planner Environment reward, next state
Motivation Approaches for Metareasoners Results Summary Motivation Why Metareasoning? Optimise policy in given time Allocate time to think where it is needed Act if decision is easy, clear best action Think if decision is difficult, multiple actions very close
Motivation Approaches for Metareasoners Results Summary Metareasoning Problem Metareasoning Problem Steps from finite horizon MDP Rounds Limited time Anytime search algorithm Metareasoner Decision to think or act Based on specific values for these factors After one thinking cycle of the algorithm Goal: only think when necessary
Motivation Approaches for Metareasoners Results Summary Table of Contents Motivation 1 Why Metareasoning? Metareasoning Problem Approaches for Metareasoners 2 Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner Results 3 Results for the Hand Made Functions Results for the Formal Procedure
Motivation Approaches for Metareasoners Results Summary Hand Made Functions Idea Allocate time for each step Think for as long as time is left State of the search algorithm not considered Functions Tested: 1 Uniform (Standard) 2 First 3 Linear 4 Hyperbolic
Motivation Approaches for Metareasoners Results Summary Example Time Distribution
Motivation Approaches for Metareasoners Results Summary Formal Metareasoner of Lin et al. Metareasoner Idea: think if change of policy is likely, act if it will stay the same Only considers expected reward estimations (Q-values) of search algorithm Act if Q act ≥ Q think How are they calculated?
Motivation Approaches for Metareasoners Results Summary Formal Metareasoner: Q think and Q act Q think Expected reward of the policy after another thinking cycle Simplification: only best action is relevant Estimate probability of action a being the best after the next thinking cycle Estimate expected reward given that action a is chosen Needed: next Q-values for each action Q act Intuitive idea: Q-value of current best action But: average of current Q-value and next Q-value
Motivation Approaches for Metareasoners Results Summary Formal Metareasoner: Estimation of Next Q-values Estimating Next Q-values Idea: base next change in Q-values on previous change in Q-values Assumption: next ∆ Q -value no larger than the previous one Draw random ρ between 0 and 1 ∆ Q ( a ) = ˆ ∆ Q ( a ) ∗ ρ for all actions a
Motivation Approaches for Metareasoners Results Summary Line Segment Example: UCT Q think > Q act 1 e 2 a 1 0 . 8 e 1 b Scaled Values a 2 0 . 6 0 . 4 0 . 2 a 3 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Unit Interval
Motivation Approaches for Metareasoners Results Summary Line Segment Example: UCT Q think = Q act 1 e 1 , b 0 . 8 a 1 Scaled Values a 2 0 . 6 a 3 0 . 4 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Unit Interval
Motivation Approaches for Metareasoners Results Summary Improvements Minimum Thinking Time Problem: assumption is often not true early on Improvement: think for at least T min seconds
Motivation Approaches for Metareasoners Results Summary Cthink + Cthink + Problem: time left is not considered Improvement: subtract C think from Q think
Motivation Approaches for Metareasoners Results Summary Cthink Cthink Problem: stopping with time left is useless Improvement: allow a negative C think
Motivation Approaches for Metareasoners Results Summary Table of Contents Motivation 1 Why Metareasoning? Metareasoning Problem Approaches for Metareasoners 2 Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner Results 3 Results for the Hand Made Functions Results for the Formal Procedure
Motivation Approaches for Metareasoners Results Summary Results Hand Made Functions Results Problem Uniform Hyperbolic First Linear Wildfire 74 71 80 81 Triangle 72 65 72 75 Academic 37 37 34 45 Elevators 93 93 91 94 Tamarisk 93 92 91 94 Sysadmin 94 94 90 91 Recon 97 97 96 99 Game 97 93 94 93 Traffic 96 96 97 97 Crossing 87 89 91 99 Skill 91 91 88 93 Navigation 65 58 83 82 Total 83 82 84 86
Motivation Approaches for Metareasoners Results Summary Results Formal Procedure Results Cthink + Problem Uniform Lin et al. Minimum Cthink Wildfire 60 90 86 95 68 Triangle 78 67 62 59 68 Academic 39 32 36 35 38 Elevators 98 71 83 83 97 Tamarisk 68 86 90 92 96 Sysadmin 100 36 67 74 82 Recon 56 75 75 97 98 Game 97 64 82 86 96 Traffic 85 90 87 98 99 Crossing 88 58 78 83 89 Skill 25 71 69 86 100 Navigation 82 26 25 28 83 Total 86 56 70 72 83
Motivation Approaches for Metareasoners Results Summary Summary Result Summary Hand made functions performed very well Default metareasoner severely underestimates thinking The improvements proved to be very useful
Motivation Approaches for Metareasoners Results Summary Summary Outlook More general hand made functions Improve formal procedure: Consider all previous ∆ Q -values Replace random ρ More sophisticated cost of thinking: combination of two approaches
Motivation Approaches for Metareasoners Results Summary Questions?
Appendix BRTDP vs UCT BRTDP Used in original paper Cost setting Uses upper bound of the actual Q-value Monotonously decreasing UCT Used by Prost planner Reward setting No guarantees
Appendix BRTDP vs UCT: Visualisation
Appendix Line Segment Example: BRTDP Q think < Q act 1 a 1 0 . 8 Scaled Values 0 . 6 a 2 a 3 0 . 4 b e 3 0 . 2 e 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Unit Interval
Appendix Wildfire Time per Step
Recommend
More recommend