Approximation of POMDPs: Active Localization CSE-571 Localization so far: passive integration AI-based Mobile Robotics of sensor information Active Sensing and 19 m Reinforcement Learning 26.5 m Active Localization: Idea Actions • Target point relative to robot • Two-dimensional search space • Choose action based on utility and cost 19 m 26.5 m Efficient, autonomous localization by active disambiguation 1
Utilities Costs: Occupancy Probabilities • Given by change in uncertainty • Costs are based on • Uncertainty measured by entropy occupancy probabilities H ( X ) Bel ( x ) log Bel ( x ) x p ( a ) Bel ( x ) p ( f ( x )) occ occ a U ( a ) H ( X ) E [ H ( X )] x a p ( z | x ) Bel ( x | a ) H ( X ) p ( z | x ) Bel ( x | a ) log p ( z | a ) z , a Costs: Optimal Path Action Selection • Choose action based on • Given by cost-optimal path to expected utility and costs the target • Cost-optimal path determined a arg max ( U ( a ) C ( a )) through value iteration a • Execution: • cost-optimal path C ( a ) p ( a ) min [ C ( b )] occ • reactive collision b avoidance 2
Experimental Results RL for Active Sensing • Random navigation failed in 9 out of 10 test runs • Active localization succeeded in all 20 test runs Active Sensing in Multi-State Active Sensing Domains Sensors have limited coverage & range Uncertainty in multiple, different state variables Robocup: robot & ball location, relative goal location, … Question: Where to move / point sensors? Which uncertainties should be minimized? Typical scenario: Uncertainty in only one type of Importance of uncertainties changes over time. state variable Ball location has to be known very accurately before a kick. Robot location [Fox et al., 98; Kroese & Bunschoten, 99; Roy & Thrun 99] Accuracy not important if ball is on other side of the field. Object / target location(s) [Denzler & Brown, 02; Kreuchner Has to consider sequence of sensing actions! et al., 04, Chung et al., 04] RoboCup: typically use hand-coded strategies. Predominant approach: Minimize expected uncertainty (entropy) 3
Converting Beliefs to Augmented Projected Uncertainty (Goal States Orientation) r g State variables Goal (a) (b) Uncertainty variables Belief Augmented state (c) (d) Why Reinforcement Learning? Least-squares Policy Iteration Model-free approach No accurate model of the robot and the environment. Approximates Q-function by linear function of state features k ˆ Particularly difficult to assess how Q ( s , a ) Q ( s , a ; w ) ( s , a ) w j j No discretization needed j 1 (projected) entropies evolve over time. No iterative procedure needed for policy evaluation Possible to simulate robot and noise in actions and observations. Off-policy: can re-use samples [Lagoudakis and Parr ’01,’03] 4
Application: Least-squares Policy Iteration Active Sensing for Goal Scoring ' 0 Task: AIBO trying to score goals Repeat Sensing actions: looking at ball, or ' the goals, or the markers • Estimate Q-function from samples S Fixed motion control policy: Uses most likely states to dock the robot w LSTD Q ( S , , ) Ball Goal to the ball, then kicks the ball into k ˆ Q ( s , a ; w ) ( s , a ) w the goal. j j j 1 Robot • Update policy Find sensing strategy that “ best ” supports the given control policy. ˆ ' ( s ) arg max Q ( s , a , w ) Mar ker a A Until ( ) ' Augmented State Space and Experiments Features Strategy learned from simulation Episode ends when: State variables: • Scores (reward +5) Distance to ball • Misses (reward 1.5 – 0.1) Ball Orientation • Loses track of the ball (reward -5) Uncertainty variables: Robot • Fails to dock / accidentally kicks the ball Ent. of ball location away (reward -5) Ent. of robot location Applied to real robot g b Ent. of goal orientation Compared with 2 hand-coded strategies Goal Features: Ball • Panning: robot periodically scans • Pointing: robot periodically looks up at markers/goals ( s , a , d ) , H , H , H , , 1 b b b r a g 5
Rewards (simulation) Success Ratio (simulation) 1 4 2 0.8 0 Average rewards Success Ratio 0.6 -2 -4 0.4 -6 0.2 -8 Learned Learned Pointing Pointing Panning Panning -10 0 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 Episodes Episodes Learned Strategy Results on Real Robots • 45 episodes of goal kicking Initially, robot learns to dock (only looks at ball) Goals Misses Avg. Miss Kick Then, robot learns to look at goal and Distance Failures markers 6 ± 0.3cm Learned 31 10 4 9 ± 2.2cm Robot looks at ball when docking Pointing 22 19 4 Briefly before docking, adjusts by looking Panning 15 21 22 ± 9.4cm 9 at the goal Prefers looking at the goal instead of markers for location information 6
Adding Opponents Learning With Opponents 1 Learned with pre-trained data Learned from scratch Pre-trained 0.8 Robot Lost Ball Ratio 0.6 Opponent 0.4 o d v o 0.2 b Goal u Ball 0 0 100 200 300 400 500 600 700 Episodes Additional features: ball velocity, knowledge about other Robot learned to look at ball when opponent is robots close to it. Thereby avoids losing track of it. Summary Learned effective sensing strategies that make good trade-offs between uncertainties Results on a real robot show improvements over carefully tuned, hand-coded strategies Augmented-MDP (with projections) good approximation for RL LSPI well suited for RL on augmented state spaces 7
Recommend
More recommend