cse 571
play

CSE-571 Localization so far: passive integration AI-based Mobile - PowerPoint PPT Presentation

Approximation of POMDPs: Active Localization CSE-571 Localization so far: passive integration AI-based Mobile Robotics of sensor information Active Sensing and 19 m Reinforcement Learning 26.5 m Active Localization: Idea Actions Target


  1. Approximation of POMDPs: Active Localization CSE-571 Localization so far: passive integration AI-based Mobile Robotics of sensor information Active Sensing and 19 m Reinforcement Learning 26.5 m Active Localization: Idea Actions • Target point relative to robot • Two-dimensional search space • Choose action based on utility and cost 19 m 26.5 m Efficient, autonomous localization by active disambiguation 1

  2. Utilities Costs: Occupancy Probabilities • Given by change in uncertainty • Costs are based on • Uncertainty measured by entropy occupancy probabilities H ( X ) Bel ( x ) log Bel ( x ) x p ( a ) Bel ( x ) p ( f ( x )) occ occ a U ( a ) H ( X ) E [ H ( X )] x a p ( z | x ) Bel ( x | a ) H ( X ) p ( z | x ) Bel ( x | a ) log p ( z | a ) z , a Costs: Optimal Path Action Selection • Choose action based on • Given by cost-optimal path to expected utility and costs the target • Cost-optimal path determined a arg max ( U ( a ) C ( a )) through value iteration a • Execution: • cost-optimal path C ( a ) p ( a ) min [ C ( b )] occ • reactive collision b avoidance 2

  3. Experimental Results RL for Active Sensing • Random navigation failed in 9 out of 10 test runs • Active localization succeeded in all 20 test runs Active Sensing in Multi-State Active Sensing Domains  Sensors have limited coverage & range  Uncertainty in multiple, different state variables Robocup: robot & ball location, relative goal location, …  Question: Where to move / point sensors?  Which uncertainties should be minimized?  Typical scenario: Uncertainty in only one type of  Importance of uncertainties changes over time. state variable  Ball location has to be known very accurately before a kick.  Robot location [Fox et al., 98; Kroese & Bunschoten, 99; Roy & Thrun 99]  Accuracy not important if ball is on other side of the field.  Object / target location(s) [Denzler & Brown, 02; Kreuchner  Has to consider sequence of sensing actions! et al., 04, Chung et al., 04]  RoboCup: typically use hand-coded strategies.  Predominant approach: Minimize expected uncertainty (entropy) 3

  4. Converting Beliefs to Augmented Projected Uncertainty (Goal States Orientation) r g State variables Goal (a) (b) Uncertainty variables Belief Augmented state (c) (d) Why Reinforcement Learning? Least-squares Policy Iteration  Model-free approach  No accurate model of the robot and the environment.  Approximates Q-function by linear function of state features k ˆ  Particularly difficult to assess how Q ( s , a ) Q ( s , a ; w ) ( s , a ) w j j  No discretization needed j 1 (projected) entropies evolve over time.  No iterative procedure needed for policy evaluation  Possible to simulate robot and noise in actions and observations.  Off-policy: can re-use samples [Lagoudakis and Parr ’01,’03] 4

  5. Application: Least-squares Policy Iteration Active Sensing for Goal Scoring ' 0 Task: AIBO trying to score goals  Repeat  Sensing actions: looking at ball, or  ' the goals, or the markers • Estimate Q-function from samples S Fixed motion control policy: Uses  most likely states to dock the robot w LSTD Q ( S , , ) Ball Goal to the ball, then kicks the ball into k ˆ Q ( s , a ; w ) ( s , a ) w the goal. j j j 1 Robot • Update policy Find sensing strategy that “ best ”  supports the given control policy. ˆ ' ( s ) arg max Q ( s , a , w ) Mar ker a A  Until ( ) ' Augmented State Space and Experiments Features  Strategy learned from simulation  Episode ends when:  State variables: • Scores (reward +5)  Distance to ball • Misses (reward 1.5 – 0.1)  Ball Orientation • Loses track of the ball (reward -5)  Uncertainty variables: Robot • Fails to dock / accidentally kicks the ball  Ent. of ball location away (reward -5)  Ent. of robot location  Applied to real robot g b  Ent. of goal orientation  Compared with 2 hand-coded strategies Goal  Features: Ball • Panning: robot periodically scans • Pointing: robot periodically looks up at markers/goals ( s , a , d ) , H , H , H , , 1 b b b r a g 5

  6. Rewards (simulation) Success Ratio (simulation) 1 4 2 0.8 0 Average rewards Success Ratio 0.6 -2 -4 0.4 -6 0.2 -8 Learned Learned Pointing Pointing Panning Panning -10 0 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 Episodes Episodes Learned Strategy Results on Real Robots • 45 episodes of goal kicking  Initially, robot learns to dock (only looks at ball) Goals Misses Avg. Miss Kick  Then, robot learns to look at goal and Distance Failures markers 6 ± 0.3cm Learned 31 10 4 9 ± 2.2cm  Robot looks at ball when docking Pointing 22 19 4  Briefly before docking, adjusts by looking Panning 15 21 22 ± 9.4cm 9 at the goal  Prefers looking at the goal instead of markers for location information 6

  7. Adding Opponents Learning With Opponents 1 Learned with pre-trained data Learned from scratch Pre-trained 0.8 Robot Lost Ball Ratio 0.6 Opponent 0.4 o d v o 0.2 b Goal u Ball 0 0 100 200 300 400 500 600 700 Episodes Additional features: ball velocity, knowledge about other  Robot learned to look at ball when opponent is robots close to it. Thereby avoids losing track of it. Summary  Learned effective sensing strategies that make good trade-offs between uncertainties  Results on a real robot show improvements over carefully tuned, hand-coded strategies  Augmented-MDP (with projections) good approximation for RL  LSPI well suited for RL on augmented state spaces 7

Recommend


More recommend