learning in robotic systems
play

Learning in Robotic Systems Robotic Agents @ Allegheny College - PowerPoint PPT Presentation

Learning in Robotic Systems Robotic Agents @ Allegheny College Janyl Jumadinova November 27, 2019 Janyl Jumadinova Learning in Robotic Systems November 27, 2019 1 / 13 Reinforcement Learning Basic idea: Receive feedback in the form of


  1. Learning in Robotic Systems Robotic Agents @ Allegheny College Janyl Jumadinova November 27, 2019 Janyl Jumadinova Learning in Robotic Systems November 27, 2019 1 / 13

  2. Reinforcement Learning Basic idea: Receive feedback in the form of rewards. Agent’s utility is defined by the reward function. Must (learn to) act so as to maximize expected rewards. Janyl Jumadinova Learning in Robotic Systems November 27, 2019 2 / 13

  3. Reinforcement Learning Agents can use: model-based learning: model the other agents and compute optimal action based on this model and knowledge of the reward structure (the agent attempts to learn a model of its environment), or model-free: directly learn the expected utility (probability · payoff) of actions in a given state. Janyl Jumadinova Learning in Robotic Systems November 27, 2019 3 / 13

  4. Model-free reinforcement learning Idea : learn how to act without explicitly learning the transition probabilities P ( s ′ | s , a ) Janyl Jumadinova Learning in Robotic Systems November 27, 2019 4 / 13

  5. Model-free reinforcement learning Idea : learn how to act without explicitly learning the transition probabilities P ( s ′ | s , a ) Q-learning : learn an action-utility function Q ( s , a ) that tells us the value of doing action a in state s V ( s ) = max a Q ( s , a ) Q ( s , a ) ← Q ( s , a ) + α ( R ( s ) + γ max ′ a Q ( s ′ , a ′ )), α - alpha : Learning Rate: Extent to which the Q-values are being updated in every iteration. γ - gamma : Discount Rate: How much importance we want to give to future rewards. Selected action (policy): π ( s ) = argmax a Q ( s , a ) Janyl Jumadinova Learning in Robotic Systems November 27, 2019 4 / 13

  6. Q-learning At each step s , choose the action a which maximizes the function Q ( s , a ) – Q is the estimated utility function - it tells us how good an action is given a certain state Q ( s , a ) = immediate reward for making an action + best utility ( Q ) for the resulting state Janyl Jumadinova Learning in Robotic Systems November 27, 2019 5 / 13

  7. Gym’s Taxi Environment https://github.com/openai/gym/blob/master/gym/envs/toy_ text/taxi.py 5x5 grid = 25 possible locations Four locations for pick up and drop off: R - (0,0): 0, G - (0,4): 1, Y - (4,0): 2, B - (4,3): 3 Passenger’s state of being inside the taxi 5 x 5 x 5 x 4 = 500 possible states ( state space ) Janyl Jumadinova Learning in Robotic Systems November 27, 2019 6 / 13

  8. Gym’s Taxi Environment Filled square is the taxi Yellow square: taxi is without a passenger Green square: taxi with a passenger Pipe —: a wall Blue letter: current passenger pick up location Purple letter: current destination Janyl Jumadinova Learning in Robotic Systems November 27, 2019 7 / 13

  9. Gym’s Taxi Environment Six possible actions: 0 = south, 1 = north, 2 = east, 3 = west, 4 = pickup, 5 = dropoff ( action space ) Penalty of -1 for hit walls Reward of +20 for a successful drop off Reward (penalty) of -1 for every time-step it takes Reward (penalty) of -10 for wrong pick up and drop off actions Janyl Jumadinova Learning in Robotic Systems November 27, 2019 8 / 13

  10. Gym’s Taxi Environment Problem Statement (from gym documentation) There are 4 locations (labeled by different letters). The task of the taxi robot is to pick up the passenger at one location and drop off a passenger at another. The taxi robot receives +20 points for a successful drop-off and loses 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions. Janyl Jumadinova Learning in Robotic Systems November 27, 2019 9 / 13

  11. The Reward Table An initial reward table, called ‘P‘, is created when the Taxi environment is initialized. P table is a matrix with rows = number of states and columns = number of actions, giving states x actions matrix This dictionary has the structure action: [(probability, nextstate, reward, done)]. env.P[328] {0: [(1.0, 428, -1, False)], 1: [(1.0, 228, -1, False)], 2: [(1.0, 348, -1, False)], 3: [(1.0, 328, -1, False)], 4: [(1.0, 328, -10, False)], 5: [(1.0, 328, -10, False)]} Janyl Jumadinova Learning in Robotic Systems November 27, 2019 10 / 13

  12. Gym’s Taxi Environment In this environment, the probability is always 1.0. The nextstate is the state the taxi would be in if it takes the action at this index of the dictionary. All the movement actions have a -1 reward. Each successfull dropoff is the end of an episode done flag is used to indicate when the taxi has successfully dropped off a passenger in the right location. — –¿ WALL: can’t pass through, will remain in the same position if tries to move through wall Janyl Jumadinova Learning in Robotic Systems November 27, 2019 11 / 13

  13. Gym’s Taxi Environment The current newest version of gym forcefully stops the environment in 200 steps ( https://github.com/openai/gym/wiki/FAQ ). To avoid this, use env = gym.make("MountainCar-v0").env In state 328, the pickup/dropoff actions have -10 reward. If the taxi was in a state where the taxi has a passenger and is on top of the right destination, we would see a reward of 20 at the dropoff action (5). Janyl Jumadinova Learning in Robotic Systems November 27, 2019 12 / 13

  14. Exploration vs. Exploitation Exploration : change to a different random strategy. Exploitation : keep selecting the best strategy so far. Janyl Jumadinova Learning in Robotic Systems November 27, 2019 13 / 13

  15. Exploration vs. Exploitation Exploration : change to a different random strategy. Exploitation : keep selecting the best strategy so far. epsilon : Probability of selecting random action instead of the ’optimal’ action Janyl Jumadinova Learning in Robotic Systems November 27, 2019 13 / 13

  16. Exploration vs. Exploitation Exploration : change to a different random strategy. Exploitation : keep selecting the best strategy so far. epsilon : Probability of selecting random action instead of the ’optimal’ action 1 TODO: How do changes to epsilon influence the performance of reinforcement learning? 2 TODO: What about alpha, gamma, episodes ? Janyl Jumadinova Learning in Robotic Systems November 27, 2019 13 / 13

Recommend


More recommend