decentralized non communicating multi agent collision
play

Decentralized Non- Communicating Multi-agent Collision Avoidance - PowerPoint PPT Presentation

Decentralized Non- Communicating Multi-agent Collision Avoidance with Deep Reinforcement Learning By Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P . How Presenter: Jared Choi Motivation Finding a path Computationally


  1. Decentralized Non- Communicating Multi-agent Collision Avoidance with Deep Reinforcement Learning By Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P . How Presenter: Jared Choi

  2. Motivation • Finding a path • Computationally expensive due to • Collision checking • Feasibility checking • Effjciency checking

  3. Motivation • Finding a path • Computationally expensive due to • Collision checking • Feasibility checking • Effjciency checking • Offmine Learning

  4. Background • A sequential decision making problem can be formulated as a Markov Decision Process (MDP) • M = <S, A, P, R, >

  5. Background • A sequential decision making problem can be formulated as a Markov Decision Process (MDP) • M = <S, A, P, R, > • S (state space) • A(action space) • P(state transition model) • R: reward function • : discount factor

  6. State Space (M = < S , A, P, R, >) • S(state space) • System’s state is constructed by concatenating the two agents’ individual states Observable State Vector (position (x,y), velocity(x,y), radius) Unobservable State vector (goal position (x,y), preferred speed, he

  7. State Space (M = < S , A, P, R, >) • S(state space) • System’s state is constructed by concatenating the two agents’ individual states Observable State Vector (position (x,y), velocity(x,y), radius) Unobservable State vector (goal position (x,y), preferred speed, hea

  8. Action Space (M = <S, A , P, R, >) • A(action space): • Set of permissible velocity vectors, a(s) = v

  9. State Transition Model(M = <S, A, P , R, >) • P(state transition model) • A probabilistic state transition model • Determined by the agents’ kinematics • Unknown to us

  10. Reward Function (M = <S, A, P, R , >) • R: reward function • Award the agent for reaching its goal • Penalize the agent for getting too close or colliding with other agent

  11. Discount Factor(M = <S, A, P, R, >) • Discount factor

  12. Value Function • The value of a state • Value depends on • close to 1 • We care about our long term reward • close to 0 • We care only about our immediate reward

  13. Optimal Policy • The best trajectory at given state

  14. Value Function and Optimal Policy From David Silver’s slide

  15. Value Function and Optimal Policy • Every state s has value V(s) • Store it in a lookup table • In a grid world : 16 values • In motion planning : Infjnite values (b/c it’s continuous state space) • Solution: • Approximate value via neural network

  16. Value Function and Optimal Policy From David Silver’s slides

  17. Value Function and Optimal Policy

  18. Collision Avoidance Deep Reinforcement Learning 1.T rain Value network using ORCA 2.T rain again with Deep reinforcement Learning

  19. Collision Avoidance Deep Reinforcement Learning 1.T rain Value network using ORCA • Why pre-train?

  20. Collision Avoidance Deep Reinforcement Learning 1.T rain Value network using ORCA • Why pre-train? - Initializing the neural network is crucial to convergence - We want the network to output something reasonable

  21. Collision Avoidance Deep Reinforcement Learning 1.T rain Value network using ORCA • Why pre-train? - Initializing the neural network is crucial to convergence - We want the network to output something reasonable • Generate 500 trajectories as a training set • Each trajectory contains 40 state-value pairs (total of 20,000 pairs) • Back-propagate to minimize our loss function:

  22. Collision Avoidance Deep Reinforcement Learning 1.T rain Value network using ORCA 2.T rain again with Deep reinforcement Learning

  23. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  24. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  25. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  26. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  27. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  28. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  29. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  30. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  31. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  32. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  33. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  34. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  35. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  36. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  37. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  38. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning Backpropagatio n

  39. Collision Avoidance Deep Reinforcement Learning 1.T rain again with Deep reinforcement Learning

  40. Result

  41. Result

  42. Result

  43. Q&A

  44. Quiz  Values are update after each episode (T/F)  Value function needs to be trained with ORCA (T/F)  ORCA path does not need to be optimal (T/F)

Recommend


More recommend