motions using reinforcement learning
play

Motions Using Reinforcement Learning Libin Liu http://libliu.info - PowerPoint PPT Presentation

Learning to Control Complex Human Motions Using Reinforcement Learning Libin Liu http://libliu.info DeepMotion Inc http://deepmotion.com 1 Physics-based Character Animation Motion Control Physics Character Controller Signal Engine


  1. Learning to Control Complex Human Motions Using Reinforcement Learning Libin Liu http://libliu.info DeepMotion Inc http://deepmotion.com 1

  2. Physics-based Character Animation Motion Control Physics Character Controller Signal Engine Animation [Gang Beasts] [Totally Accurate Battle Simulator]

  3. Designing Controllers for Locomotion Hand-crafted control policy [Hodgins et al. 1995] Simulating abstract model SIMBICON [Yin et al. 2007] SIMBICON, IPM, ZMP โ€ฆ Optimization/policy search [Coros et al. 2010] [Tan et al. 2014] Reinforcement learning Actor-critic [Mordatch et al. 2010] [Peng et al. 2017]

  4. Designing Controllers for Complex Motions 4

  5. Designing controllers for complex motions Tracking Motion Clip Controller 5

  6. Tracking Control for Complex Human Motion Feedback Policy Open-loop Motion Clip Tracking Control Feedback Control Policy Scheduler

  7. Reinforcement Learning Feedback Guided Policy Learning Policy Control Deep Q-Learning Feedback Scheduler Policy

  8. Outline Construct open-loop control SAMCON (Sample-based Motion Control) Guided learning of linear feedback policies Learning to schedule control fragment using deep Q-learning 8

  9. Tracking Control โ€ข PD servo ๐œ = ๐‘™ ๐‘ž เทจ ๐œ„ โˆ’ ๐œ„ โˆ’ ๐‘™ ๐‘’ แˆถ ๐œ„ 9

  10. Mocap Clips as Tracking Target 10

  11. Correction with Sampling [ ] ๐œ€๐‘ข 11

  12. SAMCON โ€ข SA mpling-based M otion CON trol [Liu et al. 2010, 2015] โ€ข Motion Clip ๏ƒ  Open-loop control trajectory Sample Sample Sample Start End โ€ฆ Start 1 Start 2 Start n Particle filtering / Sequential Monte Carlo 12

  13. SAMCON ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข State Reference Trajectory time 13

  14. Sampling & Simulation ๐‘ Actions (PD-control Targets) ๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข State Reference Trajectory time 14

  15. Resampling ๐‘ Actions (PD-control Targets) ๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข State Reference Trajectory time 15

  16. SAMCON Iterations ๐‘ Actions (PD-control Targets) ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข State Reference Trajectory time 16

  17. SAMCON Iterations ๐‘ Actions (PD-control Targets) ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข State Reference Trajectory time 17

  18. Constructed Open-loop Control Trajectory ๐‘ Actions (PD-control Targets) ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข ๐œ€๐‘ข State Reference Trajectory time 18

  19. Control Reconstruction 19

  20. Linear Policy ๐œŒ: ๐œ€๐‘ = ๐‘ ๐œ€๐‘ก + เทœ ๐‘ ๐œ€๐‘ = ๐‘ โˆ’ เทค ๐‘ Control Trajectory ๐‘ก โˆ’ ว ๐‘ก = ๐œ€๐‘ก Simulation 20

  21. For complex motions Uniform Segmentation Control Fragments Linear Feedback Policy 21

  22. Control Fragment โ€ข A short control unit: ๐‘› เท โ€ข ๐œ€๐‘ข โ‰ˆ 0.1 seconds long ๐œŒ โ€ข Open-loop control segment เท ๐‘› ๐’Ÿ โˆถ ๐œ€๐‘ข, เท ๐‘›, ๐œŒ ๐œ€๐‘ข โ€ข Linear Feedback policy ๐œŒ 22

  23. Controller โ€ข A chain of control fragments โ‹ฏ ๐’Ÿ 1 ๐’Ÿ 2 ๐’Ÿ ๐ฟ 23

  24. Guided Learning of Control Policies Regression Feedback Policy Multiple Open-loop Solutions 24

  25. Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 25

  26. Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 26

  27. Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 27

  28. Example: Cyclical Motion ๐’Ÿ 2 ๐““ ๐Ÿ , ๐““ ๐Ÿ‘ , ๐““ ๐Ÿ’ , ๐““ ๐Ÿ“ , ๐““ ๐Ÿ , ๐““ ๐Ÿ‘ , ๐““ ๐Ÿ’ , ๐““ ๐Ÿ“ , ๐““ ๐Ÿ , ๐““ ๐Ÿ‘ , ๐““ ๐Ÿ’ , ๐““ ๐Ÿ“ , โ€ฆ ๐’Ÿ 3 ๐’Ÿ 1 ๐’Ÿ 4 ๐’Ÿ ๐‘™ : ๐‘› ๐‘™ , ๐œ€๐‘ข, ๐œŒ ๐‘™ เท SAMCON ๐‘ ๐‘ข 28

  29. Example: Cyclical Motion ๐‘ 1 ๐‘ 2 ๐œŒ 2 ๐œŒ 1 ๐““ ๐Ÿ , ๐““ ๐Ÿ‘ , ๐““ ๐Ÿ’ , ๐““ ๐Ÿ“ , ๐““ ๐Ÿ , ๐““ ๐Ÿ‘ , ๐““ ๐Ÿ’ , ๐““ ๐Ÿ“ , ๐““ ๐Ÿ , ๐““ ๐Ÿ‘ , ๐““ ๐Ÿ’ , ๐““ ๐Ÿ“ , โ€ฆ ๐‘ก 2 ๐‘ก 1 ๐‘ 3 ๐‘ 4 ๐œŒ ๐Ÿ“ ๐œŒ ๐Ÿ’ SAMCON ๐‘ ๐‘ก 4 ๐‘ก 3 ๐‘ข 29

  30. Policy Update ๐‘ ๐‘ก 30 30

  31. Policy Update Regression ๐‘ ๐‘ก 31 31

  32. Guided Learning Iterations Guided SAMCON Regression ๐‘ ๐‘ ๐‘ก ๐‘ก 32 32

  33. Guided Learning Iterations Guided SAMCON Regression Regression ๐‘ ๐‘ ๐‘ก ๐‘ก 33 33

  34. 34

  35. Control Graph โ€ข A graph whose nodes are control fragments Control Graph 35

  36. Control Graph โ€ข A graph whose nodes are control fragments โ€ข Converted from a motion graph Motion Graph Control Graph 36

  37. 37

  38. 38

  39. Problem of Fixed Time-Indexed Tracking Reference Basin of attraction Simulation

  40. Scheduling Reference Basin of attraction Simulation

  41. Scheduling ?

  42. Deep Q-Learning Learn to perform good actions Raw image input Deep convolutional network [Mnih et al. 2015, DQN]

  43. A Q-Network For Scheduling 300 ReLus 300 ReLus โ€ฆ โ€ฆ โ€ฆ โ€ฆ ๐‘” ๐‘” max 0, ๐‘จ max 0, ๐‘จ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ Q-values state Fully Connected

  44. A Q-Network For Scheduling 300 ReLus 300 ReLus โ€ฆ โ€ฆ โ€ฆ โ€ฆ Input: motion state ๐‘” ๐‘” environmental state user command max 0, ๐‘จ max 0, ๐‘จ โ€ฆ โ€ฆ DoFs: 18 ~ 25 โ€ฆ โ€ฆ โ€ฆ โ€ฆ Q-values state Fully Connected

  45. A Q-Network For Scheduling 300 ReLus 300 ReLus โ€ฆ โ€ฆ โ€ฆ โ€ฆ Action Set: ๐‘” ๐‘” Control Fragments max 0, ๐‘จ max 0, ๐‘จ # of actions: 39 ~ 146 โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ state Fully Connected

  46. A Q-Network For Scheduling 300 ReLus 300 ReLus Q-Values โ€ฆ โ€ฆ โ€ฆ โ€ฆ ๐‘” ๐‘” actions: max 0, ๐‘จ max 0, ๐‘จ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ Fully Connected

  47. Training Pipeline: Exploration / Exploitation Simulation Reward Replay Buffer Batch SGD

  48. Reward Function ๐‘† = ๐น tracking + ๐น preference + ๐น feedback + ๐น task + ๐‘† 0

  49. Importance of the Reference Sequence original sequence is enforced original sequence is not enforced

  50. Tracking penalty term In-sequence action Out-of-sequence action Penalty

  51. Tracking exploration strategy with probability ๐œ ๐‘  select a random action with probability ๐œ ๐‘ select an in-sequence action

  52. Bongo Board Balancing Action Sequence

  53. Effect of Feedback Policy Open-loop Control Fragments Feedback-augmented Fragments

  54. Discover New Transitions

  55. Running

  56. Tripping

  57. Skateboarding

  58. Skateboarding

  59. Walking On A Ball

  60. Push-Recovery

  61. Conclusion Feedback Policy Open-loop Motion Clip Tracking Control Feedback Policy Control Scheduler Libin Liu and Jessica Hodgins. 2017. Learning to Libin Liu, Michiel Van De Panne, and Kangkang Yin. Schedule Control Fragments for Physics-Based 2016. Guided Learning of Control Graphs for Physics- Characters Using Deep Q-Learning. ACM Trans. Graph. Based Characters. ACM Trans. Graph. 35, 3, Article 29 (May 2016), 14 pages. 36, 3, Article 29 (June 2017), 14 pages.

  62. Future Work Statistical/generative model [Holden et al. 2017] Control with raw simulation state and terrain information Active human-object interaction [Peng et al. 2017, DeepLoco] basketball, soccer dancing, boxing, martial arts [Heess et al. 2017] 62

  63. Questions? Libin Liu http://libliu.info DeepMotion Inc http://deepmotion.com

Recommend


More recommend