Learning to Control Complex Human Motions Using Reinforcement Learning Libin Liu http://libliu.info DeepMotion Inc http://deepmotion.com 1
Physics-based Character Animation Motion Control Physics Character Controller Signal Engine Animation [Gang Beasts] [Totally Accurate Battle Simulator]
Designing Controllers for Locomotion Hand-crafted control policy [Hodgins et al. 1995] Simulating abstract model SIMBICON [Yin et al. 2007] SIMBICON, IPM, ZMP โฆ Optimization/policy search [Coros et al. 2010] [Tan et al. 2014] Reinforcement learning Actor-critic [Mordatch et al. 2010] [Peng et al. 2017]
Designing Controllers for Complex Motions 4
Designing controllers for complex motions Tracking Motion Clip Controller 5
Tracking Control for Complex Human Motion Feedback Policy Open-loop Motion Clip Tracking Control Feedback Control Policy Scheduler
Reinforcement Learning Feedback Guided Policy Learning Policy Control Deep Q-Learning Feedback Scheduler Policy
Outline Construct open-loop control SAMCON (Sample-based Motion Control) Guided learning of linear feedback policies Learning to schedule control fragment using deep Q-learning 8
Tracking Control โข PD servo ๐ = ๐ ๐ เทจ ๐ โ ๐ โ ๐ ๐ แถ ๐ 9
Mocap Clips as Tracking Target 10
Correction with Sampling [ ] ๐๐ข 11
SAMCON โข SA mpling-based M otion CON trol [Liu et al. 2010, 2015] โข Motion Clip ๏ Open-loop control trajectory Sample Sample Sample Start End โฆ Start 1 Start 2 Start n Particle filtering / Sequential Monte Carlo 12
SAMCON ๐๐ข ๐๐ข ๐๐ข ๐๐ข State Reference Trajectory time 13
Sampling & Simulation ๐ Actions (PD-control Targets) ๐ข ๐๐ข ๐๐ข ๐๐ข ๐๐ข State Reference Trajectory time 14
Resampling ๐ Actions (PD-control Targets) ๐ข ๐๐ข ๐๐ข ๐๐ข ๐๐ข State Reference Trajectory time 15
SAMCON Iterations ๐ Actions (PD-control Targets) ๐๐ข ๐๐ข ๐๐ข ๐๐ข ๐๐ข State Reference Trajectory time 16
SAMCON Iterations ๐ Actions (PD-control Targets) ๐๐ข ๐๐ข ๐๐ข ๐๐ข ๐๐ข State Reference Trajectory time 17
Constructed Open-loop Control Trajectory ๐ Actions (PD-control Targets) ๐๐ข ๐๐ข ๐๐ข ๐๐ข ๐๐ข State Reference Trajectory time 18
Control Reconstruction 19
Linear Policy ๐: ๐๐ = ๐ ๐๐ก + เท ๐ ๐๐ = ๐ โ เทค ๐ Control Trajectory ๐ก โ ว ๐ก = ๐๐ก Simulation 20
For complex motions Uniform Segmentation Control Fragments Linear Feedback Policy 21
Control Fragment โข A short control unit: ๐ เท โข ๐๐ข โ 0.1 seconds long ๐ โข Open-loop control segment เท ๐ ๐ โถ ๐๐ข, เท ๐, ๐ ๐๐ข โข Linear Feedback policy ๐ 22
Controller โข A chain of control fragments โฏ ๐ 1 ๐ 2 ๐ ๐ฟ 23
Guided Learning of Control Policies Regression Feedback Policy Multiple Open-loop Solutions 24
Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 25
Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 26
Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 27
Example: Cyclical Motion ๐ 2 ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , โฆ ๐ 3 ๐ 1 ๐ 4 ๐ ๐ : ๐ ๐ , ๐๐ข, ๐ ๐ เท SAMCON ๐ ๐ข 28
Example: Cyclical Motion ๐ 1 ๐ 2 ๐ 2 ๐ 1 ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , ๐ ๐ , โฆ ๐ก 2 ๐ก 1 ๐ 3 ๐ 4 ๐ ๐ ๐ ๐ SAMCON ๐ ๐ก 4 ๐ก 3 ๐ข 29
Policy Update ๐ ๐ก 30 30
Policy Update Regression ๐ ๐ก 31 31
Guided Learning Iterations Guided SAMCON Regression ๐ ๐ ๐ก ๐ก 32 32
Guided Learning Iterations Guided SAMCON Regression Regression ๐ ๐ ๐ก ๐ก 33 33
34
Control Graph โข A graph whose nodes are control fragments Control Graph 35
Control Graph โข A graph whose nodes are control fragments โข Converted from a motion graph Motion Graph Control Graph 36
37
38
Problem of Fixed Time-Indexed Tracking Reference Basin of attraction Simulation
Scheduling Reference Basin of attraction Simulation
Scheduling ?
Deep Q-Learning Learn to perform good actions Raw image input Deep convolutional network [Mnih et al. 2015, DQN]
A Q-Network For Scheduling 300 ReLus 300 ReLus โฆ โฆ โฆ โฆ ๐ ๐ max 0, ๐จ max 0, ๐จ โฆ โฆ โฆ โฆ โฆ โฆ Q-values state Fully Connected
A Q-Network For Scheduling 300 ReLus 300 ReLus โฆ โฆ โฆ โฆ Input: motion state ๐ ๐ environmental state user command max 0, ๐จ max 0, ๐จ โฆ โฆ DoFs: 18 ~ 25 โฆ โฆ โฆ โฆ Q-values state Fully Connected
A Q-Network For Scheduling 300 ReLus 300 ReLus โฆ โฆ โฆ โฆ Action Set: ๐ ๐ Control Fragments max 0, ๐จ max 0, ๐จ # of actions: 39 ~ 146 โฆ โฆ โฆ โฆ โฆ โฆ state Fully Connected
A Q-Network For Scheduling 300 ReLus 300 ReLus Q-Values โฆ โฆ โฆ โฆ ๐ ๐ actions: max 0, ๐จ max 0, ๐จ โฆ โฆ โฆ โฆ โฆ โฆ Fully Connected
Training Pipeline: Exploration / Exploitation Simulation Reward Replay Buffer Batch SGD
Reward Function ๐ = ๐น tracking + ๐น preference + ๐น feedback + ๐น task + ๐ 0
Importance of the Reference Sequence original sequence is enforced original sequence is not enforced
Tracking penalty term In-sequence action Out-of-sequence action Penalty
Tracking exploration strategy with probability ๐ ๐ select a random action with probability ๐ ๐ select an in-sequence action
Bongo Board Balancing Action Sequence
Effect of Feedback Policy Open-loop Control Fragments Feedback-augmented Fragments
Discover New Transitions
Running
Tripping
Skateboarding
Skateboarding
Walking On A Ball
Push-Recovery
Conclusion Feedback Policy Open-loop Motion Clip Tracking Control Feedback Policy Control Scheduler Libin Liu and Jessica Hodgins. 2017. Learning to Libin Liu, Michiel Van De Panne, and Kangkang Yin. Schedule Control Fragments for Physics-Based 2016. Guided Learning of Control Graphs for Physics- Characters Using Deep Q-Learning. ACM Trans. Graph. Based Characters. ACM Trans. Graph. 35, 3, Article 29 (May 2016), 14 pages. 36, 3, Article 29 (June 2017), 14 pages.
Future Work Statistical/generative model [Holden et al. 2017] Control with raw simulation state and terrain information Active human-object interaction [Peng et al. 2017, DeepLoco] basketball, soccer dancing, boxing, martial arts [Heess et al. 2017] 62
Questions? Libin Liu http://libliu.info DeepMotion Inc http://deepmotion.com
Recommend
More recommend