Inferring Human Interaction from Motion Trajectories Tianmin Shu 1 Yujia Peng 2 Lifeng Fan 1 Hongjing Lu 2 Song-Chun Zhu 1 University of California, Los Angeles, USA 1 Department of Statistics 2 Department of Psychology
People are adept at inferring social interactions from highly simplified stimuli. Heider and Simmel (1944)
• Later studies showed that the perception of human-like interactions relies on some critical low-level motion cues , e.g., speed and motion direction. (Dittrich & Lea, 1994; Scholl & Tremoulet, 2000; Tremoulet & Feldman, 2000, 2006; Gao, Newman, & Scholl, 2009; Gao, McCarthy, & Scholl, 2010…) Chasing vs. Stalking [Gao & Scholl, 2011]
Real-life Stimuli [Choi et al., 2009] [Shu et al., 2015]
Tracking human trajectories and labeling group human interactions. (Shu, et al., 2015) [Shu et al., CVPR 2015]
Interactive instances Experiment 1 1 2 • Participants 2 1 • 33 participants from the UCLA subject pool • Stimuli • 24 interactive actions • 24 non-interactive actions Non-interactive instances • Duration: 15-33 s • 4 out of 48 actions were used as practice 1 2 • Task 2 • Judging whether the two agents are 1 interacting at each moment .
Interactive Example
Non-interactive Example
Human Experiment Results Interactive action 4 Non-interactive action 40 Video frame
Human Experiment Results Interactive action 4 N = 33 Non-interactive action 40 Video frame
Computational Model • Previous studies have developed Bayesian models to reason about the intentions of agents when moving in maze-like environments (Baker, Goodman, & Tenenbaum, 2008; Baker, Saxe, and Tenenbaum, 2009; Ullman et al., 2009; Baker, 2012...) [Ullman et al., 2009] [Baker, 2012]
Computational Model • Previous studies have developed Bayesian models to reason about the intentions of agents when moving in maze-like environments (Baker, Goodman, & Tenenbaum, 2008; Baker, Saxe, and Tenenbaum, 2009; Ullman et al., 2009; Baker, 2012...) • In the current study, we are not trying to explicitly infer intention. Instead, we want to see if the model can capture statistical regularities (e.g. motion patterns) that can signal human interaction.
Computational Model Y: interaction labels (0: interactive, 1: non-interactive) S: latent sub-interactions Г : input layer, motion trajectories of two agents
1. Conditional interactive fields (CIFs) Interactivity can be represented by latent motion fields, each capturing the relative motion between the two agents. Linear Dynamic System:
An example CIF: Orbiting • Arrows: represent the mean relative motion at different locations • Intensities of the arrows: the relative spatial density which increases from light to dark
2. Temporal parsing by latent sub-interactions Latent motion fields can vary over time, which enables the model to characterize the behavioral change of the agents.
A Simple View of Our Model Interaction ≈ Fields + Procedure
Formulation Given the input of motion trajectories Г , the model infers the posterior distribution of the latent variables S and Y: Г : input of motion trajectories • • S: latent variable, sub-interaction • Y: interaction labels (0: interactive, 1: non-interactive)
Formulation Г : input of motion trajectories • • S: latent variable, sub-interaction • Y: interaction labels (0: interactive, 1: non-interactive)
Formulation Г : input of motion trajectories • • S: latent variable, sub-interaction • Y: interaction labels (0: interactive, 1: non-interactive)
Formulation Г : input of motion trajectories • • S: latent variable, sub-interaction • Y: interaction labels (0: interactive, 1: non-interactive)
Learning • Gibbs sampling
Inference The model infers the current status of latent variables
Inference The model infers the current status of latent variables Infer s t under the assumption of interaction (i.e., y t = 1)
Inference The model infers the current status of latent variables Infer s t under the assumption of interaction (i.e., y t = 1) The posterior probability of y t = 1 given s t ∈ S
Prediction Predict/synthesize s t+1 given y t+1 and all previous s y t+1 = 1 : interactive trajectories y t+1 = 0 : non-interactive trajectories Predict/synthesize x t+1 given y t and s t
Training Data 1. UCLA aerial event dataset (Shu et al. 2015) http://www.stat.ucla.edu/˜tianmin.shu/AerialVideo/AerialVideo.html • 131 training instances (excluding trajectories used in the stimuli) • 22 validation instances from 44 stimuli • 22 testing instances from the remaining stimuli
Training Data 1. UCLA aerial event dataset (Shu et al. 2015) http://www.stat.ucla.edu/˜tianmin.shu/AerialVideo/AerialVideo.html • 131 training instances (excluding trajectories used in the stimuli) • 22 validation instances from 44 stimuli • 22 testing instances from the remaining stimuli 2. The second dataset was created from the original Heider-Simmel animation (i.e., two triangles and one circle). • 27 training instances
A few predominant CIFs: 1. approaching • Arrows: represent the mean relative motion at different locations • Intensities of the arrows: the relative spatial density which increases from light to dark
A few predominant CIFs: 2. Passing by (upper part)
A few predominant CIFs: 2. Passing by (upper part) and following (lower part)
A few predominant CIFs: 3. Leaving/avoiding
Illustration of the top five frequent CIFs learned from the training data. The frequencies of CIFs.
Temporal Parsing of Fields in Heider-Simmel Animations
Interactiveness Inference in Aerial Videos
Experiment 1 Results Comparison of online predictions by our full model (|S| = 15) (orange) and humans (blue) over time (in seconds) on testing videos. Interactive ratings Time (s) Time (s) Time (s)
Experiment 1 Results Comparison of online predictions by our full model (|S| = 15) (orange) and humans (blue) over time (in seconds) on testing videos. Interactive ratings Trained on Heider-simmel stimuli, tested on aerial video stimuli: r = 0.640 and RMSE of 0.227 Time (s) Time (s) Time (s)
Test the Model Trained from Aerial Videos on Heider-Simmel Stimuli
Experiment 2 • We used the model trained on aerial videos to synthesize new interactive videos:
Experiment 2 • We used the model trained on aerial videos to synthesize new interactive videos • The model generated 10 interactive animations Synthesized interactive video (y=1) 5x Model predicted interactiveness
Experiment 2 • We used the model trained on aerial videos to synthesize new interactive videos • The model generated 10 interactive animations Synthesized interactive video (y=1) 5x Model predicted interactiveness
Experiment 2 • We used the model (trained on aerial videos) to synthesize new interactive videos: • The model generated 10 interactive animations and 10 non-interactive animations 5x Synthesized non-interactive video (y=0) Model predicted interactiveness
Experiment 2 Results N = 17 • The interactiveness between the two agents in the synthesized videos was judged accurately by human observers. • The model effectively captured the visual features that signal potential interactivity between agents.
Conclusion • Decontextualized animations based on real-life videos enable human perception of social interactions.
Conclusion • Decontextualized animations based on real-life videos enable human perception of social interactions. • The hierarchical model can learn the statistic regularity of common sub-interactions, and accounts for human judgments of interactiveness.
Conclusion • Decontextualized animations based on real-life videos enable human perception of social interactions. • The hierarchical model can learn the statistic regularity of common sub-interactions, and accounts for human judgments of interactiveness. • Results suggest that human interactions can be decomposed into sub-interactions such as approaching, walking in parallel, or orbiting.
For more details, please visit our website http://www.stat.ucla.edu/~tianmin.shu/HeiderSimmel/CogSci17/
Recommend
More recommend