Structured Deep Learning of Human Motion Christian Wolf Fabien Baradel Natalia Neverova Julien Mille Graham W. Taylor Greg Mori 1
Deep Learning of Human Motion Gesture re Ge recognition Re Recognition of in indiv ivid idual act activities es & & interactions Re Recognition of group act activities es Pose estimation Po 2
[Neverova, Wolf, Taylor, Nebout. CVIU 2017] 3 3
Combining real and simulated data Joint positions (NYU Dataset) Synthetic data (part segmentation) Natalia Neverova Christian Wolf Graham W. Taylor Florian Nebout LIRIS University of Guelph Awabot Phd @ LIRIS, INSA-Lyon Canada Now at Facebook 4
Semantic Segmentation with GridNetworks [Fourure, Emonet, Fromont, Muselet, Tremeau, Wolf, BMVC 2017] Damien Fourure E. Fromont, R. Emonet, A. Trémeau, D. Muselet, C. Wolf 5
Activity recognition Un Unconstrained in internet/yo youtube vi videos No No acquisition E.g. Youtube-8M dataset: 7M videos, 4716 classes, ~3.4 labels per video. > 1PB of data. Vide Vi deos wi with hum human an act activities es, , fr from yo youtube No acquisition No E.g. ActivityNet/Kinetics dataset: ~300k videos, 400 classes. Hu Human act activities es sh shot wi with de dept pth se senso sors Ac Acquisition is is ti time cons consum uming ng! E.g. NTU RGB-D dataset, MSR dataset, ChaLearn/Montalbano dataset, etc. 6
Deep Learning (Global) (Mostly after 2012) Deep Learning is mostly based on global models. [Baccouche, Mamalet, Wolf, Garcia, [Ji et al., ICML 2010] Baskur, HBU 2011] [Carreira and Zisserman, CVPR 2017] [Baccouche, Mamalet, Wolf, Garcia, Baskur, BMVC 2012] 7 7
The role of articulated pose 8 Reading Writing 8
The role of articulated pose 9 Appearance is helpful [Neverova, Wolf, Taylor, Nebout, PAMI 2016] [Baradel, Wolf, Mille, Taylor, BMVC 2018] Reading Writing 9
Context We need put attention to places which are not always determined by pose 10
Context We need put attention to places which are not always determined by pose 11
Context Frame from the NTU RGB-D Dataset 12
Local representations (Before 2012) Im Images, o , objects ts and act activities es have often been represented as collections of local features, e.g. through DPMs. [Felzenszwalb et al., PAMI 2010] Local appearance Deformation 13 5
Structured Deep Learning Deep Learning Visual recognition (activities, gestures, Representation objects) learning Local context Complex relationships, Global context Structured Structured and deep learning semi-structured models F 2 1 F F 4 l 2 1 l l 4 14
Human attention: gaze patterns [Johansson, Holsanova, Dewhurst, Holmqvist, 2012] 15
Local representations (Before 2012) Deep Learning (Local representations) Deep Learning (Global) Deep Learning (attention maps) (Mostly after 2012) (~2016) Hard attention Attention on joints Soft attention in feature maps [Sharma et al., ICLR 2016] [Mnih et al., NIPS 2015] [Song et al., AAAI 2016] 16
Local representations (Before 2012) Deep Learning (Local representations) Deep Learning (Global) Deep Learning (attention maps) (Mostly after 2012) (~2016) Objective: fully trainable high-capacity local representations 1. Learn where to attend 2. Learn how to track attended points 3. Learn how to recognize from a local distributed representation v 3,1 Recognize activity [Baradel, Wolf, Mille, Taylor, CVPR 2018] 17 9
Attention in feature space Time Time 3D Global model: Inflated Resnet 50 RGB input video Feature space [Baradel, Wolf, Mille, Taylor, CVPR 2018] 18
Unconstrained differentiable attention e m i T Hidden state from recurrent recognizers (workers) Frame context "Differentiable crop » [Baradel, Wolf, Mille, Taylor, (Spatial Transformer Network) CVPR 2018] 19
Distributed recognition Workers r 1 + Time Time + 3D + Global r 3 r 2 model: Inflated Spatial Resnet Attention 50 h process RGB input video Distributed Distributed Unconstrained Attention tracking/recognition tracking/recognition in feature space 20
Results 21
State-of-the-art comparaison Dynamic visual attention Unstructured Glimpse Cloud e m i T CNN Recognition SOTA results on two datasets NTU and N-UCLA Larger difference between Glimpse clouds and global model on N-UCLA [Baradel, Wolf, Mille, Taylor, CVPR 2018] [Baradel, Wolf, Mille, Taylor (under review] 22
Results Ablation study [Baradel, Wolf, Mille, Taylor (under review] [Baradel, Wolf, Mille, Taylor, CVPR 2018] 23
Pose conditioned attention [Baradel, Wolf, Mille, Taylor, BMVC 2018] 24
AI vs. NI 2014 Nobel Prize in Medecine Head direction Border cells 25
AI vs. NI 2014 Nobel Prize in Medecine 26
AI vs. NI 2018 : discoverty of the same cells in neural networks trained on similar tasks. [Cueva, Wei, ICLR 2018] 27
AI vs. NI Emergence of the different types of cells in the same order. [Cueva, Wei, ICLR 2018] 28
Reasoning : what happened? 29
Human psychology - Daniel Kahnemann (Nobel prize in 2002) - Book: "Thinking Fast and Slow" 30
Cognitive tasks 24*17 = ? 31
Two systems Sy System em 1 - Continuously monitors environment (and mind) - No specific attention - Continuously generates assessments / judgments w/o efforts, even in the presence of low data. Jumps to conclusions - Prone to errors. No capabilities for statistics Sy System em 2 - Receives questions or generates them - Directs attention and searches memory to find answers - Requires (eventually a lot of) effort - More reliable 32
Where is ML today? Claim: AI requires a combination of - Extraction of high-level information from high- dimensional input (visual, audio, language): ma machine le learnin ing - High-level reasoning: com compar pare, e, as asses ess, focus ocus at attent ention on, , perform lo logic ical l deductio ions Roadmap: Estimating semantics from low level information (Vision & Learning) Estimating causal relationships from data Reasoning: Logic + Statistics 33 22 22
Object level Visual Reasoning [Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018] Fabien Baradel Christian Julien Mille Greg Mori Natalia Neverova Phd @ LIRIS, Wolf LI, INSA VdL Simon Facebook AI INRIA INSA-Lyon Research, Paris Fraser Chroma University, Canada 34
Object level Visual Reasoning [Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018] 35
Object level Visual Reasoning [Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018] 36
Learned interactions Class: person-book interaction 37
Failure cases 38
Results Something-something dataset VLOG dataset EPIC Kitchen dataset 39
Conclusion - We propose a models which recognize activities from – a cloud of unconstrained feature points – Interactions between spatially well defined objects - Visual spatial attention is useful and competitive compared to pose - State of the art performance on 5 datasets (NTU RGB-D, Northwestern UCLA, VLOG, Something-Something, Epic Kitchen) - Reasoning is key component of human cognition, also important for IA systems 40
Recommend
More recommend