structured deep learning of human motion
play

Structured Deep Learning of Human Motion Christian Wolf Fabien - PowerPoint PPT Presentation

Structured Deep Learning of Human Motion Christian Wolf Fabien Baradel Natalia Neverova Julien Mille Graham W. Taylor Greg Mori 1 Deep Learning of Human Motion Gesture re Ge recognition Re Recognition of in indiv ivid idual act


  1. Structured Deep Learning of Human Motion Christian Wolf Fabien Baradel Natalia Neverova Julien Mille Graham W. Taylor Greg Mori 1

  2. Deep Learning of Human Motion Gesture re Ge recognition Re Recognition of in indiv ivid idual act activities es & & interactions Re Recognition of group act activities es Pose estimation Po 2

  3. [Neverova, Wolf, Taylor, Nebout. CVIU 2017] 3 3

  4. Combining real and simulated data Joint positions (NYU Dataset) Synthetic data (part segmentation) Natalia Neverova Christian Wolf Graham W. Taylor Florian Nebout LIRIS University of Guelph Awabot Phd @ LIRIS, INSA-Lyon Canada Now at Facebook 4

  5. Semantic Segmentation with GridNetworks [Fourure, Emonet, Fromont, Muselet, Tremeau, Wolf, BMVC 2017] Damien Fourure E. Fromont, R. Emonet, A. Trémeau, D. Muselet, C. Wolf 5

  6. Activity recognition Un Unconstrained in internet/yo youtube vi videos No No acquisition E.g. Youtube-8M dataset: 7M videos, 4716 classes, ~3.4 labels per video. > 1PB of data. Vide Vi deos wi with hum human an act activities es, , fr from yo youtube No acquisition No E.g. ActivityNet/Kinetics dataset: ~300k videos, 400 classes. Hu Human act activities es sh shot wi with de dept pth se senso sors Ac Acquisition is is ti time cons consum uming ng! E.g. NTU RGB-D dataset, MSR dataset, ChaLearn/Montalbano dataset, etc. 6

  7. Deep Learning (Global) (Mostly after 2012) Deep Learning is mostly based on global models. [Baccouche, Mamalet, Wolf, Garcia, [Ji et al., ICML 2010] Baskur, HBU 2011] [Carreira and Zisserman, CVPR 2017] [Baccouche, Mamalet, Wolf, Garcia, Baskur, BMVC 2012] 7 7

  8. The role of articulated pose 8 Reading Writing 8

  9. The role of articulated pose 9 Appearance is helpful [Neverova, Wolf, Taylor, Nebout, PAMI 2016] [Baradel, Wolf, Mille, Taylor, BMVC 2018] Reading Writing 9

  10. Context We need put attention to places which are not always determined by pose 10

  11. Context We need put attention to places which are not always determined by pose 11

  12. Context Frame from the NTU RGB-D Dataset 12

  13. Local representations (Before 2012) Im Images, o , objects ts and act activities es have often been represented as collections of local features, e.g. through DPMs. [Felzenszwalb et al., PAMI 2010] Local appearance Deformation 13 5

  14. Structured Deep Learning Deep Learning Visual recognition (activities, gestures, Representation objects) learning Local context Complex relationships, Global context Structured Structured and deep learning semi-structured models F 2 1 F F 4 l 2 1 l l 4 14

  15. Human attention: gaze patterns [Johansson, Holsanova, Dewhurst, Holmqvist, 2012] 15

  16. Local representations (Before 2012) Deep Learning (Local representations) Deep Learning (Global) Deep Learning (attention maps) (Mostly after 2012) (~2016) Hard attention Attention on joints Soft attention in feature maps [Sharma et al., ICLR 2016] [Mnih et al., NIPS 2015] [Song et al., AAAI 2016] 16

  17. Local representations (Before 2012) Deep Learning (Local representations) Deep Learning (Global) Deep Learning (attention maps) (Mostly after 2012) (~2016) Objective: fully trainable high-capacity local representations 1. Learn where to attend 2. Learn how to track attended points 3. Learn how to recognize from a local distributed representation v 3,1 Recognize activity [Baradel, Wolf, Mille, Taylor, CVPR 2018] 17 9

  18. Attention in feature space Time Time 3D Global model: Inflated Resnet 50 RGB input video Feature space [Baradel, Wolf, Mille, Taylor, CVPR 2018] 18

  19. Unconstrained differentiable attention e m i T Hidden state from recurrent recognizers (workers) Frame context "Differentiable crop » [Baradel, Wolf, Mille, Taylor, (Spatial Transformer Network) CVPR 2018] 19

  20. Distributed recognition Workers r 1 + Time Time + 3D + Global r 3 r 2 model: Inflated Spatial Resnet Attention 50 h process RGB input video Distributed Distributed Unconstrained Attention tracking/recognition tracking/recognition in feature space 20

  21. Results 21

  22. State-of-the-art comparaison Dynamic visual attention Unstructured Glimpse Cloud e m i T CNN Recognition SOTA results on two datasets NTU and N-UCLA Larger difference between Glimpse clouds and global model on N-UCLA [Baradel, Wolf, Mille, Taylor, CVPR 2018] [Baradel, Wolf, Mille, Taylor (under review] 22

  23. Results Ablation study [Baradel, Wolf, Mille, Taylor (under review] [Baradel, Wolf, Mille, Taylor, CVPR 2018] 23

  24. Pose conditioned attention [Baradel, Wolf, Mille, Taylor, BMVC 2018] 24

  25. AI vs. NI 2014 Nobel Prize in Medecine Head direction Border cells 25

  26. AI vs. NI 2014 Nobel Prize in Medecine 26

  27. AI vs. NI 2018 : discoverty of the same cells in neural networks trained on similar tasks. [Cueva, Wei, ICLR 2018] 27

  28. AI vs. NI Emergence of the different types of cells in the same order. [Cueva, Wei, ICLR 2018] 28

  29. Reasoning : what happened? 29

  30. Human psychology - Daniel Kahnemann (Nobel prize in 2002) - Book: "Thinking Fast and Slow" 30

  31. Cognitive tasks 24*17 = ? 31

  32. Two systems Sy System em 1 - Continuously monitors environment (and mind) - No specific attention - Continuously generates assessments / judgments w/o efforts, even in the presence of low data. Jumps to conclusions - Prone to errors. No capabilities for statistics Sy System em 2 - Receives questions or generates them - Directs attention and searches memory to find answers - Requires (eventually a lot of) effort - More reliable 32

  33. Where is ML today? Claim: AI requires a combination of - Extraction of high-level information from high- dimensional input (visual, audio, language): ma machine le learnin ing - High-level reasoning: com compar pare, e, as asses ess, focus ocus at attent ention on, , perform lo logic ical l deductio ions Roadmap: Estimating semantics from low level information (Vision & Learning) Estimating causal relationships from data Reasoning: Logic + Statistics 33 22 22

  34. Object level Visual Reasoning [Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018] Fabien Baradel Christian Julien Mille Greg Mori Natalia Neverova Phd @ LIRIS, Wolf LI, INSA VdL Simon Facebook AI INRIA INSA-Lyon Research, Paris Fraser Chroma University, Canada 34

  35. Object level Visual Reasoning [Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018] 35

  36. Object level Visual Reasoning [Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018] 36

  37. Learned interactions Class: person-book interaction 37

  38. Failure cases 38

  39. Results Something-something dataset VLOG dataset EPIC Kitchen dataset 39

  40. Conclusion - We propose a models which recognize activities from – a cloud of unconstrained feature points – Interactions between spatially well defined objects - Visual spatial attention is useful and competitive compared to pose - State of the art performance on 5 datasets (NTU RGB-D, Northwestern UCLA, VLOG, Something-Something, Epic Kitchen) - Reasoning is key component of human cognition, also important for IA systems 40

Recommend


More recommend