gesture recognition with 3d cnns
play

GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz Motivation Problem statement AGENDA Selecting the best classifier Online gesture


  1. April 4-7, 2016 | Silicon Valley GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz

  2. Motivation Problem statement AGENDA Selecting the best classifier Online gesture detection and classification Demos 2

  3. MOTIVATION 3

  4. GESTURE IS NATURAL FORM OF COMMUNICATION 4 photo.elsoar.com

  5. SAFE INTERFACES 5 @ bmw.com

  6. IN NEED FOR VIDEO RELAY SERVICES 6 @ http://relayservice.gov.au/

  7. GAMMING @ leapmotion 7

  8. PROBLEM STATEMENT 8

  9. PROBLEM STATEMENT No special devices Single commodity sensor: • Gesture recognition Skeleton tracking • Kinectv1 • Gaze estimation Head tracking • SoftKinetic 9

  10. PROBLEM STATEMENT Understanding gesture concepts We do: We don’t: Classifier Thumb up Classifier Wave hand Hand model fitting and tracking 10 *http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

  11. PROBLEM STATEMENT Understanding gesture concepts We do: We don’t: Classifier Thumb up ?????? Classifier Wave hand Hand model fitting and tracking 11 *http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

  12. SELECTING THE BEST CLASSIFIER 12

  13. SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total 13

  14. SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total Gesture example: Slide 2 fingers left 14

  15. SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total Gesture example: Zoom out 15

  16. SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total Gesture example: Rotate CCW 16

  17. SELECTING THE BEST CLASSIFIER 3D Convolutional Neural Network ReLU ReLU Prediction RGB Depth 3D convolution 3D convolution 3D convolution 3D convolution Softmax and max-pooling and max-pooling and max-pooling and max-pooling 17

  18. SEGMENTED GESTURE CLASSIFICATION Training Depth error Back RGB 3D CNN propagation update 18

  19. SELECTING THE BEST CLASSIFIER First result HON4D 1 HOG 2 3D-CNN Testing set 58.7% 64.5% 48.3% Training set 99.9% Classification accuracy, higher better 1 Oreifej and Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences, CVPR, 2013 19 2 Ohn-Bar and Trivedi, IEEE Trans. on Intelligent Transportation Systems, 2014.

  20. SELECTING THE BEST CLASSIFIER VIVA IMAGENET 1.5 M examples 885 examples Recent success in deep learning benefited from large data 20

  21. SELECTING THE BEST CLASSIFIER Training Depth error Back RGB 3D CNN propagation update 21

  22. SELECTING THE BEST CLASSIFIER Training Depth error Data Back RGB 3D CNN augmentation propagation update 22

  23. SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 23

  24. SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 24

  25. SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 25

  26. SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 26

  27. SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 27

  28. SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 28

  29. SELECTING THE BEST CLASSIFIER Data augmentation Spatial geometric transformations Temporal augmentation Generating new training data 29

  30. SELECTING THE BEST CLASSIFIER Data augmentation Spatial geometric transformations Temporal augmentation Generating new training data flip 30

  31. SELECTING THE BEST CLASSIFIER VIVA AUGMENTED 0.3 M examples 885 examples 31

  32. SELECTING THE BEST CLASSIFIER Official challenge results NVIDIA (3D-CNN) No data augmentation 48.3 HOG+HOG2 64.5 HON4D 58.7 Dense Trajectories 54 HOG3D 44.6 Harris-3.5D 36.4 0 10 20 30 40 50 60 70 80 Classification accuracy, higher better 32

  33. SELECTING THE BEST CLASSIFIER Official challenge results with data augmentation 77.5 NVIDIA (3D-CNN) 48.3 HOG+HOG2 64.5 HON4D 58.7 Dense Trajectories 54 HOG3D 44.6 Harris-3.5D 36.4 0 10 20 30 40 50 60 70 80 Classification accuracy, higher better 33

  34. SELECTING THE BEST CLASSIFIER Speed NVIDIA (3D-CNN) 110 GPU +250 cuDNNv4 +400 HOG+HOG2 50 HON4D 25 CPU Dense Trajectories 18 HOG3D 3 Harris-3.5D 0.2 0 100 200 300 400 500 600 700 800 900 FPS, higher better 34

  35. SEGMENTED GESTURE CLASSIFICATION Start of the gesture End of the gesture time Gesture Classification Decision Decision after gesture ends introduces latency 35

  36. ONLINE GESTURE DETECTION AND CLASSIFICATION 36

  37. ONLINE GESTURE CLASSIFICATION Start of the gesture End of the gesture time Gesture Classification Decision Decision before gesture ends improve feedback and user experience 37

  38. ONLINE GESTURE CLASSIFICATION R3DCNN Forward recurrence only Connectionist Temporal Classification (CTC) Detection and classification softmax softmax softmax 109M parameters global motion RNN RNN RNN CTC for training only descriptor local 3D CNN 3D CNN motion descriptor 38 8 frames Video server

  39. ONLINE GESTURE CLASSIFICATION Training loss function Labeling dynamic gestures is difficult Labeling per frame is ambiguous Input: Labels: Loss function: Per frame negative log likelihood 39

  40. ONLINE GESTURE CLASSIFICATION Training loss function Sequence based training is the solution Input: nothing – slide right – nothing – slide left - nothing Sequence: Loss function: Connectionist Temporal Classification (CTC) by A. Graves et al. 40

  41. ONLINE GESTURE CLASSIFICATION Italian sign language recognition Chalearn2014 challenge held in 2014 RGBD videos of 20 Italian sign language 13K gestures 20 subjects 41

  42. ONLINE GESTURE CLASSIFICATION Italian sign language recognition Classification accuracy (%) 35% 98.2 Improvement in accuracy 97.4 97.2 By seeing only 41% Pigou et al.* 3D-CNN 3D-CNN CTC of gesture 42 *L. Pigou et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video

  43. ONLINE GESTURE CLASSIFICATION Italian sign language recognition 35% Improvement in accuracy By seeing only 41% of gesture No pre- or post-processing 43

  44. ONLINE GESTURE CLASSIFICATION Car interfaces In-house database Media player, navigation, phone 20 subjects, 25 gestures More information at CVPR2016 44

  45. ONLINE GESTURE CLASSIFICATION Car interfaces Human 88 In-house database Ours 84 Media player, navigation, phone C3D 79 20 subjects, 25 gestures iDT 73 More information at CVPR2016 SNV 71 Two stream CNN 66 HOG+HOG2 37 25 45 65 85 45

  46. ONLINE GESTURE CLASSIFICATION Latency is critical Suitability of hardware for inference: IMAGE CLASSIFICATION VIDEO CLASSIFICATION CPU CPU GPU GPU 46

  47. ONLINE GESTURE CLASSIFICATION Scalability NVIDIA TX1 - for embedded solutions Credit card GPU in your pocket Our R3DCNN takes only 30% of GPU 47

  48. CONTRIBUTIONS Data augmentation helps a lot to deep learning R3DCNN are the best for sign language and gesture recognition CTC helps a lot for video sequence learning Scalable enough to run on NVIDIA TX1 48

  49. April 4-7, 2016 | Silicon Valley Deep Data CTC Learning Augmentation

  50. April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

Recommend


More recommend