April 4-7, 2016 | Silicon Valley GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz
Motivation Problem statement AGENDA Selecting the best classifier Online gesture detection and classification Demos 2
MOTIVATION 3
GESTURE IS NATURAL FORM OF COMMUNICATION 4 photo.elsoar.com
SAFE INTERFACES 5 @ bmw.com
IN NEED FOR VIDEO RELAY SERVICES 6 @ http://relayservice.gov.au/
GAMMING @ leapmotion 7
PROBLEM STATEMENT 8
PROBLEM STATEMENT No special devices Single commodity sensor: • Gesture recognition Skeleton tracking • Kinectv1 • Gaze estimation Head tracking • SoftKinetic 9
PROBLEM STATEMENT Understanding gesture concepts We do: We don’t: Classifier Thumb up Classifier Wave hand Hand model fitting and tracking 10 *http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/
PROBLEM STATEMENT Understanding gesture concepts We do: We don’t: Classifier Thumb up ?????? Classifier Wave hand Hand model fitting and tracking 11 *http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/
SELECTING THE BEST CLASSIFIER 12
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total 13
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total Gesture example: Slide 2 fingers left 14
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total Gesture example: Zoom out 15
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA 19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total Gesture example: Rotate CCW 16
SELECTING THE BEST CLASSIFIER 3D Convolutional Neural Network ReLU ReLU Prediction RGB Depth 3D convolution 3D convolution 3D convolution 3D convolution Softmax and max-pooling and max-pooling and max-pooling and max-pooling 17
SEGMENTED GESTURE CLASSIFICATION Training Depth error Back RGB 3D CNN propagation update 18
SELECTING THE BEST CLASSIFIER First result HON4D 1 HOG 2 3D-CNN Testing set 58.7% 64.5% 48.3% Training set 99.9% Classification accuracy, higher better 1 Oreifej and Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences, CVPR, 2013 19 2 Ohn-Bar and Trivedi, IEEE Trans. on Intelligent Transportation Systems, 2014.
SELECTING THE BEST CLASSIFIER VIVA IMAGENET 1.5 M examples 885 examples Recent success in deep learning benefited from large data 20
SELECTING THE BEST CLASSIFIER Training Depth error Back RGB 3D CNN propagation update 21
SELECTING THE BEST CLASSIFIER Training Depth error Data Back RGB 3D CNN augmentation propagation update 22
SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 23
SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 24
SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 25
SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 26
SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 27
SELECTING THE BEST CLASSIFIER Data augmentation Original Spatial geometric transformations Temporal augmentation Generating new training data Augmented 28
SELECTING THE BEST CLASSIFIER Data augmentation Spatial geometric transformations Temporal augmentation Generating new training data 29
SELECTING THE BEST CLASSIFIER Data augmentation Spatial geometric transformations Temporal augmentation Generating new training data flip 30
SELECTING THE BEST CLASSIFIER VIVA AUGMENTED 0.3 M examples 885 examples 31
SELECTING THE BEST CLASSIFIER Official challenge results NVIDIA (3D-CNN) No data augmentation 48.3 HOG+HOG2 64.5 HON4D 58.7 Dense Trajectories 54 HOG3D 44.6 Harris-3.5D 36.4 0 10 20 30 40 50 60 70 80 Classification accuracy, higher better 32
SELECTING THE BEST CLASSIFIER Official challenge results with data augmentation 77.5 NVIDIA (3D-CNN) 48.3 HOG+HOG2 64.5 HON4D 58.7 Dense Trajectories 54 HOG3D 44.6 Harris-3.5D 36.4 0 10 20 30 40 50 60 70 80 Classification accuracy, higher better 33
SELECTING THE BEST CLASSIFIER Speed NVIDIA (3D-CNN) 110 GPU +250 cuDNNv4 +400 HOG+HOG2 50 HON4D 25 CPU Dense Trajectories 18 HOG3D 3 Harris-3.5D 0.2 0 100 200 300 400 500 600 700 800 900 FPS, higher better 34
SEGMENTED GESTURE CLASSIFICATION Start of the gesture End of the gesture time Gesture Classification Decision Decision after gesture ends introduces latency 35
ONLINE GESTURE DETECTION AND CLASSIFICATION 36
ONLINE GESTURE CLASSIFICATION Start of the gesture End of the gesture time Gesture Classification Decision Decision before gesture ends improve feedback and user experience 37
ONLINE GESTURE CLASSIFICATION R3DCNN Forward recurrence only Connectionist Temporal Classification (CTC) Detection and classification softmax softmax softmax 109M parameters global motion RNN RNN RNN CTC for training only descriptor local 3D CNN 3D CNN motion descriptor 38 8 frames Video server
ONLINE GESTURE CLASSIFICATION Training loss function Labeling dynamic gestures is difficult Labeling per frame is ambiguous Input: Labels: Loss function: Per frame negative log likelihood 39
ONLINE GESTURE CLASSIFICATION Training loss function Sequence based training is the solution Input: nothing – slide right – nothing – slide left - nothing Sequence: Loss function: Connectionist Temporal Classification (CTC) by A. Graves et al. 40
ONLINE GESTURE CLASSIFICATION Italian sign language recognition Chalearn2014 challenge held in 2014 RGBD videos of 20 Italian sign language 13K gestures 20 subjects 41
ONLINE GESTURE CLASSIFICATION Italian sign language recognition Classification accuracy (%) 35% 98.2 Improvement in accuracy 97.4 97.2 By seeing only 41% Pigou et al.* 3D-CNN 3D-CNN CTC of gesture 42 *L. Pigou et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video
ONLINE GESTURE CLASSIFICATION Italian sign language recognition 35% Improvement in accuracy By seeing only 41% of gesture No pre- or post-processing 43
ONLINE GESTURE CLASSIFICATION Car interfaces In-house database Media player, navigation, phone 20 subjects, 25 gestures More information at CVPR2016 44
ONLINE GESTURE CLASSIFICATION Car interfaces Human 88 In-house database Ours 84 Media player, navigation, phone C3D 79 20 subjects, 25 gestures iDT 73 More information at CVPR2016 SNV 71 Two stream CNN 66 HOG+HOG2 37 25 45 65 85 45
ONLINE GESTURE CLASSIFICATION Latency is critical Suitability of hardware for inference: IMAGE CLASSIFICATION VIDEO CLASSIFICATION CPU CPU GPU GPU 46
ONLINE GESTURE CLASSIFICATION Scalability NVIDIA TX1 - for embedded solutions Credit card GPU in your pocket Our R3DCNN takes only 30% of GPU 47
CONTRIBUTIONS Data augmentation helps a lot to deep learning R3DCNN are the best for sign language and gesture recognition CTC helps a lot for video sequence learning Scalable enough to run on NVIDIA TX1 48
April 4-7, 2016 | Silicon Valley Deep Data CTC Learning Augmentation
April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join
Recommend
More recommend