Multimodal 2DCNN action recognition from RGB-D Data with Video - PowerPoint PPT Presentation

Multimodal 2DCNN action recognition from RGB-D Data with Video Summarization Vicent Roig Ripoll Master in Artificial Intelligence UPC, UB, URV Master’s Thesis Advisor: Sergio Escalera Guerrero Co-advisor: Maryam Asadi-Aghbolaghi October, 2017 Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 1 / 39

Overview Introduction 1 Related Work 2 Video Summarization 3 Proposed Method 4 Experimental results 5 Conclusions 6 References 7 Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 2 / 39

Introduction Motivation: Human action recognition research area large intra-class variations low video resolution high dimension of video data Kinect → multimodal data access Hand-crafted features vs automatic feature learning Goals: Analyse multimodal data benefits in deep learning To this end, 2DCNN is extended to multimodal ( MM2DCNN ) Evaluation of video summarization impact in action recognition Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 3 / 39

Outline Introduction 1 Related Work 2 Video Summarization 3 Proposed Method 4 Experimental results 5 Conclusions 6 References 7 Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 4 / 39

Hand-crafted Features Approaches to cope with temporal information 1 Treat videos as spatio-temporal volumes 2 Flow-based features, explicitly deal with motion 3 Trajectory-based approaches, motion is implicitly modelled Histograms of Oriented Gradients (HOG) → HOG3D Scale-Invariant Feature Transform (SIFT) → 3D-SIFT Histogram of Normals (HON) → HON4D Dense Trajectories (DT & iDT) Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 5 / 39

Optical Flow For a given time t and pixel ( x , y ) t : ( x , y ) t +1 = ( x , y ) t + d ( x , y ) t Applications: Trajectory construction Descriptors: HOF, MBH Deep learning → CNN input Figure: Optical flow field vectors (green vectors with red end points) Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 6 / 39

Scene Flow For a given time t and pixel ( x , y , z ) t ( x , y , z ) t +1 = ( x , y , z ) t + d t ( x , y , z ) Applications: 3D trajectory construction Deep learning → CNN input Advantages over optical flow: Real world motion units Z-axis motion Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 7 / 39

Deep Learning - Two-stream Convolutional Neural Network 2DCNN performs the recognition by processing 2 different streams , spatial and temporal , combining both by a late fusion Figure: Two-stream architecture for video classification Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 8 / 39

Video Summarization Video summarization allows for the extraction of few video frames (keyframes) so that they jointly try to maximize the information contained in the original video Figure: Video summarization overview Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 10 / 39

Video Summarization - Techniques Sequential Distortion Minimization (SeDiM) [Panagiotakis2013] Selects frames so that the distortion between the original video and the synopsis is min- imized . Does not guarantee global minima of distortion Absolute Histogram Difference (Hdiff) [CV2015] Simple summarization technique based on the absolute difference of histograms of con- secutive frames Time Equidistant Algorithm (TEA) Keeps keyframes in equal intervals in duration Content Equidistant Algorithm (CEA) [4783025] Based on the iso-content principle . Estimates keyframes that are equidistant in video content Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 11 / 39

SeDiM - Architecture (a) Original steps (b) Modified version Figure: Schemes for (a) original version and (b) our proposal Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 12 / 39

SeDiM - Examples Figure: k = 5 keyframes on Montalbano RGB samples. 1st row: vattene , 2nd: seipazzo , 3th combinato , 4th: ok Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 13 / 39

Hdiff - Examples Figure: k = 5 keyframes on Montalbano RGB samples. 1st row: vattene , 2nd: seipazzo , 3th combinato , 4th: ok Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 14 / 39

TEA - Examples Figure: k = 5 keyframes on Montalbano RGB samples. 1st row: vattene , 2nd: seipazzo , 3th combinato , 4th: ok Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 15 / 39

CEA - Examples Figure: k = 5 keyframes on Montalbano RGB samples. 1st row: vattene , 2nd: seipazzo , 3th combinato , 4th: ok Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 16 / 39

Proposed Method 1. Data Pre-processing RGB-D Registering 1 Depth denoising 2 2. Video Summarization strategies RGB: Ordered sequences of k = 14 RGB videos 1 Depth: Ordered sequences of k = 14 Depth videos 2 RGB-D: Combination of k = 7 RGB and depth summaries 3 3. Multi-Modal 2D CNN Extend VGG-16 2DCNN by adding a scene flow stream Base models are UCF101 (temporal and spatial) Scene flow stream is to be fine-tuned from the RGB model of the same dataset Weighted average fusion Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 18 / 39

RGB-D Alignment Some datasets are not properly aligned RGB-D registration uses the intrinsic (focal length and the distortion model) and extrinsic (translation and rotation) camera parameters to warp the colour image to fit the depth map Figure: IsoGD RGB and depth frame superpositions Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 19 / 39

Hybrid Median Filter Figure: HMF workflow Figure: 5x5 HMF shapes Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 20 / 39

Denoising (1) (a) Original (b) Inpaint (c) Inpaint + HMF Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 21 / 39

Denoising (2) 1st row: Inpainting + HMF 2nd row: Superposition before registration 3rd row: Superposition after registration Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 22 / 39

Late Fusion Weighted sum is used to fuse class scores of each modality. Given M modalities, each sample has N feature arrays of size K classes, then, the final scores are: N � S f = w i S i i where weights w i are to be optimized Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 23 / 39

MSR Daily Activity 3D Characteristics: Action recognition 16 classes 10 subjects 320 samples Evaluation: 25% Train 25% Validation 50% Test Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 25 / 39

Montalbano V2 Characteristics: Gesture recognition 20 classes 27 subjects 940 samples 13858 gestures Evaluation: 1-470 Train 471-700 Validation 701-940 Test Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 26 / 39

Isolated Gesture Dataset (IsoGD) Characteristics: Gesture recognition 249 classes 17 subjects 47933 gestures Evaluation: 35878 Train 5784 Validation 6271 Test Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 27 / 39

Evaluation on MSR Daily Figure: sedim Figure: tea Figure: hdiff Figure: cea W=[0.2, 0.3, 0.5] Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 28 / 39

Evaluation on Montalbano V2 Figure: sedim Figure: tea Figure: cea Figure: hdiff W=[0.65, 0.15, 0.2] Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 29 / 39

Evaluation on IsoGD Figure: sedim Figure: tea Figure: hdiff Figure: cea W=[0.2, 0.8] Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 30 / 39

Comparison - MSR Daily Method Accuracy EigenJoints 58.10 MovingPose 73.80 HON4D 80.00 SSTKDes 85.00 ActionLet 85.75 MMDT 78.13 MM2DCNN 68.50 Table: Performance comparison with sota methods on MSR Daily Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 31 / 39

Comparison - Montalbano V2 Method Accuracy Rank pooling 75.30 AdaBoost, HoG 83.40 Temp Conv + LSTM 94.49 Dense Trajectories 83.50 MMDT 85.66 MM2DCNN 97.74 Table: Performance comparison with sota methods on Montalbano Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 32 / 39

Comparison - IsoGD Method Accuracy NTUST 20.33 MFSK 24.19 MFSK+DeepID 23.67 XJTUfx 43.92 XDETVP-TRIMPS 50.93 TARDIS 40.15 ICT NHCI 46.80 AMRL 55.57 2SCVN-3DDSN 67.19 MM2DCNN 46.63 Table: Performance comparison with sota methods on IsoGD ref: http://chalearnlap.cvc.uab.es Vicent Roig Ripoll (UPC,UB,URV) Master’s Thesis October, 2017 33 / 39

Multimodal 2DCNN action recognition from RGB-D Data with Video - PowerPoint PPT Presentation

Multimodal 2DCNN action recognition from RGB-D Data with Video Summarization Vicent Roig Ripoll Master in Artificial Intelligence UPC, UB, URV Masters Thesis Advisor: Sergio Escalera Guerrero Co-advisor: Maryam Asadi-Aghbolaghi October,

RGB ar chitect s RGB ar chitect s RGB ar chitect s Concepts behind Blended Learning

RGB-D Mapping Overview CSE 571 Robotics Map RGB-D Mapping `` University of Washington Dieter

Action Recognition ICIP2019 Tutorial Outline Problem space Datasets RGB RGB-D

Correcting Image Defects Chaiwoot Boonyasiriwat October 30, 2020 RGB Color Space Most

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Objects Thinking About Assignment 2 A2 : three color models id1 rgb id1 list RGB : 3

RGB YE2011 Corporate Review RGB YE2011 Corporate Review 7 th March 2012 Safe Harbor Statement

rgb@i @iiserpu serpune.a ne.ac.in c.in II IISER ER-Pune Pune Prof. RGB, IISER Pune

Human Body Recogni6on and Tracking: Kinect RGB-D Camera How the Kinect RGB-D Camera Works

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Scene Understanding with 3D Deep Networks Thomas Funkhouser Princeton University Disclaimer: I

RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments Peter Henry 1 ,

COLOR IN GRAPHICS & VISUALIZATION Graphics & Visualization: Principles & Algorithms

Dominant colors in images CLUS TERIN G METH ODS W ITH S CIP Y Shaumik Daityari Business

COLOR SPECTRUM RECONSTRUCTION USING NEURAL NETWORKS 2 Hyperspectral-sensing.nb THE GOAL

EE 6882 Visual Search Engine Prof. Shih Fu Chang, Jan. 30, 2012 Lecture #2 Visual Features:

Multimodal Gesture Recognition Based on the ResC3D Network Qiguang Miao Yunan Li Wanli Ouyang

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration)

Multimodal 2DCNN action recognition from RGB-D Data with Video - PowerPoint PPT Presentation

Multimodal 2DCNN action recognition from RGB-D Data with Video Summarization Vicent Roig Ripoll Master in Artificial Intelligence UPC, UB, URV Masters Thesis Advisor: Sergio Escalera Guerrero Co-advisor: Maryam Asadi-Aghbolaghi October,

RGB ar chitect s RGB ar chitect s RGB ar chitect s Concepts behind Blended Learning

RGB-D Mapping Overview CSE 571 Robotics Map RGB-D Mapping `` University of Washington Dieter

Action Recognition ICIP2019 Tutorial Outline Problem space Datasets RGB RGB-D

Correcting Image Defects Chaiwoot Boonyasiriwat October 30, 2020 RGB Color Space Most

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Objects Thinking About Assignment 2 A2 : three color models id1 rgb id1 list RGB : 3

RGB YE2011 Corporate Review RGB YE2011 Corporate Review 7 th March 2012 Safe Harbor Statement

rgb@i @iiserpu serpune.a ne.ac.in c.in II IISER ER-Pune Pune Prof. RGB, IISER Pune

Human Body Recogni6on and Tracking: Kinect RGB-D Camera How the Kinect RGB-D Camera Works

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Scene Understanding with 3D Deep Networks Thomas Funkhouser Princeton University Disclaimer: I

RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments Peter Henry 1 ,

COLOR IN GRAPHICS &amp; VISUALIZATION Graphics &amp; Visualization: Principles &amp; Algorithms

Dominant colors in images CLUS TERIN G METH ODS W ITH S CIP Y Shaumik Daityari Business

COLOR SPECTRUM RECONSTRUCTION USING NEURAL NETWORKS 2 Hyperspectral-sensing.nb THE GOAL

EE 6882 Visual Search Engine Prof. Shih Fu Chang, Jan. 30, 2012 Lecture #2 Visual Features:

Multimodal Gesture Recognition Based on the ResC3D Network Qiguang Miao Yunan Li Wanli Ouyang

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration)

COLOR IN GRAPHICS & VISUALIZATION Graphics & Visualization: Principles & Algorithms