Structured Deep Learning for Video Analysis Fabien Baradel PhD - PowerPoint PPT Presentation

Structured Deep Learning for Video Analysis Fabien Baradel PhD Candidate Advisors: Christian Wolf & Julien Mille June, 29th, 2020 1 fabienbaradel.github.io

What is video understanding? Human actions Sitting on the floor Entity-level interactions Grabing a silencer Temporal reasoning / Causality The baby starts crying because of the silencer 2

Why video understanding? Indexing Recommendation Human-robot interactions Retrieval Analysis 3 [TTNET, CVPRW’20]

A video understanding task Action Recognition CV walking system Classification task Pre-defined labels Similar to object recognition 4 [HMDB, Kuehne, ICCV’11]

Human Pose Walking Chopping Pose is enough What is being chopped? 5 [Johansson, 1973]

Context & Appearance Swimming? Handshaking 6

Action Recognition Recent works Two-stream I3D CNN CNN label Appearance label CNN Video Video Motion 3D 2D inflated kernel 2D kernel Limitations Biased towards context Lack of explainability [Two-stream, Simonyan, NIPS’16] Human pose? Objects? Scene? 7 [I3D, Carreira et al, CVPR’18]

Structured Deep Learning St 8

Outline Entity-level interactions Reasoning Visual Attention « Object level Reasoning » « Counterfactual learning » « Glimpse Clouds » F. Baradel, N. Neverova, C. Wolf, F. Baradel, N. Neverova, J. Mille, F. Baradel, C. Wolf, J. Mille, J. Mille, G. Mori G. Mori, C. Wolf G. Taylor ECCV'18 ICLR’20 (spotlight) CVPR'18 Graham W. Taylor Christian Wolf Julien Mille University of Guelph INSA Lyon - LIRIS INSA CVL - LI Tours 9 Vector Institute

Visual Attention What is happening? Winter activities [Yarbus, 1976] [Roger et al, 2012] 10

Visual Attention What is Charlie doing? Walking [Yarbus, 1976] [Roger et al, 2012] 11

Action Recognition Baseline 𝑈 𝑈 𝐸 classifier pooling label R3D 𝐼 𝑋 𝐸 Video Feature map Limitations What about fine-grained human actions? How to focus on relevant parts of the video? 12

Glimpse Clouds Method context 𝑈 human pose Loss Maximize distance between glimpses Good human pose estimation R3D Cross-entropy loss RNN N concepts External Video memory t=1 label Worker 1 … … soft Local features t assignment … label Worker N t=T 13 [Recurrent Visual Attention, Mnih et al, NIPS’14] [3D Resnet, Hara et al, CVPR’18]

Glimpse Clouds State-of-the-art results Method Modality CS CV Method Modality V1 Ensemble TS-LSTM skeleton 74.6 81.3 Enhanced viz. Skeleton 86.1 Ensemble TS-LSTM Skeleton 89.2 View invariant skeleton 80.0 87.2 NKTM RGB 75.8 Hands-Attention (ours) skeleton+ RGB 84.8 90.6 Glimpse Clouds (ours) RGB 90.1 Glimpse Clouds (ours) RGB 88.4 93.2 Accuracy on NTU-RGB+D Accuracy on Northwestern-UCLA fabienbaradel/glimpse_clouds [Pose-driven hands-attention, Baradel et al, BMVC’18] 14

Glimpse Clouds Ablation study Impact of the attention mechanism 91 90 +1.9 89 88 Resolution matters +4.4 87 Local fine-grained features 86 85 84 83 82 NTU NUCLA Global model Glimpse Clouds 15

Glimpse Clouds Visualization Raw video Attended regions Worker 1 → ~Hands Worker 2 → ~Heads Worker 3 → ~Legs 16 Note: argmax shown for the feature-to-worker association

Outline Unstructured local features… Incorporate structure from images? Leverage visual entities interactions? Visual Attention Entity-level interactions « Object level Reasoning » F. Baradel, N. Neverova, C. Wolf, J. Mille, G. Mori Christian Wolf Julien Mille Natalia Neverova Greg Mori ECCV'18 INSA Lyon - LIRIS INSA CVL - LI Tours Facebook SFU 17

Object-level Reasoning time Action Often possible to infer what happened from few frames 18

Object-level Reasoning time Action Often possible to infer what happened from few frames Visual entities interactions 19 [Mask-RCNN, He et al, ICCV’17]

Image as a set of objects Set attributes appearance shape semantic RGB Mask-RCNN CNN features pixel location COCO class 20

Object Relation Network 𝑢 ) 𝑢 bed 1.00 bed 1.00 person 0.89 𝑃 ) 𝑃 𝑞𝑓𝑠𝑡𝑝𝑜 ( 𝑐𝑓𝑒 () Graph creation 𝑐𝑓𝑒 ( 𝔗 How? Structure? Clique size? Type of interactions? [GCN, Kipf et al, ICLR’17] 𝑕 ( 21 [Graph Networks, Battaglia et al, arXiv’18]

Object Relation Network 𝑢 ) 𝑢 bed 1.00 bed 1.00 person 0.89 𝑃 ) 𝑃 𝑞𝑓𝑠𝑡𝑝𝑜 ( 𝑐𝑓𝑒 () Graph creation 𝑐𝑓𝑒 ( 𝔗 𝑞𝑓𝑠𝑡𝑝𝑜 ( 𝑐𝑓𝑒 () 𝑐𝑓𝑒 ( Shared MLP Data efficient Invariant to the number of objects 𝑕 ( = 𝑔(𝑐𝑓𝑒 () , 𝑞𝑓𝑠𝑡𝑝𝑜 ( ) + 𝑔(𝑐𝑓𝑒 () , 𝑐𝑓𝑒 ( ) Inter-frame object relations Semantic meaning Clique of size 2 𝑔(𝑝 ) , 𝑝) 𝑕 ( = 8 8 [GCN, Kipf et al, ICLR’17] 𝑕 ( 9 < ∈; < 9∈; 22 [Graph Networks, Battaglia et al, arXiv’18]

Object Relation Network time bed 1.00 bed 0.99 bed 1.00 bed 0.69 bed 1.00 bed 0.98 person 0.89 person 0.26 person 0.60 person 0.26 person 0.31 inter-frame 𝑕 > 𝑕 ? 𝑕 A 𝑕 B 𝑕 @ representation RNN object-level 𝑠 video linear Loss representation label Cross-entropy 23

Object Relation Network State-of-the-art Method Acc. Method mAP Method Acc. C3D 21.50 Resnet50 40.5 Resnet18 32.05 I3D 27.63 I3D 39.7 Resnet3D-18 34.20 Multiscale TRN 33.60 Object Relation Network 44.7 Object Relation Network 40.89 Object Relation Network 35.97 Accuracy on Something-Something Mean Average Precision on VLOG Verb accuracy on EPIC Kitchens fabienbaradel/object_level_visual_reasoning + Object masks detected by Mask-RCNN 24

Object Relation Network Ablation study Impact of the object relation network 46 44 +4.8 42 40 +6.7 Detection performance High resolution 38 36 +2.3 34 32 30 Something-Something VLOG EPIC R3D ORN 25

Object Relation Network Interactions Spurious correlations Co-occurences Learned relations Human-Laptop interactions 26

Outline Structure matters… Can we go one step further? Beyond supervised learning? Learning underlying latent concepts? Visual Attention Entity-level interactions Reasoning « Counterfactual learning » F. Baradel, N. Neverova, J. Mille, G. Mori, C. Wolf ICLR’20 (spotlight) Christian Wolf Julien Mille Natalia Neverova Greg Mori INSA Lyon - LIRIS INSA CVL - LI Tours Facebook SFU 27

Reasoning & Causation Latent Concepts Understanding of complex relationships Cause-effect What would happened if? Counterfactual statement 28

Counterfactual Future forecasting 𝐶 𝐵 Initial state Outcome Masses Frictions Gravity 𝐷 𝐸 Counterfactual Modified outcome initial state 29

Counterfactual Future forecasting Feedforward Counterfactual 𝑉 𝑉 Confounder Confounder 𝑌 G 𝑌 >:I 𝑌 G 𝑌 >:I 𝑌 G 𝑌 >:I 𝐵 𝐶 𝐵 𝐶 𝒆𝒑(𝑌 G = 𝐷) 𝐸 Initial state Outcome Initial state Outcome Modified initial state Counterfactual outcome 30 [Algorithmization of counterfactuals, Pearl, arXiv’18]

CoPhy benchmark Large-scale datasets 250k examples ((A,B), (C,D)) 7 millions of frames Supervision of the do-operator ( ) Confounders are necessary for future prediction 31

CoPhyNet Unsupervised confounders estimations 𝐵 time GCN Recurrent GCN RNN GCN … 𝐶 RNN 𝑉 GCN 32 [GCN, Kipf et al, ICLR’17]

CoPhyNet Unsupervised confounders estimations Recurrent 𝐵 time GCN Recurrent GCN … 𝐶 Recurrent 𝑉 GCN 33 [GCN, Kipf et al, ICLR’17]

CoPhyNet Trajectory prediction time 𝐷 Recurrent [:] 𝑢 = 1 GCN 𝐸 Recurrent [:] 𝑢 = 2 GCN 𝑉 … … 𝑢 = T 34 [Perceived-causality, Gerstenberg et al, ACCSS’15]

Human study 𝐷 (𝐵, 𝐶) 𝐷 ? ? Human non-CF Human CF 2D pixel error for each block 100 80 60 40 20 0 Bottom block Middle block Top block Avg block Human non-CF Human CF CoPhyNet 35

Cophynet Results NOT COMPARABLE! Copying baselines Feedforward models Soft-upper bound Train → Test Copy C Copy B IN NPE CoPhyNet IN Sup. 3 → 3 0.470 0.601 0.318 0.331 0.294 0.296 Unseen confounders 3 → 3* 0.365 0.592 0.298 0.319 0.289 0.282 Unseen number of blocks 3 → 4 0.754 0.846 0.524 0.523 0.482 0.467 MSE on 3D positions (average over time) + fabienbaradel/cophy [Interaction Network, Battaglia et al, NIPS’17] 36 CoPhy benchmark [Neural Physic Engine, Chang et al, ICLR’17]

Conclusion Visual Attention Entity-level interactions Reasoning « Glimpse Clouds » « Object level Reasoning » « Counterfactual learning » F. Baradel, C. Wolf, J. Mille, F. Baradel, N. Neverova, C. Wolf, F. Baradel, N. Neverova, J. Mille, G. Taylor, CVPR'18 J. Mille, G. Mori, ECCV'18 G. Mori, C. Wolf, ICLR’20 (spotlight) Focus on important parts Object-centric modeling Unsupervised latent discovery Automatic selection Intra-time interactions Future trajectory Distributed recognition Learned relations New task in visual space 37

Structured Deep Learning for Video Analysis Fabien Baradel PhD - PowerPoint PPT Presentation

Structured Deep Learning for Video Analysis Fabien Baradel PhD Candidate Advisors: Christian Wolf & Julien Mille June, 29th, 2020 1 fabienbaradel.github.io What is video understanding? Human actions Sitting on the floor Entity-level

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Structured Probabilistic Models for Deep Learning Lecture slides for Chapter 16 of Deep Learning

AI Animation Team Clip1 Clip2 Clip4 Clip3 Clip5 Clip6 Clip7 Clip8 clip9 Automatically

4 th QUARTER & COVID-19 BUSINESS CONTINUITY PLAN Alpha Direct| April 23, 2020 Safe Harbor

Presentation Week1 Hanan Alnizami sdsa Week 1 Plans Tasks Accomplishments

Rcpp Oliver Heidmann Supervised by: Julian Kunkel University of Hamburg

User Forum T 1 OCR cataloguing project Mark Bell and Katie Fox 18 August 2016 What is T 1?

Dual Stochastic and Silhouette-Based 2D-3D Motion Capture for Real-Time Applications Pe dro Co

Digital Humanities: A Collaborative Workshop Hello! I am Emily Friedman Associate Professor of

First Quarter revenue 21 April 2016 2019 Third Quarter Revenue October 24, 2019 DISCLAIMER

Structured Deep Learning for Video Analysis Fabien Baradel PhD - PowerPoint PPT Presentation

Structured Deep Learning for Video Analysis Fabien Baradel PhD Candidate Advisors: Christian Wolf & Julien Mille June, 29th, 2020 1 fabienbaradel.github.io What is video understanding? Human actions Sitting on the floor Entity-level

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Structured Probabilistic Models for Deep Learning Lecture slides for Chapter 16 of Deep Learning

AI Animation Team Clip1 Clip2 Clip4 Clip3 Clip5 Clip6 Clip7 Clip8 clip9 Automatically

4 th QUARTER &amp; COVID-19 BUSINESS CONTINUITY PLAN Alpha Direct| April 23, 2020 Safe Harbor

Presentation Week1 Hanan Alnizami sdsa Week 1 Plans Tasks Accomplishments

Rcpp Oliver Heidmann Supervised by: Julian Kunkel University of Hamburg

User Forum T 1 OCR cataloguing project Mark Bell and Katie Fox 18 August 2016 What is T 1?

Dual Stochastic and Silhouette-Based 2D-3D Motion Capture for Real-Time Applications Pe dro Co

Digital Humanities: A Collaborative Workshop Hello! I am Emily Friedman Associate Professor of

First Quarter revenue 21 April 2016 2019 Third Quarter Revenue October 24, 2019 DISCLAIMER

4 th QUARTER & COVID-19 BUSINESS CONTINUITY PLAN Alpha Direct| April 23, 2020 Safe Harbor