Structured Deep Learning for Video Analysis Fabien Baradel PhD Candidate Advisors: Christian Wolf & Julien Mille June, 29th, 2020 1 fabienbaradel.github.io
What is video understanding? Human actions Sitting on the floor Entity-level interactions Grabing a silencer Temporal reasoning / Causality The baby starts crying because of the silencer 2
Why video understanding? Indexing Recommendation Human-robot interactions Retrieval Analysis 3 [TTNET, CVPRW’20]
A video understanding task Action Recognition CV walking system Classification task Pre-defined labels Similar to object recognition 4 [HMDB, Kuehne, ICCV’11]
Human Pose Walking Chopping Pose is enough What is being chopped? 5 [Johansson, 1973]
Context & Appearance Swimming? Handshaking 6
Action Recognition Recent works Two-stream I3D CNN CNN label Appearance label CNN Video Video Motion 3D 2D inflated kernel 2D kernel Limitations Biased towards context Lack of explainability [Two-stream, Simonyan, NIPS’16] Human pose? Objects? Scene? 7 [I3D, Carreira et al, CVPR’18]
Structured Deep Learning St 8
Outline Entity-level interactions Reasoning Visual Attention « Object level Reasoning » « Counterfactual learning » « Glimpse Clouds » F. Baradel, N. Neverova, C. Wolf, F. Baradel, N. Neverova, J. Mille, F. Baradel, C. Wolf, J. Mille, J. Mille, G. Mori G. Mori, C. Wolf G. Taylor ECCV'18 ICLR’20 (spotlight) CVPR'18 Graham W. Taylor Christian Wolf Julien Mille University of Guelph INSA Lyon - LIRIS INSA CVL - LI Tours 9 Vector Institute
Visual Attention What is happening? Winter activities [Yarbus, 1976] [Roger et al, 2012] 10
Visual Attention What is Charlie doing? Walking [Yarbus, 1976] [Roger et al, 2012] 11
Action Recognition Baseline 𝑈 𝑈 𝐸 classifier pooling label R3D 𝐼 𝑋 𝐸 Video Feature map Limitations What about fine-grained human actions? How to focus on relevant parts of the video? 12
Glimpse Clouds Method context 𝑈 human pose Loss Maximize distance between glimpses Good human pose estimation R3D Cross-entropy loss RNN N concepts External Video memory t=1 label Worker 1 … … soft Local features t assignment … label Worker N t=T 13 [Recurrent Visual Attention, Mnih et al, NIPS’14] [3D Resnet, Hara et al, CVPR’18]
Glimpse Clouds State-of-the-art results Method Modality CS CV Method Modality V1 Ensemble TS-LSTM skeleton 74.6 81.3 Enhanced viz. Skeleton 86.1 Ensemble TS-LSTM Skeleton 89.2 View invariant skeleton 80.0 87.2 NKTM RGB 75.8 Hands-Attention (ours) skeleton+ RGB 84.8 90.6 Glimpse Clouds (ours) RGB 90.1 Glimpse Clouds (ours) RGB 88.4 93.2 Accuracy on NTU-RGB+D Accuracy on Northwestern-UCLA fabienbaradel/glimpse_clouds [Pose-driven hands-attention, Baradel et al, BMVC’18] 14
Glimpse Clouds Ablation study Impact of the attention mechanism 91 90 +1.9 89 88 Resolution matters +4.4 87 Local fine-grained features 86 85 84 83 82 NTU NUCLA Global model Glimpse Clouds 15
Glimpse Clouds Visualization Raw video Attended regions Worker 1 → ~Hands Worker 2 → ~Heads Worker 3 → ~Legs 16 Note: argmax shown for the feature-to-worker association
Outline Unstructured local features… Incorporate structure from images? Leverage visual entities interactions? Visual Attention Entity-level interactions « Object level Reasoning » F. Baradel, N. Neverova, C. Wolf, J. Mille, G. Mori Christian Wolf Julien Mille Natalia Neverova Greg Mori ECCV'18 INSA Lyon - LIRIS INSA CVL - LI Tours Facebook SFU 17
Object-level Reasoning time Action Often possible to infer what happened from few frames 18
Object-level Reasoning time Action Often possible to infer what happened from few frames Visual entities interactions 19 [Mask-RCNN, He et al, ICCV’17]
Image as a set of objects Set attributes appearance shape semantic RGB Mask-RCNN CNN features pixel location COCO class 20
Object Relation Network 𝑢 ) 𝑢 bed 1.00 bed 1.00 person 0.89 𝑃 ) 𝑃 𝑞𝑓𝑠𝑡𝑝𝑜 ( 𝑐𝑓𝑒 () Graph creation 𝑐𝑓𝑒 ( 𝔗 How? Structure? Clique size? Type of interactions? [GCN, Kipf et al, ICLR’17] ( 21 [Graph Networks, Battaglia et al, arXiv’18]
Object Relation Network 𝑢 ) 𝑢 bed 1.00 bed 1.00 person 0.89 𝑃 ) 𝑃 𝑞𝑓𝑠𝑡𝑝𝑜 ( 𝑐𝑓𝑒 () Graph creation 𝑐𝑓𝑒 ( 𝔗 𝑞𝑓𝑠𝑡𝑝𝑜 ( 𝑐𝑓𝑒 () 𝑐𝑓𝑒 ( Shared MLP Data efficient Invariant to the number of objects ( = 𝑔(𝑐𝑓𝑒 () , 𝑞𝑓𝑠𝑡𝑝𝑜 ( ) + 𝑔(𝑐𝑓𝑒 () , 𝑐𝑓𝑒 ( ) Inter-frame object relations Semantic meaning Clique of size 2 𝑔(𝑝 ) , 𝑝) ( = 8 8 [GCN, Kipf et al, ICLR’17] ( 9 < ∈; < 9∈; 22 [Graph Networks, Battaglia et al, arXiv’18]
Object Relation Network time bed 1.00 bed 0.99 bed 1.00 bed 0.69 bed 1.00 bed 0.98 person 0.89 person 0.26 person 0.60 person 0.26 person 0.31 inter-frame > ? A B @ representation RNN object-level 𝑠 video linear Loss representation label Cross-entropy 23
Object Relation Network State-of-the-art Method Acc. Method mAP Method Acc. C3D 21.50 Resnet50 40.5 Resnet18 32.05 I3D 27.63 I3D 39.7 Resnet3D-18 34.20 Multiscale TRN 33.60 Object Relation Network 44.7 Object Relation Network 40.89 Object Relation Network 35.97 Accuracy on Something-Something Mean Average Precision on VLOG Verb accuracy on EPIC Kitchens fabienbaradel/object_level_visual_reasoning + Object masks detected by Mask-RCNN 24
Object Relation Network Ablation study Impact of the object relation network 46 44 +4.8 42 40 +6.7 Detection performance High resolution 38 36 +2.3 34 32 30 Something-Something VLOG EPIC R3D ORN 25
Object Relation Network Interactions Spurious correlations Co-occurences Learned relations Human-Laptop interactions 26
Outline Structure matters… Can we go one step further? Beyond supervised learning? Learning underlying latent concepts? Visual Attention Entity-level interactions Reasoning « Counterfactual learning » F. Baradel, N. Neverova, J. Mille, G. Mori, C. Wolf ICLR’20 (spotlight) Christian Wolf Julien Mille Natalia Neverova Greg Mori INSA Lyon - LIRIS INSA CVL - LI Tours Facebook SFU 27
Reasoning & Causation Latent Concepts Understanding of complex relationships Cause-effect What would happened if? Counterfactual statement 28
Counterfactual Future forecasting 𝐶 𝐵 Initial state Outcome Masses Frictions Gravity 𝐷 𝐸 Counterfactual Modified outcome initial state 29
Counterfactual Future forecasting Feedforward Counterfactual 𝑉 𝑉 Confounder Confounder 𝑌 G 𝑌 >:I 𝑌 G 𝑌 >:I 𝑌 G 𝑌 >:I 𝐵 𝐶 𝐵 𝐶 𝒆𝒑(𝑌 G = 𝐷) 𝐸 Initial state Outcome Initial state Outcome Modified initial state Counterfactual outcome 30 [Algorithmization of counterfactuals, Pearl, arXiv’18]
CoPhy benchmark Large-scale datasets 250k examples ((A,B), (C,D)) 7 millions of frames Supervision of the do-operator ( ) Confounders are necessary for future prediction 31
CoPhyNet Unsupervised confounders estimations 𝐵 time GCN Recurrent GCN RNN GCN … 𝐶 RNN 𝑉 GCN 32 [GCN, Kipf et al, ICLR’17]
CoPhyNet Unsupervised confounders estimations Recurrent 𝐵 time GCN Recurrent GCN … 𝐶 Recurrent 𝑉 GCN 33 [GCN, Kipf et al, ICLR’17]
CoPhyNet Trajectory prediction time 𝐷 Recurrent [:] 𝑢 = 1 GCN 𝐸 Recurrent [:] 𝑢 = 2 GCN 𝑉 … … 𝑢 = T 34 [Perceived-causality, Gerstenberg et al, ACCSS’15]
Human study 𝐷 (𝐵, 𝐶) 𝐷 ? ? Human non-CF Human CF 2D pixel error for each block 100 80 60 40 20 0 Bottom block Middle block Top block Avg block Human non-CF Human CF CoPhyNet 35
Cophynet Results NOT COMPARABLE! Copying baselines Feedforward models Soft-upper bound Train → Test Copy C Copy B IN NPE CoPhyNet IN Sup. 3 → 3 0.470 0.601 0.318 0.331 0.294 0.296 Unseen confounders 3 → 3* 0.365 0.592 0.298 0.319 0.289 0.282 Unseen number of blocks 3 → 4 0.754 0.846 0.524 0.523 0.482 0.467 MSE on 3D positions (average over time) + fabienbaradel/cophy [Interaction Network, Battaglia et al, NIPS’17] 36 CoPhy benchmark [Neural Physic Engine, Chang et al, ICLR’17]
Conclusion Visual Attention Entity-level interactions Reasoning « Glimpse Clouds » « Object level Reasoning » « Counterfactual learning » F. Baradel, C. Wolf, J. Mille, F. Baradel, N. Neverova, C. Wolf, F. Baradel, N. Neverova, J. Mille, G. Taylor, CVPR'18 J. Mille, G. Mori, ECCV'18 G. Mori, C. Wolf, ICLR’20 (spotlight) Focus on important parts Object-centric modeling Unsupervised latent discovery Automatic selection Intra-time interactions Future trajectory Distributed recognition Learned relations New task in visual space 37
Recommend
More recommend