Semantic Spaces for Zero-Shot Behaviour Analysis Xun Xu Computer Vision and Interactive Media Lab, NUS Singapore 1
Collaborators Prof. Shaogang Gong Dr. Timothy Hospedales 2
Outline • Background • Transductive Zero-Shot Action Recognition • Multi-Task Zero-Shot Embedding • Zero-Shot Crowd Analysis 3
Video Behaviour Defined as Visually Distinguishable Activities • Human Actions • Crowd Behaviour 4
Human Actions • Individual or multiple interactive human activities 5 Soomro, et al. “UCF101 : A Dataset of 101 human actions classes from videos in the wild.” 2012
Human Actions Tasks • Action Recognition Eye Makeup Rafting Swimming Fencing Diving Archery 6
Human Actions Tasks • Action Detection (Retrieval) Given query “Swimming” return ranked videos Lower Ranking …… 7
Crowd Behaviour • A group of people acting collectively 8 Shao, J., et al. “Deeply learned attributes for crowded scene understanding .” CVPR 2015
Crowd Behaviour Tasks • Crowd Behaviour Profiling 9
Crowd Behaviour Tasks • Crowd Anomaly Detection 10 Hassner, T., et al. “Violent flows: Real-time detection of violent crowd behavior .” CVPR 2012
Potential Applications Human Computer Interaction Surveillance Video Sharing 11
Outline • Background • Transductive Zero-Shot Action Recognition • Multi-Task Zero-Shot Embedding • Zero-Shot Crowd Analysis 12
Motivation • Ever Increasing #Categories for action recognition 2004 2005 2010 Weizmann 9 Classes KTH 6 Classes Olympic Sports 16 Classes 2011 2012 2015 203 Classes UCF101 101 Classes HMDB51 51 Classes 13
Motivation • Ever Increasing #Categories Limitations 2004 2005 Expensive to collect training data 2010 Weizmann 9 Classes KTH 6 Classes Olympic Sports 16 Classes Annotating video is costly 2011 2012 2015 203 Classes UCF101 101 Classes HMDB51 51 Classes 14
Zero-Shot Learning (ZSL) • Can we use videos from known class to help predict videos from unknown classes? Known Classes Unknown Classes Shot-Put Hammer Throw Discus Throw 15
Attribute Semantic Space • Attribute Based Attributes Hammer Throw Throw Away Outdoor Discus Throw Turn Around Ball Bend 16
Attribute Semantic Space • Attribute Based Attributes Hammer Throw Throw Away Outdoor Discus Throw Turn Around Ball Shot-put Bend Known a priori 17
Attribute Semantic Space • Attribute Based Attributes Hammer Throw Throw Away Test video Outdoor Discus Throw Turn Around Ball Shot-put Bend 18
Attribute Semantic Space • Attribute Based Attributes Limitations Discus Throw Throw Away • Ontological problem Outdoor Hammer • Manual label attributes is Throw Turn costly for videos Around • Incompatible with other Ball Shot-put attribute sets Bend 19
Word-Vector Semantic Space Feature Space X Word-Vector Space Z Hammer Discus Throw = [0.2 0.5 0.1 …] Throw ( ) z f x Discus Throw Hammer Throw = [0.1 0.6 0.1 …] 20
Word-Vector Semantic Space Feature Space X Word-Vector Space Z Hammer Discus Throw = [0.2 0.5 0.1 …] Throw ShotPut = [0.3 0.4 0.2 …] Discus Throw Hammer Throw = [0.1 0.6 0.1 …] 21
Semantic Word-Vector • Skip-gram model predicts adjacent words 1 T max log p(z | z ) t j t T { z } t c j c , j 1 0 T exp(z z ) i j p(z z ) | i j T exp(z z ) i j i Result of this optimization vec (“ball”)=[ -0.004 0.01 0.01 -0.03 0.05] vec (“sword”)=[0.16 0.06 0.09 -0.06 -0.002] vec (“archery”)=[0.02 0.01 0.02 -0.03 -0.03] vec (“boxing”)=[ -0.08 -0.01 0.15 -0.01 0.09] Mikolov, T., et al. "Distributed representations of words and phrases and their compositionality .” NIPS2013 22 Pennington, J., et al. "Glove: Global vectors for word representation." EMNLP 2014.
Benefits • Geometric Meaningful Word-Vector Space ship Far Away Run cat Walk Closer dog 23
Benefits • Unsupervised Semantic Space 24
Benefits • Wide coverage of words Vec (“Apple”) = [0.2 0.3 0.1 …] Vec (“Bear”) = [0.1 0.9 0.1 …] Vec (“Car ”) = [0.6 0.2 0.4 …] Vec (“Desk”) = [0.2 0.8 0.4 …] Vec (“Fish”) = [0.5 0.2 0.3 …] … 25
Benefits • Uniform across datasets Dataset 1 Dataset 2 Discus Throw = [0.2 0.5 …] Discus Throw = [0.2 0.5 …] HammerThrow = [0.1 0.2 …] HammerThrow = [0.1 0.2 …] 26
Challenges • Domain Shift Feature Space X Semantic Vector Space Y Discus Throw Hammer Throw HammerThrow Sword Exercise Discus Throw Play Guitar 27
Challenges • Domain Shift Feature Space X Semantic Vector Space Y Discus Throw Hammer Throw HammerThrow Sword Exercise Discus Throw Confusion Play Guitar 28
Our Solution 29 Xu, X., et al. “ Transductive Zero-Shot Action Recognition by Word-Vector Embedding .” IJCV 2017
Our Solution 30 Xu, X., et al. “ Transductive Zero-Shot Action Recognition by Word-Vector Embedding .” IJCV 2017
Low-Level Visual Feature • Improved Trajectory Feature for x 31 Wang, H. and Schmid , C., et al. “Action recognition with improved trajectories,” ICCV13
Our Solution 32 Xu, X., et al. “ Transductive Zero-Shot Action Recognition by Word- Vector Embedding.” IJCV 2017
Combinations of Multi Words • A phrase is constructed from single word vectors Additive Composition vec (“Apply Eye Makeup”) = vec (“Apply”) + vec (“Eye”) + vec (“Makeup”) vec (“Brushing Teeth”) = vec (“Brushing”) + vec (“Teeth”) vec (“Playing Guitar”) = vec (“Playing”) + vec (“Guitar”) 33
Our Solution 34 Xu, X., et al. “ Transductive Zero-Shot Action Recognition by Word- Vector Embedding.” IJCV 2017
Visual to Semantic Mapping by Regularized Linear Regression • Multi-Dimensional Regularized Linear Regression N 2 2 min z Wx W i i 2 2 W i 1 W x is N Dimension z is D Dimension x 1 z 1 Feature Space Semantic Space x 2 z 2 …… x 3 … 35
Domain Shift – Semi Supervised (Manifold Regularized) Regression • Semi-supervised regression is applied to tackle domain shift which takes test data distribution into consideration trg X Target Train Data tr X trg Target Test Data te Train and Test Data in Feature Space X X trg tr tr trg X X te te KNN Graph KNN Graph to model Manifold weight 2 f x f x : x [ X ;X ] Manifold Regularizor ij i j tr te 2 36
Domain Shift – Semi Supervised (Manifold Regularized) Regression • Semi-supervised regression is applied to tackle domain shift which takes test data distribution into consideration trg X Target Train Data tr X trg Target Test Data te KNN Graph to model Manifold N 2 2 2 min z Wx W Wx Wx i i ij i j 2 2 2 W i ij 1 37
Our Solution Additional datasets are available 38 Xu, X., et al. “ Transductive Zero-Shot Action Recognition by Word- Vector Embedding.” IJCV 2017
Data Augmentation • Use more training data from Auxiliary Dataset to help learn a better regression Augmented Train and Test Data in Feature Space X [ X trg ; X aux ] tr tr trg X X te te trg X Target Train Data tr aux Auxiliary Data X trg X Target Test Data te trg X Target Dataset Train Data tr (e.g. HMDB51) Data Augmentation X aux Auxiliary Dataset Data More Data is considered to learn more robust regressor (e.g. UCF101) 39
Semantic Word Vector Approach 40
Zero-Shot Recognition by Nearest Neighbor • Do nearest Neighbor search in word-vector space to predict category of test data HulaHoop Fencing Basketball W Diving TestData Kayaking Minimal distance Rafting TaiChi Category Name Test Video Instance 41
Domain Shift – SelfTraining • Self-training is applied to tackle domain shift Category Name z f ( x ) Test Video Instance te z z Z("Taichi") g("Taichi") 2 3 K 1 * Z ("Taichi") z z z te K 4 Z("Taichi") 1 z NN( Z("Taichi"),K ) te z , K ) is the KNN function NN( Z proto 5 z 7 z z 4 NN example 8 Z ("Taichi") * 6 Z ("Taichi") * ( z z z z ) 4 5 6 7 8 42
Domain Shift – SelfTraining • Self-training is applied to tackle domain shift Category Name z f ( x ) Test Video Instance te z z Z("Taichi") g("Taichi") 2 3 K 1 * Z ("Taichi") z z z te K 4 Z("Taichi") 1 z NN( Z("Taichi"),K ) te z , K ) is the KNN function NN( Z proto 5 z 7 z z 4 NN example 8 Z ("Taichi") * 6 Z ("Taichi") * ( z z z z ) 4 5 6 7 8 43
Recommend
More recommend