Learning Semantic Visual Codebook for Action Recognition by Embedding into Concept Space Behrouz Saghafi
Using Spatio-temporal Features Action recognition using silhouettes or optical flow encounters difficulties when dealing with non-uniform background, severe camera jitter and noise Local spatio-temporal features are fast and easy to extract and reliable.
Bag of Words model The raw features are clustered based on the their appearance rather than their semantic relations . By utilizing the semantics , the recognition accuracy will improve.
Incorporating Semantics into BoW model (Related work) • Build a model for each category and fit the query to one of the models in an unsupervised framework, like Probabilistic Latent Semantic Analysis and Latent Generative Dirichlet Allocation. methods • Their unsupervised nature limits their performance • The number of topics = the number of categories, which limits their efficiency. • Try to construct a semantic vocabulary and use it with a classifier. • Liu and shah (CVPR 2008): maximization of mutual information between visual words and videos > The Discriminative formed clusters do not necessarily represent topics or synonym words. methods • Liu et al. (CVPR 2009): use Diffusion Map (DM) to construct a semantic visual vocabulary > Considering connectivity in measuring the semantic distance is not appropriate in the presence of polysemy.
Embedding into Concept Space (Proposed) We propose a framework for constructing a semantic visual vocabulary via computing a rich semantic space ( Concept space ). The concept space is computed by Latent Semantic Models or Canonical Correlation Analysis . The visual words are embedded into concept space to form meaningful clusters representing semantic topics , consequently the formed histograms are more discriminative . As opposed to generative methods which do not use category labels, our method uses a classifier trained on the training histograms. The number of topics can be more than categories as opposed to the unsupervised framework, which allows analysis in more details. By using pLSA in constructing the concept space, the problem of polysemy is handled.
Overview of the proposed framework Constructing the semantic visual vocabulary: Training steps of the proposed method:
Latent Semantic Analysis (LSA) (1) • Latent Semantic Analysis (LSA) originally used in text mining applications, is the factorization of word-video co-occurrence matrix into linear subspaces of words and videos. Videos words M x N word-video co-occurrence matrix word vector video vector • The word vectors reveal the semantic relations of words , since semantically synonymous words occur in similar documents.
Latent Semantic Analysis (LSA) (2) • The word vectors are sparse so their correlation may not be so representative of their semantic relations. Therefore, we need to find the reduced dimensional space. Rank L optimal representation: videos topics topics videos topics topics L x L L x M words N x M words N x L ~= x x • The correlation of words based on word vectors: Rows of are a good representation of rows of (words) in the sense that they approximate the correlation between words.
Embedding into concept space using LSA : representation of word i in the N x L L -dimensional concept space
Probabilistic Latent Semantic Analysis (pLSA) z d w Observed word Topic distributions word distributions distributions per document per topic
Probabilistic Latent Semantic Analysis (pLSA) z d w known unknown Likelihood
Probabilistic Latent Semantic Analysis (pLSA) z d w E-step: Maximum Likelihood by M-step: EM
Embedding into concept space using pLSA
Embedding into concept space using pLSA representation of word i in the L -dimensional concept space
Using LSA vs pLSA • pLSA can handle polysemy – Polysemes are the words which have more than one meaning. Table
Using LSA vs pLSA • LSA can perform faster. LSA pLSA Mean Training time 62 sec 4261 sec (having the initial vocabulary) Mean Testing time 0.54 sec 0.71 sec (having learned the concept space)
Canonical Correlation Analysis (CCA) • Given a pair of vector sets, CCA finds the direction for each set, such that the projection of the vectors onto these directions have maximal correlation.
Canonical Correlation Analysis (CCA) (2)
Embedding into concept space using CCA Raw feature Semantic representation representation noisy Noise covariance is reduced
Constructing the semantic visual vocabulary using CCA
Local Feature Extractor
Performance of proposed method (Latent Semantic Space) on KTH dataset with different number of topics pLSA LSA
Comparison of results with the classic framework for different sizes of vocabulary (KTH)
Confusion matrix for the best result achieved using Latent Semantic Space (KTH) Best recognition accuracy: 93.94 % by pLSA with L=50, Kf=400.
Effect of changing the vocabulary size (CCA Space)
Confusion matrix for the best result on KTH dataset using CCA Best recognition accuracy: 93.39 % by Kf=700.
Comparison with reported results on KTH dataset
Recommend
More recommend