E ffi cient 2D Viewpoint Combination for Human Action Recognition
Multi-view Action Recognition • Video describes a 2-dimensional space while actions truly occur in 3-dimensional world space. • Subject may be occluded by an object or by itself (self-occlusions) • Researchers have used multiple cameras to obtain a 3D representation of the subject (visual hull)
Drawbacks of Visual Hulls • A sufficient number of views is required to build a reliable visual hull. • In carving a visual hull, some information is lost. Visual hull is an approximation of the true 3D model.
Proposed method (1) • We propose to extract features from each viewpoint separately and combine them efficiently such that useful information is reinforced and redundant features are attenuated. • We extract local features from each view, which is easy to extract and does not require segmentation. • As opposed to Peng and Qian who used HMMs, we use a simple BOW model which is obviously orderless, much more easier to train/test and can be used with classifiers such as SVM. • Instead of extracting many heterogeneous features, we focus on computing different models using different codebooks and functions and combining them efficiently.
Proposed method (2) • Multi-class recognition is done using 1-vs-1 scheme instead of 1- vs-all to achieve more precision and the ability to add a category without the need to re-train the whole system. • We model the same video with different histograms obtained from two local features and two vocabularies. • The distance between histograms are measured using HIK (Histogram Intersection Kernel) as well as RBF (Radial Basis Function) kernel with Chi-square distance. • We use an efficient interleaved optimization strategy to learn the optimum weights for the multiple kernels. The obtained optimum weights score each kernel based on its ability to discriminate between two different categories.
Some viewpoints are more discriminative between some pairs of actions
Feature Types • Apply a Gaussian filter to the spatial domain and a quadratic pair of Gabor filters to the temporal Separable dimension. Proposed by Dollar et al. Linear Filters Space- • Extension of the Harris corner detector proposed by Laptev and Lindeberg. Time Corner Detector
Codebook sizes • After extracting features, we use them to obtain a codebook for each view by applying k-means using Euclidean distance. • We use two codebooks, one of size V and the other 2V. • According to Gehler and Nowozin, adding any kernel, even uninformative and non-discriminative one, to the kernel weight optimization methods will not reduce the classification performance. In particular, when the added feature(kernel) is discriminative, the classification performance will increase. • Using two different sizes of vocabulary will enable us to model the actions with two different scales of detail.
Kernel Types Histogram Intersection Kernel (HIK) Radial Basis Function (RBF) Kernel with Chi- Square Distance
Learning an e ffi cient combination of kernels (1) • The HIK and RBF kernels from different histograms need to be combined in an efficient way to acquire an optimized final kernel. • The final kernel is used with SVM to classify the actions. The binary SVM classier will be in the form of • is the kernel weight which changes (scales) the influence of the kernel space associated with and subsequently the corresponding histogram space.
Learning an e ffi cient combination of kernels (2) • We use 1-vs-1 classification scheme and choose different weights for each binary classifier. As some of the histograms (feature spaces) may be discriminative in differentiating between a pair of classes but non-informative for another pair. • For every combination of local feature and codebook size, we incorporate only one instance of HIK kernel and four instances of RBF kernel with different bandwidths. • Experimental results show that sparse methods do not perform much better than baseline methods using average weights, therefore we use a non-sparse general -norm multiple kernel learning algorithm where no feature is removed but all features participate and with different contributions. We empirically select the -norm. Newton descent is used for optimization due to its faster performance compared to cutting planes.
MKL (sparse and non-sparse) • Lp-norm refers to the norm which is used by the regularizer of learning function.
IXMAS dataset 11 actions, 10 subjects, 5 views.
Views in IXMAS dataset
Confusion matrix for the best result achieved on IXMAS Recognition accuracy: 95.8 %
Accuracy for each view (camera) of IXMAS
Best accuracy for combination of views in IXMAS
Performance of each feature type
Performance of using more codebooks
Performance of each kernel type and combination of them
Comparison of di ff erent fusion methods
Comparison of Recognition Accuracy on IXMAS dataset Multi-view Visual Hull Single view
Recommend
More recommend