Chapter 8

Action Recognition

Yu KongYun Fu    B. Thomas Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY, United States
Department of Electrical and Computer Engineering and College of Computer and Information Science (Affiliated), Northeastern University, Boston, MA, United States

Abstract

Appearance and pose of human actions vary significantly in different views, making multiview action recognition a very challenging task. To address this problem, it is crucial to learn view-invariant features from multiview action data that are robust to view variations. In this chapter, we first describe an approach by using deep learning networks to learn view-specific and view-shared features. The former captures unique dynamics in each view, and the latter encodes common patterns across views. We describe a sample-affinity matrix (SAM) to accurately balance information transfer within the samples of multiple views, for the purpose of limiting information transfer across samples. SAM enables us to learn more informative shared features that are robust to view variations. Furthermore, we encourage incoherence between the two types of features in order to decrease information redundancy and increase discriminability in them. The discriminative power of the learned features is further enhanced by encouraging features in the same categories to be geometrically closer. Robust view-invariant features are finally learned by stacking several layers of features. This approach is evaluated on two multiview datasets, and shows superior performance over state-of-the-art approaches.

This chapter further introduces a hybrid convolutional-recursive neural network (HCRNN) to learn compositional features for action recognition from depth cameras. HCRNN captures rich motion, structure and context information, including local adjacent body parts as well as long-range body parts. HCRNN extracts robust features by stacking two layers, the 3D convolutional neural network (3D-CNN) and the 3D recursive neural network (3D-RNN). We use 3D-CNN to capture motion and structure information in local neighborhoods, and use 3D-RNN to compose high-order features for action recognition. Multiple 3D-RNNs are employed to improve the discriminability and robustness of the learned features. We organize the two components as a deep model, and train HCRNN in an unsupervised fashion without network tuning. Results on two RGB-D action datasets show that our method achieves state-of-the-art performance.

Keywords

Deep learning; Action recognition; Multiview learning; RGB-D action

8.1 Deeply Learned View-Invariant Features for Cross-View Action Recognition1

8.1.1 Introduction

Human action data are universal and are of interest to machine learning [1,2] and computer vision communities [3,4]. In general, action data can be captured from multiple views, such as multiple sensor views and various camera views, etc.; see Fig. 8.1. Classification on such action data in cross-view scenario is challenging since the raw data are captured by various sensor devices at different physical locations, and may look completely different. For example, in Fig. 8.1B, an action observed from side view and that observed from top view are visually different. Thus, it is less discriminative to use extracted features in one view than classifying actions in another view.

Image
Figure 8.1 Examples of multiview scenarios: (A) multisensor-view, where multiple sensors (orange rectangles) are attached to torso, arms and legs, and human action data are recorded by these sensors; (B) multicamera-view, where human actions are recorded by multiple cameras at various viewpoints. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this chapter.)

A line of work has been studied to build view-invariant representations for action recognition [59], where an action video is converted to a time series of frames. Approaches [5,6] take advantage of a so-called self-similarity matrix (SSM) descriptor to summarize actions in multiple views and have demonstrated their robustness in cross-view scenarios. Information shared between views is learned and transferred to each of the views in [79]. Their authors made an assumption that samples in different views contribute equally to the shared features. However, this assumption would be invalid as the cues in one view might be extraordinarily different from other views (e.g., the top view in Fig. 8.1B) and should have lower contribution to the shared features compared to other views. Furthermore, they do not constrain information sharing between action categories. This may produce similar features for videos in different classes but are captured from the same view, which would undoubtedly make classifiers confused.

We describe a deep network that classifies cross-view actions using the learned view-invariant features. A sample-affinity matrix (SAM) is introduced to measure the similarities between video samples in different camera views. This allows us to accurately balance information transfer between views and facilitates learning more informative shared features for cross-view action classification. The SAM structure controls information transfer among samples in different classes, which enables us to extract distinctive features in each class. Besides the shared features, private features are also learned to capture motion information exclusively existing in each view that cannot be modeled using shared features. We learn discriminative view-invariant information from shared and private features separately by encouraging incoherence between them. Label information and stacking multiple layers of features are used to further boost the performance of the network. This feature learning problem is formulated in a marginalized autoencoder framework (see Fig. 8.2) [10], particularly developed for learning view-invariant features. Specifically, cross-view shared features are summarized by one autoencoder, and private feature particularly for one view is learned using a group of autoencoders. We obtain incoherence between the two types of features by encouraging the orthogonality between mapping matrices in the two categories of autoencoders. A Laplacian graph is built to encourage samples in the same action categories to have similar shared and private features. We stack multiple layers of features and learn them in a layer-wise fashion. We evaluate our approach on two multi-view datasets, and show that our approach significantly outperforms state-of-the-art approaches.

Image
Figure 8.2 Overview of the method described for cross-view action recognition.

8.1.2 Related Work

The aim of multiview learning methods is to find mutual agreement between two distinct views of data. Researchers made several attempts to learn more expressive and discriminative features from low-level observations [1113,2,1416]. Cotraining approach [17] finds consistent relationships between a pair of data points across different views by training multiple learning algorithms for each view. Canonical correlation analysis (CCA) was also used in [18] to learn a common space between multiple views. Wang et al. [19] proposed a method which learns two projection matrices to map multimodal data onto a common feature space, in which cross-modal data matching can be executed. Incomplete view problem was discussed in [20]. Its authors presumed that a shared subspace generated different views. A generalized multiview analysis (GMA) method was introduced in [21], and was proved to be a supervised extension of CCA. Liu et al. [13] took advantage of matrix factorization in multiview clustering. Their method leverages factors representing clustering structures gained from multiple views toward a common consensus. A collective matrix factorization (CMF) method was explored in [12], which obtains correlations among relational feature matrices. Ding et al. [16] proposed a low-rank constrained matrix factorization model, which works perfectly in the multiview learning scenario even if the view information of test data is unknown.

View-invariant action recognition methods are designed to predict action labels given multiview samples. As viewpoint changes, large within-class pose and appearance variation appear. Previous studies focused on view-invariant features designs that are robust to viewpoint variations. The method in [22] implements local partitioning and hierarchical classification of the 3D Histogram of Oriented Gradients (HOG) descriptor to produce sequences of images. Frame-wise similarity matrix in a video is computed, and view-invariant descriptors within a log-polar block on the matrix are extracted in SSM-based approaches [5,23]. Sharing knowledge among views was reviewed in [24,25,8,7,9,2628]. Specifically, MRM-Lasso method in [9] captures latent corrections across different views by learning a low-rank matrix consisting of pattern-specific weights. Transferable dictionary pairs were created in [8,7], which encourage a shared sparse feature space. Bipartite graph was exploited in [25] to combine two view-dependent vocabularies into visual-word clusters called bilingual-words in order to bridge the semantic gap across view-dependent vocabularies.

8.1.3 Deeply Learned View-Invariant Features

The goal of this work is to extract view-invariant features that allow us to train the classification model on one (or multiple) view(s), and examine on the other view.

8.1.3.1 Sample-Affinity Matrix (SAM)

We introduce SAM to measure the similarity among pairs of video samples in multiple views. Suppose that we are given training videos of V views, {Xv,yv}v=1VImage. The data of the vth view XvImage consist of N action videos, Xv=[x1v,,xNv]Rd×NImage with corresponding labels yv=[y1v,,yNv]Image. SAM ZRVN×VNImage is interpreted as a block diagonal matrix

Z=diag(Z1,,ZN),Zi=(0zi12zi1Vzi210zi2VziV1ziV20),

Image

where diag()Image creates a diagonal matrix, and ziuvImage is the distance between two views in the ith sample computed by ziuv=exp(xivxiu2/2c)Image parameterized by c.

Essentially, SAM Z captures within-class between-view information and between-class within-view information. Block ZiImage in Z characterizes appearance variations in different views within one class. This explains how an action varies if view changes. Such information makes it possible to transfer information among views and build robust cross-view features. Additionally, since the off-diagonal blocks in SAM Z are zeros, this restricts information sharing among classes in the same view. Consequently, the features from different classes but in the same view are encouraged to be distinct. This enables us to differentiate various action categories if they appear similarly in some views.

8.1.3.2 Preliminaries on Autoencoders

Our approach is based on a popular deep learning approach, called autoencoder (AE) [29,10,30]. AE maps the raw inputs x to hidden units h using an “encoder” f1()Image, h=f1(x)Image, and then maps the hidden units to outputs using a “decoder” f2()Image, o=f2(h)Image. The objective of learning AE is to encourage similar or identical input–output pairs, where the reconstruction loss is minimized after decoding, mini=1Nxif2(f1(xi))2Image. Here, N is the number of training samples. In this way, the neurons in the hidden layer are good representations for the inputs as the reconstruction process captures the intrinsic structure of the input data.

As opposed to the two-level encoding and decoding in AE, marginalized stacked denoising autoencoder [10] (mSDA) reconstructs the corrupted inputs with a single mapping W, mini=1NxiWx˜i2Image, where x˜iImage is the corrupted version of xiImage obtained by setting each feature to 0 with a probability p. mSDA performs m passes over the training set, each time with different corruptions. This essentially performs a dropout regularization on the mSDA [31]. By setting mImage, mSDA effectively computes the transformation matrix W that is robust to noise using infinitely many copies of noisy data. mSDA is stackable and can be calculated in closed-form.

8.1.3.3 Single-Layer Feature Learning

The single-layer feature learner described in this subsection builds on mSDA. We attempt to learn both discriminative shared features between multiple views and private features particularly owned by one view for cross-view action classification. Considering large motion variations in different views, we incorporate SAM Z in learning shared features to balance information transfer between views so as to build more robust features.

We use the following objective function to learn shared features and private features:

minW,{Gv}Q,Q=WX˜XZF2+v[αGvX˜vXvF2+βWTGvF2+γTr(PvXvLXvTPvT)],

Image (8.1)

where W is the mapping matrix for learning shared features, {Gv}v=1VImage is a group of mapping matrices for learning private features particularly for each view, and Pv=(W;Gv)Image. The above objective function contains 4 terms: ψ=WX˜XZF2Image learns shared features between views, which essentially reconstructs an action data from one view with the data from all views; ϕv=GvX˜vXvF2Image learns view-specific private features that are complementary to the shared features; r1v=WTGvF2Image and r2v=Tr(PvXvLXvTPvT)Image are model regularizers. Here, r1vImage reduces redundancy between two mapping matrices, and r2vImage encourages the shared and private features of the same class and the same view to be similar, while α,β,γImage are parameters balancing the importance of these components. Further details about these terms are discussed in the following.

Note that in cross-view action recognition, data from all the views are available in training to learn shared and private features. Data from some views are unavailable only in testing.

Shared Features. Humans can perceive an action from one view and envision what the action will look like if we observe from other views. This is possibly because we have studied similar actions before from multiple views. This inspires us to reconstruct an action data from one view (target view) using the action data from all the views (source view). In this way, information shared between views can be summarized and transferred to the target view.

We define the discrepancy between the data of the vth target view and the data of all the V source views as

ψ=i=1Nv=1VWx˜ivuxiuziuv2=WX˜XZF2,

Image (8.2)

where ziuvImage is a weight measuring the contributions of the uth view action in the reconstruction of the sample xivImage of the vth view, WRd×dImage is a single linear mapping for the corrupted input x˜ivImage of all the views, ZRVN×VNImage is a sample-affinity matrix encoding all the weights {ziuv}Image. Matrices X,X˜Rd×VNImage denote the input training matrix and the corresponding corrupted version of X, respectively [10]. The corruption essentially performs a dropout regularization on the model [31].

The SAM Z here allows us to precisely balance information transfer among views and assists learn more discriminative shared features. Instead of using equal weights [7,8], we reconstruct the ith training sample of the vth view based on the samples from all V views with different contributions. As shown in Fig. 8.3, a sample of side view (source 1) will be more identical to the one also from side view (target view) than the one from top view (source 2). Thus, more weight should be given to source 1 in order to learn more descriptive shared features for the target view. Note that SAM Z limits information sharing across samples (off-diagonal blocks are zeros) since it cannot capture view-invariant information for cross-view action recognition.

Image
Figure 8.3 Learning shared features using weighted samples.

Private Features. Besides the information shared across views, there is still some remaining discriminative information that exclusively exists in each view. In order to utilize such information and make it robust to viewpoint variations, we adopt the robust feature learning in [10], and learn view-specific private features for the samples in the vth view using a mapping matrix GvRd×dImage,

ϕv=i=1NGvx˜ivxiv2=GvX˜vXvF2.

Image (8.3)

Here, X˜vImage is the corrupted version of the feature matrix XvImage of the vth view. We will learn V mapping matrices {Gv}v=1VImage given corresponding inputs of different views.

It should be noted that using Eq. (8.3) may also capture some redundant shared information from the vth view. In this work, we reduce such redundancy by encouraging the incoherence between the view-shared mapping matrix W and view-specific mapping matrix GvImage,

r1v=WTGvF2.

Image (8.4)

The incoherence between W and {Gv}Image enables our approach to independently exploit the discriminative information included in the view-specific features and view-shared features.

Label Information. Large motion and posture variations may appear in action data captured from various views. Therefore, the shared and private features extracted using Eqs. (8.2) and (8.3) may not be discriminative enough for actions classification with large variations. We enforce the shared and private features of the same class and same view to be similar to address the issue. A within-class within-view variance is defined to regularize the learning of the view-shared mapping matrix W and view-specific mapping matrix GvImage as

r2v=i=1Nj=1N[WxivWxjv2+GvxivGvxjv2]=Tr(WXvLXvTWT)+Tr(GvXvLXvTGvT)=Tr(PvXvLXvTPvT).

Image (8.5)

Here, LRN×NImage is the label-view Laplacian matrix, L=DAImage, D is the diagonal degree matrix with D(i,i)=j=1Na(i,j)Image, A is the adjacent matrix that represents the label relationships of training videos. The (i,j)Imageth element a(i,j)Image in A is 1 if yi=yjImage and 0 otherwise.

Note that since we have implicitly used this idea in Eq. (8.2), we do not need features from different views in the same class to be similar. In learning the shared feature, features of the same class from multiple views will be mapped to a new space using the mapping matrix W. Consequently, we can better represent the projected features of one sample by the features from multiple views of the same sample. Therefore, the discrepancy among views is minimized, and thus the within-class cross-view variance in Eq. (8.5) is not necessary.

Discussion. Using label information in Eq. (8.5) contributes to a supervised approach. We can also replace this term with an unsupervised one by making γ=0Image. We refer to the unsupervised approach as Ours-1 and the supervised approach as Ours-2 in the following discussions.

8.1.3.4 Learning

We develop a coordinate descent algorithm to solve the optimization problem in Eq. (8.1) and optimize parameters W and {Gv}v=1VImage. More specifically, in each step, one parameter matrix is updated by fixing the others, and computing the derivative of QImage w.r.t. to the parameter and setting it to 0.

Update W. Parameters {Gv}v=1VImage are fixed in updating W, which can be updated by setting the derivative QW=0Image, yielding

W=[v(βGvGvT+γXvLXvT+I)]1(XZX˜T)[X˜X˜T+I]1.

Image (8.6)

It should be noted that XZX˜TImage and X˜X˜TImage are computed by repeating the corruption mImage times. By the weak law of large numbers [10], XZX˜TImage and X˜X˜TImage can be computed by their expectations Ep(XZX˜T)Image and Ep(X˜X˜T)Image with the corruption probability p, respectively.

Update GvImage. Fixing W and {Gu}u=1,uvVImage, parameter GvImage is updated by setting the derivative QGv=0Image, giving

Gv=(βWWT+γXvLXvT+I)1(αXvX˜vT)[αX˜vX˜vT+I]1.

Image (8.7)

Similar to the procedure of updating W, XvX˜vTImage and X˜vX˜vTImage are computed by their expectations with corruption probability p.

Convergence. Our learning algorithm iteratively updates W and {Gv}v=1VImage. The problem in Eq. (8.1) can be divided into V+1Image subproblems, each of which is a convex problem with respect to one variable. Therefore, by solving the subproblems alternatively, the learning algorithm is guaranteed to find an optimal solution to each subproblem. Finally, the algorithm will converge to a local solution.

8.1.3.5 Deep Architecture

Inspired by the deep architecture in [10,32], we also design a deep model by stacking multiple layers of feature learners described in Sect. 8.1.3.3. A nonlinear feature mapping is performed layer by layer. More specifically, a nonlinear squashing function σ()Image is applied on the output of one layer, Hw=σ(WX)Image and Hgv=σ(GvXv)Image, resulting in a series of hidden feature matrices.

A layer-wise training scheme is used in this work to train the networks {Wk}k=1KImage, {Gkv}k=1,v=1K,VImage with K layers. Specifically, the outputs of the fth layer HkwImage and HkgvImage are used as the input to the (k+1)Imageth layer. The mapping matrices Wk+1Image and {Gk+1v}v=1VImage are then trained using these inputs. For the first layer, the inputs H0wImage and H0gvImage are the raw features X and XvImage, respectively. More details are shown in Algorithm 8.1.

Image
Algorithm 8.1 Learning deep sequential context networks.

8.1.4 Experiments

We evaluate Ours-1 and Ours-2 approaches on two multiview datasets: multiview IXMAS dataset [33], and the Daily and Sports Activities (DSA) dataset [1], both of which have been popularly used in [1,24,25,8,7,9].

We consider two cross-view classification scenarios in this work, many-to-one and one-to-one. The former trains on V1Image views and tests on the remaining view, while the latter trains on one view and tests on the other views. For the vth view that is used for testing, we simply set the corresponding XvImage used in training to 0 in Eq. (8.1) during training. Intersection kernel support vector machine (IKSVM) with parameter C=1Image is adopted as the classifier. Default parameters are α=1,β=1,γ=0,K=1,p=0Image for Ours-1 approach, and α=1,β=1,γ=1,K=1,p=0Image for Ours-2 approach unless specified. The default number of layers is set to 1 for efficiency consideration.

IXMAS is a multicamera-view video dataset, where each view corresponds to a camera view (see Fig. 8.4B). The IXMAS dataset consists of 12 actions performed by 10 actors. An action was recorded by 4 side view cameras and 1 top view camera. Each actor repeated one action 3 times.

Image
Figure 8.4 Examples of multi-view problem settings: (A) multiple sensor views in the Daily and Sports Activities (DSA) dataset, and (B) multiple camera views in the IXMAS.

We adopt the bag-of-words model in [34]. An action video is described by a set of detected local spatiotemporal trajectory-based and global frame-based descriptors [35]. A k-means clustering method is employed to quantize these descriptors and build so-called video words. Consequently, a video can be represented by a histogram of the video words detected in the video, which is essentially a feature vector. An action captured by V camera views is represented by V feature vectors, each of which is a feature representation for one camera view.

DSA is a multisensor-view dataset comprising 19 daily and sports activities (e.g., sitting, playing basketball, and running on a treadmill with a speed of 8 km/h), each performed by 8 subjects in their own style for 5 minutes. Five Xsens MTx sensor units are used on the torso, arms, and legs (Fig. 8.4A), resulting in a 5-view data representation. Sensor units are calibrated to acquire data at 25 Hz sampling frequency. The 5-min signals are divided into 5-second segments so that 480(=60seconds×8subjects)Image signal segments are obtained for each activity. One 5-second segment is used as an action time series in this work.

We follow [1] to preprocess the raw action data in a 5-s window, and represent the data as a 234-dimensional feature vector. Specifically, the raw action data is represented as a 125×9Image matrix, where 125 is the number of sampling points (125=25Image Hz × 5 s), and 9 is the number of values (the x,y,zImage axes' acceleration, the x,y,zImage axes' rate of turn, and the x,y,zImage axes' Earth's magnetic field) obtained on one sensor. We first compute the minimum and maximum values, the mean, skewness, and kurtosis on the data matrix. The resulting features are concatenated and generate a 45-dimensional (5 features × 9 axes) feature vector. Then, we compute discrete Fourier transform on the raw data matrix, and select the maximum 5 Fourier peaks. This yields a 45-dimensional (5 peaks × 9 axes) feature vector. The 45 frequency values that correspond to these Fourier peaks are also extracted, resulting in a 45-dimensional (5 frequency × 9 axes) as well. Afterwards, 11 autocorrelation samples are computed for each of the 9 axes, resulting in a 99-dimensional (11 samples × 9 axes) features. The three types of features are concatenated and generate a 234-dimensional feature vector, representing the human motion captured by one sensor in a 5-second window. A human action captured by V sensors is represented by V feature vectors, each of which corresponds to a sensor view.

8.1.4.1 IXMAS Dataset

Dense trajectory and histogram of oriented optical flow [35] are extracted from videos. A dictionary of size 2000 is built for each type of features using k-means. We use the bag-of-words model to encode these features, and represent each video as a feature vector.

We adopt the same leave-one-action-class-out training scheme in [25,7,8] for fair comparison. At each time, one action class is used for testing. In order to evaluate the effectiveness of the information transfer in our approaches, all the videos in this action are excluded from the feature learning procedure including k-means and our approaches. Note that these videos can be seen in training the action classifiers. We evaluate both the unsupervised approach (Ours-1) and the supervised approach (Ours-2).

One-to-One Cross-view Action Recognition

This experiment trains on data from one camera view (training view), and tests the on data from the other view (test view). We only use the learned shared features and discard the private features in this experiment as the private features learned on one view does not capture too much information of the other view.

We compare Ours-2 approach with [36,7,8] and report recognition results in Table 8.1. Ours-2 achieves the best performance in 18 out of 20 combinations, significantly better than all the compared approaches. It should be noted that Ours-2 achieves 100% in 16 cases, demonstrating the effectiveness of the learned shared features. Thanks to the abundant discriminative information from the learned shared features and label information, our approach is robust to viewpoint variations and can achieve high performance in cross-view recognition.

Table 8.1

One-to-one cross-view recognition results of various supervised approaches on IXMAS dataset. Each row corresponds to a training view (from view C0 to view C4) and each column is a test view (also from view C0 to view C4). The results in brackets are the recognition accuracies of [36,7,8] and our supervised approach, respectively

C0 C1 C2 C3 C4
C0 NA (79,98.8,98.5,100) (79,99.1,99.7,99.7) (68,99.4,99.7,100) (76,92.7,99.7,100)
C1 (72,98.8,100,100) NA (74,99.7,97.0,99.7) (70,92.7,89.7,100) (66,90.6,100,99.7)
C2 (71,99.4,99.1,100) (82,96.4,99.3,100) NA (76,97.3,100,100) (72,95.5,99.7,100)
C3 (75,98.2,90.0,100) (75,97.6,99.7,100) (73,99.7,98.2,99.4) NA (76,90.0,96.4,100)
C4 (80,85.8,99.7,100) (77,81.5,98.3,100) (73,93.3,97.0,100) (72,83.9,98.9,100) NA
Ave. (74,95.5,97.2,100) (77,93.6,98.3,100) (76,98.0,98.7,99.7) (73,93.3,97.0,100) (72,92.4,98.9,99.9)

Image

We also compare Ours-1 approach with [25,7,8,24,37], and report comparison results in Table 8.2. Our approach achieves the best performance in 19 out of 20 combinations. In some cases, our approach outperforms the comparison approaches by a large margin, for example, C4C0 (C4 is the training view and C0 is the test view), C4C1, and C1C3. The overall performance of Ours-1 is slightly worse than Ours-2 due to the removal of the label information.

Table 8.2

One-to-one cross-view recognition results of various unsupervised approaches on IXMAS dataset. Each row corresponds to a training view (from view C0 to view C4) and each column is a test view (also from view C0 to view C4). The results in brackets are the recognition accuracies of [25,7,8,24,37] and our unsupervised approach, respectively

C0 C1 C2 C3 C4
C0 NA (79.9,96.7,99.1,92.7,94.8,99.7) (76.8,97.9,90.9,84.2,69.1,99.7) (76.8,97.6,88.7,83.9,83.9,98.9) (74.8,84.9,95.5,44.2,39.1,99.4)
C1 (81.2,97.3,97.8,95.5,90.6,100) NA (75.8,96.4,91.2,77.6,79.7,99.7) (78.0,89.7,78.4,86.1,79.1,99.4) (70.4,81.2,88.4,40.9,30.6,99.7)
C2 (79.6,92.1,99.4,82.4,72.1,100) (76.6,89.7,97.6,79.4,86.1,99.7) NA (79.8,94.9,91.2,85.8,77.3,100) (72.8,89.1,100,71.5,62.7,99.7)
C3 (73.0,97.0,87.6,82.4,82.4,100) (74.1,94.2,98.2,80.9,79.7,100) (74.0,96.7,99.4,82.7,70.9,100) NA (66.9,83.9,95.4,44.2,37.9,100)
C4 (82.0,83.0,87.3,57.1,48.8,99.7) (68.3,70.6,87.8,48.5,40.9,100) (74.0,89.7,92.1,78.8,70.3,100) (71.1,83.7,90.0,51.2,49.4,100) NA
Ave (79.0,94.4,93.0,79.4,74.5,99.9) (74.7,87.8,95.6,75.4,75.4,99.9) (75.2,95.1,93.4,80.8,72.5,99.9) (76.4,91.2,87.1,76.8,72.4,99.9) (71.2,84.8,95.1,50.2,42.6,99.7)

Image

Many-to-One Cross-view Action Recognition

In this experiment, one view is used as test view and all the other views are used as training views. We evaluate the performance of our approaches in this experiment, which use both the learned shared and private features.

Our unsupervised (Ours-1) and supervised (Ours-2) approaches are compared with existing approaches [38,5,22,25,7,8,6]. The importance of SAM Z in Eq. (8.2), the incoherence in Eq. (8.4), and the private features in Ours-2 model are also evaluated.

Table 8.3 shows that our supervised approach (Ours-2) achieves an impressive 100% recognition accuracy in all the 5 cases, and Ours-1 achieves an overall accuracy of 99.8%. Ours-1 and Ours-2 achieve superior overall performance over all the other comparison approaches, demonstrating the benefit of using both shared and private features in this work. Our approaches use the sample-affinity matrix to measure the similarities between video samples across camera views. Consequently, the learned shared features accurately characterize the commonness across views. In addition, the redundancy is reduced between shared and private features, making the learned private features more informative for classification. Although the two methods in [8] exploit private features as well, they do not measure different contributions of samples in learning the shared dictionary, making the shared information less discriminative.

Table 8.3

Many-to-one cross-view action recognition results on IXMAS dataset. Each column corresponds to a test view

Methods C0 C1 C2 C3 C4
Junejo et al. [5] 74.8 74.5 74.8 70.6 61.2
Liu and Shah [38] 76.7 73.3 72.0 73.0 N/A
Weinland et al. [22] 86.7 89.9 86.4 87.6 66.4
Liu et al. [25] 86.6 81.1 80.1 83.6 82.8
Zheng et al. [7] 98.5 99.1 99.1 100 90.3
Zheng and Jiang [8]-1 97.0 99.7 97.2 98.0 97.3
Zheng and Jiang [8]-2 99.7 99.7 98.8 99.4 99.1
Yan et al. [6] 91.2 87.7 82.1 81.5 79.1
No-SAM 95.3 93.9 95.3 93.1 94.7
No-private 98.6 98.1 98.3 99.4 100
No-incoherence 98.3 97.5 98.9 98.1 100
Ours-1 (unsupervised) 100 99.7 100 100 99.4
Ours-2 (supervised) 100 100 100 100 100

Image

Ours-2 outperforms No-SAM approach, suggesting the effectiveness of SAM Z. Without SAM Z, No-SAM treats samples across views equally, and thus cannot accurately weigh the importance of samples in different views. The importance of the private features can be clearly seen from the performance gap between Ours-2 and No-private approach. Without private features, the No-private approach only uses shared features for classification, which are not discriminative enough if some informative motion patterns exclusively exist in one view and are not sharable across views. The performance variation between Ours-2 and the No-incoherence method suggests the benefit of encouraging the incoherence in Eq. (8.4). Using Eq. (8.4) allows us to reduce the redundancy between shared and private features, and helps extract discriminative information in each of them. Ours-2 slightly outperforms Ours-1 in this experiment, indicating the effectiveness of using label information in Eq. (8.5).

8.1.4.2 Daily and Sports Activities Data Set

Many-to-One Cross-view Action Classification

In this experiment, data from 4 sensors are used for training (36,480 time series) and the data from the remaining 1 sensor (9,120 time series) are used for testing. This process is repeated 5 times and the average results are reported.

Our unsupervised (Ours-1) and supervised (Ours-2) approaches are compared with mSDA [10], DRRL [39] and IKSVM. The importance of SAM Z in Eq. (8.2), the incoherence in Eq. (8.4), and the private features in Ours-2 model are also evaluated. We remove Z in Eq. (8.2) and the incoherence component in Eq. (8.4) from the supervised model, respectively, and obtain the “No-SAM”, and the “No-incoherence” model. We also remove the learning of parameter {Gv}v=1VImage from the supervised model and obtain the “No-private” model. Comparison results are shown in Table 8.4.

Table 8.4

Many-to-one cross-view action classification results on DSA dataset. Each column corresponds to a test view. V0–V4 are sensor views on torso, arms, and legs

Methods Overall V0 V1 V2 V3 V4
IKSVM 54.6 36.5 53.4 63.4 60.1 59.7
DRRL [39] 55.4 35.5 56.7 62.1 61.7 60.9
mSDA [10] 56.1 34.4 57.7 62.8 61.5 64.1
No-SAM 55.4 35.1 57.0 60.7 62.2 62.2
No-private 55.4 35.1 57.0 60.7 62.2 62.1
No-incoherence 55.4 35.1 56.9 60.7 62.2 62.2
Ours-1 57.1 35.7 57.4 64.4 64.2 63.9
Ours-2 58.0 36.1 58.9 65.8 64.2 65.2

Image

Ours-2 achieves superior performance over all the other comparison methods in all the 5 cases with an overall recognition accuracy of 58.0%. Ours-2 outperforms Ours-1 by 0.9% in overall classification result due to the use of label information. Note that cross-view classification on DSA dataset is challenging as the sensors on different body parts are weakly correlated. The sensor on torso (V0) has the weakest correlations with the other four sensors on arms and legs. Therefore, results of all the approaches on V0 are the lowest performance compared to sensors V1–V4. Ours-1 and Ours-2 achieve superior overall performance over the comparison approaches IKSVM and mSDA due to the use of both shared and private features. IKSVM and mSDA do not discover shared and private features, and thus cannot use correlations between views and exclusive information in each view for classification. To better balance the information transfer between views, Ours-1 and Ours-2 use the sample-affinity matrix to measure the similarities between video samples across camera views. Thus, the learned shared features accurately characterize the commonness across views. Though the overall improvement of Ours-1 and Ours-2 over mSDA is 1% and 1.9%, Ours-1 and Ours-2 correctly classify 456 and 866 more sequences than mSDA in this experiment, respectively.

The performance gap between Ours-2 and the No-SAM approach suggests the effectiveness of SAM Z. Without SAM Z, No-SAM treats samples across views equally, and thus cannot accurately weigh the importance of samples in different views. Ours-2 outperforms No-private approach, suggesting the importance of the private features in learning discriminative features for multiview classification. Without private features, No-private approach only uses shared features for classification, which are not discriminative enough if some informative motion patterns exclusively exist in one view and are not sharable across views. Ours-2 achieves superior performance over No-incoherence method, indicating the benefit of encouraging the incoherence in Eq. (8.4). Using Eq. (8.4) allows us to reduce the redundancy between shared and private features, and helps extract discriminative information in each of them. Ours-2 slightly outperforms Ours-1, indicating the effectiveness of using label information in Eq. (8.5).

8.2 Hybrid Neural Network for Action Recognition from Depth Cameras

8.2.1 Introduction

Using depth cameras for action recognition is receiving increasing interest in computer vision community due to the recent advent of the cost-effective Kinect. Depth sensors provide several advantages over typical visible light cameras. Firstly, 3D structural information can be easily captured, which helps simplify the intra-class motion variation. Secondly, depth information provides useful cues for background subtraction and occlusion detection. Thirdly, depth data are generally not affected by the lighting variations, and thus it is a robust information in different lighting conditions.

Unfortunately, improving the recognition performance via depth data is not an easy task. One reason is that depth data are noisy, and may have spatial and temporal discontinuities when undefined depth points exist. Existing methods resort to mining discriminative actionlets from noisy data [40], exploiting a sampling scheme [41], or developing depth spatiotemporal interest point detectors [42,43], in order to overcome the problem of noisy depth data. However, these methods directly use low-level features, which may not be expressive enough for discriminating depth videos. Another problem is that depth information alone is not discriminative enough, as most of the body parts in different actions have similar depth values. It is desirable to extract useful information from depth data, e.g., surface normals in the 4D space [44] and 3D human shapes [45], and then use additional cues effectively to improve the performance, e.g., joint data [40,46]. It should be noted that most of the existing approaches for depth action videos only capture low-order context, such as hand–arm and foot–leg. High-order context such as head–arm–leg and torso–arm–leg is not considered. In addition, all these methods depend on hand-crafted, problem-dependent features, whose importance for the recognition task is rarely known. This is extremely noticeable since intra-class action data are generally highly varied, and inter-class action data usually appear similar.

In this chapter, we describe a hybrid convolutional-recursive neural network (HCRNN), a cascade of 3D convolutional neural network (3D-CNN) and 3D recursive neural network (3D-RNN), to learn high-order, compositional features for recognizing RGB-D action videos. The hierarchical nature of HCRNN helps us abstract low-level features to yield powerful features for action recognition. HCRNN models the relationships between local neighboring body parts and allows deformable body parts in actions. This makes our model robust to pose variations and geometry changes in intra-class RGB-D action data. In addition, HCRNN captures high-order body part context information in RGB-D action data, which is significantly important for learning actions with large pose variations [47,48]. A new 3D convolution is performed on spatiotemporal 3D patches, thereby capturing rich motion information in adjacent frames and reducing noise. We organize all the components in HCRNN in different layers, and train HCRNN in an unsupervised fashion without network tuning. More importantly, we demonstrate that high quality features can be learned by 3D-RNNs, even with random weights.

The goal of HCRNN is to learn discriminative features from RGB-D videos. As the flowchart illustrated in Fig. 8.5, HCRNN starts with raw RGB-D videos, and first extract features from each of the RGB and depth modalities separately. The two separate modalities, RGB and depth data, are then fed into 3D convolutional neural networks (3D-CNN), and are convolved with K filters, respectively. 3D-CNN outputs translational invariant low-level features, a matrix of filter responses. These features are then given to the 3D-RNN to learn compositional high-order features. To improve feature discriminability, multiple 3D-RNNs are jointly employed to learn the features. The final feature vectors learned by all the 3D-RNNs from all the modalities are combined into a single feature vector, which is the action representation for the input RGB-D video. The softmax classifier is applied to recognize the RGB-D action video.

Image
Figure 8.5 Architecture of the our hybrid convolutional-recursive neural network (HCRNN) model. Given a RGB-D video, HCRNN learns a discriminative feature vector from both RGB and depth data. We use 3D-CNN to learn features of local neighbor body parts, and 3D-RNN to learn compositional features hierarchically.

8.2.2 Related Work

RGBD Action Recognition. In depth videos, depth images generally contain undefined depth points, causing spatial and temporal discontinuities. This is an obstacle for using informative depth information. For example, popular spatio-temporal interest point (STIP) detectors [34,49] can not be applied to depth videos directly, as they will falsely fire on those discontinuous black regions [44]. To overcome this problem, counterparts of the popular spatio-temporal interest point (STIP) detectors [34,49] for depth videos have been proposed in [42,43]. Depth STIP [42], a filtering method, was introduced to detect interest points from RGB-D videos with noise reduction.

To obtain useful information from noisy depth videos, [40] proposed to select informative joints that are most relevant to the recognition task. Consequently, an action can be represented by subsets of joints (actionlets), and is learned by the multiple-kernel SVM, where each kernel corresponds to an actionlet. In [44], the histogram of oriented 4D surface normals (HON4D) are computed to effectively exploit geometric changing structure of actions in depth videos. Li et al. [45] projected depth maps onto 2D planes and sampled a set of points along the contours of the projections. Then, the points are clustered to obtain salient postures. GMM is further used to model the postures, and an action graph is applied for inference. Holistic features [14] and human pose (joint) information are also used for action recognition from RGB-D videos [46,50,51].

Hollywood 3D dataset, a new 3D action dataset, was released in [43], and evaluated using both conventional STIP detectors and their extensions for depth videos. Results show that those new STIP detectors for depth videos can effectively take advantage of depth data and suppress false detections caused by the spatial and temporal discontinuities in depth images.

Applications Using Deep Models. In recent years, feature learning using deep models have been successfully applied in object recognition [5254] and detection [55,56], scene understanding [57,58], face recognition and action recognition [5961].

Feature learning methods for object recognition are generally composed of a filter bank, a nonlinear transformation, and some pooling layers. To evaluate their influences, [52] built several hierarchies by different combinations of those components, and reported their performance on object recognition and handwritten digits recognition datasets. 3D object recognition task were also addressed in [53,54]. In [55], mutual visibility relationship in pedestrian detection is modeled by summarizing human body part interaction in a hierarchical way.

Deep models have achieved promising results in conventional action recognition tasks. Convolutional neural network [59] was applied to extract features from both the spatial and temporal dimensions by performing 3D convolutions. An unsupervised gate RBM model [61] was proposed for action recognition. Le et al. [60] combined independent subspace analysis with deep learning techniques to build features robust to local translation. All these methods are designed for color videos. In this chapter, we introduce a deep architecture for recognizing actions from RGB-D videos.

8.2.3 Hybrid Convolutional-Recursive Neural Networks

We describe the hybrid convolutional-recursive neural networks (HCRNN) to learn high-order compositional features for action recognition from depth camera. The HCRNN consists of two components, the 3D-CNN model and the 3D-RNN model. The 3D-CNN model is utilized to generate low-level, translationally invariant features, and the 3D-RNN model is employed to compose high-order features that can be used to classify actions. Architecture is shown in Fig. 8.5, which is a cascade of 3D-CNN and 3D-RNN.

8.2.3.1 Architecture Overview

Our method takes an RGB-D video v as input and outputs the corresponding action label. Our HCRNN is employed to find a transform h that maps the RGB-D video into a feature vector x, x=h(v)Image. Feature vector x is then fed into a classifier, and the action label y is obtained. We treat an RGB-D video as multichannel data, and extract gray, gradient, optical flow and depth data from RGB and depth modalities. HCRNN is applied to each of these channels.

3D-CNN. The lower part of our HCRNN is a 3D-CNN model that extracts features hierarchically. The 3D-CNN (Fig. 8.6) in this work has five stages: 3D convolution (Sect. 8.2.3.2), absolute rectification, local normalization, average pooling and subsampling. We sample N 3D patches of size (sr,sc,st)Image (height, width, frame) from each channel with stride size spImage. The 3D-CNN g takes these patches as the input, and outputs a K-dimensional vector u for each patch, g:RSRKImage. Here, K is the number of learned filters and S=sr×sc×stImage.

Image
Figure 8.6 Graphical illustration of 3D-CNN. Given a 3D video patch p, 3D-CNN g performs five stages of computations: 3D convolution, rectification, local normalization, average pooling and subsampling. Then the K-dimensional feature vector u is obtained, u = g(p).

Rich motion and geometry change information is captured using the 3D convolution in 3D-CNN. Each video of size (height, width, frame) dIImage is convolved with K filters of size dPImage, resulting in K filter responses of dimensionality N (N is the number of patches extracted from one channel of a video). Then, absolute rectification is performed which applies absolute value function to all the components of the filter responses. This step is followed by local contrast normalization (LCN). LCN module performs local subtractive and divisive normalization, enforcing a sort of local competition between adjacent features in a feature map, and between features at the same spatiotemporal location in different feature maps. To improve the robustness of features to small distortions, we add average pooling and subsampling modules to 3D-CNN. Patch features whose locations are within a small spatiotemporal neighborhood are averaged and pooled to generate one parent feature of dimensionality K.

In this chapter, we augment gray and depth feature maps (channels) with gradient-x, gradient-y, optflow-x, and optflow-y as in [59]. The gradient-x and gradient-y feature maps are computed by computing the gradient along the horizontal and vertical directions, respectively. The optflow-x and optflow-y feature maps are computed by running the optical flow algorithm and separating the horizontal and vertical flow field data.

3D-RNN. The 3D-RNN model is to hierarchically learn compositional features. The graphical illustration of 3D-RNN is shown in Fig. 8.7. It merges a spatiotemporal block of patch feature vectors and generate a parent feature vector. The input for 3D-RNN is a K×MImage matrix, where M is the number of feature vectors generated by 3D-CNN (MNImage since we apply subsampling in 3D-CNN). The output of 3D-RNN is a K-dimensional vector, which is the feature for one channel of the video. We adopt a tree-structured 3D-RNN with J layers. At each layer, child vectors, whose spatiotemporal locations are within a 3D block, are merged into one parent vector. This procedure continues to generate one parent vector in the top layer. We concatenate feature vectors of all the channels generated by 3D-RNN, and derive a KC-dimensional vector as the action feature given C channels of data.

Image
Figure 8.7 Graphical illustration of 3D-RNN. u1,…,u8 are a block of patch features generated by 3D-CNN. 3D-RNN takes these features as inputs and produces a parent feature vector q. 3D-RNN recursively merges adjacent feature vectors, and generates feature vector x for one channel of data.

8.2.3.2 3D Convolutional Neural Networks

We use 3D-CNN to extract features from RGB-D action videos. 2D-CNNs have been successfully used in 2D images, such as object recognition [5254] and scene understanding [57,58]. In these methods, 2D convolution are adopted to extract features from local neighborhood on the feature map. Then an addictive bias is added and a sigmoid function is used for feature mapping. However, in action recognition, a 3D convolution is desired as successive frames in videos encode rich motion information. In this work, we develop a new 3D convolution operation for RGB-D videos.

The 3D convolution in the 3D-CNN is achieved by convolving a filter on 3D patches extracted from RGB-D videos. Consequently, local spatiotemporal motion and structure information can be well captured in the convolution layer. It also captures motion and structure relationships of body parts in a local neighborhood, e.g., arm–hand and leg–foot. Suppose p is a 3D spatiotemporal patch randomly extracted from a video. We apply a nonlinear mapping to map p onto the feature map in the next layer,

gk(p)=max(d¯kpzk2,0),

Image (8.8)

where d¯k=1Kkxzk2Image is the averaged distance of the sample p to all the filters, zkImage is the kth filter, and K is the number of the learned filters. Note that the convolution in Eq. (8.8) is different from [59,62], but philosophically similar; d¯kImage can be considered as the bias, pzk2Image is similar to the convolution operation, which is a similarity measure between the filter zkImage and the patch p. The max()Image function is a nonlinear mapping function as the sigmoid function. The filters in Eq. (8.8) are easily trained in unsupervised fashion (Sect. 8.2.3.5) by running k-means [63].

After 3D convolution, rectification, local contrast normalization and averaged down sampling are applied as in object recognition but they are performed on 3D patches.

3D-CNN generates a list of K-dimensional vectors uirctUImage (i=1,,MImage), where r, c, and t are the locations of the vector in row, column and temporal dimensions, respectively, U is the set of vectors generated by 3D-CNN. Each vector in U is the feature for a patch. All these patch features are then given as inputs to 3D-RNN to compose high-order features.

Our 3D-CNN extracts discriminative motion and geometry changes information from RGB-D data. It also captures relationships of human body parts in local neighborhood, and allows body part to be deformable in actions. Therefore, the learned features are robust to pose variations in RGB-D data.

8.2.3.3 3D Recursive Neural Networks

The idea of recursive neural networks is to learn hierarchical features by applying the same neural network recursively in a tree structure. In our case, 3D-RNN can be regarded as combining convolution and pooling over 3D patches into one efficient, hierarchical operation.

We use balanced fixed-tree structure of 3D-RNN. Compared with previous RNN approaches, the tree structure offers high speed operation, and making use of parallelization. In our tree-structure 3D-RNN, each leaf node is a K-dimensional vector, which is an output of the 3D-CNN. At each layer, the 3D-RNN merges adjacent vectors into one vector. As this process repeats, high-order relationships and long-range dependencies of body parts can be encoded in the learned feature.

3D-RNN takes a list of K-dimensional vectors uirctUImage generated by 3D-CNN (i=1,,MImage) as inputs, and recursively merges a block of vectors into one parent vector qRKImage. We define a 3D block of size br×bc×btImage as a list of adjacent vectors to be merged. For example, if br=bc=bt=2Image, then B=8Image vectors will be merged. We define the merging function as

q=f(W[u1uB]).

Image (8.9)

Here, W is the parameter of size K×BKImage (B=br×bc×btImage), f()Image is a nonlinear function (e.g., tanh()Image). Bias term is omitted here as it does not affect performance.

3D-RNN is a tree with multiple layers, where the jth layer composes high-order features over the (j1)Imageth layer. In the jth layer, all the blocks of vectors in the (j1)Imageth are merged into one parent vector using the same weight W in Eq. (8.9). This process is repeated until only one parent vector x remains. Fig. 8.7 shows an example of a pooled CNN output of size K×2×2×2Image and an RNN tree structure with blocks of 8 children, u1,,u8Image.

We apply 3D-RNN to C channel data separately, and obtain C parent vectors from 3D-RNN xcImage, c=1,,CImage. Each parent vector xcImage in 3D-RNN is a K-dimensional vector, computed from one channel data of an RGB-D video. The vectors from all the channels are then concatenated into a long vector to encode rich motion and structure information for the RGB-D video. Finally, this feature is fed into the softmax classifier for action classification.

The feature learned by 3D-RNN captures high-order relationships of body parts and encodes long-range dependencies of body parts. Therefore, human actions can be well represented and can be classified accurately.

8.2.3.4 Multiple 3D-RNNs

3D-RNN abstracts high-order features using the same weight W in a recursive way. The weight W, randomly learned, expresses the knowledge of which vector is more important in the parent vector for the classification task. It may not be accurate due to the randomness of W.

This problem can be solved by using multiple 3D-RNNs. Similar to [54], we use multiple 3D-RNNs with different random weights. Consequently, different importance of adjacent vectors can be well captured by different weights, and high quality feature vectors can then be produced. We concatenate vectors generated by multiple 3D-RNNs to feed into the softmax classifier.

8.2.3.5 Model Learning

Unsupervised Learning of 3D-CNN Filters. CNN models can be learned using supervised or unsupervised approaches [64,59]. Since convolution operates on millions of 3D patches, using back propagation and fine tuning the entire networks may not be practical or efficient. Instead, we train our 3D-CNN model using an unsupervised approach.

Inspired by [63], we learn 3D-CNN filters in an unsupervised way by clustering random 3D patches. We treat multichannel data (gray, gradient, optical flow, and depth) as separated feature maps, and randomly generate spatiotemporal 3D patches from each channel. The extracted 3D patches are then normalized and whitened. Finally, these 3D patches are clustered to build K cluster centers zkImage, k=1,,KImage, which are used in the 3D convolution (Eq. (8.8)).

Random Weights for 3D-RNN. Recent work [54] shows that RNNs with random weights can also generate features with high discriminability. We follow [54] to learn 3D-RNNs with random weights W. We show that learning RNNs with random weights provides an efficient, yet powerful model for action recognition from depth camera.

8.2.3.6 Classification

As described in Sect. 8.2.3.4, features generated by multiple 3D-RNNs will be concatenated to produce the feature vector x for the depth video. We train a multiclass softmax classifier to classify the depth action x,

f(x,y)=exp(θyTxi)lYexp(θlTxi),

Image (8.10)

where θyImage is the parameter for class y. The prediction is performed by taking the argmaxImage of the vector whose lth element is f(x,l)Image, y=argmaxlf(x,l)Image. The multiclass cross entropy loss function is defined in learning the model parameter θ for all the classes. The model parameter θ is learned using the limited-memory variable-metric gradient ascent (BFGS) method.

8.2.4 Experiments

We evaluate our HCRNN model on two popular 3D action datasets, MSR-Gesture3D Dataset [65] and MSR-Action3D dataset [40]. Example frames of these datasets are shown in Fig. 8.8. We use gray, depth, gradient-x, gradient-y, optflow-x, and optflow-y feature maps for all the datasets.

Image
Figure 8.8 Example frames from three RGB-D action datasets.

8.2.4.1 MSR-Gesture3D Dataset

MSR-Gesture3D dataset is a hand gesture dataset containing 336 depth sequences captured by a depth camera. There are 12 categories of hand gestures in the dataset: “bathroom”, “blue”, “finish”, “green”, “hungry”, “milk”, “past”, “pig”, “store”, “where”, “j”, and “z”. This is a challenging dataset due to the self-occlusion issue and visually similarity. Our HCRNN takes an input video of size 80×80×18Image. The number of filters in 3D-CNN is set to 256 and the number of 3D-RNNs is set to 16. The kernel size (filter size) in 3D-CNN is 6×6×4Image, and the receptive filed size in 3D-RNN is 2×2×2Image. As in [65], only depth frames are used in the experiments. The leave-one-out cross validation is employed to in evaluation.

Fig. 8.9 shows the confusion matrix of HCRNN on the MSR-Gesture3D dataset. Our method achieves 93.75% accuracy in classifying hand gestures. Our method misses some of the examples in “ASL Past” and “ASL Store”, “ASL Finish” and “ASL Past”, and “ASL Blue” and “ASL J” due to their visual similarities. As shown in Fig. 8.9, our method can recognize visually similar hand gestures as the HCRNN discovers discriminative features and abstracts expressive high-order features for the task. HCRNN confuses some of examples shown in Fig. 8.9 due to the self-occlusion and intra-class motion variations.

Image
Figure 8.9 Confusion matrix of HCRNN on the MSR-Gesture3D dataset. Our method achieves 93.75% recognition accuracy.

We compare our HCRNN with [44,66,65,67] on the MSR-Gesture3D dataset. Methods in [44,66,65] are particularly designed for depth sequences and [67] proposed HoG3D descriptor which was originally designed for color sequences. Compared with these methods that are based on hand-crafted features, the HCRNN learns features from data. Results in Table 8.5 indicate that our method outperforms all these comparison methods. Our method learns features from data, which better represent intra-class variations and inter-class similarities, and thus achieves better performance. In addition, HCRNN encodes high-order context information of body part, and allows deformable body parts. These two benefits help improve the expressiveness of the learned features.

Table 8.5

The performance of our HCRNN model on MSR-Gesture3D dataset compared with previous methods
Method Accuracy (%)
Oreifej et al. [44] 92.45
Yang et al. [66] 89.20
Jiang et al. [65] 88.50
Klaser et al. [67] 85.23
HCRNN 93.75

8.2.4.2 MSR-Action3D Dataset

MSR-Action3D dataset [40] consists of 20 classes of human actions: “bend”, “draw circle”, “draw tick”, “draw x”, “forward kick”, “forward punch”, “golf swing”, “hammer”, “hand catch”, “hand clap”, “high arm wave”, “high throw”, “horizontal arm wave”, “jogging”, “pick up & throw”, “side boxing”, “side kick”, “tennis serve”, “tennis swing”, and “two hand wave”. A total of 567 depth videos are contained in the dataset which are captured using a depth camera.

In this dataset, the background is preprocessed to remove the discontinuities caused by the undefined depth regions. However, it is still challenging since many actions are visually very similar. The same training/testing splits in [44] is adopted in this experiment, i.e., the videos of the first five subjects (295 videos) are used for training and the remaining 272 videos are for testing. Our HCRNN takes an input video of size 120×160×30Image. The number of filters in 3D-CNN is set to 256, and the number of 3D-RNNs is set to 32. The kernel size (filter size) in 3D-CNN is 6×6×4Image, and the receptive filed size in 3D-RNN is 2×2×2Image.

The confusion matrix of HCRNN is displayed in Fig. 8.10A. Our method achieves 90.07% recognition accuracy on the MSR-Action3D dataset. Confusions mostly occur between visually similar actions, e.g., “horizontal hand wave” and “clap”, “hammer” and “tennis serve”, and “draw x” and “draw tick”. The learned filters used in the experiment are also illustrated in Fig. 8.10B. Our filter learning method discovers various representative patterns, which can be used to accurately describe local 3D patches.

Image
Figure 8.10 (A) Confusion matrix and (B) learned 3D filters of HCRNN on the MSR-Action3D dataset. Our method achieves 90.07% recognition accuracy.

We compare with methods particularly designed for depth sequences [40,65,66,68], as well as conventional action recognition methods that use spatiotemporal interest point detectors [34,49,67]. Results in Table 8.6A show that our method outperforms all the comparison methods. Our method achieves 90.07% recognition accuracy, which is higher than that of the state-of-the-art methods [42,14]. It should be noted that our method does not utilize the skeleton tracker, and yet outperforms the skeleton-based method [40]. Table 8.6B shows the recognition of HCRFF with different number of 3D-RNNs. The HCRNN achieves the best performance at nr=32Image. HCRNN obtains the worse performance with nr=1Image. With more 3D-RNNs, HCRNN achieves higher accuracy until 32 3D-RNNs are used. Then, with more 3D-RNNs, its performance degrades due to the overfitting problem.

Table 8.6

The performance of our HCRNN model on MSRAction3D dataset
Data Accuracy (%)
RGGP [14] 89.30
Xia and Aggarwal [42] 89.30
Oreifej et al. [44] 88.89
Jiang et al. [40] 88.20
Jiang et al. [65] 86.50
Yang et al. [66] 85.52
Klaser et al. [67] 81.43
Vieira et al. [68] 78.20
Dollar [34] 72.40
Laptev [49] 69.57
HCRNN 90.07
Number of RNNs Accuracy (%)
1 40.44
2 57.72
4 63.24
8 73.90
16 83.09
32 90.07
64 80.88
128 68.01

8.3 Summary

This chapter studies the problem of action recognition using two different types of data, multi-view data and RGB-D data. In the first scenario, action data are captured by multiple cameras, and thus the appearance of the human subject looks significantly different in different camera view, making action recognition more challenging. To address this problem, we have proposed feature learning approaches for learning view-invariant features. The proposed approaches utilize both shared and private features to accurately characterize human actions with large viewpoint and appearance variations. The sample affinity matrix is introduced in this chapter to compute sample similarities across views. The matrix is elegantly embedded in the learning of shared features in order to accurately weigh the contribution of each sample to the shared features, and balance information transfer. Extensive experiments on the IXMAS and DSA datasets show that our approaches outperform state-of-the-art approaches in cross-view action classification.

Actions can also be captured by RGB-D sensors such as Kinect since there are cost-effective. Action data captured by a Kinect sensor have multiple data channels including RGB, depth, and skeleton. However, it is challenging to use all of them for recognition as they are in different feature space. To address this problem, a new 3D convolutional recursive deep neural network (3DCRNN) is proposed for action recognition from RGB-D cameras. The architecture of the network consists of a 3D-CNN layer and a 3D-RNN layer. The 3D-CNN layer learns low-level translationally invariant features, which are then given as the input to the 3D-RNN. The 3D-RNN combines convolution and pooling into an efficient and hierarchical operation, and learns high-order compositional features. Results on two datasets show that the proposed method achieves state-of-the-art performance.

References

[1] K. Altun, B. Barshan, O. Tunçel, Comparative study on classifying human activities with miniature inertial and magnetic sensors, Pattern Recognition 2010;43(10):3605–3620.

[2] J. Grabocka, A. Nanopoulos, L. Schmidt-Thieme, Classification of sparse time series via supervised matrix factorization, AAAI. 2012.

[3] Y. Kong, Y. Fu, Bilinear heterogeneous information machine for RGB-D action recognition, IEEE conference on computer vision and pattern recognition. 2015.

[4] Y. Kong, Y. Fu, Max-margin action prediction machine, IEEE Transactions on Pattern Analysis and Machine Intelligence 2016;38(9):1844–1858.

[5] I. Junejo, E. Dexter, I. Laptev, P. Perez, Cross-view action recognition from temporal self-similarities, ECCV. 2008.

[6] Y. Yan, E. Ricci, R. Subramanian, G. Liu, N. Sebe, Multitask linear discriminant analysis for view invariant action recognition, IEEE Transactions on Image Processing 2014;23(12):5599–5611.

[7] J. Zheng, Z. Jiang, P.J. Philips, R. Chellappa, Cross-view action recognition via a transferable dictionary pair, BMVC. 2012.

[8] J. Zheng, Z. Jiang, Learning view-invariant sparse representation for cross-view action recognition, ICCV. 2013.

[9] W. Yang, Y. Gao, Y. Shi, L. Cao, MRM-Lasso: a sparse multiview feature selection method via low-rank analysis, IEEE Transactions on Neural Networks and Learning Systems 2015;26(11):2801–2815.

[10] M. Chen, Z. Xu, K.Q. Weinberger, F. Sha, Marginalized denoising autoencoders for domain adaptation, ICML. 2012.

[11] G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multimodal data, CVPR. 2014.

[12] A.P. Singh, G.J. Gordon, Relational learning via collective matrix factorization, KDD. 2008.

[13] J. Liu, C. Wang, J. Gao, J. Han, Multi-view clustering via joint nonnegative matrix factorization, SDM. 2013.

[14] L. Liu, L. Shao, Learning discriminative representations from rgb-d video data, IJCAI. 2013.

[15] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, IJCV 2008;73(3):243–272.

[16] Z. Ding, Y. Fu, Low-rank common subspace for multi-view learning, IEEE international conference on data mining. IEEE; 2014:110–119.

[17] A. Kumar, H. Daume, A co-training approach for multi-view spectral clustering, ICML. 2011.

[18] W. Zhang, K. Zhang, P. Gu, X. Xue, Multi-view embedding learning for incompletely labeled data, IJCAI. 2013.

[19] K. Wang, R. He, W. Wang, L. Wang, T. Tan, Learning coupled feature spaces for cross-modal matching, ICCV. 2013.

[20] C. Xu, D. Tao, C. Xu, Multi-view learning with incomplete views, IEEE Transactions on Image Processing 2015;24(12).

[21] A. Sharma, A. Kumar, H. Daume, D.W. Jacobs, Generalized multiview analysis: a discriminative latent space, CVPR. 2012.

[22] D. Weinland, M. Özuysal, P. Fua, Making action recognition robust to occlusions and viewpoint changes, ECCV. 2010.

[23] I.N. Junejo, E. Dexter, I. Laptev, P. Pérez, View-independent action recognition from temporal self-similarities, IEEE Transactions on Pattern Analysis and Machine Intelligence 2011;33(1):172–185.

[24] H. Rahmani, A. Mian, Learning a non-linear knowledge transfer model for cross-view action recognition, CVPR. 2015.

[25] J. Liu, M. Shah, B. Kuipers, S. Savarese, Cross-view action recognition via view knowledge transfer, CVPR. 2011.

[26] B. Li, O.I. Campus, M. Sznaier, Cross-view activity recognition using Hankelets, CVPR. 2012.

[27] R. Li, T. Zickler, Discriminative virtual views for cross-view action recognition, CVPR. 2012.

[28] Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, C. Shi, Cross-view action recognition via continuous virtual path, CVPR. 2013.

[29] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 2006;313(5786):504–507.

[30] J. Li, T. Zhang, W. Luo, J. Yang, X. Yuan, J. Zhang, Sparseness analysis in the pretraining of deep neural networks, IEEE Transactions on Neural Networks and Learning Systems 2016 10.1109/TNNLS.2016.2541681.

[31] M. Chen, K. Weinberger, F. Sha, Y. Bengio, Marginalized denoising auto-encoders for nonlinear representations, ICML. 2014.

[32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion, JMLR 2010;11:3371–3408.

[33] D. Weinland, R. Ronfard, E. Boyer, Free viewpoint action recognition using motion history volumes, Computer Vision and Image Understanding 2006;104(2–3).

[34] P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, VS-PETS. 2005.

[35] H. Wang, A. Kläser, C. Schmid, C.L. Liu, Dense trajectories and motion boundary descriptors for action recognition, IJCV 2013;103(1):60–79.

[36] A. Farhadi, M.K. Tabrizi, I. Endres, D.A. Forsyth, A latent model of discriminative aspect, ICCV. 2009.

[37] A. Gupta, J. Martinez, J.J. Little, R.J. Woodham, 3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding, CVPR. 2014.

[38] J. Liu, M. Shah, Learning human actions via information maximization, CVPR. 2008.

[39] Y. Kong, Y. Fu, Discriminative relational representation learning for rgb-d action recognition, IEEE Transactions on Image Processing 2016;25(6).

[40] J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, CVPR. 2012.

[41] Y. Wang, G. Mori, A discriminative latent model of object classes and attributes, ECCV. 2010.

[42] L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera, CVPR. 2013.

[43] S. Hadfield, R. Bowden, Hollywood 3d: recognizing actions in 3d natural scenes, CVPR. 2013.

[44] O. Oreifej, Z. Liu, HON4D: histogram of oriented 4D normals for activity recognition from depth sequences, CVPR. 2013:716–723.

[45] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points, CVPR workshop. 2010.

[46] H. Rahmani, A. Mahmood, A. Mian, D. Huynh, Real time action recognition using histograms of depth gradients and random decision forests, WACV. 2013.

[47] Y. Kong, Y. Fu, Y. Jia, Learning human interaction by interactive phrases, ECCV. 2012.

[48] T. Lan, Y. Wang, W. Yang, S.N. Robinovitch, G. Mori, Discriminative latent models for recognizing contextual group activities, PAMI 2012;34(8):1549–1562.

[49] I. Laptev, On space–time interest points, IJCV 2005;64(2):107–123.

[50] H.S. Koppula, A. Saxena, Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation, ICML. 2013.

[51] J. Luo, W. Wang, H. Qi, Group sparsity and geometry constrained dictionary learning for action recognition from depth maps, ICCV. 2013.

[52] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage architecture for object recognition? ICCV. 2009.

[53] V. Nair, G.E. Hinton, 3d object recognition with deep belief nets, NIPS. 2009.

[54] R. Socher, B. Huval, B. Bhat, C.D. Manning, A.Y. Ng, Convolutional-recursive deep learning for 3d object classification, NIPS. 2012.

[55] W. Ouyang, Modeling mutual visibility relationship in pedestrian detection, CVPR. 2013.

[56] C. Szegedy, A. Toshev, D. Erhan, Deep neural networks for object detection, NIPS. 2013.

[57] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling, PAMI 2013.

[58] R. Socher, C.C.Y. Lim, A.Y. Ng, C.D. Manning, Parsing natural scenes and natural language with recursive neural networks, ICML. 2011.

[59] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 2013;35(1):221–231.

[60] Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng, Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, CVPR. 2011.

[61] G.W. Taylor, R. Fergus, Y. LeCun, C. Bregler, Convolutional learning of spatio-temporal features, ECCV. 2010.

[62] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE. 1998.

[63] A. Coates, H. Lee, A.Y. Ng, An analysis of single-layer networks in unsupervised feature learning, AISTATS. 2011.

[64] M. Ranzato, F.J. Huang, Y.L. Boureau, Y. LeCun, Unsupervised learning of invariant feature hierarchies with applications to object recognition, CVPR. 2007.

[65] J. Wang, Z. Liu, J. Chorowski, Z. Chen, Y. Wu, Robust 3d action recognition with random occupancy patterns, ECCV. 2012.

[66] X. Yang, C. Zhang, Y. Tian, Recognizing actions using depth motion maps-based histograms of oriented gradients, ACM multimedia. 2012 978-1-4503-1089-5 10.1145/2393347.2396382.

[67] A. Klaser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, BMVC. 2008.

[68] A.W. Vieira, E.R. Nascimento, G.L. Oliveira, Z. Liu, M.F.M. Campos, STOP: space–time occupancy patterns for 3D action recognition from depth map sequences, 17th Iberoamerican congress on pattern recognition (CIARP). 2012.


1  “©2017 IEEE. Reprinted, with permission, from Yu Kong, Zhengming Ding, Jun Li, and Yun Fu. “Deeply learned view-invariant features for cross-view action recognition.” IEEE Transactions on Image Processing (2017).”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.97.202