Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8

Action Recognition

Yu Kong^⁎; Yun Fu^† ^⁎B. Thomas Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY, United States
^†Department of Electrical and Computer Engineering and College of Computer and Information Science (Affiliated), Northeastern University, Boston, MA, United States

Abstract

Appearance and pose of human actions vary significantly in different views, making multiview action recognition a very challenging task. To address this problem, it is crucial to learn view-invariant features from multiview action data that are robust to view variations. In this chapter, we first describe an approach by using deep learning networks to learn view-specific and view-shared features. The former captures unique dynamics in each view, and the latter encodes common patterns across views. We describe a sample-affinity matrix (SAM) to accurately balance information transfer within the samples of multiple views, for the purpose of limiting information transfer across samples. SAM enables us to learn more informative shared features that are robust to view variations. Furthermore, we encourage incoherence between the two types of features in order to decrease information redundancy and increase discriminability in them. The discriminative power of the learned features is further enhanced by encouraging features in the same categories to be geometrically closer. Robust view-invariant features are finally learned by stacking several layers of features. This approach is evaluated on two multiview datasets, and shows superior performance over state-of-the-art approaches.

This chapter further introduces a hybrid convolutional-recursive neural network (HCRNN) to learn compositional features for action recognition from depth cameras. HCRNN captures rich motion, structure and context information, including local adjacent body parts as well as long-range body parts. HCRNN extracts robust features by stacking two layers, the 3D convolutional neural network (3D-CNN) and the 3D recursive neural network (3D-RNN). We use 3D-CNN to capture motion and structure information in local neighborhoods, and use 3D-RNN to compose high-order features for action recognition. Multiple 3D-RNNs are employed to improve the discriminability and robustness of the learned features. We organize the two components as a deep model, and train HCRNN in an unsupervised fashion without network tuning. Results on two RGB-D action datasets show that our method achieves state-of-the-art performance.

Keywords

Deep learning; Action recognition; Multiview learning; RGB-D action

8.1 Deeply Learned View-Invariant Features for Cross-View Action Recognition¹

8.1.1 Introduction

Human action data are universal and are of interest to machine learning [1,2] and computer vision communities [3,4]. In general, action data can be captured from multiple views, such as multiple sensor views and various camera views, etc.; see Fig. 8.1. Classification on such action data in cross-view scenario is challenging since the raw data are captured by various sensor devices at different physical locations, and may look completely different. For example, in Fig. 8.1B, an action observed from side view and that observed from top view are visually different. Thus, it is less discriminative to use extracted features in one view than classifying actions in another view.

Figure 8.1 Examples of multiview scenarios: (A) multisensor-view, where multiple sensors (orange rectangles) are attached to torso, arms and legs, and human action data are recorded by these sensors; (B) multicamera-view, where human actions are recorded by multiple cameras at various viewpoints. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this chapter.)

A line of work has been studied to build view-invariant representations for action recognition [5–9], where an action video is converted to a time series of frames. Approaches [5,6] take advantage of a so-called self-similarity matrix (SSM) descriptor to summarize actions in multiple views and have demonstrated their robustness in cross-view scenarios. Information shared between views is learned and transferred to each of the views in [7–9]. Their authors made an assumption that samples in different views contribute equally to the shared features. However, this assumption would be invalid as the cues in one view might be extraordinarily different from other views (e.g., the top view in Fig. 8.1B) and should have lower contribution to the shared features compared to other views. Furthermore, they do not constrain information sharing between action categories. This may produce similar features for videos in different classes but are captured from the same view, which would undoubtedly make classifiers confused.

We describe a deep network that classifies cross-view actions using the learned view-invariant features. A sample-affinity matrix (SAM) is introduced to measure the similarities between video samples in different camera views. This allows us to accurately balance information transfer between views and facilitates learning more informative shared features for cross-view action classification. The SAM structure controls information transfer among samples in different classes, which enables us to extract distinctive features in each class. Besides the shared features, private features are also learned to capture motion information exclusively existing in each view that cannot be modeled using shared features. We learn discriminative view-invariant information from shared and private features separately by encouraging incoherence between them. Label information and stacking multiple layers of features are used to further boost the performance of the network. This feature learning problem is formulated in a marginalized autoencoder framework (see Fig. 8.2) [10], particularly developed for learning view-invariant features. Specifically, cross-view shared features are summarized by one autoencoder, and private feature particularly for one view is learned using a group of autoencoders. We obtain incoherence between the two types of features by encouraging the orthogonality between mapping matrices in the two categories of autoencoders. A Laplacian graph is built to encourage samples in the same action categories to have similar shared and private features. We stack multiple layers of features and learn them in a layer-wise fashion. We evaluate our approach on two multi-view datasets, and show that our approach significantly outperforms state-of-the-art approaches.

Figure 8.2 Overview of the method described for cross-view action recognition.

8.1.2 Related Work

The aim of multiview learning methods is to find mutual agreement between two distinct views of data. Researchers made several attempts to learn more expressive and discriminative features from low-level observations [11–13,2,14–16]. Cotraining approach [17] finds consistent relationships between a pair of data points across different views by training multiple learning algorithms for each view. Canonical correlation analysis (CCA) was also used in [18] to learn a common space between multiple views. Wang et al. [19] proposed a method which learns two projection matrices to map multimodal data onto a common feature space, in which cross-modal data matching can be executed. Incomplete view problem was discussed in [20]. Its authors presumed that a shared subspace generated different views. A generalized multiview analysis (GMA) method was introduced in [21], and was proved to be a supervised extension of CCA. Liu et al. [13] took advantage of matrix factorization in multiview clustering. Their method leverages factors representing clustering structures gained from multiple views toward a common consensus. A collective matrix factorization (CMF) method was explored in [12], which obtains correlations among relational feature matrices. Ding et al. [16] proposed a low-rank constrained matrix factorization model, which works perfectly in the multiview learning scenario even if the view information of test data is unknown.

View-invariant action recognition methods are designed to predict action labels given multiview samples. As viewpoint changes, large within-class pose and appearance variation appear. Previous studies focused on view-invariant features designs that are robust to viewpoint variations. The method in [22] implements local partitioning and hierarchical classification of the 3D Histogram of Oriented Gradients (HOG) descriptor to produce sequences of images. Frame-wise similarity matrix in a video is computed, and view-invariant descriptors within a log-polar block on the matrix are extracted in SSM-based approaches [5,23]. Sharing knowledge among views was reviewed in [24,25,8,7,9,26–28]. Specifically, MRM-Lasso method in [9] captures latent corrections across different views by learning a low-rank matrix consisting of pattern-specific weights. Transferable dictionary pairs were created in [8,7], which encourage a shared sparse feature space. Bipartite graph was exploited in [25] to combine two view-dependent vocabularies into visual-word clusters called bilingual-words in order to bridge the semantic gap across view-dependent vocabularies.

8.1.3 Deeply Learned View-Invariant Features

The goal of this work is to extract view-invariant features that allow us to train the classification model on one (or multiple) view(s), and examine on the other view.

8.1.3.1 Sample-Affinity Matrix (SAM)

We introduce SAM to measure the similarity among pairs of video samples in multiple views. Suppose that we are given training videos of V views, ${X^{v}, y^{v}}_{v = 1}^{V}$ ${X^{v}, y^{v}}_{v = 1}^{V}$ . The data of the vth view $X^{v}$ $X^{v}$ consist of N action videos, $X^{v} = [x_{1}^{v}, \dots, x_{N}^{v}] \in R^{d \times N}$ $X^{v} = [x_{1}^{v}, \dots, x_{N}^{v}] \in R^{d \times N}$ with corresponding labels $y^{v} = [y_{1}^{v}, \dots, y_{N}^{v}]$ $y^{v} = [y_{1}^{v}, \dots, y_{N}^{v}]$ . SAM $Z \in R^{V N \times V N}$ $Z \in R^{V N \times V N}$ is interpreted as a block diagonal matrix

$Z = diag (Z_{1}, \dots, Z_{N}), Z_{i} = (\begin{matrix} 0 & z_{i}^{12} & \dots & z_{i}^{1 V} \\ z_{i}^{21} & 0 & \dots & z_{i}^{2 V} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ z_{i}^{V 1} & z_{i}^{V 2} & \dots & 0 \end{matrix}),$ $Z = diag (Z_{1}, \dots, Z_{N}), Z_{i} = (\begin{matrix} 0 & z_{i}^{12} & \dots & z_{i}^{1 V} \\ z_{i}^{21} & 0 & \dots & z_{i}^{2 V} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ z_{i}^{V 1} & z_{i}^{V 2} & \dots & 0 \end{matrix}),$

where $diag (\cdot)$ $diag (\cdot)$ creates a diagonal matrix, and $z_{i}^{u v}$ $z_{i}^{u v}$ is the distance between two views in the ith sample computed by $z_{i}^{u v} = \exp ({‖ x_{i}^{v} - x_{i}^{u} ‖}^{2} / 2 c)$ $z_{i}^{u v} = \exp ({‖ x_{i}^{v} - x_{i}^{u} ‖}^{2} / 2 c)$ parameterized by c.

Essentially, SAM Z captures within-class between-view information and between-class within-view information. Block $Z_{i}$ $Z_{i}$ in Z characterizes appearance variations in different views within one class. This explains how an action varies if view changes. Such information makes it possible to transfer information among views and build robust cross-view features. Additionally, since the off-diagonal blocks in SAM Z are zeros, this restricts information sharing among classes in the same view. Consequently, the features from different classes but in the same view are encouraged to be distinct. This enables us to differentiate various action categories if they appear similarly in some views.

8.1.3.2 Preliminaries on Autoencoders

Our approach is based on a popular deep learning approach, called autoencoder (AE) [29,10,30]. AE maps the raw inputs x to hidden units h using an “encoder” $f_{1} (\cdot)$ $f_{1} (\cdot)$ , $h = f_{1} (x)$ $h = f_{1} (x)$ , and then maps the hidden units to outputs using a “decoder” $f_{2} (\cdot)$ $f_{2} (\cdot)$ , $o = f_{2} (h)$ $o = f_{2} (h)$ . The objective of learning AE is to encourage similar or identical input–output pairs, where the reconstruction loss is minimized after decoding, $\min \sum_{i = 1}^{N} {‖ x_{i} - f_{2} (f_{1} (x_{i})) ‖}^{2}$ $\min \sum_{i = 1}^{N} {‖ x_{i} - f_{2} (f_{1} (x_{i})) ‖}^{2}$ . Here, N is the number of training samples. In this way, the neurons in the hidden layer are good representations for the inputs as the reconstruction process captures the intrinsic structure of the input data.

As opposed to the two-level encoding and decoding in AE, marginalized stacked denoising autoencoder [10] (mSDA) reconstructs the corrupted inputs with a single mapping W, $\min \sum_{i = 1}^{N} {‖ x_{i} - W {\tilde{x}}_{i} ‖}^{2}$ $\min \sum_{i = 1}^{N} {‖ x_{i} - W {\tilde{x}}_{i} ‖}^{2}$ , where ${\tilde{x}}_{i}$ ${\tilde{x}}_{i}$ is the corrupted version of $x_{i}$ $x_{i}$ obtained by setting each feature to 0 with a probability p. mSDA performs m passes over the training set, each time with different corruptions. This essentially performs a dropout regularization on the mSDA [31]. By setting $m \to \infty$ $m \to \infty$ , mSDA effectively computes the transformation matrix W that is robust to noise using infinitely many copies of noisy data. mSDA is stackable and can be calculated in closed-form.

8.1.3.3 Single-Layer Feature Learning

The single-layer feature learner described in this subsection builds on mSDA. We attempt to learn both discriminative shared features between multiple views and private features particularly owned by one view for cross-view action classification. Considering large motion variations in different views, we incorporate SAM Z in learning shared features to balance information transfer between views so as to build more robust features.

We use the following objective function to learn shared features and private features:

$\begin{matrix} \min_{W, {G^{v}}} Q & , Q = {‖ W \tilde{X} - X Z ‖}_{F}^{2} + \sum_{v} [α {‖ G^{v} {\tilde{X}}^{v} - X^{v} ‖}_{F}^{2} \\ + β {‖ W^{T} G^{v} ‖}_{F}^{2} + γ Tr (P^{v} X^{v} L X^{v T} P^{v T})], \end{matrix}$ $\begin{matrix} \min_{W, {G^{v}}} Q & , Q = {‖ W \tilde{X} - X Z ‖}_{F}^{2} + \sum_{v} [α {‖ G^{v} {\tilde{X}}^{v} - X^{v} ‖}_{F}^{2} \\ + β {‖ W^{T} G^{v} ‖}_{F}^{2} + γ Tr (P^{v} X^{v} L X^{v T} P^{v T})], \end{matrix}$

(8.1)

where W is the mapping matrix for learning shared features, ${G^{v}}_{v = 1}^{V}$ ${G^{v}}_{v = 1}^{V}$ is a group of mapping matrices for learning private features particularly for each view, and $P^{v} = (W; G^{v})$ $P^{v} = (W; G^{v})$ . The above objective function contains 4 terms: $ψ = {‖ W \tilde{X} - X Z ‖}_{F}^{2}$ $ψ = {‖ W \tilde{X} - X Z ‖}_{F}^{2}$ learns shared features between views, which essentially reconstructs an action data from one view with the data from all views; $ϕ_{v} = {‖ G^{v} {\tilde{X}}^{v} - X^{v} ‖}_{F}^{2}$ $ϕ_{v} = {‖ G^{v} {\tilde{X}}^{v} - X^{v} ‖}_{F}^{2}$ learns view-specific private features that are complementary to the shared features; $r_{1 v} = {‖ W^{T} G^{v} ‖}_{F}^{2}$ $r_{1 v} = {‖ W^{T} G^{v} ‖}_{F}^{2}$ and $r_{2 v} = Tr (P^{v} X^{v} L X^{v T} P^{v T})$ $r_{2 v} = Tr (P^{v} X^{v} L X^{v T} P^{v T})$ are model regularizers. Here, $r_{1 v}$ $r_{1 v}$ reduces redundancy between two mapping matrices, and $r_{2 v}$ $r_{2 v}$ encourages the shared and private features of the same class and the same view to be similar, while $α, β, γ$ $α, β, γ$ are parameters balancing the importance of these components. Further details about these terms are discussed in the following.

Note that in cross-view action recognition, data from all the views are available in training to learn shared and private features. Data from some views are unavailable only in testing.

Shared Features. Humans can perceive an action from one view and envision what the action will look like if we observe from other views. This is possibly because we have studied similar actions before from multiple views. This inspires us to reconstruct an action data from one view (target view) using the action data from all the views (source view). In this way, information shared between views can be summarized and transferred to the target view.

We define the discrepancy between the data of the vth target view and the data of all the V source views as

$ψ = \sum_{i = 1}^{N} \sum_{v = 1}^{V} {‖ W {\tilde{x}}_{i}^{v} - \sum_{u} x_{i}^{u} z_{i}^{u v} ‖}^{2} = {‖ W \tilde{X} - X Z ‖}_{F}^{2},$ $ψ = \sum_{i = 1}^{N} \sum_{v = 1}^{V} {‖ W {\tilde{x}}_{i}^{v} - \sum_{u} x_{i}^{u} z_{i}^{u v} ‖}^{2} = {‖ W \tilde{X} - X Z ‖}_{F}^{2},$

(8.2)

where $z_{i}^{u v}$ $z_{i}^{u v}$ is a weight measuring the contributions of the uth view action in the reconstruction of the sample $x_{i}^{v}$ $x_{i}^{v}$ of the vth view, $W \in R^{d \times d}$ $W \in R^{d \times d}$ is a single linear mapping for the corrupted input ${\tilde{x}}_{i}^{v}$ ${\tilde{x}}_{i}^{v}$ of all the views, $Z \in R^{V N \times V N}$ $Z \in R^{V N \times V N}$ is a sample-affinity matrix encoding all the weights ${z_{i}^{u v}}$ ${z_{i}^{u v}}$ . Matrices $X, \tilde{X} \in R^{d \times V N}$ $X, \tilde{X} \in R^{d \times V N}$ denote the input training matrix and the corresponding corrupted version of X, respectively [10]. The corruption essentially performs a dropout regularization on the model [31].

The SAM Z here allows us to precisely balance information transfer among views and assists learn more discriminative shared features. Instead of using equal weights [7,8], we reconstruct the ith training sample of the vth view based on the samples from all V views with different contributions. As shown in Fig. 8.3, a sample of side view (source 1) will be more identical to the one also from side view (target view) than the one from top view (source 2). Thus, more weight should be given to source 1 in order to learn more descriptive shared features for the target view. Note that SAM Z limits information sharing across samples (off-diagonal blocks are zeros) since it cannot capture view-invariant information for cross-view action recognition.

Figure 8.3 Learning shared features using weighted samples.

Private Features. Besides the information shared across views, there is still some remaining discriminative information that exclusively exists in each view. In order to utilize such information and make it robust to viewpoint variations, we adopt the robust feature learning in [10], and learn view-specific private features for the samples in the vth view using a mapping matrix $G^{v} \in R^{d \times d}$ $G^{v} \in R^{d \times d}$ ,

$ϕ_{v} = \sum_{i = 1}^{N} {‖ G^{v} {\tilde{x}}_{i}^{v} - x_{i}^{v} ‖}^{2} = {‖ G^{v} {\tilde{X}}^{v} - X^{v} ‖}_{F}^{2} .$ $ϕ_{v} = \sum_{i = 1}^{N} {‖ G^{v} {\tilde{x}}_{i}^{v} - x_{i}^{v} ‖}^{2} = {‖ G^{v} {\tilde{X}}^{v} - X^{v} ‖}_{F}^{2} .$

(8.3)

Here, ${\tilde{X}}^{v}$ ${\tilde{X}}^{v}$ is the corrupted version of the feature matrix $X^{v}$ $X^{v}$ of the vth view. We will learn V mapping matrices ${G^{v}}_{v = 1}^{V}$ ${G^{v}}_{v = 1}^{V}$ given corresponding inputs of different views.

It should be noted that using Eq. (8.3) may also capture some redundant shared information from the vth view. In this work, we reduce such redundancy by encouraging the incoherence between the view-shared mapping matrix W and view-specific mapping matrix $G^{v}$ $G^{v}$ ,

$r_{1 v} = {‖ W^{T} G^{v} ‖}_{F}^{2} .$ $r_{1 v} = {‖ W^{T} G^{v} ‖}_{F}^{2} .$

(8.4)

The incoherence between W and ${G^{v}}$ ${G^{v}}$ enables our approach to independently exploit the discriminative information included in the view-specific features and view-shared features.

Label Information. Large motion and posture variations may appear in action data captured from various views. Therefore, the shared and private features extracted using Eqs. (8.2) and (8.3) may not be discriminative enough for actions classification with large variations. We enforce the shared and private features of the same class and same view to be similar to address the issue. A within-class within-view variance is defined to regularize the learning of the view-shared mapping matrix W and view-specific mapping matrix $G^{v}$ $G^{v}$ as

$\begin{matrix} r_{2 v} = \sum_{i = 1}^{N} \sum_{j = 1}^{N} [{‖ W x_{i}^{v} - W x_{j}^{v} ‖}^{2} + {‖ G^{v} x_{i}^{v} - G^{v} x_{j}^{v} ‖}^{2}] \\ = Tr (W X^{v} L X^{v T} W^{T}) + Tr (G^{v} X^{v} L X^{v T} G^{v T}) \\ = Tr (P^{v} X^{v} L X^{v T} P^{v T}) . \end{matrix}$ $\begin{matrix} r_{2 v} = \sum_{i = 1}^{N} \sum_{j = 1}^{N} [{‖ W x_{i}^{v} - W x_{j}^{v} ‖}^{2} + {‖ G^{v} x_{i}^{v} - G^{v} x_{j}^{v} ‖}^{2}] \\ = Tr (W X^{v} L X^{v T} W^{T}) + Tr (G^{v} X^{v} L X^{v T} G^{v T}) \\ = Tr (P^{v} X^{v} L X^{v T} P^{v T}) . \end{matrix}$

(8.5)

Here, $L \in R^{N \times N}$ $L \in R^{N \times N}$ is the label-view Laplacian matrix, $L = D - A$ $L = D - A$ , D is the diagonal degree matrix with $D_{(i, i)} = \sum_{j = 1}^{N} a_{(i, j)}$ $D_{(i, i)} = \sum_{j = 1}^{N} a_{(i, j)}$ , A is the adjacent matrix that represents the label relationships of training videos. The $(i, j)$ $(i, j)$ th element $a_{(i, j)}$ $a_{(i, j)}$ in A is 1 if $y_{i} = y_{j}$ $y_{i} = y_{j}$ and 0 otherwise.

Note that since we have implicitly used this idea in Eq. (8.2), we do not need features from different views in the same class to be similar. In learning the shared feature, features of the same class from multiple views will be mapped to a new space using the mapping matrix W. Consequently, we can better represent the projected features of one sample by the features from multiple views of the same sample. Therefore, the discrepancy among views is minimized, and thus the within-class cross-view variance in Eq. (8.5) is not necessary.

Discussion. Using label information in Eq. (8.5) contributes to a supervised approach. We can also replace this term with an unsupervised one by making $γ = 0$ $γ = 0$ . We refer to the unsupervised approach as Ours-1 and the supervised approach as Ours-2 in the following discussions.

8.1.3.4 Learning

We develop a coordinate descent algorithm to solve the optimization problem in Eq. (8.1) and optimize parameters W and ${G^{v}}_{v = 1}^{V}$ ${G^{v}}_{v = 1}^{V}$ . More specifically, in each step, one parameter matrix is updated by fixing the others, and computing the derivative of $Q$ $Q$ w.r.t. to the parameter and setting it to 0.

Update W. Parameters ${G^{v}}_{v = 1}^{V}$ ${G^{v}}_{v = 1}^{V}$ are fixed in updating W, which can be updated by setting the derivative $\frac{\partial Q}{\partial W} = 0$ $\frac{\partial Q}{\partial W} = 0$ , yielding

$\begin{matrix} W = [\sum_{v} & {(β G^{v} G^{v T} + γ X^{v} L X^{v T} + I)]}^{- 1} \\ \cdot (X Z {\tilde{X}}^{T}) {[\tilde{X} {\tilde{X}}^{T} + I]}^{- 1} . \end{matrix}$ $\begin{matrix} W = [\sum_{v} & {(β G^{v} G^{v T} + γ X^{v} L X^{v T} + I)]}^{- 1} \\ \cdot (X Z {\tilde{X}}^{T}) {[\tilde{X} {\tilde{X}}^{T} + I]}^{- 1} . \end{matrix}$

(8.6)

It should be noted that $X Z {\tilde{X}}^{T}$ $X Z {\tilde{X}}^{T}$ and $\tilde{X} {\tilde{X}}^{T}$ $\tilde{X} {\tilde{X}}^{T}$ are computed by repeating the corruption $m \to \infty$ $m \to \infty$ times. By the weak law of large numbers [10], $X Z {\tilde{X}}^{T}$ $X Z {\tilde{X}}^{T}$ and $\tilde{X} {\tilde{X}}^{T}$ $\tilde{X} {\tilde{X}}^{T}$ can be computed by their expectations $E_{p} (X Z {\tilde{X}}^{T})$ $E_{p} (X Z {\tilde{X}}^{T})$ and $E_{p} (\tilde{X} {\tilde{X}}^{T})$ $E_{p} (\tilde{X} {\tilde{X}}^{T})$ with the corruption probability p, respectively.

Update $G^{v}$ $G^{v}$ . Fixing W and ${G^{u}}_{u = 1, u \neq v}^{V}$ ${G^{u}}_{u = 1, u \neq v}^{V}$ , parameter $G^{v}$ $G^{v}$ is updated by setting the derivative $\frac{\partial Q}{\partial G^{v}} = 0$ $\frac{\partial Q}{\partial G^{v}} = 0$ , giving

$\begin{matrix} G^{v} = (β & {W W^{T} + γ X^{v} L X^{v T} + I)}^{- 1} \\ \cdot (α X^{v} {\tilde{X}}^{v T}) {[α {\tilde{X}}^{v} {\tilde{X}}^{v T} + I]}^{- 1} . \end{matrix}$ $\begin{matrix} G^{v} = (β & {W W^{T} + γ X^{v} L X^{v T} + I)}^{- 1} \\ \cdot (α X^{v} {\tilde{X}}^{v T}) {[α {\tilde{X}}^{v} {\tilde{X}}^{v T} + I]}^{- 1} . \end{matrix}$

(8.7)

Similar to the procedure of updating W, $X^{v} {\tilde{X}}^{v T}$ $X^{v} {\tilde{X}}^{v T}$ and ${\tilde{X}}^{v} {\tilde{X}}^{v T}$ ${\tilde{X}}^{v} {\tilde{X}}^{v T}$ are computed by their expectations with corruption probability p.

Convergence. Our learning algorithm iteratively updates W and ${G^{v}}_{v = 1}^{V}$ ${G^{v}}_{v = 1}^{V}$ . The problem in Eq. (8.1) can be divided into $V + 1$ $V + 1$ subproblems, each of which is a convex problem with respect to one variable. Therefore, by solving the subproblems alternatively, the learning algorithm is guaranteed to find an optimal solution to each subproblem. Finally, the algorithm will converge to a local solution.

8.1.3.5 Deep Architecture

Inspired by the deep architecture in [10,32], we also design a deep model by stacking multiple layers of feature learners described in Sect. 8.1.3.3. A nonlinear feature mapping is performed layer by layer. More specifically, a nonlinear squashing function $σ (\cdot)$ $σ (\cdot)$ is applied on the output of one layer, $H_{w} = σ (W X)$ $H_{w} = σ (W X)$ and $H_{g}^{v} = σ (G^{v} X^{v})$ $H_{g}^{v} = σ (G^{v} X^{v})$ , resulting in a series of hidden feature matrices.

A layer-wise training scheme is used in this work to train the networks ${W_{k}}_{k = 1}^{K}$ ${W_{k}}_{k = 1}^{K}$ , ${G_{k}^{v}}_{k = 1, v = 1}^{K, V}$ ${G_{k}^{v}}_{k = 1, v = 1}^{K, V}$ with K layers. Specifically, the outputs of the fth layer $H_{k w}$ $H_{k w}$ and $H_{k g}^{v}$ $H_{k g}^{v}$ are used as the input to the $(k + 1)$ $(k + 1)$ th layer. The mapping matrices $W_{k + 1}$ $W_{k + 1}$ and ${G_{k + 1}^{v}}_{v = 1}^{V}$ ${G_{k + 1}^{v}}_{v = 1}^{V}$ are then trained using these inputs. For the first layer, the inputs $H_{0 w}$ $H_{0 w}$ and $H_{0 g}^{v}$ $H_{0 g}^{v}$ are the raw features X and $X^{v}$ $X^{v}$ , respectively. More details are shown in Algorithm 8.1.

Algorithm 8.1 Learning deep sequential context networks.

8.1.4 Experiments

We evaluate Ours-1 and Ours-2 approaches on two multiview datasets: multiview IXMAS dataset [33], and the Daily and Sports Activities (DSA) dataset [1], both of which have been popularly used in [1,24,25,8,7,9].

We consider two cross-view classification scenarios in this work, many-to-one and one-to-one. The former trains on $V - 1$ $V - 1$ views and tests on the remaining view, while the latter trains on one view and tests on the other views. For the vth view that is used for testing, we simply set the corresponding $X^{v}$ $X^{v}$ used in training to 0 in Eq. (8.1) during training. Intersection kernel support vector machine (IKSVM) with parameter $C = 1$ $C = 1$ is adopted as the classifier. Default parameters are $α = 1, β = 1, γ = 0, K = 1, p = 0$ $α = 1, β = 1, γ = 0, K = 1, p = 0$ for Ours-1 approach, and $α = 1, β = 1, γ = 1, K = 1, p = 0$ $α = 1, β = 1, γ = 1, K = 1, p = 0$ for Ours-2 approach unless specified. The default number of layers is set to 1 for efficiency consideration.

IXMAS is a multicamera-view video dataset, where each view corresponds to a camera view (see Fig. 8.4B). The IXMAS dataset consists of 12 actions performed by 10 actors. An action was recorded by 4 side view cameras and 1 top view camera. Each actor repeated one action 3 times.

Figure 8.4 Examples of multi-view problem settings: (A) multiple sensor views in the Daily and Sports Activities (DSA) dataset, and (B) multiple camera views in the IXMAS.

We adopt the bag-of-words model in [34]. An action video is described by a set of detected local spatiotemporal trajectory-based and global frame-based descriptors [35]. A k-means clustering method is employed to quantize these descriptors and build so-called video words. Consequently, a video can be represented by a histogram of the video words detected in the video, which is essentially a feature vector. An action captured by V camera views is represented by V feature vectors, each of which is a feature representation for one camera view.

DSA is a multisensor-view dataset comprising 19 daily and sports activities (e.g., sitting, playing basketball, and running on a treadmill with a speed of 8 km/h), each performed by 8 subjects in their own style for 5 minutes. Five Xsens MTx sensor units are used on the torso, arms, and legs (Fig. 8.4A), resulting in a 5-view data representation. Sensor units are calibrated to acquire data at 25 Hz sampling frequency. The 5-min signals are divided into 5-second segments so that $480 (= 60 seconds \times 8 subjects)$ $480 (= 60 seconds \times 8 subjects)$ signal segments are obtained for each activity. One 5-second segment is used as an action time series in this work.

We follow [1] to preprocess the raw action data in a 5-s window, and represent the data as a 234-dimensional feature vector. Specifically, the raw action data is represented as a $125 \times 9$ $125 \times 9$ matrix, where 125 is the number of sampling points ( $125 = 25$ $125 = 25$ Hz × 5 s), and 9 is the number of values (the $x, y, z$ $x, y, z$ axes' acceleration, the $x, y, z$ $x, y, z$ axes' rate of turn, and the $x, y, z$ $x, y, z$ axes' Earth's magnetic field) obtained on one sensor. We first compute the minimum and maximum values, the mean, skewness, and kurtosis on the data matrix. The resulting features are concatenated and generate a 45-dimensional (5 features × 9 axes) feature vector. Then, we compute discrete Fourier transform on the raw data matrix, and select the maximum 5 Fourier peaks. This yields a 45-dimensional (5 peaks × 9 axes) feature vector. The 45 frequency values that correspond to these Fourier peaks are also extracted, resulting in a 45-dimensional (5 frequency × 9 axes) as well. Afterwards, 11 autocorrelation samples are computed for each of the 9 axes, resulting in a 99-dimensional (11 samples × 9 axes) features. The three types of features are concatenated and generate a 234-dimensional feature vector, representing the human motion captured by one sensor in a 5-second window. A human action captured by V sensors is represented by V feature vectors, each of which corresponds to a sensor view.

8.1.4.1 IXMAS Dataset

Dense trajectory and histogram of oriented optical flow [35] are extracted from videos. A dictionary of size 2000 is built for each type of features using k-means. We use the bag-of-words model to encode these features, and represent each video as a feature vector.

We adopt the same leave-one-action-class-out training scheme in [25,7,8] for fair comparison. At each time, one action class is used for testing. In order to evaluate the effectiveness of the information transfer in our approaches, all the videos in this action are excluded from the feature learning procedure including k-means and our approaches. Note that these videos can be seen in training the action classifiers. We evaluate both the unsupervised approach (Ours-1) and the supervised approach (Ours-2).

One-to-One Cross-view Action Recognition

This experiment trains on data from one camera view (training view), and tests the on data from the other view (test view). We only use the learned shared features and discard the private features in this experiment as the private features learned on one view does not capture too much information of the other view.

We compare Ours-2 approach with [36,7,8] and report recognition results in Table 8.1. Ours-2 achieves the best performance in 18 out of 20 combinations, significantly better than all the compared approaches. It should be noted that Ours-2 achieves 100% in 16 cases, demonstrating the effectiveness of the learned shared features. Thanks to the abundant discriminative information from the learned shared features and label information, our approach is robust to viewpoint variations and can achieve high performance in cross-view recognition.

Table 8.1

One-to-one cross-view recognition results of various supervised approaches on IXMAS dataset. Each row corresponds to a training view (from view C0 to view C4) and each column is a test view (also from view C0 to view C4). The results in brackets are the recognition accuracies of [36,7,8] and our supervised approach, respectively

	C0	C1	C2	C3	C4
C0	NA	(79,98.8,98.5,100)	(79,99.1,99.7,99.7)	(68,99.4,99.7,100)	(76,92.7,99.7,100)
C1	(72,98.8,100,100)	NA	(74,99.7,97.0,99.7)	(70,92.7,89.7,100)	(66,90.6,100,99.7)
C2	(71,99.4,99.1,100)	(82,96.4,99.3,100)	NA	(76,97.3,100,100)	(72,95.5,99.7,100)
C3	(75,98.2,90.0,100)	(75,97.6,99.7,100)	(73,99.7,98.2,99.4)	NA	(76,90.0,96.4,100)
C4	(80,85.8,99.7,100)	(77,81.5,98.3,100)	(73,93.3,97.0,100)	(72,83.9,98.9,100)	NA
Ave.	(74,95.5,97.2,100)	(77,93.6,98.3,100)	(76,98.0,98.7,99.7)	(73,93.3,97.0,100)	(72,92.4,98.9,99.9)

We also compare Ours-1 approach with [25,7,8,24,37], and report comparison results in Table 8.2. Our approach achieves the best performance in 19 out of 20 combinations. In some cases, our approach outperforms the comparison approaches by a large margin, for example, C4→C0 (C4 is the training view and C0 is the test view), C4→C1, and C1→C3. The overall performance of Ours-1 is slightly worse than Ours-2 due to the removal of the label information.

Table 8.2

One-to-one cross-view recognition results of various unsupervised approaches on IXMAS dataset. Each row corresponds to a training view (from view C0 to view C4) and each column is a test view (also from view C0 to view C4). The results in brackets are the recognition accuracies of [25,7,8,24,37] and our unsupervised approach, respectively

	C0	C1	C2	C3	C4
C0	NA	(79.9,96.7,99.1,92.7,94.8,99.7)	(76.8,97.9,90.9,84.2,69.1,99.7)	(76.8,97.6,88.7,83.9,83.9,98.9)	(74.8,84.9,95.5,44.2,39.1,99.4)
C1	(81.2,97.3,97.8,95.5,90.6,100)	NA	(75.8,96.4,91.2,77.6,79.7,99.7)	(78.0,89.7,78.4,86.1,79.1,99.4)	(70.4,81.2,88.4,40.9,30.6,99.7)
C2	(79.6,92.1,99.4,82.4,72.1,100)	(76.6,89.7,97.6,79.4,86.1,99.7)	NA	(79.8,94.9,91.2,85.8,77.3,100)	(72.8,89.1,100,71.5,62.7,99.7)
C3	(73.0,97.0,87.6,82.4,82.4,100)	(74.1,94.2,98.2,80.9,79.7,100)	(74.0,96.7,99.4,82.7,70.9,100)	NA	(66.9,83.9,95.4,44.2,37.9,100)
C4	(82.0,83.0,87.3,57.1,48.8,99.7)	(68.3,70.6,87.8,48.5,40.9,100)	(74.0,89.7,92.1,78.8,70.3,100)	(71.1,83.7,90.0,51.2,49.4,100)	NA
Ave	(79.0,94.4,93.0,79.4,74.5,99.9)	(74.7,87.8,95.6,75.4,75.4,99.9)	(75.2,95.1,93.4,80.8,72.5,99.9)	(76.4,91.2,87.1,76.8,72.4,99.9)	(71.2,84.8,95.1,50.2,42.6,99.7)

Many-to-One Cross-view Action Recognition

In this experiment, one view is used as test view and all the other views are used as training views. We evaluate the performance of our approaches in this experiment, which use both the learned shared and private features.

Our unsupervised (Ours-1) and supervised (Ours-2) approaches are compared with existing approaches [38,5,22,25,7,8,6]. The importance of SAM Z in Eq. (8.2), the incoherence in Eq. (8.4), and the private features in Ours-2 model are also evaluated.

Table 8.3 shows that our supervised approach (Ours-2) achieves an impressive 100% recognition accuracy in all the 5 cases, and Ours-1 achieves an overall accuracy of 99.8%. Ours-1 and Ours-2 achieve superior overall performance over all the other comparison approaches, demonstrating the benefit of using both shared and private features in this work. Our approaches use the sample-affinity matrix to measure the similarities between video samples across camera views. Consequently, the learned shared features accurately characterize the commonness across views. In addition, the redundancy is reduced between shared and private features, making the learned private features more informative for classification. Although the two methods in [8] exploit private features as well, they do not measure different contributions of samples in learning the shared dictionary, making the shared information less discriminative.

Table 8.3

Many-to-one cross-view action recognition results on IXMAS dataset. Each column corresponds to a test view

Methods	C0	C1	C2	C3	C4
Junejo et al. [5]	74.8	74.5	74.8	70.6	61.2
Liu and Shah [38]	76.7	73.3	72.0	73.0	N/A
Weinland et al. [22]	86.7	89.9	86.4	87.6	66.4
Liu et al. [25]	86.6	81.1	80.1	83.6	82.8
Zheng et al. [7]	98.5	99.1	99.1	100	90.3
Zheng and Jiang [8]-1	97.0	99.7	97.2	98.0	97.3
Zheng and Jiang [8]-2	99.7	99.7	98.8	99.4	99.1
Yan et al. [6]	91.2	87.7	82.1	81.5	79.1
No-SAM	95.3	93.9	95.3	93.1	94.7
No-private	98.6	98.1	98.3	99.4	100
No-incoherence	98.3	97.5	98.9	98.1	100
Ours-1 (unsupervised)	100	99.7	100	100	99.4
Ours-2 (supervised)	100	100	100	100	100

Ours-2 outperforms No-SAM approach, suggesting the effectiveness of SAM Z. Without SAM Z, No-SAM treats samples across views equally, and thus cannot accurately weigh the importance of samples in different views. The importance of the private features can be clearly seen from the performance gap between Ours-2 and No-private approach. Without private features, the No-private approach only uses shared features for classification, which are not discriminative enough if some informative motion patterns exclusively exist in one view and are not sharable across views. The performance variation between Ours-2 and the No-incoherence method suggests the benefit of encouraging the incoherence in Eq. (8.4). Using Eq. (8.4) allows us to reduce the redundancy between shared and private features, and helps extract discriminative information in each of them. Ours-2 slightly outperforms Ours-1 in this experiment, indicating the effectiveness of using label information in Eq. (8.5).

8.1.4.2 Daily and Sports Activities Data Set

Many-to-One Cross-view Action Classification

In this experiment, data from 4 sensors are used for training (36,480 time series) and the data from the remaining 1 sensor (9,120 time series) are used for testing. This process is repeated 5 times and the average results are reported.

Our unsupervised (Ours-1) and supervised (Ours-2) approaches are compared with mSDA [10], DRRL [39] and IKSVM. The importance of SAM Z in Eq. (8.2), the incoherence in Eq. (8.4), and the private features in Ours-2 model are also evaluated. We remove Z in Eq. (8.2) and the incoherence component in Eq. (8.4) from the supervised model, respectively, and obtain the “No-SAM”, and the “No-incoherence” model. We also remove the learning of parameter ${G^{v}}_{v = 1}^{V}$ ${G^{v}}_{v = 1}^{V}$ from the supervised model and obtain the “No-private” model. Comparison results are shown in Table 8.4.

Table 8.4

Many-to-one cross-view action classification results on DSA dataset. Each column corresponds to a test view. V0–V4 are sensor views on torso, arms, and legs

Methods	Overall	V0	V1	V2	V3	V4
IKSVM	54.6	36.5	53.4	63.4	60.1	59.7
DRRL [39]	55.4	35.5	56.7	62.1	61.7	60.9
mSDA [10]	56.1	34.4	57.7	62.8	61.5	64.1
No-SAM	55.4	35.1	57.0	60.7	62.2	62.2
No-private	55.4	35.1	57.0	60.7	62.2	62.1
No-incoherence	55.4	35.1	56.9	60.7	62.2	62.2
Ours-1	57.1	35.7	57.4	64.4	64.2	63.9
Ours-2	58.0	36.1	58.9	65.8	64.2	65.2

Ours-2 achieves superior performance over all the other comparison methods in all the 5 cases with an overall recognition accuracy of 58.0%. Ours-2 outperforms Ours-1 by 0.9% in overall classification result due to the use of label information. Note that cross-view classification on DSA dataset is challenging as the sensors on different body parts are weakly correlated. The sensor on torso (V0) has the weakest correlations with the other four sensors on arms and legs. Therefore, results of all the approaches on V0 are the lowest performance compared to sensors V1–V4. Ours-1 and Ours-2 achieve superior overall performance over the comparison approaches IKSVM and mSDA due to the use of both shared and private features. IKSVM and mSDA do not discover shared and private features, and thus cannot use correlations between views and exclusive information in each view for classification. To better balance the information transfer between views, Ours-1 and Ours-2 use the sample-affinity matrix to measure the similarities between video samples across camera views. Thus, the learned shared features accurately characterize the commonness across views. Though the overall improvement of Ours-1 and Ours-2 over mSDA is 1% and 1.9%, Ours-1 and Ours-2 correctly classify 456 and 866 more sequences than mSDA in this experiment, respectively.

The performance gap between Ours-2 and the No-SAM approach suggests the effectiveness of SAM Z. Without SAM Z, No-SAM treats samples across views equally, and thus cannot accurately weigh the importance of samples in different views. Ours-2 outperforms No-private approach, suggesting the importance of the private features in learning discriminative features for multiview classification. Without private features, No-private approach only uses shared features for classification, which are not discriminative enough if some informative motion patterns exclusively exist in one view and are not sharable across views. Ours-2 achieves superior performance over No-incoherence method, indicating the benefit of encouraging the incoherence in Eq. (8.4). Using Eq. (8.4) allows us to reduce the redundancy between shared and private features, and helps extract discriminative information in each of them. Ours-2 slightly outperforms Ours-1, indicating the effectiveness of using label information in Eq. (8.5).

8.2 Hybrid Neural Network for Action Recognition from Depth Cameras

8.2.1 Introduction

Using depth cameras for action recognition is receiving increasing interest in computer vision community due to the recent advent of the cost-effective Kinect. Depth sensors provide several advantages over typical visible light cameras. Firstly, 3D structural information can be easily captured, which helps simplify the intra-class motion variation. Secondly, depth information provides useful cues for background subtraction and occlusion detection. Thirdly, depth data are generally not affected by the lighting variations, and thus it is a robust information in different lighting conditions.

Unfortunately, improving the recognition performance via depth data is not an easy task. One reason is that depth data are noisy, and may have spatial and temporal discontinuities when undefined depth points exist. Existing methods resort to mining discriminative actionlets from noisy data [40], exploiting a sampling scheme [41], or developing depth spatiotemporal interest point detectors [42,43], in order to overcome the problem of noisy depth data. However, these methods directly use low-level features, which may not be expressive enough for discriminating depth videos. Another problem is that depth information alone is not discriminative enough, as most of the body parts in different actions have similar depth values. It is desirable to extract useful information from depth data, e.g., surface normals in the 4D space [44] and 3D human shapes [45], and then use additional cues effectively to improve the performance, e.g., joint data [40,46]. It should be noted that most of the existing approaches for depth action videos only capture low-order context, such as hand–arm and foot–leg. High-order context such as head–arm–leg and torso–arm–leg is not considered. In addition, all these methods depend on hand-crafted, problem-dependent features, whose importance for the recognition task is rarely known. This is extremely noticeable since intra-class action data are generally highly varied, and inter-class action data usually appear similar.

In this chapter, we describe a hybrid convolutional-recursive neural network (HCRNN), a cascade of 3D convolutional neural network (3D-CNN) and 3D recursive neural network (3D-RNN), to learn high-order, compositional features for recognizing RGB-D action videos. The hierarchical nature of HCRNN helps us abstract low-level features to yield powerful features for action recognition. HCRNN models the relationships between local neighboring body parts and allows deformable body parts in actions. This makes our model robust to pose variations and geometry changes in intra-class RGB-D action data. In addition, HCRNN captures high-order body part context information in RGB-D action data, which is significantly important for learning actions with large pose variations [47,48]. A new 3D convolution is performed on spatiotemporal 3D patches, thereby capturing rich motion information in adjacent frames and reducing noise. We organize all the components in HCRNN in different layers, and train HCRNN in an unsupervised fashion without network tuning. More importantly, we demonstrate that high quality features can be learned by 3D-RNNs, even with random weights.

The goal of HCRNN is to learn discriminative features from RGB-D videos. As the flowchart illustrated in Fig. 8.5, HCRNN starts with raw RGB-D videos, and first extract features from each of the RGB and depth modalities separately. The two separate modalities, RGB and depth data, are then fed into 3D convolutional neural networks (3D-CNN), and are convolved with K filters, respectively. 3D-CNN outputs translational invariant low-level features, a matrix of filter responses. These features are then given to the 3D-RNN to learn compositional high-order features. To improve feature discriminability, multiple 3D-RNNs are jointly employed to learn the features. The final feature vectors learned by all the 3D-RNNs from all the modalities are combined into a single feature vector, which is the action representation for the input RGB-D video. The softmax classifier is applied to recognize the RGB-D action video.

Figure 8.5 Architecture of the our hybrid convolutional-recursive neural network (HCRNN) model. Given a RGB-D video, HCRNN learns a discriminative feature vector from both RGB and depth data. We use 3D-CNN to learn features of local neighbor body parts, and 3D-RNN to learn compositional features hierarchically.

8.2.2 Related Work

RGBD Action Recognition. In depth videos, depth images generally contain undefined depth points, causing spatial and temporal discontinuities. This is an obstacle for using informative depth information. For example, popular spatio-temporal interest point (STIP) detectors [34,49] can not be applied to depth videos directly, as they will falsely fire on those discontinuous black regions [44]. To overcome this problem, counterparts of the popular spatio-temporal interest point (STIP) detectors [34,49] for depth videos have been proposed in [42,43]. Depth STIP [42], a filtering method, was introduced to detect interest points from RGB-D videos with noise reduction.

To obtain useful information from noisy depth videos, [40] proposed to select informative joints that are most relevant to the recognition task. Consequently, an action can be represented by subsets of joints (actionlets), and is learned by the multiple-kernel SVM, where each kernel corresponds to an actionlet. In [44], the histogram of oriented 4D surface normals (HON4D) are computed to effectively exploit geometric changing structure of actions in depth videos. Li et al. [45] projected depth maps onto 2D planes and sampled a set of points along the contours of the projections. Then, the points are clustered to obtain salient postures. GMM is further used to model the postures, and an action graph is applied for inference. Holistic features [14] and human pose (joint) information are also used for action recognition from RGB-D videos [46,50,51].

Hollywood 3D dataset, a new 3D action dataset, was released in [43], and evaluated using both conventional STIP detectors and their extensions for depth videos. Results show that those new STIP detectors for depth videos can effectively take advantage of depth data and suppress false detections caused by the spatial and temporal discontinuities in depth images.

Applications Using Deep Models. In recent years, feature learning using deep models have been successfully applied in object recognition [52–54] and detection [55,56], scene understanding [57,58], face recognition and action recognition [59–61].

Feature learning methods for object recognition are generally composed of a filter bank, a nonlinear transformation, and some pooling layers. To evaluate their influences, [52] built several hierarchies by different combinations of those components, and reported their performance on object recognition and handwritten digits recognition datasets. 3D object recognition task were also addressed in [53,54]. In [55], mutual visibility relationship in pedestrian detection is modeled by summarizing human body part interaction in a hierarchical way.

Deep models have achieved promising results in conventional action recognition tasks. Convolutional neural network [59] was applied to extract features from both the spatial and temporal dimensions by performing 3D convolutions. An unsupervised gate RBM model [61] was proposed for action recognition. Le et al. [60] combined independent subspace analysis with deep learning techniques to build features robust to local translation. All these methods are designed for color videos. In this chapter, we introduce a deep architecture for recognizing actions from RGB-D videos.

8.2.3 Hybrid Convolutional-Recursive Neural Networks

We describe the hybrid convolutional-recursive neural networks (HCRNN) to learn high-order compositional features for action recognition from depth camera. The HCRNN consists of two components, the 3D-CNN model and the 3D-RNN model. The 3D-CNN model is utilized to generate low-level, translationally invariant features, and the 3D-RNN model is employed to compose high-order features that can be used to classify actions. Architecture is shown in Fig. 8.5, which is a cascade of 3D-CNN and 3D-RNN.

8.2.3.1 Architecture Overview

Our method takes an RGB-D video v as input and outputs the corresponding action label. Our HCRNN is employed to find a transform h that maps the RGB-D video into a feature vector x, $x = h (v)$ $x = h (v)$ . Feature vector x is then fed into a classifier, and the action label y is obtained. We treat an RGB-D video as multichannel data, and extract gray, gradient, optical flow and depth data from RGB and depth modalities. HCRNN is applied to each of these channels.

3D-CNN. The lower part of our HCRNN is a 3D-CNN model that extracts features hierarchically. The 3D-CNN (Fig. 8.6) in this work has five stages: 3D convolution (Sect. 8.2.3.2), absolute rectification, local normalization, average pooling and subsampling. We sample N 3D patches of size $(s_{r}, s_{c}, s_{t})$ $(s_{r}, s_{c}, s_{t})$ (height, width, frame) from each channel with stride size $s_{p}$ $s_{p}$ . The 3D-CNN g takes these patches as the input, and outputs a K-dimensional vector u for each patch, $g : R^{S} \to R^{K}$ $g : R^{S} \to R^{K}$ . Here, K is the number of learned filters and $S = s_{r} \times s_{c} \times s_{t}$ $S = s_{r} \times s_{c} \times s_{t}$ .

Figure 8.6 Graphical illustration of 3D-CNN. Given a 3D video patch p, 3D-CNN g performs five stages of computations: 3D convolution, rectification, local normalization, average pooling and subsampling. Then the K-dimensional feature vector u is obtained, u = g(p).

Rich motion and geometry change information is captured using the 3D convolution in 3D-CNN. Each video of size (height, width, frame) $d_{I}$ $d_{I}$ is convolved with K filters of size $d_{P}$ $d_{P}$ , resulting in K filter responses of dimensionality N (N is the number of patches extracted from one channel of a video). Then, absolute rectification is performed which applies absolute value function to all the components of the filter responses. This step is followed by local contrast normalization (LCN). LCN module performs local subtractive and divisive normalization, enforcing a sort of local competition between adjacent features in a feature map, and between features at the same spatiotemporal location in different feature maps. To improve the robustness of features to small distortions, we add average pooling and subsampling modules to 3D-CNN. Patch features whose locations are within a small spatiotemporal neighborhood are averaged and pooled to generate one parent feature of dimensionality K.

In this chapter, we augment gray and depth feature maps (channels) with gradient-x, gradient-y, optflow-x, and optflow-y as in [59]. The gradient-x and gradient-y feature maps are computed by computing the gradient along the horizontal and vertical directions, respectively. The optflow-x and optflow-y feature maps are computed by running the optical flow algorithm and separating the horizontal and vertical flow field data.

3D-RNN. The 3D-RNN model is to hierarchically learn compositional features. The graphical illustration of 3D-RNN is shown in Fig. 8.7. It merges a spatiotemporal block of patch feature vectors and generate a parent feature vector. The input for 3D-RNN is a $K \times M$ $K \times M$ matrix, where M is the number of feature vectors generated by 3D-CNN ( $M \neq N$ $M \neq N$ since we apply subsampling in 3D-CNN). The output of 3D-RNN is a K-dimensional vector, which is the feature for one channel of the video. We adopt a tree-structured 3D-RNN with J layers. At each layer, child vectors, whose spatiotemporal locations are within a 3D block, are merged into one parent vector. This procedure continues to generate one parent vector in the top layer. We concatenate feature vectors of all the channels generated by 3D-RNN, and derive a KC-dimensional vector as the action feature given C channels of data.

Figure 8.7 Graphical illustration of 3D-RNN. u₁,…,u₈ are a block of patch features generated by 3D-CNN. 3D-RNN takes these features as inputs and produces a parent feature vector q. 3D-RNN recursively merges adjacent feature vectors, and generates feature vector x for one channel of data.

8.2.3.2 3D Convolutional Neural Networks

We use 3D-CNN to extract features from RGB-D action videos. 2D-CNNs have been successfully used in 2D images, such as object recognition [52–54] and scene understanding [57,58]. In these methods, 2D convolution are adopted to extract features from local neighborhood on the feature map. Then an addictive bias is added and a sigmoid function is used for feature mapping. However, in action recognition, a 3D convolution is desired as successive frames in videos encode rich motion information. In this work, we develop a new 3D convolution operation for RGB-D videos.

The 3D convolution in the 3D-CNN is achieved by convolving a filter on 3D patches extracted from RGB-D videos. Consequently, local spatiotemporal motion and structure information can be well captured in the convolution layer. It also captures motion and structure relationships of body parts in a local neighborhood, e.g., arm–hand and leg–foot. Suppose p is a 3D spatiotemporal patch randomly extracted from a video. We apply a nonlinear mapping to map p onto the feature map in the next layer,

$g_{k} (p) = \max ({\bar{d}}_{k} - {‖ p - z_{k} ‖}_{2}, 0),$ $g_{k} (p) = \max ({\bar{d}}_{k} - {‖ p - z_{k} ‖}_{2}, 0),$

(8.8)

where ${\bar{d}}_{k} = \frac{1}{K} \sum_{k} {‖ x - z_{k} ‖}_{2}$ ${\bar{d}}_{k} = \frac{1}{K} \sum_{k} {‖ x - z_{k} ‖}_{2}$ is the averaged distance of the sample p to all the filters, $z_{k}$ $z_{k}$ is the kth filter, and K is the number of the learned filters. Note that the convolution in Eq. (8.8) is different from [59,62], but philosophically similar; ${\bar{d}}_{k}$ ${\bar{d}}_{k}$ can be considered as the bias, ${‖ p - z_{k} ‖}_{2}$ ${‖ p - z_{k} ‖}_{2}$ is similar to the convolution operation, which is a similarity measure between the filter $z_{k}$ $z_{k}$ and the patch p. The $\max (\cdot)$ $\max (\cdot)$ function is a nonlinear mapping function as the sigmoid function. The filters in Eq. (8.8) are easily trained in unsupervised fashion (Sect. 8.2.3.5) by running k-means [63].

After 3D convolution, rectification, local contrast normalization and averaged down sampling are applied as in object recognition but they are performed on 3D patches.

3D-CNN generates a list of K-dimensional vectors $u_{i}^{r c t} \in U$ $u_{i}^{r c t} \in U$ ( $i = 1, \dots, M$ $i = 1, \dots, M$ ), where r, c, and t are the locations of the vector in row, column and temporal dimensions, respectively, U is the set of vectors generated by 3D-CNN. Each vector in U is the feature for a patch. All these patch features are then given as inputs to 3D-RNN to compose high-order features.

Our 3D-CNN extracts discriminative motion and geometry changes information from RGB-D data. It also captures relationships of human body parts in local neighborhood, and allows body part to be deformable in actions. Therefore, the learned features are robust to pose variations in RGB-D data.

8.2.3.3 3D Recursive Neural Networks

The idea of recursive neural networks is to learn hierarchical features by applying the same neural network recursively in a tree structure. In our case, 3D-RNN can be regarded as combining convolution and pooling over 3D patches into one efficient, hierarchical operation.

We use balanced fixed-tree structure of 3D-RNN. Compared with previous RNN approaches, the tree structure offers high speed operation, and making use of parallelization. In our tree-structure 3D-RNN, each leaf node is a K-dimensional vector, which is an output of the 3D-CNN. At each layer, the 3D-RNN merges adjacent vectors into one vector. As this process repeats, high-order relationships and long-range dependencies of body parts can be encoded in the learned feature.

3D-RNN takes a list of K-dimensional vectors $u_{i}^{r c t} \in U$ $u_{i}^{r c t} \in U$ generated by 3D-CNN ( $i = 1, \dots, M$ $i = 1, \dots, M$ ) as inputs, and recursively merges a block of vectors into one parent vector $q \in R^{K}$ $q \in R^{K}$ . We define a 3D block of size $b_{r} \times b_{c} \times b_{t}$ $b_{r} \times b_{c} \times b_{t}$ as a list of adjacent vectors to be merged. For example, if $b_{r} = b_{c} = b_{t} = 2$ $b_{r} = b_{c} = b_{t} = 2$ , then $B = 8$ $B = 8$ vectors will be merged. We define the merging function as

$q = f (W [\begin{matrix} u_{1} \\ ⋮ \\ u_{B} \end{matrix}]) .$ $q = f (W [\begin{matrix} u_{1} \\ ⋮ \\ u_{B} \end{matrix}]) .$

(8.9)

Here, W is the parameter of size $K \times B K$ $K \times B K$ ( $B = b_{r} \times b_{c} \times b_{t}$ $B = b_{r} \times b_{c} \times b_{t}$ ), $f (\cdot)$ $f (\cdot)$ is a nonlinear function (e.g., $\tanh (\cdot)$ $\tanh (\cdot)$ ). Bias term is omitted here as it does not affect performance.

3D-RNN is a tree with multiple layers, where the jth layer composes high-order features over the $(j - 1)$ $(j - 1)$ th layer. In the jth layer, all the blocks of vectors in the $(j - 1)$ $(j - 1)$ th are merged into one parent vector using the same weight W in Eq. (8.9). This process is repeated until only one parent vector x remains. Fig. 8.7 shows an example of a pooled CNN output of size $K \times 2 \times 2 \times 2$ $K \times 2 \times 2 \times 2$ and an RNN tree structure with blocks of 8 children, $u_{1}, \dots, u_{8}$ $u_{1}, \dots, u_{8}$ .

We apply 3D-RNN to C channel data separately, and obtain C parent vectors from 3D-RNN $x_{c}$ $x_{c}$ , $c = 1, \dots, C$ $c = 1, \dots, C$ . Each parent vector $x_{c}$ $x_{c}$ in 3D-RNN is a K-dimensional vector, computed from one channel data of an RGB-D video. The vectors from all the channels are then concatenated into a long vector to encode rich motion and structure information for the RGB-D video. Finally, this feature is fed into the softmax classifier for action classification.

The feature learned by 3D-RNN captures high-order relationships of body parts and encodes long-range dependencies of body parts. Therefore, human actions can be well represented and can be classified accurately.

8.2.3.4 Multiple 3D-RNNs

3D-RNN abstracts high-order features using the same weight W in a recursive way. The weight W, randomly learned, expresses the knowledge of which vector is more important in the parent vector for the classification task. It may not be accurate due to the randomness of W.

This problem can be solved by using multiple 3D-RNNs. Similar to [54], we use multiple 3D-RNNs with different random weights. Consequently, different importance of adjacent vectors can be well captured by different weights, and high quality feature vectors can then be produced. We concatenate vectors generated by multiple 3D-RNNs to feed into the softmax classifier.

8.2.3.5 Model Learning

Unsupervised Learning of 3D-CNN Filters. CNN models can be learned using supervised or unsupervised approaches [64,59]. Since convolution operates on millions of 3D patches, using back propagation and fine tuning the entire networks may not be practical or efficient. Instead, we train our 3D-CNN model using an unsupervised approach.

Inspired by [63], we learn 3D-CNN filters in an unsupervised way by clustering random 3D patches. We treat multichannel data (gray, gradient, optical flow, and depth) as separated feature maps, and randomly generate spatiotemporal 3D patches from each channel. The extracted 3D patches are then normalized and whitened. Finally, these 3D patches are clustered to build K cluster centers $z_{k}$ $z_{k}$ , $k = 1, \dots, K$ $k = 1, \dots, K$ , which are used in the 3D convolution (Eq. (8.8)).

Random Weights for 3D-RNN. Recent work [54] shows that RNNs with random weights can also generate features with high discriminability. We follow [54] to learn 3D-RNNs with random weights W. We show that learning RNNs with random weights provides an efficient, yet powerful model for action recognition from depth camera.

8.2.3.6 Classification

As described in Sect. 8.2.3.4, features generated by multiple 3D-RNNs will be concatenated to produce the feature vector x for the depth video. We train a multiclass softmax classifier to classify the depth action x,

$f (x, y) = \frac{\exp (θ_{y}^{T} x_{i})}{\sum_{l \in Y} \exp (θ_{l}^{T} x_{i})},$ $f (x, y) = \frac{\exp (θ_{y}^{T} x_{i})}{\sum_{l \in Y} \exp (θ_{l}^{T} x_{i})},$

(8.10)

where $θ_{y}$ $θ_{y}$ is the parameter for class y. The prediction is performed by taking the $\arg \max$ $\arg \max$ of the vector whose lth element is $f (x, l)$ $f (x, l)$ , $y^{⁎} = \arg \max_{l} f (x, l)$ $y^{⁎} = \arg \max_{l} f (x, l)$ . The multiclass cross entropy loss function is defined in learning the model parameter θ for all the classes. The model parameter θ is learned using the limited-memory variable-metric gradient ascent (BFGS) method.

8.2.4 Experiments

We evaluate our HCRNN model on two popular 3D action datasets, MSR-Gesture3D Dataset [65] and MSR-Action3D dataset [40]. Example frames of these datasets are shown in Fig. 8.8. We use gray, depth, gradient-x, gradient-y, optflow-x, and optflow-y feature maps for all the datasets.

Figure 8.8 Example frames from three RGB-D action datasets.

8.2.4.1 MSR-Gesture3D Dataset

MSR-Gesture3D dataset is a hand gesture dataset containing 336 depth sequences captured by a depth camera. There are 12 categories of hand gestures in the dataset: “bathroom”, “blue”, “finish”, “green”, “hungry”, “milk”, “past”, “pig”, “store”, “where”, “j”, and “z”. This is a challenging dataset due to the self-occlusion issue and visually similarity. Our HCRNN takes an input video of size $80 \times 80 \times 18$ $80 \times 80 \times 18$ . The number of filters in 3D-CNN is set to 256 and the number of 3D-RNNs is set to 16. The kernel size (filter size) in 3D-CNN is $6 \times 6 \times 4$ $6 \times 6 \times 4$ , and the receptive filed size in 3D-RNN is $2 \times 2 \times 2$ $2 \times 2 \times 2$ . As in [65], only depth frames are used in the experiments. The leave-one-out cross validation is employed to in evaluation.

Fig. 8.9 shows the confusion matrix of HCRNN on the MSR-Gesture3D dataset. Our method achieves 93.75% accuracy in classifying hand gestures. Our method misses some of the examples in “ASL Past” and “ASL Store”, “ASL Finish” and “ASL Past”, and “ASL Blue” and “ASL J” due to their visual similarities. As shown in Fig. 8.9, our method can recognize visually similar hand gestures as the HCRNN discovers discriminative features and abstracts expressive high-order features for the task. HCRNN confuses some of examples shown in Fig. 8.9 due to the self-occlusion and intra-class motion variations.

Figure 8.9 Confusion matrix of HCRNN on the MSR-Gesture3D dataset. Our method achieves 93.75% recognition accuracy.

We compare our HCRNN with [44,66,65,67] on the MSR-Gesture3D dataset. Methods in [44,66,65] are particularly designed for depth sequences and [67] proposed HoG3D descriptor which was originally designed for color sequences. Compared with these methods that are based on hand-crafted features, the HCRNN learns features from data. Results in Table 8.5 indicate that our method outperforms all these comparison methods. Our method learns features from data, which better represent intra-class variations and inter-class similarities, and thus achieves better performance. In addition, HCRNN encodes high-order context information of body part, and allows deformable body parts. These two benefits help improve the expressiveness of the learned features.

Table 8.5

The performance of our HCRNN model on MSR-Gesture3D dataset compared with previous methods
Method	Accuracy (%)
Oreifej et al. [44]	92.45
Yang et al. [66]	89.20
Jiang et al. [65]	88.50
Klaser et al. [67]	85.23
HCRNN	93.75

8.2.4.2 MSR-Action3D Dataset

MSR-Action3D dataset [40] consists of 20 classes of human actions: “bend”, “draw circle”, “draw tick”, “draw x”, “forward kick”, “forward punch”, “golf swing”, “hammer”, “hand catch”, “hand clap”, “high arm wave”, “high throw”, “horizontal arm wave”, “jogging”, “pick up & throw”, “side boxing”, “side kick”, “tennis serve”, “tennis swing”, and “two hand wave”. A total of 567 depth videos are contained in the dataset which are captured using a depth camera.

In this dataset, the background is preprocessed to remove the discontinuities caused by the undefined depth regions. However, it is still challenging since many actions are visually very similar. The same training/testing splits in [44] is adopted in this experiment, i.e., the videos of the first five subjects (295 videos) are used for training and the remaining 272 videos are for testing. Our HCRNN takes an input video of size $120 \times 160 \times 30$ $120 \times 160 \times 30$ . The number of filters in 3D-CNN is set to 256, and the number of 3D-RNNs is set to 32. The kernel size (filter size) in 3D-CNN is $6 \times 6 \times 4$ $6 \times 6 \times 4$ , and the receptive filed size in 3D-RNN is $2 \times 2 \times 2$ $2 \times 2 \times 2$ .

The confusion matrix of HCRNN is displayed in Fig. 8.10A. Our method achieves 90.07% recognition accuracy on the MSR-Action3D dataset. Confusions mostly occur between visually similar actions, e.g., “horizontal hand wave” and “clap”, “hammer” and “tennis serve”, and “draw x” and “draw tick”. The learned filters used in the experiment are also illustrated in Fig. 8.10B. Our filter learning method discovers various representative patterns, which can be used to accurately describe local 3D patches.

Figure 8.10 (A) Confusion matrix and (B) learned 3D filters of HCRNN on the MSR-Action3D dataset. Our method achieves 90.07% recognition accuracy.

We compare with methods particularly designed for depth sequences [40,65,66,68], as well as conventional action recognition methods that use spatiotemporal interest point detectors [34,49,67]. Results in Table 8.6A show that our method outperforms all the comparison methods. Our method achieves 90.07% recognition accuracy, which is higher than that of the state-of-the-art methods [42,14]. It should be noted that our method does not utilize the skeleton tracker, and yet outperforms the skeleton-based method [40]. Table 8.6B shows the recognition of HCRFF with different number of 3D-RNNs. The HCRNN achieves the best performance at $n_{r} = 32$ $n_{r} = 32$ . HCRNN obtains the worse performance with $n_{r} = 1$ $n_{r} = 1$ . With more 3D-RNNs, HCRNN achieves higher accuracy until 32 3D-RNNs are used. Then, with more 3D-RNNs, its performance degrades due to the overfitting problem.

Table 8.6

The performance of our HCRNN model on MSRAction3D dataset
Data	Accuracy (%)
RGGP [14]	89.30
Xia and Aggarwal [42]	89.30
Oreifej et al. [44]	88.89
Jiang et al. [40]	88.20
Jiang et al. [65]	86.50
Yang et al. [66]	85.52
Klaser et al. [67]	81.43
Vieira et al. [68]	78.20
Dollar [34]	72.40
Laptev [49]	69.57
HCRNN	90.07

Number of RNNs	Accuracy (%)
1	40.44
2	57.72
4	63.24
8	73.90
16	83.09
32	90.07
64	80.88
128	68.01

8.3 Summary

This chapter studies the problem of action recognition using two different types of data, multi-view data and RGB-D data. In the first scenario, action data are captured by multiple cameras, and thus the appearance of the human subject looks significantly different in different camera view, making action recognition more challenging. To address this problem, we have proposed feature learning approaches for learning view-invariant features. The proposed approaches utilize both shared and private features to accurately characterize human actions with large viewpoint and appearance variations. The sample affinity matrix is introduced in this chapter to compute sample similarities across views. The matrix is elegantly embedded in the learning of shared features in order to accurately weigh the contribution of each sample to the shared features, and balance information transfer. Extensive experiments on the IXMAS and DSA datasets show that our approaches outperform state-of-the-art approaches in cross-view action classification.

Actions can also be captured by RGB-D sensors such as Kinect since there are cost-effective. Action data captured by a Kinect sensor have multiple data channels including RGB, depth, and skeleton. However, it is challenging to use all of them for recognition as they are in different feature space. To address this problem, a new 3D convolutional recursive deep neural network (3DCRNN) is proposed for action recognition from RGB-D cameras. The architecture of the network consists of a 3D-CNN layer and a 3D-RNN layer. The 3D-CNN layer learns low-level translationally invariant features, which are then given as the input to the 3D-RNN. The 3D-RNN combines convolution and pooling into an efficient and hierarchical operation, and learns high-order compositional features. Results on two datasets show that the proposed method achieves state-of-the-art performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Action Recognition

Create new playlist

Sign In

Sign Up

8.1 Deeply Learned View-Invariant Features for Cross-View Action Recognition1