8.4 Action Classification and Recognition

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Some typical activity analysis examples are:

(1)Virtual fencing: Any monitoring system has a monitoring range. By setting up early warning on the border, an equivalent virtual fence is established for a certain range, and the invasion will trigger system to start an analysis procedure. For example, the high-resolution PTZ cameras (PTZ) can be used to catch the details of the invasion, and to start making statistics on the number of the invasion.

(2)Speed profiling: Virtual fencing uses only position information, while tracking can also provide dynamic information for speed-based early warning, such as speeding or road blockage.

(3)Path classification: Speed profiling uses only the current tracking data, while activity path (AP) can also provide historical information to predict and interpret the coming data. The behavior of emerging target can be described by means of a maximum a posteriori (MAP) path:

$L^{*} = \underset{k}{arg max p (l_{k} | G)} = \underset{k}{arg max p (G, l_{k}) p (l_{k})} (8.23)$ $L^{*} = \underset{k}{arg max p (l_{k} | G)} = \underset{k}{arg max p (G, l_{k}) p (l_{k})} (8.23)$

This will help to determine which activities can best explain the new path data. Because the path prior distribution p(lk) can be estimated from the training set, so the problem is reduced to an HMM for maximum likelihood estimation.

4. Abnormality detection: Detecting abnormal events is often an important task of the monitoring system. Since the activity path can indicate the characteristics of typical activities, so if the new path shows certain differences with normal one, an abnormality will be detected. Exception mode can be detected with intelligent thresholding:

$p (l^{*} | G) < L_{l} (8.24)$ $p (l^{*} | G) < L_{l} (8.24)$

where the value for the most resemble path l* with new trajectory G is still smaller than the threshold value Ll.

5. Online activity analysis: Enabling online analysis, identification, and evaluation of activities would be more powerful and useful than just descripting the activity with the path. An online system needs to fast inference the current behavior based on incomplete data (often on the basis of graph model). Two examples are path prediction and trace exception. Path prediction uses tracking data up to now to predict future behavior and refines the prediction when more data are collected. Using noncomplete trajectory to forecast activities can be expressed as

$\hat{L} = \underset{j}{\arg \max p (l_{j}} | W_{t} G_{t + k}) (8.25)$ $\hat{L} = \underset{j}{\arg \max p (l_{j}} | W_{t} G_{t + k}) (8.25)$

wherein Wt represents window functions, Gt+k is the trajectory until the current time and k future tracking states predicted. Trace exception is to detect an abnormal event once it is occurred, in addition to place the entire track under an exception. This can be achieved by using WtGt+k instead of G in eq. (8.24). The window function Wt is not necessarily the same as in prediction, and the threshold may need to adjust according to the amount of data.

6.Objectinteractive characterization: Even higher level analysis is expected to explain the interaction among different objects. Similar to abnormal event, strictly defining object interaction is also very difficult. Under different circumstances, different objects have different types of interactions. Taking a car crash as an example. Every car has its spatial dimensions, which can be regarded as their personal space. When the car is moving, the personal space around the car needs to add a minimum safety distance (minimum safety zone), so the spatial–temporal personal space will change with the movement, the faster the speed, the more increase of the minimum safety distance (particularly in the direction of driving). A schematic diagram is shown in Figure 8.8, where the personal space is represented by a circle, while the safety zone will change with the speed (including magnitude and direction). If the safety zones of two cars have the intersection, there will be a possibility of collision. Thereby this can help plan the route.

Figure 8.8: Use paths for collision assessment.

Finally, it should be noted that for simple activities, relying only on speed and object position can be analyzed, but for more complex activities, more measurements would be required, such as adding a cross-sectional curvature to determine odd walk. To provide more comprehensive coverage for activities and behaviors, the use of multiple camera networks would be often required. Activity trajectory may also come from the interconnecting parts of objects (such as human), where the activity definitions should be made with respect to a set of trajectories.

8.4Action Classification and Recognition

Vision-based human action recognition is a process to use action labels for marking an image sequence (video). On the basis of obtaining the representation of the observed image sequence or video, this process can be turned into a classification problem.

8.4.1Action Classification

Many techniques for action classification have been proposed (Poppe, 2010).

8.4.1.1Direct Classification

In direct methods, it does not pay special attention to the time domain, even the video is used. The related methods put all observed information (from all frames) into a single expression, or identify and classify actions separately for each frame.

In many cases, as high-dimensional representation for image is required, a large number of computation is inevitable.

In addition, the representation may also include features such as noise. Therefore, it is required to have a compact, robust feature representation in low-dimensional space for the classification. Either linear methods or nonlinear methods for dimensional reduction can be used. For example, PCA is a typical linear approach, locally linear embedding (LLE) is a typical nonlinear method.

The classifiers used in direct classification can also be different. Classifiers based on identification concern about how to distinguish different categories, rather than to model various categories, such as the typical SVM. In the bootstrap framework, a series of weak classifiers (each often using only 1-D representation) are used to build a strong classifier. Except AdaBoost, LPBoost can obtain sparse coefficients and can quickly converge.

8.4.1.2Time Status Model

Generative models try learning the joint distribution between observation and action, and modeling each action class (considering all changes). Identification model try learning the probability of action classes under observation conditions, they are not concerned for the category modeling but the difference between classes.

The most typical model in generative group is hidden Markov model (HMM), in which the hidden state corresponds to each step of action. Hidden state tries modeling state transition probabilities and observation probabilities. There are two independent assumptions. One is that a state transition is only dependent on the previous state, another is that the observation depends only on the current state. The variants of HMM include maximum entropy Markov model (MEMM), the state decomposition of the hierarchical hidden Markov model (FS-HHMM), hierarchical variable transition hidden Markov model (HVT-HMM).

On the other hand, the identification group tries modeling the conditional distributions for given observation, and combining multiple observations to distinguish different classes of action. This model is advantageous to distinguish between related actions. Conditional Random fields (CRF) is a typical model of identification, its improvements include decomposition CRFs (FCRF), spread CRFs, etc.

8.4.1.3Action Detection

The methods based on action detection does not explicitly model the object representation in the image, nor the action in the image. It connect the observed sequence with the labeled video sequence to directly detect (already defined) actions. For example, video clips can be described as being on different time scales coded bag of words, each word corresponds to the gradient orientation of a local patch. A patch having low variation with time can be ignored, so that the representation will focus on the motion regions.

When the motion is periodic one (such as a person walking or running), the action is cyclical, that is cyclical action. In this case, it is possible to perform temporal segmentation by means of analyzing self-similarity matrix. In addition, tags can be attached to motion initiator, to build self-similarity matrix by tracking tags and using an affine function of distance. Self-similarity matrix can be further undergone frequency conversion, then the peak in the spectrum corresponds to the frequency of movement (e. g., to distinguish a walking person from a running person, the gait cycle can be calculated). Matrix structure can be analyzed to determine the type of actions.

The main methods for action representation and description can be classified into two groups: appearance-based and body model-based.

In appearance-based approach, the descriptions for foreground, contour, optical flow, etc. are directly used. In body model-based method, the body model is used to represent the structural features of the human body. For example, the actions are represented by the sequence of joint. Regardless of the kind of methods used, achieving the detection of the human body as well as the detection and tracking of some important parts of the human body (such as the head, hands, and feet) will play an important role.

Example 8.2 Action recognition database

Some sample pictures for actions in Weizmann action recognition database are shown in Figure 8.9 Blank (2005). From top to bottom, each row provides the pictures in one action: head clap (jack), lateral movement (side), bend (bend), walking (walk), running (run), play with one hand (wave 1), waving both hands (wave 2), forward hop (skip), both feet jump (jump), feet jump in place (p-jump).

8.4.2Action Recognition

The domain of representation and recognition of action and activity is a not very new but un-matured domain. Many methods for action recognition have been developed. They are depended on the purpose of the research and the application domain (Moeslund, 2006). For monitoring systems, the human activity and human interaction are considered. For scene interpretation, representation can be independent from the objects inducing the motion.

8.4.2.1Holistic Recognition

It puts the overall emphasis on the whole-body identification or on identifying each individual part of single person. For example, the information based on the structure and dynamics of the whole body may be used to identify people walking, walking gait, etc. Most techniques are based on human silhouette or outline, and they have less distinction among various parts of the body. For example, there is a body-based identification technology using human silhouettes and sampling uniformly its outlines, and then, the PCA decomposition process is performed. For the calculation of correlation in spatial–temporal domain, respective trajectories can be compared in eigen-space. On the other hand, the use of dynamic information can not only recognize identity but also determine what is doing this person. The action recognition based on body parts makes use of the position and dynamic information of body parts.

Figure 8.9: Example images in Weizmann action recognition database.

8.4.2.2Posture Modeling

Recognition of human action is closely related to the body posture estimation. Body posture can be divided into action posture and postural posture (gesture). The former corresponds to the action behavior at a certain moment, while the latter corresponds to the orientation of human body in 3-D space.

The main methods for posture representation and description can be classified into three groups: appearance-based, body model-based, and 3-D reconstruction-based.

(1)Appearance-based method: It does not directly modeling the physical structure, but using color, texture, outline and other information for body posture analysis. Since the apparent information in 2-D images only are used, it is difficult to estimate the human pose.

(2)Body model-based method: First, the graph model, 2-D or 3-D model of the human body are used for modeling human body, then by analyzing these parameterized body models the body posture is estimated. This kind of methods typically has high requirement for image resolution and precision of object detection.

(3)3-D reconstruction-based method: First, multiple cameras at different locations are used to obtain images of 2-D moving objects; through the matching of corresponding points, 3-D moving objects are reconstructed; then the camera parameters and imaging formula are used to estimate the body posture in 3-D space.

Gesture modeling can be based on spatial–temporal points of interest (see Section 8.2). If Harris corner detector is used, the spatial–temporal points of interest obtained will be more concentrated in regions with sudden change of motion. The number of such points is small and sparse, so it is possible to lose important motion information in video, leading to failure detection. On the other side, dense spatial–temporal points of interest can be extracted via motion intensity (the image can be convoluted with a Gaussian filter in spatial domain and Gabor filter in temporal domain), in order to fully capture the change of motion. After extracting the spatial–temporal points of interest, the descriptor for each point is built first, and then, the modeling for each gesture is followed. One particular method is to first extract the spatial–temporal points of features from training sample database as low-level features, one posture corresponds to one set of spatial–temporal points of features. Then, unsupervised classification method is used for sample classification to get the clustering results for typical postures. Finally, modeling each typical category of posture by using the Gaussian mixture model based on EM-algorithm.

Recent trends in the natural scene pose estimation is to detect posture in single frame, which can overcome the problem caused by unstructured scene viewed by single camera. For example, on the basis of robust parts detection and of probability combination of parts, it has already been able to obtain a better estimation for 2-D posture in a complex movie.

8.4.2.3Activity Reconstruction

Action results in a change of posture. If each stationary body posture is defined as a state, then the building of a sequence of actions (activity) can be obtained by conducting a single traversal through the corresponding postures of the state, with the help of state-space method (also known as probability network method). Based on such a sequence of actions, body action, and posture can be recovered.

Based on the posture estimation, some significant progresses have also been achieved in automatic reconstruction of human activity from the video. The original model-based analysis–synthesis approach conducts effectively search in posture space by means of a multiview video capturing. Many current methods focus more on capturing overall body movement rather than getting stressed on very precisely building details.

Human activity reconstruction with a single view has also made a lot of progresses based on statistical sampling technology. Currently, it is more concerned on how to use the learned model for constraining the activity-based reconstruction. Studies show that using a strong a priori model is helpful for tracking specific activities in a single view.

8.4.2.4Inter-Activity

Interactivity is relative complex. Two main types can be distinguished: interaction between human and environment; interaction among different persons.

(1)Interaction between human and environment: The human is the initiator of the activity, such as taking a book up from a table and driving a car on the road. This can be referred as single (person) activity.

(2)Interaction among different persons: Several persons interact each other. It often refers to the exchange activities or contact behavior of two (or multiple) persons. It can be seen as the combination of several single (atomic) activities, synchronized in spatial–temporal space.

For single activity, it can be described by means of probabilistic graphical models. Probabilistic graphical model is a powerful tool for modeling continuous dynamic characteristic sequence, a relatively mature theoretical basis. Its disadvantage is that the topology of the model is dependent on the structural information of activity itself, so the complex interactions requires a lot of training to learn the topology data model. In order to combine a number of single activities together, you can use statistical relational learning (SRL) approach. SRL is a machine-learning method integrating relational/logical representation, probabilistic reasoning and data mining in order to obtain a comprehensive likelyhood model of relational data.

8.4.2.5Group Activity

Many quantitative changes may cause a qualitative change. Large number of objects involved in the activity could pose new problems and require new solutions. For example, the motion analysis of object groups is mainly concentrated on biological communities of people flow, traffic flow and the dense groups of organisms in nature. The goals of research are the representation and description of object groups, the motion feature analysis of object groups and the boundary constraints on the object groups. In this case, a special grasp of the unique behavior of individuals is weakened, more concerns are on the abstract description of individual for describing the entire collection of activities. For example, some studies draw macroscopic kinematic theory to explore the movement of the particle stream and establish the kinetic theory of particle flux. On the basis, semantic analysis of the polymerization, dissipation, differentiation, and combination of object groups becomes indispensable for capturing the tendency and situation of whole scene.

Example 8.3 Counting the number of people

In many public places, such as squares, stadium entrances, the counting of people numbers is required. Figure 8.10 shows a scene for such a situation. Although there are many people with different forms in the scene, the concern here is the number of people (passed) in a specific range (area surrounded by a frame) (Jia, 2009).

8.4.2.6Scene Interpretation

Different from recognition of objects in scene, scene interpretation considers to comprehend the meaning of entire image rather than to verify a particular person or object. In practice, many methods do recognize activity by only considering the images acquired by camera and observing the motion of objects in the images, without determining the identity of objects. This strategy is effective when the object is small enough to be represented as a point in 2-D space.

For example, a detection system for abnormal conditions include the following modules. First, it is the extraction of the position, speed, size, and binary silhouette of 2-D objects with vector quantization to generate a paradigm codebook. To consider the time relationship between each other, the symbiotic statistics can be used to produce symbiotic matrix. By iteratively define the probability function of paradigms in two codebooks, a binary tree structure can be determined, in which the leaf nodes correspond symbiotic statistical probability distribution in the matrix. The higher-level node corresponds to simple scenario events (such as the movements of pedestrians or cars), so they are used to give further explanation of the scene.

Figure 8.10: Counting the number of people in flow monitoring.

8.5Modeling Activity and Behavior

A general action/activity recognition system should comprise, from an image sequence to high-level interpretation, several steps (Turaga, 2008):

(1)Capturing input video or image sequence.

(2)Extracting concise low-level image features.

(3)Descripting middle-level actions, based on low-level features.

(4)Interpreting image with high-level semantics, starting from basic actions.

Generally, a practical activity recognition system has hierarchical structure. In the low level, there are modules for foreground–background separation, for object detection and tracking. In the middle level, the main module is for action recognition. In the high level, the most important module is inference engine, which codes the semantics of activity according to lower-level action or action primitives, and then understands the entire activity with the aid of learning.

From an abstract point of view, the level of activity is higher than that of action. From the technology point of view, modeling and recognizing the action and activity can often be conducted by using different techniques. A categorization scheme is shown in Figure 8.11 (Turaga, 2008):

8.5.1Modeling Action

The methods for action recognition can be divided into three groups: nonparameter modeling, volume modeling, and (timing) parameter modeling. Nonparametermodeling methods extract a set of features from each video frame, and match these features with stored template. Volume-modeling methods do not extract features frame by frame, but rather see video as a 3-D volume of pixel intensities and extend the standard 2-D image features (such as scale space extremes, spatial filter response) to 3-D. Methods of timing parameter modeling focus on the dynamic modeling of movement time, estimating the specific parameters from a training set of actions.

8.5.1.1Nonparameter Modeling

Typical methods include: using 2-D template, using 3-D object model, manifold learning, and so on.

Figure 8.11: Classification of approaches for action and activity recognition.

Using 2-D Template

Such methods include the following steps: first, motion detection and object tracking in the scene. After tracking, a crop sequence with objects is established. Scale variations can be compensated by means of the size normalization. For a given movement, a cyclical index is calculated. If it is highly cyclical, action recognition should be performed. For recognition, the periodic sequences are divided into independent periods by using periodic estimation. The average period is divided into several time segments and for each space point in every segment the flow characteristics are calculated based on the flow characteristics. The flow characteristics of each segment are averaged into a single frame. The frame of average flow in an activity period constitutes the template for each action group.

A typical approach is to build a time–domain template as an action model. First, the background is extracted, and then, the extracted background blocks from a sequence are combined into a still image. There are two ways of combination. One assigns to all frames in a sequence with the same weight, thus obtained combination may be referred to the motion energy image (MEI). The other assigns to all frames in a sequence with different weights. For example, the new frame having greater weights, while the old frame having smaller weights. Thus, obtained representation is called the motion history image (MHI). For a given action, the images obtained by combination form a template. The calculation of invariant moments leads to recognition.

Using 3-D Object Model

3-D object model is the model built for spatial–temporal objects, such as the generalized cylinder model and 2-D contour superposition model. In 2-D contour superposition model, the object movement and shape information are contained, whereby the geometric information of object surface can be extracted, such as peaks, pits, valleys, and ridges. If the 2-D contour is substituted by the blob in background, the binary spatial–temporal volume can be obtained.

Manifold Learning

Many action recognition tasks are related to the data of high-dimensional space. The feature space becomes sparse with the dimension in exponential form, so to build an effective model requires a large number of samples. According to the manifold of data to learn, the intrinsic dimension of the data can be determined. The number of freedom in this intrinsic dimension is small and can help the design of efficient model in low-dimensional space. The simplest method to reduce the dimension is the principal component analysis (PCA), in which the data is assumed in a linear subspace. In practice, except in extraordinary circumstances, the data are not in a linear subspace, so the techniques to learn, from a large number of samples, the intrinsic geometry of manifold are needed. Nonlinear dimensionality reduction technology allows data points to be represented according to their degree of proximity to each other in the nonlinear manifolds. Typical methods include locally linear embedding (LLE), and Laplace intrinsic maps.

8.5.1.2Volume Modeling

Typical methods include spatial–temporal filtering, using parts of 3-D space (e.g., spatial–temporal points of interest), subvolume matching, and tensor-based methods.

Spatial–temporal Filtering

Spatial–temporal filtering is an extension of spatial filtering, where a bank of spatial–temporal filters are used for filtering space–time volume data in video. Further, specific characteristics based on the response of the filter bank can be obtained. There is hypothesis that the spatial–temporal nature of the visual cortex cells can be described by the structures of available spatial–temporal filters, such as oriented Gaussian kernel and differentiation as well as oriented Gabor filter bank. For example, a video clip can be considered as a space–time volume in XYT space. For each voxel (x, y, t), the local appearance model can be calculated with Gabor filter banks for different orientations and spatial scales as well as for single time scale. Using the average space probability of each pixel in a frame, the action can be recognized. Because the analysis is conducted in a single time scale, this method cannot be used when the frame rate change. To do this, the locally normalized histogram of spatial–temporal gradient can be extracted on several time scales, then the ỵ2 of histograms as well as the input video and samples of stored video are matched. Another method is to use a Gaussian kernel in the spatial filtering, and to use Gauss differential in the temporal filtering. The responses are incorporated into the histogram after thresholding. This method is capable of providing simple and effective features for far field (nonclose shots) video.

With effective convolution, the filtering method can be easily and quickly achieved. However, in most applications, the filter bandwidth does not know in advance, so it is necessary using large filter banks at multiple spatial and temporal scales to effectively get action. Since the output response of each filter has the same number of dimensions with the input data, so the use of large filter bank with multiple spatial and temporal scales is also subject to certain restrictions.

Using Parts of 3-D Space

Video can be seen as a collection of many local assembly parts, each part has a specific movement patterns. A typical approach is to use the spatial–temporal points of interest as shown in Section 8.1. In addition to use the Harris interest point detector, the spatial–temporal gradients extracted from the training set can also be used for clustering. In addition, the bag of words model can be used to represent the action, where the bag of words model can be obtained by extracting spatial–temporal points of interest and doing feature clustering.

Because the points of interest are local in nature, so the long-time relevance is ignored. To solve this problem, the correlogram can be used. A video is seen as a series of sets, each set comprising a part in a small time sliding window. This method does not directly build a global geometric model for local parts, but see them as a bag of features. Different actions can contain similar spatial–temporal components but can have different geometric relationships. If the global geometry is incorporated into the part-based video representation, which constitutes part of a constellation. When the number of parts is large, this model would be quite complex. Constellation model and bag of features model can also be combined into a hierarchy structure, at the top of the constellation model, there are only a small number of components, and each component is contained within the bag of features at the bottom. Such a mixture has combined the advantages of these two models.

In most part-based methods, the detection of parts is often based on linear operations, such as filtering, spatial-temporal gradient, so the descriptors are sensitive for the apparent change, noise, occlusion, and so on. On the other hand, due to the localized nature, these methods are relatively robust for unsteady background.

Subvolume Matching

Subvolume matching means the matching between the video and subvolume of template. For example, the action and templates can be matched from the point of view that spatial motion and temporal motion are correlated. The main difference between this approach and part-based approach is that it does not need to extract action descriptor from the extreme points of scale space but to check the similarity between two spatial-temporal blocks (patch). However, the correlation computation for the whole video volume can be time consuming. One way to solve this problem is to extend the fast Haar feature (box feature) to 3-D, which has been very successful in object detection. The 3-D Haar features are the outputs of 3-D filter bank, the filter coefficient is 1 and −1. The outputs of the filters can be combined with the bootstrap method to obtain robust performance. Another method is to seen the video volume as a set of subvolumes with any shape, each of the subvolumes is a homogenous space, which can be obtained by the clustering of closer pixels in appearance and space. Further, the given video is divided into many subvolumes or super-voxels. Action templates can be matched in these sub-volumes by searching the minimum region sets that can maximize the overlapping rate between the sub-volumes and a set of templates.

The advantages of subvolume matching is relatively robust to noise and occlusion. If the optical flow features is combined, it is also relatively robust to apparent change. The disadvantage of subvolume matching is that it is easier to be influenced by the change of background.

Tensor-based Methods

Tensor is the multidimensional (space) extension of matrix. A 3-D spatial-temporal volume may be naturally viewed as a tensor with three independent dimensions. For example, human action, human identity, and joint trajectory can be seen as the three independent dimensions of a tensor, respectively. By the decomposition of the total data tensor into dominant mode (similar to PCA extension), it is possible to extract the corresponding labels for the movement and identity of action person. Of course, the three dimensions of a tensor can be directly taken as the three dimensions in space, that is, (x, y, t).

The tensor-based method provides a direct way for the matching of whole video, without regard to the middle-level representations used by the previous approaches. In addition, other types of features (such as optical flow and spatial-temporal filter response) are also very easy to be incorporated by increasing the number of dimensions of the tensor.

8.5.1.3Parameter Modeling

The first two modeling methods are more suitable for simple actions, modeling method described below is more suitable for complex movements across the time domain, such as ballet steps in the video and instrumentalists playing with complex gestures. Typical methods include hidden Markov model (HMM), linear dynamic system, and nonlinear dynamic system.

Hidden Markov Model

Hidden Markov model (HMM) is a typical model in state space. It is very effective for modeling time series data, and it has good promotional and identification properties so is suitable for applications requiring recurrence probability estimations. In the process of constructing discrete hidden Markov model, the state space is seen as a finite set of discrete points. The evolution with time is modeled as a probability step transformed from one state to another. Three key issues of hidden Markov model are reasoning, decoding and learning. Hidden Markov model was first used to identify the action of hit in tennis (shot), such as backhand, backhand volley, forehand, forehand volley, and smash. Wherein the image models with background subtraction are converted to hidden Markov models corresponding to particular classes. Hidden Markov model can also be used for modeling the time-dependent actions (e.g., gait).

A single hidden Markov model can be used for modeling the action of a single person. For multiplayer action or interaction, one pair of hidden Markov model can be used to represent alternate actions. Also, domain knowledge can be incorporated into the hidden Markov model construction, or HMM can be combined with object detection to take advantage of the relation between the actions and the objects of action. For example, a priori knowledge for the state duration can be incorporated into hidden Markov model framework, thus resulted model is called half-hidden Markov model (semi-HMM). If assigning a discrete label that is used for high-level behavioral modeling to the state space, then a mixed state hidden Markov model is constituted and can be used for nonstationary behavior modeling.

Linear Dynamic System

Linear dynamic system (LDS) is more general than hidden Markov model, in which the state space is not constrained to be the collection of limiting symbols, but can be continuous values in $ℝ^{k}$ $ℝ^{k}$ space, where k is the dimension of state space. The simplest linear dynamic system is the invariant first-order Gauss-Markov process:

$x (t) = A x (t - 1) + w (t) w ~ N (0, P) (8.26)$ $x (t) = A x (t - 1) + w (t) w ~ N (0, P) (8.26)$

$y (t) = C x (t) + V (t) V ~ N (0, Q) (8.27)$ $y (t) = C x (t) + V (t) V ~ N (0, Q) (8.27)$

wherein, $x \in ℝ^{d} is a d - D state space, y \in ℝ^{n}$ $x \in ℝ^{d} is a d - D state space, y \in ℝ^{n}$ is an n-D observation vector, d n, w and v are the process noise and observation noise, respectively. Both w and v are Gaussian distributed, with zero mean and covariance matrices P and Q, respectively. Linear dynamic system can be viewed as the generalization of the hidden Markov model with Gaussian observation model in the continuous state space. It would be more suitable for treating high-dimensional time series data but would still be not suitable for nonsteady-state action.

Nonlinear Dynamic System

Considering the following series of actions: a person first bends to pick an item, then walks to a table and puts the item on the table, and finally sits on a chair. There are a series of short steps, each can be modeled by LDS. The whole process can be seen as a number of transitions between the different LDSs. The most common form of time-varying LDS is

$x (t) = A (t) x (t - 1) + w (t) w ~ N (0, P) (8.28)$ $x (t) = A (t) x (t - 1) + w (t) w ~ N (0, P) (8.28)$

$y (t) = C (t) x (t) + v (t) v ~ N (0, Q) (8.29)$ $y (t) = C (t) x (t) + v (t) v ~ N (0, Q) (8.29)$

Compare eq. (8.26) and eq. (8.27), where A and C can both change over time. To solve such a complex and dynamic problem, the common method is to use a switching linear dynamic systems (SLDS) or jump linear systems (JLS). Switching linear dynamic system includes a set of linear dynamic systems and a switching function. The switching function changes the model parameters by switching between the models. To recognize the complex movements, it can use a multilayer method that contains a number of different levels of abstraction. In the lowest level, there are a series of input images. One level up, there is a region (called blob) composed of some consistent movements. Further a level up, the trajectories of different blobs are grouped in time. In the highest level, there is an HMM for representing complex behavior.

Although the switching linear dynamic systems have more powerful modeling and description capability than hidden Markov models and linear dynamic systems, but learning and reasoning in switching linear dynamic system is much more complex. Therefore, it is generally required to use approximations. In practice, to determine the appropriate number of switching states is difficult, and it often requires a lot of training data or complex manual adjustments.

8.5.2Activity Modeling and Recognition

Compared to action, activity not only has long duration but also often involve many initiators in most considered activity applications, such as monitoring and content-based indexing. These activities not only interact with each other but also with the context entity affected. For modeling complex scenario, it is necessary to perform high-level representation and reasoning for the intrinsic structure and semantics of behavior (Zheng, 2012).

The methods for activity modeling and recognition can be divided into three groups: by means of graphic model, with the help of syntax, and using knowledge.

8.5.2.1Graphic Model

Commonly used graph models include:

Belief Network

Bayesian network is a sample type of belief network. It first codes a set of random variables as partial conditional probability density (CPD), and then the complex conditions of dependency between them are encoded. Dynamic belief networks (DBN, also known as dynamic Bayesian network) can be sees as an extension of the simple Bayesian networks through a combination of time-dependent random variables. Contrasted with traditional HMM that can only encode a hidden variable, DBN can code the dependence relation of complex conditions among a number of random variables.

The interaction between two persons, such as pointing, extrusion, declining, and hugging should be modeled by a two-step process. First, Bayesian networks are used for pose estimation, then the posture evolution with time is modeled with DBN. The action recognition is based on contextual information provided by other objects in scene, while the interaction between persons and matters can be interpreted by Bayesian network.

If the dependence of many random variables is considered, DBN would be more versatile than HMM. But in the DBN, the temporal model, as in the HMM model, is also Markov model, so the use of the basic DBN model can only deal with action sequences. They can model the structural behavior with the development of graphical models for learning and reasoning. However, learning local CPD for a large network often requires a lot of training data or complicated manual adjustment of experts, these two points bring certain limits for the use of DBN in large-scale environments.

Petri Net

Petri net is a mathematical tool for descripting the link between the condition and the event. It is particularly suitable for modeling and visualizing behaviors, such as sorting, concurrency, synchronization, and resource sharing. Petri net is a bilateral diagram containing two nodes: location and transition, in which the location refers to the state of the entity and the transition refers to the change of entity state. Consider an example of representing an activity of a car pickup with probability Petri nets, as shown in Figure 8.12. In this figure, the positions are labeled p1, p2, p3, p4, p5, the transitions are labeled t2, t2, t3, t4, t5, t6. In this Petri net, p and p3 are the starting nodes, p5 is the end node. After a car gets into the scene, a token will be placed on the position p1. At this time, the transition t3 can be enabled, but it is required that the associated condition (i. e., the car was stopped in a nearby parking) meets before the official launch. At this point, the token at p3 is eliminated and then put into place p2. Similarly, when a man entered the parking and put the token on p3 at which the transition will start after the person has left the parked car. Next, the token is removed from p3 and placed at p4.

Figure 8.12: The probability Petri nets representing a car pick activity.

Now, a token has placed at each allowed location of transition t6, so that when the relevant condition (where a car leaving the parking space) was satisfied, the activity fire starts. Once the car left, t6 fires, tokens are removed, one token is placed at the final location p5. In this example, sorting, concurrency and synchronization are all happened.

Petri nets have been used to develop high-level interpretation systems for image sequence. Among them, the structure of Petri nets need to be determined in advance, which is a very complex work for large network representing complex activity. Through the automation of mapping a group logic, spatial–temporal operations to graph structure, the tasks described earlier may be semi-automated. With such a method, the interactive tools for querying video surveillance can be developed by mapping user querying requirements to Petri nets. However, this method is based on a deterministic Petri net, so it cannot handle the uncertainty in the low-level modules (such as object detection and object tracking).

Further, the real human activities and rigorous models are not exactly the same, the model needs to be allowed having some differences with a desired sequence and punishing the significant differences. For this reason, the concept of probability Petri net (PPN) is proposed. In the PPN, the transition is associated with a weight, and the weights recorded the probability of transition starting. By using jump transition and giving them low probability of punishment, it is possible to achieve the input stream robustness when the observation is missed. In addition, the uncertainty of object identification and the uncertainty of expand activity (unfolding) can be effectively incorporated into the tokens of Petri nets.

Although Petri net is a relatively intuitive tool for describing complex activities, it has the disadvantage that it needs to manually describe the model structure. Learning from training data structure has not formally considered.

Other Graph Models

Considering the shortcomings of DBN, particularly the restrictions on the sequence of events, a number of other graph models are proposed. Under the framework of DBN, several special graph models are built. They can model complex time relations, such as sequentiality, time period, parallelism, and synchronization. A typical example using the past – present – future (PNF) structure can model the complex time-ordering case. In addition, the communications network can be used to represent the activity employing a partial ordering interval. The new method takes a temporal extension activity as a series of event tags. With specific constraints related to context and activity, the sequence tags can be found having a property of partial ordering. For example, it is required to first open the mailbox then to view the messages. With these constraints, the activity model can be considered as a set of sub-sequences that represent some sorts of constraints of different lengths.

8.5.2.2Syntax Methods

Two key points for syntax methods are grammar and rule. The syntax methods are mainly realized by using suitable designed grammars and rules.

Grammar

The grammar uses a group of production (generative) rules to describe the structure under considerations, so as to model body motion and multiple-person interaction. Similar to the language model, production rules indicate how to build sentences (activity) from the word (action primitive), and how to identify sentences (video) that satisfy rules of a given grammar (activity model).

Early grammar for the recognition of visual activity is used for identifying the object dismantling work, there is no probability model in grammar, yet. Subsequently, the context-free grammar (CFG) has been applied for modeling and identification of human motion and multiplayer interaction. In which a hierarchical process is used, in the lower layer, it is a combination of HMM and BN, while the high-level interaction is modeled with CFG. The method of context-free grammar has a strong theoretical foundation for modeling structured process. In the synthesis method, it is only needed to enumerate the primitives to detect and define the event production rules at high level. Once the CFG rules build out, the existing resolution algorithms can be used.

Stochastic Grammar

Stochastic context-free grammar (SCFG) has been used for modeling semantic of activity. The algorithms for detecting low-level primitives are often the nature of probabilistic algorithms. Therefore, SCFG makes probability extension of context free grammar and is more suitable for combining the actual visual models together. SCFG can be used for modeling the semantic of activities (which is assumed with known structure). HMM is used in the detection of low-level primitives. Production rules of grammar has been supplemented by probability, and a skip transition is introduced. This can not only improve the robustness of insertion error in input stream but also improve the robustness in the lower level modules. SCFG is also used for modeling multitasking activities (including multiple independent threads of execution, intermittent activities related interaction).

In many situations, it is often required to associate some additional attributes or characteristics with the event primitives. For example, the exact location of an event primitive occurring is likely to be very important for describing this event, but this may not be recorded in advance in the event primitive set. In these cases, the attribute grammar will have a stronger ability in description than traditional grammar does.

Example 8.4 Attribute syntax example

As shown in Figure 8.13, the production rules and event primitives, such as “appears”, “disappear”, “move close” and “move away” are used to describe activity. The event primitives are further related with the attributes, such as the location (loc) of appearance and disappearance of an event, the classification of a group of objects (class), identification of related entities (idr) and other attributes.

While SCFG is more robust than CFG for the missed errors in input stream, SCFG has the same limits as CFG for temporal modeling.

8.5.2.3Methods Based on Knowledge and Logic

Knowledge is closely related to logic.

Logic-based Approach

Logic rules are suitable for representing input knowledge and high-level reasoning results, so they can be used to describe general domain knowledge for activity explanation. Logic-based approach relies on the strict logical rules to describe the domain knowledge in the general sense to describe activities. Logic rules are useful for describing domain knowledge input by the user or for representing the results of high-level reasoning with an intuitive and user-readable form.

Declarative model describes all the expected activities by using the structure of a scene and events. Activity model includs the interaction between the objects in scene. The hierarchical structure can be used to identify a series of actions made by the agent. The descriptors of action symbols may be extracted from low-level features by some intermediate level. This method takes into account that an activity is composited of several action threads, each of action threads can be modeled as a stochastic finite state automaton. The constraints between different threads are propagated in a temporal logic network. Considering a logic-programming system, when it is used to represent and recognize high-level activities, the event primitives is first detected by low-level modules, then the high-level inference engine based on Prolog is used for recognizing activity represented by logical rules between event primitives. Such methods do not directly discuss the uncertainty problem in observing the input stream. To address these issues, the logical model and probabilistic model can be combined, in which the logic rules are expressed in first-order predicate logic. Each rule is also associated with a weight indicating the accuracy of the rule. Further reasoning can be performed by means of Markov logic network.

Figure 8.13: An attribute grammar example for boarding.

Ontological Approach

Ontology can standardize the definition of activity, increase inter-operability among different systems, and compare the performance of various behavior assessments easily. Typical examples include the analysis of actual social interaction in the nursing room, the classification of conference video, the setting of interactive operations in the banks, and others.

Internationally, the video event challenge workshops (VECW) have been organized since 2003 to integrate a variety of capacity for building a domain ontology based on common knowledge. The workshops have defined six areas of video surveillance:

(1)Periphery and internal security;

(2)Monitoring of railway crossing;

(3)Visual banking supervision;

(4)Visual monitoring of subway;

(5)Warehouse security;

(6)Airport parking apron safety.

The workshops have also guided the development of two formal languages, one is a video event representation language (VERL) that helps the ontological representation of complex events based on simple subevents; the other is a video event markup language (VEML) that is used to mark VERL events in video.

Example 8.5 Ontology Example

Figure 8.14 gives an example of ontology concept description for car cruising activity. The ontology counts the number of cars circling on the road of the parking lot without stopping. When this number exceeds a threshold value, a cruising activity is detected.

Figure 8.14: An example for describing the ontology of a car cruising in the parking lot.

8.6Problems and Questions

8-1In English, there are many words having meaning similar/corresponding to behavior, including action, activity, conduct, and event, respectively. Make a literature research, and the analysis and discussion on their precise meaning. Compare their similarities and differences.

8-2Analyze the basis of putting the “activities”, “events”, and “actions” in the same level and in three levels. Make a discussion on where they are needed to distinghish them and where they are not needed to distinghish them.

8-3*Extending the algorithm of 2-D spatial POI to 3-D spatial–temporal POI will encounter the problem of anisotropy (the resolution in three directions are different), try to discuss what methods can solve this problem.

8-4Find two video clips (which are significantly different in the number of moving objects, the range/speed of motion or the direction of motion), detect the spatio-temporal points of interest separately, and analyze the distribution of detected spatial–temporal points of interest.

8-5Which application scenarios would be suitable for each of the three trajectory and path detection methods?

8-6Compare the distance measure defined by eq. (8.20) with the Hausdorff distance and the modified Hausdorff distance with the average value (see Section 4.2.1), what is the relationship between them?

8-7What are the distinct characteristics of two ways for path modeling in Figure 8.7? Which applications are suitable for each of them?

8-8Provide some other examples of automatic activity analysis. Make a discussion on what types of information of moving target are used?

8-9Section 8.4.1 introduces the techniques for direct action classification. Where is the “indirectness” of the other two types of classification techniques?

8-10Try to cite two actions that require modeling using linear dynamic systems and nonlinear dynamic systems, respectively. Decompose these actions and indicate their chanracteristics or features.

8-11Reference the Example 8.5.1, and use the production rules and event primitives to describe another activity.

8-12*Provide a line-by-line explanation of the ontology example in Figure 8.14 (equivalent to annotating the program). Re-use the concept of ontology for a similar description of another activity.

8.7Further Reading

1.Spatial–Temporal Technology

–The references to the contents of 168 papers in Figure 8.1 can be found in Zhang (2006, 2007, 2008a, 2009a, 2010, 2011, 2012a, 2013a, 2014, 2015a, 2016, 2017).

2.Spatial–Temporal Interesting Points

–Further information on the application of spatial–temporal interesting points can see Bregonzio (2009). One example for action recognition in video can be found in Kara (2011).

3.Dynamic Trajectory Learning and Analysis

–A work on the use of on-board system for learning and detecting multi-path can be found in Sivaraman (2011).

–One example of using trajectory for action recognition can see Chen (2016a).

4.Action Classification and Recognition

–Motion classification and recognition techniques are supported by the extraction of the underlying image features. A survey on the combination of the static features based on shape, edges, or silhouettes as well as dynamic features based on optical flow or motion information and spatial–temporal features based on volume data in image sequences can see Weinland (2011).

–An action recognition work in combining the human posture and the context information through the static image can be found in Zheng (2012).

–A deep learning study combining different features for action recognition can be found in Chen (2016b).

5.Modeling Activity and Behavior

–A technique based on learning to identify specific actions and activities can be found in Tran (2008).

–The use of action units to identify the expression (Zhu, 2011b) can be of some help for action and activity modeling.

–An understanding of activities and behaviors involves the cognitive computation of visual information, see Luo (2010).

–Some of the websites for activity recognition researches are visible in Forsyth (2012).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8.4 Action Classification and Recognition

Create new playlist

Sign In

Sign Up

Table of Contents for
8.4 Action Classification and Recognition