8Spatial–Temporal Behavior Understanding

An important work of image understanding is through treating image obtained from the scene in order to express the meaning and to guide the action. For this purpose, it is required to determine which objects are in the scene, and how they change over time their positions, attitudes, moving speeds, and relationships in space. In short, to grasp the action of object in space and time, to determine the purpose of the action, and thus to understand the semantics of the information they passed.

Image-/video-based automatic understanding of object (human and/or organism) behavior is a very challenge research topic. It includes the access of objective information from image acquisition sequence, the processing and analysis of relevant visual information contents (representation and description), as well as the interpretation of image/video information. In addition, on the basis of the results obtained above, it is required to learn and recognize behavior of objects in scene.

The above work covers a great span and includes a number of research topics. Recently, the action detection and identification have grabbed a lot of attention and the related researches have made some significant progresses. Comparing to low-level image processing and middle-level image analysis, the high-level behavior identification and interstation (associated with semantic and intelligence) have just started with few research results. In fact, many definitions of the concept is not very clear, many techniques continue to evolve and be updated.

The sections of this chapter are arranged as follows:

Section 8.1 introduces and overviews the definitions, developments, and different layers of spatial–temporal technology.

Section 8.2 introduces the detection of key points (space–time points of interest) that reflect the collection and change of motion information in space–time domain.

Section 8.3 discusses the dynamic trajectory and activity path formed by connecting the point of interest. The learning and analysis of dynamic trajectory and activity path help to grasp the state of the scene in order to further characterize the scene properties.

Section 8.4 describes some kinds of techniques for action classification and identification, which are still ongoing research in progress.

Section 8.5 describes the technique classification for modeling and identification of actions and activities as well as a variety of classes of typical methods.

8.1Spatial–Temporal Technology

Spatial–temporal technologies are facing to understand the spatial–temporal behavior, which are relative new for research community. Many of the present works are started at different levels, here are some general situations.

Figure 8.1: Statistics of the numbers of publications for spatial–temporal technology in the last 12 years.

8.1.1New Domain

The annual survey series of the yearly bibliographies on image engineering, which was mentioned in Chapter 1, has started in 1996 (for the publications of 1995) and has been carried out for consecutive 22 years (Zhang, 2017). When the series enters its second decade (for the literature statistics of 2005), with the appearance of some new hot spots in the image engineering research and application, a new subcategories (C5): spatial–temporal technology (including 3-D motion analysis, gesture and posture detection, object tracking, behavior judgment, and understanding) has been added into the image-understanding category (C) (Zhang, 2006). The emphasis here is the comprehensive utilization of a variety of information possessed by the image/video in order to make the according interpretation for the dynamics of scene and the objects inside.

In the past 12 years, the number of publications belong to subcategory C5 in the annual survey series has attend a total of 168 ones. Their distributions in each years are shown in the bars in Figure 8.1, in which a 3-order polynomial curve fitting to the number of publications of each year is also drawn to show the change trends. Overall, this is still a relatively new field of research, so its development is not too stable, yet.

8.1.2Multiple Layers

Currently, the research target of spatial–temporal technology is mainly on moving people or things, and the change of objects (particularly human being) in scene. According to the abstraction levels of representation and description, multiple layers can be distinguished from bottom to top:

(1)Action primitive: It refers to the atomic building unit for action, generally corresponds to the motion information of the scene in a short interval of time.

(2)Action: A collection (ordered combinations) composited of a series of action primitives produced by subject/initiator, which has specific meaning. Generally, action represents a motion pattern of one person, and only lasts few seconds. The results of human actions often lead to the changes in body posture.

(3)Activity: It refers to a series of actions produced by subject/initiator. These actions are combined (mainly emphasizing the logical combination) to complete a job or reach a certain goal. Activity is relatively large-scale motion and generally depends on the environment and human interactions. Activity usually represents complex actions of more than one person (with possible interaction) and often last for a long period of time.

(4)Events: It refers to certain activities occurred at special circumstance (particular position, time, environment, etc.). Usually, the activity is performed by multiple subjects/initiators (group activity) and/or having the interaction with external world. Detection of specific events is often associated with abnormal activity.

(5)Behavior: It emphasizes that the subject/initiator (mainly human being), dominated by ideological movements, to change action, perform sustained activity, describe events, etc., in a specific environment/context.

In the following, the sport of table tennis is taken as an example to give some typical pictures at all the above layers, as shown in Figure 8.2. Player’s venue, swing, and so on can be seen as typical action primitives, as shown in Figure 8.2(a). Players complete a tee (including drop, windup, jitter wrist, hitting, etc.) or the back ball (including the venue, outriggers, palming, pumping balls, etc.) are typical actions, as shown in Figure 8.2(b). However, the whole process that a player went baffle hither side and took back the ball is often seen as an activity. The two players hit the ball back and forth in order to win points is a typical scene of activity, as shown in Figure 8.2(c). The competition between two or several sport teams is generally seen as an event, and awarding the players after the game is also a typical event, as shown in Figure 8.2(d), which leads to the ceremony. After winning, though the player makes a fist and selfmotivation can be regarded as an action, but more often is seen as a behavior of the players. In addition, when players perform good exchange, the audience applauded, shouting, cheering, are also attributed to the behavior of the audience, as shown in Figure 8.2(e).

Figure 8.2: Several pictures of table tennis game.

It should be noted that the concepts of last three layers are often not strictly distinguished and are used in many studies without distinction. For example, the activity may be called event, when it refers to some unusual activities (such as the disputes between two persons, the elder person falls during walk, etc.); the activity may be called behavior, when the emphasis is mainly on the meaning of activity (behavior), or the nature of the activity (such as shoplifting actions or activities over the wall burglary are called theft). In the following discussion, unless special emphasis being made, the (generalized) activities will be used unevenly to represent the last three layers.

The research in spatial–temporal technology has been conducted from points (point of interest) to curves (trajectory of object ≈ multiple points), to surfaces (allocation of activity ≈ compound curves), and to volume (variation of behaviors ≈ stack of surfaces).

8.2Spatial–Temporal Interesting Points

The change of scene usually comes from the motion of objects, especially accelerated motion. Accelerated motion of local structure in video images corresponds to the objects with accelerated motion in scenes, they are at the locations with unconventional values in the image. It is expected, at these positions (image points), there are information of object movement in physical world and of force for changing object structure in scene, so they are helpful in understanding the scene.

In spatial–temporal scene, the detection of the point of interest (POI) has a tendency of expansion from space to space–time (Laptev, 2005).

8.2.1Detection of Spatial Points of Interest

In the image space, the image modeling can be performed by using the linear scalespace representation, namely Lsp: R2 × R+R, with fsp: R2R.

Lsp(x,y;σl2)=gsp(x,y;σl2)fsp(x,y)(8.1)

That is, making convolution of fsp by using a Gaussian kernel with a variance of σl2:

gsp(x,y;σl2)=12πσl2exp[x2+y2/2σl2](8.2)

The idea behind the typical Harris detector of interesting points is to determine the spatial locations in f(x, y) with significant property change in both horizontal and vertical directions. For a given observation scale σl2, these points can be computed with the help of the matrix of second order moment that is obtained by summation in a Gaussian window with variance σl2:

μsp(;σi2,σi2)=gsp(;σi2){[L(;σi2)][L(;σi2)]T}                      =gsp(;σi2)[(Lxsp)2LxspLyspLxspLysp(Lysp)2]                             (8.3)

where LxspandLysp are Gauss differentials came from Lxsp=x[gsp(;σl2)fsp()] and Lysp=y[gsp(;σl2)fsp()] under local scale σl2. This second moment descriptor can be viewed as the distribution covariance matrix of orientation in a local neighborhood of a 2-D image. Therefore, the eigenvalues λ1 and λ2(λ1λ2) of µsp constitute the image descriptors of fsp changing along two directions. If the values λ1 and λf are both very large, it indicates that there is a point of interest. To detect such points, positive maximum value of corner function is needed:

Hsp=det(μsp)ktrace2(μsp)=λ1λ2k(λ1+λ2)2       (8.4)

At the point of interest, the ratio of eigenvalues a = λ2/λ1 should be great. According to eq. (8.4), for the positive local maxima of Hsp, a should satisfy ka/(1 + a)2. So, if set k = 0.25, the maximum value of H will correspond to the ideal isotropic point of interest (in this case a = 1, namely λ1 = λ2). A smaller value of k (corresponding to a larger value of a) allows for the detection of sharper points of interest. Commonly used value of k in the literature is k = 0.04, corresponding to the detection of interest points with a < 23.

8.2.2Detection of Spatial–Temporal Points of Interest

The above computation for points of interest in spatial space can be extended to spatial–temporal space for detection of spatial points of interest in a particular location on time. This tendency is started around 10 years ago (Laptev, 2005). Detection of spatial–temporal points of interest is essential for extracting low-level motion features, and no background modeling is required.

The detection of points of interest is to find positions where both spatial and temporal changes have big values. The detection is generally using techniques for extracting low-level motion features and does not require background modeling. One method is first making convolution of the given video with a 3-D Gaussian kernel at different spatial and temporal scales. Then, the spatial–temporal gradients at each layer of scale space representation are calculated and collected from neighborhood of various points to produce the stable estimation matrix of spatial–temporal second moment. The local features can be finally extracted from the matrix.

Example 8.1 Examples of spatial–temporal points of interest

Figure 8.3 gives a fragment of table tennis players swing and batting. Several spatial–temporal points of interest are extracted from this picture. The degree of density of spatial–temporal points of interest along the time axis is related to the frequency of actions, and the spatial positions of spatial–temporal points of interest are corresponding to the trajectory of motion and the range of action.

For modeling spatial–temporal image sequence, start from L, L: R2R and convolve it with nonisotropic (space variance σl2 and time variance τl2 are not correlated) Gaussian kernel to form a linear scale space L, L: R2 × RR.

L(;σl2,τl2)=g(;σl2,τl2)f()(8.5)

where the Gaussian kernel for the separation of space and time is

g(x,y,t;σl2,τl2)=1(2π)3σl4τl2exp[x2+y22σl2t22τl2]          (8.6)

Using a separate scale parameter for time domain is critical, because time and space events are generally independent. In addition, the events detected by using the operator of points of interest also depends on both the scale of observation of space and time, so the scale parameters σl2andτl2 require separate treatment.

Figure 8.3: An example of spatial–temporal points of interest.

Similar to the spatial domain, a spatial–temporal matrix of second-order moment is a 3 × 3 matrix containing spatial and temporal differentials that are convolved with Gaussian function g(x,y;σl2,σl2):

μ=g(;σl2,τl2)[Lx2LxLyLxLtLxLyLy2LyLtLxLtLyLtLt2]                 (8.7)

wherein the integral scale σl2andτf2 are linked up with the local scales σl2andτl2 by σl2=sol2andτf2=sτl2, respectively. The first-order differentials are

Lx(;σl2,τl2)=x(gf)Ly(;σl2,τl2)=y(gf)(8.8)Lt(;σl2,τl2)=t(gf)

To detect a point of interest, the search for µ with significant eigenvalues λ1, λ2, and λ3i is conducted. This can be accomplished by extending eq. (8.4), through a combination of rank expansion and determinant of µ, to spatial–temporal domain:

H=det(μ)ktrace3(μ)=λ1λ2λ3k(λ1+λ2+λ3)3(8.9)

To prove that the positive local extremes of H corresponds to points having large λ1, λ2, and λ3(λ1λ2λ3), it is required to define the ratio a = λ2/λ1 and b = λ3/λ1 and rewrite H as

H=λ13[abk(1+a+b)3](8.10)

Since H > 0, so there are kab/(1 + a + b)3, and k will have the maximum possible value of k = 1/27 at a = b = 1. To a large value of k, the local extreme values of H will correspond to points with both great changes along spatial and temporal directions. Especially, if a and b are both set as a maximum of 23 as in space, then the value of k in the eq. (8.9) will be k ≈ 0.005. Therefore, the spatial–temporal points of interest in f can be obtained by detecting the positive local spatial–temporal maxima in H.

8.3Dynamic Trajectory Learning and Analysis

Dynamic trajectory learning and analysis attempt to provide certainty for monitoring the state of the scene by understanding and characterization of the change of each target position and the moving results (Morris, 2008). A flowchart for dynamic trajectory learning and analysis in video is shown in Figure 8.4. First, the target is detected (e.g., the pedestrian detection from a moving car, see Jia (2007) and tracked; then, the scene model is automatically constructed with the trajectory obtained; finally, the model is used to monitor the situation and provide labels for the activities.

Figure 8.4: Flowchart for dynamic trajectory learning and analysis.

In the scene modeling, the points of interest (POI) are first determined within an image area and considered as the location where some events happening, then in the learning step the activities are defined along an activity path (AP). The path is to characterize how the target moving between the points of interest. Such constructed models can be called POI/AP model.

The main tasks in the POI/AP Learning include:

(1)Activity learning: It is conducted by comparing trajectories. Though the lengths of trajectories may be different, the key issue is to maintain the intuitive cognition of similarity.

(2)Adaption: It studies the techniques for managing POI/AP model. These techniques must be able to adapt to new activities online, to remove discontinued activities and to validate the model.

(3)Feature selection: It is the determination of the correct kinetics expression level for specific task. For example, using only the spatial information can verify which road the car has passed, but to determine the cause of accident the speed information of the car are also required.

8.3.1Automatic Scene Modeling

Modeling scenarios by means of dynamic trajectory automatically include the following three tasks (Makris, 2005):

8.3.1.1Object Tracking

It needs for each observable object to achieve identity maintenance in each frame. For example, tracking an object in the T frame of video will generate out of a series of inferred tracking state:

ST={s1,s2,,sT}(8.11)

where st may describe some object characteristics such as location, speed, appearance, shape and the like. These trajectory information constitute the cornerstone for further analysis. Through careful analysis of these information, it is possible to identify and understand different activities.

8.3.1.2Detection of Points of Interest

The first task in image scene modeling is to figure out a region of interest. In topographic map for object tracking, these regions correspond to the nodes in the graph. Two types of nodes mostly considered are in/out regions and stop regions. In the classroom, for example, the former correspond to the doors of classroom while the latter correspond to the podium of classroom.

In/out regions, are the locations of the objects to enter or leave the field or the locations where the tracked objects appear or disappear. These regions are often modeled by means of 2-D Gaussian mixture model Zl1WwlN(μf,σl), which includes W components. This problem can be solved by the EM algorithm. At the entry point, data comprise the place determined by the first tracking state, while at the leaving point, data include the place determined by the last tracking state. They can be distinguished by using a density criterion, the mixed density in the state i is defined as

dl=wlπ|σt|>Ld(8.12)

It measures the degree of compactness of Gaussian mixture. The threshold

Ld=wπ|C|(8.13)

indicates the average density of signal cluster. Here, 0 < w < 1 is the weight defined by user, C is the covariance matrix of all the points concentrated in the region dataset. The compact mixing indicates the correct region, while the loose mixing indicates the tracking noise resulting from disruptions of tracking.

Stop region comes from the landmark points in scene, which is the locations where objects tends to be fixed for some periods of time. The stop region can be determined by two methods with different criteria:

(1)The speed of the tracking point is below a certain predetermined threshold value in this region;

(2)All the tracking points are at least maintained inside a limited distance ring at a certain period of time.

By defining a radius and a time constant, the second method can ensure that the object is indeed maintained in a specific range, while the first method may include objects with very low speed of the movement. For the analysis of activities, not only the locations must be accurately determined but also the time spent in each stop region should be grasped.

Figure 8.5: Scheme for trajectory and path detection.

8.3.1.3Active Path Finding

To understand the behavior, the activity path (AP) needs to be determined. This can be obtained by filtering out false alarms or tracking interrupted noise from training set with POI, and keeping only the paths started after entering the region and ended before leaving the region. An activity of interest should be defined between the two end points (points of interest).

To distinguish between time-varying action objects (such as person walking or running along a pedestrian walkway), the dynamic information varying with time are necessary to be added in the path. Figure 8.5 gives the three basic structures of path finding algorithms, their main differences include the type of input, the motion vector, the trajectory (or video clips), and the way for motion abstraction.

In Figure 8.5(a), input is a single trajectory at time t, each point in the path has been implicitly ordered. In Figure 8.5(b), a full trajectory is used as the input of learning algorithms to directly establish an output path. In Figure 8.5(c), the decomposition of path following the video timing is depictured. Video clips are broken down into a set of motion words to describe the activities, or in other words, a video clip is annotated by the labels of certain activities according to the appearances of motion words.

8.3.2Active Path Learning

As the active path provides the information for object motion, an original trajectory can be represented by a sequence of dynamic measurements. For example, a common representation for trajectory is often a motion sequence:

GT={g1,g2,,gT}(8.14)

where the motion vector is

gt=[xt,yt,vxt,vyt,axt,ayt]T(8.15)

It represents the dynamic parameters obtained from the object tracking at time t, that is, position [x, y]T, speed [vx, vy]T, and acceleration [ax, ay]T.

Figure 8.6: Steps for path learning.

It is possible to learn AP, with unsupervised way, by using only the trajectory. The basic process is shown in Figure 8.6 that includes three steps. Although in the figure, there are three separate sequential steps, they are often combined. In the following, some detailed explanations for the three steps are provided.

8.3.2.1Preprocessing of Trajectory

The most of the work in path learning are to get trajectory suitable for clustering. When tracking the main difficulty comes from the time-varying characteristics, which leads to inconsistent lengths of trajectory. It is needed to take steps to safeguard meaningful comparison between different inputs with various sizes. In addition, the trajectory representation should be visually expressed in the cluster to maintain the original trajectory similarity.

Preprocessing of trajectory consists of two tasks, normalization to ensure that all the paths have the same length; dimension reduction for mapping the paths to a new low-dimensional space in order to perform more robust clustering.

(1)The purpose of normalization is to ensure that all the trajectories have the same length Ta. Two simple techniques are zero filling and expansion. Zero filling is to add some items of zero to the end of short trajectory. The expansion is to extend the last part of the original trajectory till the required length is achieved. They are both likely to expand the trajectory space to very large space. In addition to checking the training set to determine the trajectory length Ta, it is also possible to make use of a priori knowledge for resampling and smoothing. Resampling combined with interpolation could ensure that all trajectories have the same length Ta. Smoothing can be used to eliminate noise, and the trajectory smoothed can be interpolated and sampled to a fixed length.

(2)Dimensionality reduction maps the trajectory to a new low-dimensional space, so more robust-clustering methods can be used. This can be accomplished by assuming a trajectory model and determining the model parameters that can best describe this model. Commonly used techniques include vector quantization, polynomial fitting, multiresolution decomposition, hidden Markov model, subspace method, the spectrum method, and nuclear method.

Vector quantization can be achieved by limiting the number of unique trajectory. If the dynamics of trajectory is ignored and only spatial coordinate is based on, then the trajectory can be seen as a simple 2-D curves and can be approximated by a minimum mean square polynomial of order m (each w is a weight coefficient):

x(t)=k0mwktk(8.16)

In spectrum method, a similarity matrix S can be built for the training set, where each element sij represents the similarity between the trajectory i and trajectory j. In addition, a Laplace matrix L is also built:

L=D1/2SD1/2(8.17)

where D is a diagonal matrix whose i-th diagonal element is the sum of i-th elements in S. By decomposing L, the maximum K eigenvalues can be determined. If the corresponding eigenvectors are put into a new matrix, then the row of this matrix corresponds the trajectory after transform in spectrum space, and the spectrum trajectory can be obtained with k-means method.

8.3.2.2Trajectory Clustering

Clustering is commonly used machine-learning techniques for determining the structure of unlabeled data. While observing the scene, the motion trajectory can be collected and combined into the similar category. In order to produce meaningful clusters, the trajectory-clustering process needs to consider three issues: the definition of a distance (corresponding to similarity) measure; the strategy for cluster updating; and clustering validation.

1. Distance/similarity measure:

Clustering technology depends on the definition of distance (similarity) measure. As mentioned above, a major problem in the trajectory clustering is that the different trajectories generated by the same activity may have different lengths. To solve this problem, either some preprocessing methods can be used or a distance measure independent of size can be defined (if two trajectories Gi and Gj have the same length):

dE(Gi,Gj)=(GiGj)T(GiGj)(8.18)

If two trajectories Gi and Gj have different lengths, then the improvement made for the case in which Euclidean distance does not change with dimensional change is to compare two trajectory vectors with lengths of m and n (m > n), respectively, and use the last point gj,n to cumulative distortion:

dij(c)=1m{k1ndE(gi,k,gj,k)+k1mndE(gi,n+k,gj,n)}(8.19)

Euclidean distance is relatively simple, but in case there is time offset the effect would be bad, because that only the aligned sequences can match. Here, the Hausdorff distance can be considered. On the other side, there is a distance measure did not rely on the complete trajectory (without considering outliers). Suppose the lengths of trajectories Gi = {gi,k} and Gi = {gj,l} are, respectively, Ti and Ti, then

D0(Gi,Gj)=1Tlk1Tld0(gl,k,Gj)(8.20)

where

d0(gi,k,Gj)=mini[dE(gi,k,gj,l)Zl]l{(1δ)k1+δk}(8.21)

where Zl is a normalization constant and is the variance at point l. Do(Gi, Gj) is used to compare the trajectory and the existing clusters. If two trajectories are to be compared, Zl = 1 can be used. Thus, defined distance measure is an average normalized distance from any point to its best matched point, where the best match is at a center of sliding window located at point l with width of 2δ.

2. Clustering process and verification:

The preprocessed trajectories can be combined with unsupervised learning techniques. The trajectory space will be broken down into perceptually similar clusters (such as roads). There are several ways for clustering learning, such as iterative optimization; online adaptation; hierarchical approach; neural network; symbiotic decomposition.

The path learned with clustering algorithm needs further validation, this is because the real number of categories did not know. Most clustering algorithms require an initial choice for the desired number of classes K, but this is often incorrect. To this end, the clustering can be conducted for different K, respectively, and the K corresponding to the best results is taken as the real number of clusters. A tightness and separation criterion (TSC) can be used here as the judgment criterion, which compare the distance between the trajectories in the same cluster and the distance between the trajectories in different clusters. Given a training set DT = {G1,..., GM}, then

TSC(K)=1Mj1Ki1Mfij2dE2(Gi,cj)minijdE2(ci,cj)(8.22)

where fij is the fuzzy membership of trajectory Gi over clustering Cj (where the sample is represented by cj).

8.3.2.3Path Modeling

After trajectory clustering, a graph model can be built according to the path obtained for effective reasoning. Path model is a compact representation for clustering. Path modeling can be conducted in two ways, as shown in Figure 8.7. One considers the complete path by using the cluster centers and the envelopes (to indicate the range of path), as shown in Figure 8.7(a). The path from end to end has not only the average center line but also the envelope on both sides indicating the path range. Along the path, it may have some intermediate state providing the measurement sequence. Other one decomposed the whole path into a number of subpaths (using tree structure), as in Figure 8.7(b). The path is represented as the tree of sub-paths. The probability predicting the path is marked on the arc pointing from the current node to the next node.

Figure 8.7: Two ways for path modeling.

8.3.3Automatic Activity Analysis

Once the scene model is established, then the activities and behaviors of the objects can be analyzed. For instance, a basic function of surveillance video is to validate the event of interest. In general, whether an event is interesting could only be determined under specific circumstances. For example, the parking management system will focus on whether there is space to park, while in the smart meeting room the concern would be the communication among participants. In addition to simply identify a particular behavior, all atypical events need to check. By observing a scene for long time, the system can analyze a series of activities and can learn to find what the event of interest is.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.111.85