5.1. Introduction

The design and implementation of a complete artificial vision system is a daunting challenge. The computer vision community has made significant progress in many areas, but the ultimate goal is still far off. A key component of a general computer vision system is a computational framework that can address a wide range of problems in a unified way. We have developed such a framework over the past several years [40], which is the basis of the augmented framework presented in this chapter. It is based on a data representation formalism that uses tensors and an information propagation mechanism termed tensor voting.

Most computer vision problems are inverse problems, because the imaging process maps 3D "world" features onto 2D arrays of pixels. The loss of information due to the reduction in dimensionality and the nonlinearity of the imaging process, combined with unavoidable sources of noise, such as limited sensor resolution, quantization of continuous information into pixels, and photometric and projective distortion, guarantee that computer vision algorithms operate on data sets that include numerous outliers and are severely corrupted by noise. Many scene configurations could have generated a given image (see Figure 5.1). An automated computer vision system must impose constraints or prior knowledge about the appearance of familiar objects to select the most likely scene interpretation. In this chapter, we present a methodology for the perceptual organization of tokens. The tokens represent the position of elements such as points and curvels, and can also convey other information, such as curve or surface orientation. The organization is achieved by enforcing constraints, as suggested by Gestalt psychology, within the proposed computational framework. All processing is strictly bottom-up without any feedback loops or top-down stages.

Figure 5.1. An infinite set of scene configurations can produce the same image. Most people, however, would perceive a flat triangle in the above example.


5.1.1. Motivation

The motivation for designing and implementing the tensor voting framework came from the observation that even though many computer vision problems are similar, different algorithms were applied to each of them. In most cases, we start from one or more images and attempt to infer descriptions of the depicted scenes in terms of appearance, shape, motion, and so on. The type of primitives, the constraints, and the desired solution may differ; therefore, a general framework must be flexible enough to adjust to different settings and incorporate different constraints. The primitives used in computer vision problems span a wide range of features, including differential image intensity properties, such as edges and corners; elementary structures, such as elementary curves and surface patches; motion vectors; reflectance properties; and texture descriptors. The most generally applicable constraint is the "matter is cohesive" constraint proposed by Marr [39], which states that objects in real scenes are continuous almost everywhere. Other powerful constraints exist, but are usually applicable to specific problems where certain assumptions hold. The desired type of scene descriptions is a large issue, some aspects of which are addressed in the next subsection.

Despite the apparent differences among computer vision problems, the majority of them can be posed as perceptual organization problems. In the early 1900s, the Gestalt psychologists [70] formulated a set of principles which guide perceptual organization in the human visual system. These principles include similarity, proximity, good continuation, simplicity, closure, colinearity, and co-curvilinearity, which cooperate, or sometimes compete, to produce salient structures formed by the observed tokens (Figure 5.2). We claim that computer vision problems can be addressed within a Gestalt framework, where the primitives are grouped according to the above criteria to give rise to salient structures. For instance, descriptions in terms of shape can be generated by grouping tokens according to similarity, proximity, and good continuation. In other applications, salient groups might appear due to similarity in other properties, such as motion, texture, or surface normal.

Figure 5.2. Simple examples of the Gestalt principles. In (a) the dots are grouped in four groups according to similarity. In (b) the darker dots are grouped in pairs, as are the lighter ones. In (c) the most likely grouping is A to B, and not A to C, due to the smooth continuation of curve tangent from A to B. In (d), the factor of closure generates the perception of an ellipse and a diamond.


The unifying theme in computer vision problems, from a perceptual organization point of view, is the search for salient structures that arise due to nonaccidental alignment of the input tokens and therefore must bear semantic importance. The term perceptual saliency indicates the quality of features to be important, stand out conspicuously, be prominent, and attract our attention. In the remainder of this chapter, we demonstrate how a number of computer vision problems can be formulated as perceptual organization problems and how the tensor voting framework provides the machinery to address these problems in a unified way with minor problem-specific adaptations.

5.1.2. Desirable descriptions

An alternative approach to ours, and one that has been widely used, is the formulation of computer vision problems within an optimization framework. Objective functions can be set up according to the requirements and constraints of the problem at hand by imposing penalties to primitives that deviate from the desired models. Local penalties are aggregated into an energy or cost function, which is then optimized using an appropriate method. Due to the inverse nature of computer vision, optimization methods usually need simplifying assumptions to reduce the complexity of the objective functions, as well as careful initialization to avoid convergence at local optima.

The largest difference, however, between optimization methods and the one proposed here is that they arrive at global solutions, while we claim that local descriptions are more appropriate. Under a global, model-based approach, we cannot discriminate between local model misfits and noise. Furthermore, global descriptions are restricted to features that can be expressed in a parametric form. A local description is more general in the sense that it can encode any smooth feature, with a finite number of discontinuities, as a set of tokens. A hierarchy of more abstract descriptions can be derived from the low-level, local one if that is desired, while the opposite is not always feasible.

A second requirement on the descriptions we wish to infer is that they should be in terms of layers. Psychological evidence suggests that human perception also supports a layered description of the world. For instance, the illusory contour that clearly appears in Figure 5.3 can be explained by a scene interpretation that consists of a circle on top of a set of lines. Even though we do not aim to emulate the human visual system, this evidence suggests that a layered representation is beneficial for a general computer vision system.

Figure 5.3. Illusory contour and its interpretation in terms of layers.


Finally, we choose an object-based representation over a viewpoint-based one. With this choice, we avoid introducing unnecessary viewpoint-dependent elements in intermediate stages of the computation. For instance, if uniqueness along the lines of sight is a reasonable or desired assumption for a given problem, it can be enforced at later stages, thus avoiding the enforcement of constraints too early, when they might be detrimental to the solution.

5.1.3. Our approach

In this section, we briefly review the tensor voting framework, which was originally presented in [40], including extensions that have been developed in the past few years. The two main aspects of the framework are the representation by tensors and the information propagation mechanism by tensor voting. Its purpose is to serve as a computational mechanism for perceptual grouping of oriented and unoriented tokens generated based on image or other primitives. It has mainly been applied to midlevel vision problems, but it is suitable for any problem, of any dimensionality, that can be formulated as a perceptual organization problem. The novelty of our approach is that there is no objective function that is explicitly defined and optimized according to global criteria. Instead, tensor voting is performed locally, and the saliency of perceptual structures is estimated as a function of the support tokens receive from their neighbors. Tokens with compatible orientations that can form salient structures reinforce each other. The support of a token for its neighbors is expressed by votes that are cast according to the Gestalt principles of proximity, colinearity and co-curvilinearity.

Data representation

The representation of a token originally consisted of a symmetric second-order tensor that encodes perceptual saliency. The tensor essentially indicates the saliency of each type of perceptual structure the token belongs to and its preferred normal or tangent orientations. Tensors were first used as a signal processing tool for computer vision applications by Granlund and Knutsson [12] and Westin [71]. Our use of tensors differs in that our representation is not signal based, but rather symbolic, where hypotheses for the presence of a perceptual structure at a given location are represented as tokens with associated second-order tensors that encode the most likely type of structure and its preferred tangent or normal orientations. The power of this representation lies in that all types of saliency are encoded by the same tensor.

The second-order symmetric tensors fulfill the requirements set in the previous paragraphs, since they are well suited as descriptors of local structure. As opposed to scalar saliency values, the representation by tensors is richer in information, since it also encodes orientation hypotheses at each location, thus allowing the application of Gestalt principles such as good continuation, colinearity, and co-curvilinearity. Even though this is also feasible with vectors, as in the work of Guy and Medioni [15, 16], the tensors can simultaneously encode all potential types of structures, such as surfaces, curves and regions, as well as the uncertainty of the information.

A representation scheme sufficient for our purposes must be able to encode both smooth perceptual structures as well as discontinuities, which come in two types: orientation and structural discontinuities. The former occur at locations where a perceptual structure is continuous, but its orientation is not, or where multiple salient structures, such as curves or surfaces, meet. Curve orientation discontinuities occur at locations where multiple curve segments intersect, while surface orientation discontinuities occur where multiple surface patches intersect. In other words, whereas there is only one orientation associated with a location within a smooth curve segment or a surface patch or a region boundary, there are multiple orientations associated with locations where a discontinuity occurs. Hence, the desirable data representation is the one that can encode more than one orientation at a given location. It turns out that a second-order symmetric tensor possesses precisely this property, as shown in Sections 5.3 and 5.4.

The second type of discontinuities are structural discontinuities. These occur at locations such as the endpoints B and C of Figure 5.8(a). They are first-order discontinuities, since the perceptual structure is no longer continuous there. Nonaccidental terminations of perceptual structures carry significant weight and should be explicitly detected and represented. The second-order symmetric tensors fail to describe structural discontinuities because they cannot capture first-order properties, and the second-order properties of the structure remain invariant at its boundaries. In order to address this shortcoming of the framework, as it was published in [40], vectors (first-order tensors) were introduced [69]. More specifically, besides the second-order tensor, each token is associated with a polarity vector that encodes the likelihood of the token being a termination of a perceptual structure. Polarity vectors are sensitive to first-order properties such as the distribution of neighboring tokens around a given token. Structure terminations can be detected based on their essential property to have all their neighbors, at least locally, on the same side of a half-space.

Tensor Voting

The way to encode primitives in the representation scheme is demonstrated in the following sections. The starting point in our attempt to infer salient structures and scene descriptions is a set of unoriented or oriented tokens. Unoriented tokens express the hypothesis that a perceptual structure of a yet unknown type exists at the token's location. Oriented tokens can be elementary curves (curvels), elementary surfaces (surfels), and so on. They express the hypothesis that a perceptual structure with the given orientation goes through the location of the token. The question is, how should these hypotheses be combined in order to derive the saliency of each token?

We propose to do this by tensor voting, a method of information propagation where tokens convey their orientation preferences to their neighbors in the form of votes. First- and second-order votes, which are respectively vectors and second-order tensors, are cast from token to token. The second-order vote is a second-order tensor that has the orientation in terms of normals and tangents the receiver would have if the voter and receiver were part of the same smooth perceptual structure. The first-order vote is a vector that points toward the voter, and thus the interior of the structure, if the receiver were indeed a boundary.

Each vote is an estimate of orientation or termination of a perceptual structure consisting of just two tokens: the voter and the receiver. In essence, simple, smooth, perceptual structures are fitted between the two locations to generate the orientation estimates at the receiver. According to the Gestalt principle of proximity, the strength of the votes attenuates with distance, making the influence from distant tokens, and possible interference from unrelated ones, smaller. The strength of the vote also decreases with increased curvature of the hypothesized structure, making straight continuations preferable to curved ones following the principles of smooth continuation and simplicity.

A large number of first- and second-order votes are accumulated at each location. By analyzing the consistency of the orientation estimations and the amount of support a token receives, we can determine the type of structure that exists at the location and its saliency. The aggregation of support via tensor voting is a generalization to the Hough transform [23] that was first proposed by Guy and Medioni [15, 16] using a vector-based scheme.

5.1.4. Chapter overview

This chapter is organized as follows:

- Section 5.2 is a review of related work in perceptual organization.

- Section 5.3 presents the tensor voting framework in 2D.

- Section 5.4 shows how it can be generalized to 3D.

- Section 5.5 presents the ND version of the framework and a computational complexity analysis.

- Section 5.6 addresses how computer vision problems can be posed within the framework and presents our results on stereo and motion analysis.

- Section 5.7 concludes the chapter after pointing out some issues that still remain open.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.76.213