8.4. Representation and Indexing

We discussed the ultimate form of spatial data by grouping the data into object silhouettes, clusters of points, or point sets. In the next section, we leave the spatial domain to condense the pictorial information into feature values.

8.4.1. Grouping data

In content-based image retrieval, the image is often divided in parts before features are computed from each part. Partitionings of the image aim at obtaining more selective features by selecting pixels in a tradeoff against having more information in features when no subdivision of the image is used at all. We distinguish the following partitionings:

  • When searching for an object, it would be most advantageous to do a complete object segmentation first:

    Strong segmentation is a division of the image data into regions in such a way that region T contains the pixels of the silhouette of object O in the real world and nothing else specified by T = O.

It should be noted immediately that object segmentation for broad domains of general images is not likely to succeed, with a possible exception for sophisticated techniques in very narrow domains.

  • The difficulty of achieving strong segmentation may be circumvented by weak segmentation where grouping is based on data-driven properties:

    Weak segmentation is a grouping of the image data in conspicuous regions T internally homogeneous according to some criterion, hopefully with TO.

The criterion is satisfied if region T is within the bounds of object O, but there is no guarantee that the region covers all of the object's area. When the image contains two nearly identical objects close to each other, the weak segmentation algorithm may falsely observe just one patch. Fortunately, in content-based retrieval, this type of error is rarely obstructive for the goal. In [125], the homogeneity criterion is implemented by requesting that colors be spatially coherent vectors in a region. Color is the criterion in [49, 126]. In [16, 114], the homogeneity criterion is based on color and texture. The limit case of weak segmentation is a set of isolated points [143, 59]. No homogeneity criterion is needed then, but the effectiveness of the isolated points rest on the quality of their selection. When occlusion is present in the image, weak segmentation is the best one can hope for. Weak segmentation is used in many retrieval systems either as a purpose of its own or as a preprocessing stage for data-driven, model-based object segmentation.

  • When the object has a (nearly) fixed shape, like a traffic light or an eye, we call it a sign:

    Localizing signs is finding an object with a fixed shape and semantic meaning, with T = xcenter.

Signs are helpful in content-based retrieval, as they deliver an immediate and unique semantic interpretation.

  • The weakest form of grouping is partitioning:

    A partitioning is a division of the data array regardless of the data, symbolized by TO.

The area T may be the entire image or a conventional partitioning as the central part of the image against the upper, right, left, and lower parts [75]. The feasibility of fixed partitioning comes from the fact that images are created in accordance with certain canons or normative rules, such as placing the horizon about two-thirds up in the picture or keeping the main subject in the central area. This rule is often violated, but this violation in itself has semantic significance. Another possibility of partitioning is to divide the image in tiles of equal size and summarize the dominant feature values in each tile [129].

8.4.2. Features accumulation

In the computational process given in Figure 8.3, features are calculated next. The general class of accumulating features aggregate the spatial information of a partitioning irrespective of the image data. A special type of accumulative features are the global features, which are calculated from the entire image. Fj (see Figure 8.2) is the set of accumulative features or a set of accumulative features ranked in a histogram Fj is part of feature space .F. Tj is the partitioning over which the value of Fj is computed. In the case of global features, Tj=void represents the image; otherwise, Tj represents a fixed tiling of the image. The operator h may hold relative weights, for example, to compute transform coefficients.

A simple but very effective approach to accumulating features is to use the histogram: the set of features F(m) ordered by histogram index m.

One of the earlier approaches to color-based image matching using the color at pixels directly as indices, was proposed by Swain and Ballard [162]. If the RGB or normalized color distributions of two images are globally similar, the matching rate is high. The work by Swain and Ballard has had an enormous impact on color-based histogram matching, resulting in many histogram variations.

For example, the QBIC system [42] allows for a user-defined computation of the histogram by the introduction of variable k, denoting the number of bins of the histogram. Then, for each 3 × k cells, the average modified Munsell color is computed together with the five most frequently occurring colors. Using a standard clustering algorithm, they obtain k super cells, resulting in the partitioning of the color system.

In [58], various color-invariant features are selected to construct color pattern-cards. First, histograms are created in a standard way. Because the color distributions of histograms depend on the scale of the recorded object (e.g., distance object-camera), they define color pattern-cards as thresholded histograms. In this way, color pattern-cards are scale-independent by indicating whether or not a particular color model value is substantially present in an image. Matching measures are defined, expressing similarity between color pattern-cards, robust to a substantial amount of object occlusion and cluttering. Based on the color pattern-cards and matching functions, a hashing scheme is presented offering runtime image retrieval independent of the number of images in the image database.

In the ImageRover system, proposed by [147], the L*u*v* color space is used; each color axis is split into four equally sized bins, resulting in a total of 64 bins. Further, [37] uses the L*a*b* system to compute the average color and covariance matrix of each of the color channels. [158] uses the HSV color space to obtain a partition into 144 bins, giving more emphasis on hue H than on value V and saturation I. Further, [4] also focuses on the HSV color space to extract regions of dominant colors. To obtain colors that are perceptually the same but still distinctive, [165] proposes to partition the RGB color space into 220 subspaces. [36] computes the average color describing a cell of a 4 × 4 grid, which is superimposed on the image. [149] uses the L*a*b* color space because the color space consists of perceptually uniform colors, which better matches the human perception of color. [65] roughly partitions the Munsell color space into 11 color zones. Similar partitioning is proposed by [29] and [24].

Another approach, proposed by [161], is the introduction of the cumulative color histogram, which generates more dense vectors. This enables coping with coarsely quantized color spaces. [186] proposes a variation of the cumulative histograms by applying cumulative histograms to each sub-space.

Other approaches are based on the computation of moments of each color channel. For example, [6] represents color regions by the first three moments of the color models in the HSV-space. Instead of constructing histograms from color invariants, [73, 45] propose the computation of illumination-invariant moments from color histograms. In a similar way, [153] computes the color features from small object regions instead of the entire object.

[85] proposes to use integrated wavelet decomposition. In fact, the color features generate wavelet coefficients together with their energy distribution among channels and quantization layers. Similar approaches based on wavelets are proposed by [175, 101].

All of this is in favor of the use of histograms. When very large data sets are at stake, plain histogram comparison will saturate the discrimination. For a 64-bin histogram, experiments show that for reasonable conditions, the discriminatory power among images is limited to 25,000 images [160]. To keep up performance, in [125], a joint histogram is used, providing discrimination among 250,000 images in their database, rendering 80% recall among the best 10 for two shots from the same scene using simple features. Other joint histograms add local texture or local shape [68], directed edges [87], and local higher order structures [47].

Another alternative is to add a dimension representing the local distance. This is the correlogram [80], defined as a 3D histogram where the colors of any pair are along the first and second dimension, and the spatial distance between them is along the third. The autocorrelogram defining the distances between pixels of identical colors is found on the diagonal of the correlogram. A more general version is the geometric histogram [1] with the normal histogram, the correlogram, and several alternatives as special cases. This also includes the histogram of the triangular pixel values reported to outperform all of the above, as it contains more information.

A different view on accumulative features is to demand that all information (or all relevant information) in the image is preserved in the feature values. When the bit content of the features is less than the original image, this boils down to compression transforms. Many compression transforms are known, but the quest is for transforms simultaneously suited as retrieval features. As proper querying for similarity is based on a suitable distance function between images, the transform has to be applied on a metric space. In addition, the components of the transform have to correspond to semantically meaningful characteristics of the image. Finally, the transform should admit indexing in compressed form, yielding a big computational advantage over having the image be untransformed first. [144] is just one of many approaches in which the cosine-based JPEG-coding scheme is used for image retrieval. The JPEG transform fulfills the first and third requirements but fails on a lack of semantics. In the MPEG standard, the possibility to include semantic descriptors in the compression transform is introduced [27]. For an overview of feature indexes in the compressed domain, see [108]. In [96], a wavelet packet is applied to texture images and, for each packet, entropy and energy measures are determined and collected in a feature vector. In [83], vector quantization is applied in the space of coefficients to reduce its dimensionality. This approach was extended to incorporate the metric of the color space in [141]. In [86], a wavelet transform is applied independently to the three channels of a color image, and only the sign of the most significant coefficients is retained. In [3], a scheme is offered for a broad spectrum of invariant descriptors suitable for application on Fourier, wavelets, and splines and for geometry and color alike.

8.4.3. Feature accumulation and image partitioning

The lack of spatial information for methods based on feature accumulation might yield lower retrieval accuracy. As for general image databases, a manual segmentation is not feasible due to the sensory gap. A simple approach is to divide images into smaller subimages and then index them. This is known as fixed partitioning. Other systems use a segmentation scheme prior to the actual image search to partition each image into regions. Nearly all region-based partitioning schemes use some kind of weak segmentation, decomposing the image into coherent regions rather than complete objects (strong segmentation).

Fixed Partitioning

The simplest way is to use a fixed image decomposition in which an image is partitioned into equally sized segments. The disadvantage of a fixed partitioning is that blocks usually do not correspond with the visual content of an image. For example, [65] splits an image into nine equally sized subimages, where each subregion is represented by a color histogram. [67] segments the image by a quadtree, and [99] uses a multiresolution representation of each image. [36] also uses a 4 × 4 grid to segment the image. [148] partitions images into three layers, where the first layer is the whole image, the second layer is a 3 × 3 grid, and the third layer is a 5 × 5 grid. A similar approach is proposed by [107], where three levels of a quadtree are used to segment the images. [37] proposes the use of interhierarchical distances measuring the difference between color vectors of a region and its subsegments. [20] uses an augmented color histogram capturing the spatial information between pixels together with the color distribution. In [59], the aim is to combine color and shape invariants for indexing and retrieving images. Color-invariant edges are derived from which shape-invariant features are computed. Then, computational methods are described to combine the color and shape invariants into a unified, high-dimensional histogram for discriminatory object retrieval. [81] proposes the use of color correlograms for image retrieval. Color correlograms integrate the spatial information of colors by expressing the probability that a pixel of color ci lies at a certain distance from a pixel of color cj. It is shown that color correlograms are robust to a change in background, occlusion, and scale (camera zoom). [23] introduces the spatial chromatic histograms, where for every pixel the percentage of pixels having the same color is computed. Further, the spatial information is encoded by baricenter of the spatial distribution and the corresponding deviation.

Region-Based Partitioning

Segmentation is a computational method to assess the set of points in an image which represent one object in the scene. As discussed before, many different computational techniques exist, none of which is capable of handling any reasonable set of real-world images. However, in this case, weak segmentation may be sufficient to recognize an object in a scene. Therefore, in [12], an image representation is proposed providing a transformation from the raw pixel data to a small set of image regions that are coherent in color and texture space. This so-called Blobworld representation is based on segmentation using the expectation-maximization algorithm on combined color and texture features. In the Picasso system [13], a competitive learning clustering algorithm is used to obtain a multiresolution representation of color regions. In this way, colors are represented in the L*u*v* space through a set of 128 reference colors as obtained by the clustering algorithm. [63] proposes a method based on matching feature distributions derived from color ratio gradients. To cope with object cluttering, region-based texture segmentation is applied on the target images prior to the actual image retrieval process. [26] segments the image first into homogeneous regions by split and merge using a color distribution homogeneity condition. Then, histogram intersection is used to express the degree of similarity between pairs of image regions.

8.4.4. Salient features

As the information of the image is condensed into just a limited number of feature values, the information should be selected with precision for greatest saliency and proven robustness. That is why saliency in [103] is defined as the special points, which survive longest when gradually blurring the image in scale space. Also in [137], lifetime is an important selection criterion for salient points in addition to wiggliness, spatial width, and phase congruency. To enhance the quality of salient descriptions, in [170], invariant and salient features of local patches are considered. In each case, the image is summarized in a list of conspicuous points. In [143], salient and invariant transitions in gray-value images are recorded. Similarly, in [59, 54], photometric invariance is the leading principle in summarizing the image in salient transitions in the image. Salient feature calculations lead to sets of regions or points with known location and feature values capturing their salience.

In [16], first the most conspicuous homogeneous regions in the image are derived and mapped into feature space. Then, expectation-maximization [35] is used to determine the parameters of a mixture of Gaussians to model the distribution of points into the feature space. The means and covariance matrices of these Gaussians, projected on the image plane, are represented as ellipsoids characterized by their center x, their area, eccentricity, and direction. The average values of the color and texture descriptions inside the ellipse are also stored.

Various color image segmentation methods have been proposed that account for the image formation process; see for instance the work collected by Wolff, Shafer, and Healey [181]. [150] presents the dichromatic reflection model, a physical model of reflection which states that two distinct types of reflection—surface and body reflection—occur and that each type can be decomposed into a relative spectral distribution and a geometric scale factor. [93] developed a color segmentation algorithm based on the dichromatic reflection model. The method is based on evaluating characteristic shapes of clusters in red-green-blue (RGB) space followed by segmentation independent of the object's geometry, illumination, and highlights. To achieve robust image segmentation, however, surface patches of objects in view must have a rather broad distribution of surface normals, which may not hold for objects in general. [10] developed a similar image segmentation method using the H-S color space instead of the RGB color space. [73] proposed a method to segment images on the basis of normalized color. However, [92] showed that normalized color and hue are singular at some RGB values and unstable at many others.

8.4.5. Shape and object features

The theoretically best way to enhance object-specific information contained in images is by segmenting the object in the image. But, as discussed above, the brittleness of segmentation algorithms prevents the use of automatic segmentation in broad domains. And, in many cases, it is not necessary to know exactly where an object is in the image as long as one can identify the presence of the object by its unique characteristics. When the domain is narrow, a tailored segmentation algorithm may be needed more, and fortunately, it may also be more feasible.

The object's internal features are largely identical to the accumulative features, now computed over the object area. They need no further discussion here.

An abundant comparison of shape for retrieval can be found in [113], evaluating many features on a 500-element trademark data set. Straightforward features of general applicability include Fourier features and moment invariants of the object this time, sets of consecutive boundary segments, or encoding of contour shapes [40].

For retrieval, we need a shape representation that allows a robust measurement of distances in the presence of considerable deformations. Many sophisticated models widely used in computer vision often prove too brittle for image retrieval. On the other hand, the (interactive) use of retrieval makes some mismatch acceptable and, therefore precision can be traded for robustness and computational efficiency.

More sophisticated methods include elastic matching and multiresolution representation of shapes. In elastic deformation of image portions [34, 123] or modal matching techniques [145], image patches are deformed to minimize a cost functional that depends on a weighed sum of the mismatch of the two patches and on the deformation energy. The complexity of the optimization problem depends on the number of points on the contour. Hence, the optimization is computationally expensive and this, in spite of the greater precision of these methods, has limited their diffusion in image databases.

Multiscale models of contours have been studied as a representation for image databases in [118]. Contours were extracted from images and progressively smoothed by dividing them into regions of constant sign of the second derivative and progressively reducing the number of such regions. At the final step, every contour is reduced to an ellipsoid, which could be characterized by some of the features in [47]. A different view on multiresolution shape is offered in [98], where the contour is sampled by a polygon and then simplified by removing points from the contour until a polygon survives, selecting them on perceptual grounds. When computational efficiency is at stake, an approach for the description of the object boundaries is found in [189], where an ordered set of critical points on the boundary are found from curvature extremes. Such sets of selected and ordered contour points are stored in [112] relative to the basis spanned by an arbitrary pair of the points. All point pairs are used as a basis to make the redundant representation geometrically invariant, a technique similar to [182] for unordered point sets.

For retrieval of objects in 2D images of the 3D worlds, a viewpoint-invariant description of the contour is important. A good review of global shape invariants is given in [138].

8.4.6. Structure and layout

When feature calculations are available for different entities in the image, they may be stored with a relationship between them. Such a structural feature set may contain feature values plus spatial relationships, a hierarchically ordered set of feature values, or relationships between point sets or object sets. Structural and layout feature descriptions are captured in a graph, hierarchy, or any other ordered set of feature values and their relationships.

To that end, in [111, 49], layout descriptions of an object are discussed in the form of a graph of relations between blobs. A similar layout description of an image in terms of a graph representing the spatial relations between the objects of interest is used in [128] for the description of medical images. In [51], a graph is formed of topological relationships of homogeneous RGB regions. When selected features and the topological relationships are viewpoint-invariant, the description is viewpoint-invariant, but the selection of the RGB representation as used in the paper suit that purpose only to a limited degree. The systems in [78, 157] study spatial relationships between regions, each characterized by locations, size, and features. In the latter system, matching is based on the 2D string representation founded by Chang [17]. For a narrow domain, in [128, 132], the relevant elements of a medical X-ray image are characterized separately and joined together in a graph that encodes their spatial relations.

Starting from a shape description, the authors in [98] decompose an object into its main components, making the matching between images of the same object easier. Automatic identification of salient regions in the image based on nonparametric clustering followed by decomposition of the shapes found into limbs is explored in [50].

8.4.7. Discussion

General content-based retrieval systems have dealt with segmentation brittleness in a few ways. First, a weaker version of segmentation has been introduced in content-based retrieval. In weak segmentation, the result is a homogeneous region by some criterion, but not necessarily covering the complete object silhouette. It results in a fuzzy, blobby description of objects rather than a precise segmentation. Salient features of the weak segments capture the essential information of the object in a nutshell. The extreme form of the weak segmentation is the selection of a salient point set as the ultimately efficient data reduction in the representation of an object, very much like the focus-of-attention algorithms for an earlier age. Only points on the interior of the object can be used for identifying the object, and conspicuous points at the borders of objects have to be ignored. Little work has been done on how to make the selection. Weak segmentation and salient features are a typical innovation of content-based retrieval. It is expected that salience will receive much attention in the further expansion of the field, especially when computational considerations gain in importance.

The alternative is to do no segmentation at all. Content-based retrieval has gained from the use of accumulative features, computed on the global image or partitionings thereof, disregarding the content, the most notable being the histogram. Where most attention has gone to color histograms, histograms of local geometric properties and texture are following. To compensate for the complete loss of spatial information, recently the geometric histogram was defined with an additional dimension for the spatial layout of pixel properties. As it is a superset of the histogram, an improved discriminability for large data sets is anticipated. Accumulative features calculated from the central part of a photograph may be very effective in telling them apart by topic, but the center does not always reveal the purpose. Likewise, features calculated from the top part of a picture may be effective in telling indoor scenes from outdoor scenes, but again, this holds to a limited degree. A danger of accumulative features is their inability to discriminate among different entities and semantic meanings in the image. More work on semantic-driven groupings will increase the power of accumulative descriptors to capture the content of the image.

Structural descriptions match well with weak segmentation, salient regions, and weak semantics. We have to be certain that the structure is within one object and not an accidental combination of patches that have no meaning in the object world. The same brittleness of strong segmentation lurks here. We expect a sharp increase in the research of local, partial, or fuzzy structural descriptors for the purpose of content-based retrieval, especially of broad domains.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.70.131