Figure 5.15: Local structure and region adjacency graph.

This is a nonlinear relaxation problem. Taking eq. (5.20) into eq. (5.21), a global optimization function can be obtained:

F=k=1Tl=1NP(qi=wk)j=1Nl=1Tr(qi=wk,qj=wl)P(qj=wl)(5.23)

The constraints for solutions

k=1TP(ql=wk)=1iP(qi=wk)0i,k(5.24)

In the concrete use of probabilistic relaxation labeling method, the conditional probabilities of a tag for all objects are first determined, then the iteration repeats and the following two steps are continuously iterated:

(1)Computing, according to eq. (5.23), the objective function representing the quality of scene labeling;

(2)Updating the probability of a tag to increase the value of the objective function (to improve the quality of the scene labeling);

Once the objective function value is maximized, then the best tag is obtained.

5.5Scene Classification

Scene classification is to determine the various specific regions in image, according to the principles of visual perception organization, and to give a conceptual explanation of the scene. Its concrete means and goals are to automatically classify and label the images according to a given set of semantic classes, so as to provide an effective context information for object identification and scene content interpretation.

Scene classification is different from object recognition (but needs to have a full knowledge of the object). In many cases, it is required to classify the object before having its full information (in some cases, using only low-level information, such as color, texture, and so are already able to achieve categorization). Referring to the human visual cognitive processes, some primary classification and recognition have already met the requirements for the scene classification, namely the establishment of links between low-level features and high-level cognitive, and the determination and interpretation of semantic categories of the scene.

Scene classification has a guiding role for object recognition. Naturally, most of the objects appear only in specific scenarios, so the correct judgment of the global scene can provide a reasonable context constraint mechanism for partial image analysis.

5.5.1Bag of Words/Bag of Feature Models

Bag of words model is derived from natural language processing. It is often referred to as bag of features model after its introduction into the image field. Bag of features model consists of class characteristics (features) that belong to the same destination set formed package (bag), where the name comes, Sivic (2003). The model usually takes a directed graph structure. There are the probability constraints relationship between the nodes in undirected graph, and the causal relationship between the nodes in directed graph, the undirected graph can be regarded as a special kind of directed graph – symmetric directed graph. In the model of bag of features, the conditional independence between image and visual vocabulary is the theoretical basis of the model, but the model has not strictly geometric information about the object components.

The original bag of words model only considers the symbiotic relationship between the features corresponding to the words and the logical relationships of topics, while it ignores the spatial relationships between features. However, in the image field, not only the image features themselves are important but also the spatial distributions of these features are also significant. In recent years, there are many feature descriptors (e. g., SIFT) that have a relatively high number of dimensions. They can be more comprehensive and have explicit expression for representing the key points in the image and their special characteristics in small surrounding areas (different from corner points that only representing the location information while representing their own characteristics implicitly). Their representations are also distinguished from other key points and from small surrounding areas. In addition, these feature descriptors can overlap each other in image space so that the characteristics of the relationship can be better preserved. The utilization of these features enhances the ability for describing the spatial distribution of the image feature.

Representing and describing the scene with the model of bag of features is required to extract features of local region description from the scene. These features can be called visual vocabulary. If the scene is decomposed, there will be some basic components. Applying the concept of a document, a book is made up of many words or word composition. Return to the image field, it is considered that the image of scene is composited by many visual words. From a cognitive perspective of view, each visual word corresponds to a feature (more precisely, a feature describing the local scene characteristic) in image, and is a basic unit in reflecting the image content or the meaning of the scene. Building a collection of visual vocabulary (dictionary) may include the following aspects:

(1)Extracting features;

(2)Learning visual vocabulary;

(3)Using the quantitative characteristics of the visual vocabulary;

(4)Representing image by using the frequency of visual vocabulary.

Figure 5.16: The process of obtaining local region description features in image.

A specific example is shown in Figure 5.16. First, the region (the neighborhood of key points) in image is detected and different kinds of regions are divided and extracted, as shown in Figure 5.16(a), in which the region is the square form for simplification. Then for each region, the feature vectors are calculated to represent the region, as shown in Figure 5.16(b). Next, the feature vectors are quantizated into the visual words and a codebook is built as shown in Figure 5.16(c). Finally, the appearance frequency of specific words for each image is counted (few examples using histogram are shown in Figure 5.16(e) ~ Figure 5.16(f)), these frequencies will be combined to give the representation for images.

If the image is divided into a number of subregions, and each subregion is assigned a semantic concept, it can be said that each of the subregion is taken as a visual unit that has its unique independent semantic meaning. Because similar scenes should have similar concepts in distribution, a scene can be classified into specific semantic categories according to the regional distribution of semantic concepts. If the semantic concept and visual vocabulary can be linked, then the scene classification can be performed by means of representation and a description model of words.

Using visual vocabulary can directly represent the objects, or can only represent the middle-level concepts in the neighborhood of key points. The former needs to detect or segment the objects in scene, and further make the scene classification via object classification. For example, once the sky was detected, the image should be outdoors. The latter does not require direct segmentation of objects, but to identify the scene tags by using the local descriptors obtained from training. There are three general steps:

(1)Feature point detection: The often used methods include image grid method and Gaussian difference method. The former divides image into mesh, and takes their center positions to determine feature points. The latter uses DoG operator to detect local feature points of interest, such as the corners.

(2)Feature representation and description: Both characteristics of feature points themselves and their neighborhood should be combined. In recent years, SIFT operator is often used, in which the feature point detection and the feature representation and description are combined actually.

(3)Generating dictionary: Local descriptions are clustered (e.g., using the k-means clustering method), and the cluster centers are taken to build dictionary.

Example 5.2 Dictionary of visual vocabulary

One example for constructing dictionary of visual vocabulary is given in Figure 5.17. In practice, the choice of the local region can make use of SIFT local descriptor, and the selected local region is a circular area with the key point in the center. These local regions have the same characteristics, as shown in Figure 5.17(a). The constructed dictionary of visual vocabulary is shown in Figure 5.17(b), wherein each subimage represents a single basic visual vocabulary (a cluster of key features) and can be represented by a vector, as shown in Figure 5.17(c). The use of dictionary of visual vocabulary can represent the original image by a combination of the visual vocabulary, the frequency of various visual vocabulary used reflects the image characteristics.

In practical application process, with the help of feature detection operator and feature descriptor the image is expressed by visual vocabulary first; and then the parameter estimation and probabilistic reasoning are constituted, to obtain the iterative formula for parameters and the results of probability analysis; finally the trained models were analyzed to gain understanding.

The most commonly used model in modeling is Bayesian-related models, such as the typical probability latent semantic analysis (pLSA) model and latent Dirichlet allocation (LDA) model. According to the framework of bag of features model, the image is seen as text, the topic discovery from the image is seen as object class (such as teachers, sports grounds), then a scene comprising of multiple objects is seen as a probabilistic model with a mixed group of topics, the classification of semantic categories can be made by analyzing the topic distribution in scene.

Figure 5.17: Get a visual vocabulary with SIFT local descriptor.

5.5.2pLSA Model

pLSA model is derived from the probability of latent semantic indexing and is a graph model for solving the object and scene classification, Sivic (2005). pLSA model derived from the learning of natural language and texts, the original definitions are adopted from the concepts in text, but it is also very easy to be extended to the image field (in particular by means of the bag of features model).

5.5.2.1Model Description

Suppose there are a set of images T = {ti}, i = 1,.., N, N is the total number of images; T contains visual words from the set of words – dictionaries (visual vocabulary) S = {sj}, j = 1,..., M, M is the total number of words. The properties of image set T can be described by an N × M statistical co-occurrence matrix P, in which each matrix element pij = p(ti, sj) represents the frequency of word sj appearing in image ti. This matrix is actually a sparse matrix.

pLSA model uses a latent variable model to describe the data in co-occurrence matrix. It associates each observation (word sj appearing in the image ti) with a latent variable (called topic variable) z Z = {zk}, k = 1,..., K. Let p(ti) represent the probability that a word appearing in image ti, p(zk|ti) represents the probability that topic zk appearing in the image ti (the image distribution in topic space), p(sj|zk) represents the probability that the word sj appears in a particular topic zk (the topic distribution in dictionary), then by selecting image ti with probability p(ti) and the topic with probability p(zk|ti), a word sj with probability p(sj|zk) can be generated. The conditional probability model based on the co-occurrence matrix of topics and words can be defined as

p(Sj|t1)=k=1Kp(Sj|Zk)p(zk|ti)(5.25)

That is, each word in every image may be formed by mixing the K latent topic variables p(sj|zk) with coefficients p(zk|ti). Thus, the element in co-occurrence matrix P is

p(ti,sj)=p(ti)p(sj|ti)(5.26)

A graph representation of pLSA model is depicted in Figure 5.18, in which the boxes represent sets (the large box represents image set, the small box represents the repeated selection of topics and words); the arrows represent the dependencies between those nodes; nodes are random variables, the left observation node t (shaded) corresponds to the image, the right observation node s (shaded) corresponds to the visual vocabulary described by descriptors, the intermediate node z is a latent nodes (unobserved), which represents the object category, namely the topic. The model is to establish the probabilistic mapping among the topic z, image t and visual vocabulary s, and to select the category corresponding to the maximum a posteriori estimation as the judgment result of final classification category.

Figure 5.18: pLSA schematic model.

The objective of pLSA model is to search for vocabulary distribution probability p(sj|zk) under specific topics zk and for the mixing ratio of p(zk|ti) corresponding to a specific image, thereby obtaining the vocabulary distribution p(sj|ti) in the specific image. Equation (5.25) represents every image as a convex combination of K topic vectors, which can be illustrated by using matrix operations, as shown in Figure 5.19. Wherein each column in the left matrix represents the visual vocabularies in a given image, each column in the middle matrix represents the visual vocabularies in a given topic, each column in the right matrix represents the topics in a given image (object class).

5.5.2.2Model Calculations

It is required to determine the common topic vectors for all images and the specific mixing coefficients for each image, which aims to give a high probability model to the words appearing in the image, so the category with the maximum a posteriori probability can be selected as the final object category. This can be obtained by optimizing the parameters of the following objective function so as to make a maximum likelihood estimation:

L=j=1Mi=1Np(sj|ti)p(sj,ti)(5.27)

The maximum likelihood estimation for the latent variable model can be computed by using expectation-maximization (EM) algorithm. It consists of two steps: E step and M step. E step is an expectation step, in which the posterior probability of latent variables is calculated on the basis of parameter estimation. M step is a maximization step, in which the likelihood of fully expected data obtained from the E-step will be maximized.

Figure 5.19: Co-occurrence matrix decomposition.

E step can be expressed as (by using Bayes formula):

p(zk|ti,sj)=p(sj|zk)p(zk|ti)l=1kp(sj|zl)p(zl|tl)(5.28)

M step has an iterative formula

p(sj|zk)=i=1Np(sj|zk)p(zk|ti)i=1Np(sj|zl)p(zl|ti)(5.29)

The operations of E step and M step conduct alternatively until the termination condition is satisfied. Finally, the judgment of category can be carried out by means of the following formula:

z*=argmaxz{p(z|t)}(5.30)

Example 5.3 Expectation-maximization algorithm

Expectation-maximization algorithm is an algorithm of statistical computing, which searches the parameters of maximum likelihood estimation or maximum a posteriori estimation in the statistical probability model (dependent on unobservable latent variables). It is an iterative technique for estimating unknown variables via portion of relevant known variables. The algorithm has two alternating iterative calculation steps:

(1)Calculating the expected value (E step), namely using the existing estimates of latent variables, to calculate the estimation value of maximum likelihood;

(2)Maximization (M step), namely in the basis of the maximum likelihood values from E step to estimate the values of desired parameters, the estimated parameter values thus obtained will be used in the next E step.

5.5.2.3Model Application Examples

Consider an emotion-based task for semantic image classification, Li (2010). Image includes not only visual scene information but also contains a variety of emotional semantic information. In other words, except to express the objective world scenery, state, and environment, image can also bring a strong emotional response. Different emotional adjectives can be used to express different categories of emotions. There is a sentiment classification framework that divides all the emotion into 10 categories. They include five kinds of positive categories (amusement, contentment, excitement, awe, and undifferentiated positive) and five kinds of negative categories (anger, sadness, disgust, fear, and undifferentiated negative). The international community has established the International Affective Picture System (IAPS) database, Lang (1997), in which there are a total of 1182 color pictures with very rich object categories. Some examples of pictures belonging to the above-mentioned 10 kinds of emotion categories are shown in Figure 5.20. Figure 5.20(a) to Figure 5.20(e) correspond to the five kinds of positive emotions and Figure 5.19(f) to Figure 5.19(j) correspond to the five kinds of negative emotions.

In the image classification based on emotional semantic information, the images are the pictures from database, words are selected from emotional vocabulary, and the topic corresponds to latent emotional semantic factor (an intermediate layer of semantic concepts, between low-level image features and high-level emotional categories). First, the low-level image features obtained by using SIFT operator are clustered with k-means algorithm to form the emotional dictionary. Then, pLSA model is used to study the latent emotional semantic factor, so as to achieve the probability distribution p(sj|zk) of every latent emotional semantic factor on emotional words and the probability distribution p(sj|ti) of every picture on latent emotional semantic factors. Finally, the method of support vector machine (SVM) is used to train the emotional image classifier that will be applied to the classification of different emotional picture categories.

Example 5.4 Classification test and result

Some experimental results using the above methods for emotion classification are shown in Table 5.1, where 70% of each class of emotional picture are taken as the training set, the remaining 30% of the picture are taken as the test set. Training and testing processes are alternatively repeated 10 times, the table shows the average correct classification rate of 10 categories (%). The values of emotional word s have been selected from 200 to 800 with interval 100, the values of latent emotional semantic factor z are in the range of 10 to 70 with gap 10.

In Table 5.1, the effects of different numbers of latent emotional semantic factors and emotional vocabulary on the image classification are shown. When the values of latent emotional semantic factor are fixed, along with the increasing number of emotional vocabulary, the classification performance is gradually improving first and then declining, the best is around s = 500. Similarly, when the number of emotional vocabulary is fixed, along with the increasing values of latent emotional semantic factor, the classification performance is also gradually improving first and then declining, the best is around z = 20. Therefore, by selecting s = 500 and z = 20, the best classification result can be achieved.

5.5.3LDA Model

LDA model is a set probability model. It can be seen as formed by adding super parameter layer to pLSA model and building the probability distribution of latent variable z, Blei (2003).

Figure 5.20: Example pictures of 10 kinds of emotion categories.

Table 5.1: Classification examples

5.5.3.1Basic LDA Model

The LDA model can be illustrated by Figure 5.21, where the boxes represent sets (the large box represents image set with the number of images M, the small box represents the repeated selection of topics and words in image, N is the number of words in image, and is generally believed that N is independent of q and z). Figure 5.21(a) shows the basic LDA model. The leftmost latent node, a, corresponds to a Dirichlet priori parameters representing the topic distribution in every image. The second latent node from left, q, represents the topic distribution in image (qi is the topic distribution of image i), q is also called mixed probability parameters. The third latent node from left, z, is the topic node, zij represents the topic of word j in image i. The fourth latent node from left is the only observation node (shaded), s is the observation variable, sij represents the jth word in images i. The rightmost node b is the polynomial distribution parameters of topic-word layer, namely Dirichlet priori parameters of word distribution in every topic.

As seen above, the basic LDA model is a three-layer LDA Bayesian model. Wherein a and b are the parameters belonging to super parameters of image set, q belongs to the image layer, z and s belong to visual vocabulary layer.

LDA model includes K latent topics z = {z1, z2,..., zK}, every word in image is produced by a corresponding topic. Each image is composed of K topics according to specific probability q. The model parameter N obeys Poisson distribution; q obeys Dirichlet distribution, that is, q ~ Dirichlet(a), where a is a priori for the Dirichlet distribution (the topic distribution of images fits with Dirichlet probability distribution). Each word is a term in dictionary (visual vocabulary). It can be represented by a K-D vector with only one component being 1 and the other component being 0. A word sj is selected from the dictionary with a probability of p(sj|q, b).

Figure 5.21: LDA schematic model.

Solving LDA model includes two processes: variational approximation inference (Gibbs sampler can also be used, see Griffiths (2004)) and parameter learning. Variational inference refers to the process of determining the topic mixing probability q and the probability of every word produced by topic z, in given super parameters a and b as well as observed variables s, that is,

p(q,z|s,a,b)=p(q,z,s|a,b)p(s|a,b)=p(q|a)[i=1Np(zi|q)p(si|zi,b)]p(q|a)[i=1Np(zi|q)p(si|zi,b)]dq(5.31)

Wherein the denominator p(s|a, b) is the likelihood function of words.

Due to the existed coupling relationship between q and b, it cannot directly calculate p(s|a, b). It is seen from the LDA model graph, the coupling relationship is induced by the condition relationship among q, z, and s. Therefore, by deleting the connection between q and z as well as the observation node s, a simplified model can be obtained as shown in Figure 5.21(b), an approximate distribution p′(q, z|r, f) of p(q, z|s, a, b) can also be obtained as follows

p(q,z|r,f)=p(q|r)i=1Np(zi|fi)(5.32)

where the parameter r is the Dirichlet distribution parameter of q, f is a polynomial distributed parameter of z.

Further, taking the logarithm of p(s|a, b):

logp(s|a,b)=L(r,f;a,b)+KL[p'(q,z|r,f)||p(q,z|s,a,b)](5.33)

Item 2 on the right-hand side indicates the KL divergence between the approximating distribution model p′ and LDA model p. The smaller the KL divergence is, the more approximated p′ to p. Achieving the minimum KL divergence can be realized by maximizing the lower bound L(r, f; a, b) of likelihood function, and solving model parameters r and f. Once r and f are known, q and z can be solved through sampling.

Parameter learning process, under the condition of giving the set of observed variables S = {s1, s2,..., sM}, is a process determining the super parameters a and b. This can be achieved by variational EM iteration. Wherein the above variational inference algorithm is used in E step to calculate the variational parameters r and f in each image; while all variational parameters are collected in M step from all images, and the partial derivatives of super parameters a and b are computed to maximize the lower bound L(r, f; a, b) of likelihood function, finally the estimation for super parameters is achieved.

In practice, the basic LDA model is generally extended to the smooth LDA model to get better results (to overcome the sparse problems when using large data sets). Smooth LDA model is shown in Figure 5.21(c), where K represents the number of topics in model, f corresponds to a K × V Markov matrix (V is the dimension of word vector), wherein each row represents the word distribution in topics.

Here the image is represented as a random mixing of latent topics, in which each topic is characterized by the distribution of words. For each image i in image collection, the generation process of LDA is as follows:

(1)Selecting qi to satisfy Dirichlet distribution, qi ~ Dirichlet(a), where i {1,..., M}, Dirichlet(a) is the Dirichlet distribution of parameter a;

(2)Selecting fk to satisfy Dirichlet distribution fk ~ Dirichlet(b), where k {1,..., K}, Dirichlet(b) is the Dirichlet distribution of parameter b;

(3)For each word sij, where j {1, ..., Ni, selecting a topic zij ~ Multinomial(qi) and selecting a word sij ~ Multinomial(fZij), in which Multinomial represents the polynomial distribution.

5.5.3.2SLDA Model

To further improve the classification performance of LDA model, the category information can be introduced, thus obtaining a supervised LDA model—SLDA model, Wang (2009), whose graph model is shown in Figure 5.22. Figure 5.22(a) is the standard SLDA model, the meaning of each node at the upper portion is the same as in Figure 5.21(a), the lower portion has been added a category tag node l that is related to topic z. It is possible to predict tag l corresponding to topic z Z = {zk}, k = 1,..., K, with the parameter h of Softmax classifier. The reasoning in SLDA model on topic z is influenced by tag l, which makes the super parameter d of word-topic distribution more suitable for classification tasks (can also be used for labeling).

Likelihood function of image in SLDA model is

p(s,l|a,b,h)=p(q|a)zi[i=1Np(zi|q)p(si|zi,b)p(zi|q)p(si|zi,b)]p(l|z1;N,h)dq(5.34)

where h is the control parameter for category tag l. Learning process of parameter is to determine super parameters a, b, and h under the conditions that the observed variables set S = {s1, s2,..., sM} and category information {li}i=1:M are given. This can be achieved by using the variational EM algorithm. The variational inference of SLDA is to determine the topic mixing probability q of image, topic probability z of every word, and image category tag l under the conditions that super parameters a, b and h, as well as observed variable s are given. Compared to LDA model, the variational EM algorithm and variational inference method for SLDA are more complex, Wang (2009).

Figure 5.22: SLDA schematic model.

The simplified SLDA model is shown in Figure 5.22(b), it is the same as the simplified LDA model in Figure 5.21(b).

5.6Problems and Questions

5-1Figure Problem 5-1 shows the membership functions of (gray scale) dark fuzzy set D, (gray scale) bright fuzzy set B, and (gray scale) medium fuzzy set M. It can be seen that, for a given gray scale g, it can correspond to a number of different membership function values, try to explain this.

Figure Problem 5-1

5-2*In Problem 5-1, if LM(x) = 2/3 and LB(x) = 1/3, what is the gray-level g?

5-3As shown in Figure Problem 5-1, what are the values of LD(x), LM(x) and LB(x) when g = 180? And when g = 18?

5-4Prove that in fuzzy logic, the AND of a set with its complement is not an empty set.

5-5The semantic-based region growing methods can also be applied to the semantic segmentation and interpretation of images. But the use of genetic algorithms for semantic segmentation and interpretation is equivalent to a combination of image splitting and merging. Try to explain why it can say so? What advantages are brought by this approach?

5-6If in Example 5.1, eq. (5.14) instead of eq. (5.13) is used for the calculation of the circularity of the region marked as sphere and the confidence of the location of the sphere region in the grassland region, how are the steps and results of the random variation changed?

5-7If the initial code strings obtained from the random region marking in Figure 5.8(a) is BLLLB and BBLBL, try to draw the two corresponding sets of segmentation hypothesis graphs. In addition, what will the following processes of replication, crossover and mutation be like?

5-8There should be a series of intermediate steps between Figure 5.13(c) and Figure 5.13(d), which are to be supplemented. It is required to determine only one target at a time, and to list the properties used.

5-9Compare the differences and approaches used between the image grid method and the Gaussian difference method in obtaining the local feature description of the image. What effects will they have on the obtained description vectors and on the final classification results.

5-10There are some indoor scenes images, such as classrooms, meeting rooms, offices, exhibition rooms, gymnasiums, libraries, and so on, try to use the pLSA model to classify these images. Show some examples with images, words, topics (text and graphs), and describe the workflow.

5-11*For the basic LDA model shown in Figure 5.20(a), illustrate the nodes of graph with the concepts and examples in the image.

5-12Perform the same tasks and with same requirements of scene classification as the problem 5-10, but this time using the SLDA model.

5.7Further Reading

1.Overview of Scene Understanding

There are several books on the scene semantic interpretation providing information on different research directions, for example, Luo (2005), Gao (2009).

A work on using Bayesian networks for image semantic interpretation can be found in Luo (2005).

2.Fuzzy Reasoning

More on fuzzy sets and fuzzy operations can be seen in the contents of specialized books, such as Cox (1994) and Zhangzl (2010).

The application of fuzzy reasoning in image pattern recognition can be seen in Duda (2001), Theodoridis (2009).

3.Image Interpretation with Genetic Algorithms

More information on the content of genetic algorithms can be found in special books, such as Mitchell (1996), Han (2010).

Genetic algorithm can also be used in combination with fuzzy set method for segmentation based on statistical properties of pixel points and their spatial neighborhood information as in Xue (2000).

4.Labeling of Objects in Scene

An early overview of the relaxation of the target in the scene can be seen in Kittler (1985).

A method of associating probability for automatic image tagging can be found in Xu (2008).

5.Scene Classification

In the feature-based scene classification, the construction of visual dictionary is an important task. Two examples of how to optimize learning dictionaries can be found in Duan (2012) and Liu (2012a).

An example of combining image classification and retrieval is shown in Zhang (2008c).

A semi-supervised image classification based on linear programming for bootstrapping can be seen in Li (2011).

A work on image classification using the identification of sparse coding can be found in Liu (2012).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.163.58