Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2

Bi-Level Sparse Coding: A Hyperspectral Image Classification Example^⁎

Zhangyang Wang Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States

Abstract

We present a semisupervised method for single pixel classification of hyperspectral images. The proposed method is designed to address the special problematic characteristics of hyperspectral images, namely, high dimensionality of hyperspectral pixels, lack of labeled samples, and spatial variability of spectral signatures. To alleviate these problems, the proposed method features the following components. First, being a semisupervised approach, it exploits the wealth of unlabeled samples in the image by evaluating the confidence probability of the predicted labels, for each unlabeled sample. Second, we propose to jointly optimize the classifier parameters and the dictionary atoms by a task-driven formulation, in order to ensure that the learned features (sparse codes) are optimal for the trained classifier. Finally, it incorporates spatial information through adding a Laplacian smoothness regularization to the output of the classifier, rather than the sparse codes, making the spatial constraint more flexible. The proposed method is compared to a few comparable methods for classification of several popular datasets, and it produces significantly better classification results.

Keywords

Hyperspectral image classification; Semisupervised learning; Task-driven dictionary learning; Sparse coding; Spatial Laplacian regularization; Bi-level optimization

2.1 Introduction

The spectral information contained in hyperspectral imagery allows characterization, identification, and classification of land-covers with improved accuracy and robustness. However, several critical issues should be addressed in the classification of hyperspectral data, among which are the following [1,2,38]: (1) small amount of available labeled data; (2) high dimensionality of each spectral sample; (3) spatial variability of spectral signatures; (4) high cost of sample labeling. In particular, the large number of spectral channels and small number of labeled training samples pose the problem of the curse of dimensionality and as a consequence resulting in the risk of overfitting the training data. For these reasons, desirable properties of a hyperspectral image classifier should be its ability to produce accurate land-cover maps when working within a high-dimensional feature space, low-sized training datasets, and high levels of spatial spectral signature variability.

Many supervised and unsupervised classifiers have been developed to tackle the hyperspectral data classification problem [3]. Classical supervised methods, such as artificial neural networks [4,5] and support vector machines (SVMs) [6–9], were readily revealed to be inefficient when dealing with a high number of spectral bands and lack of labeled data. In [10], SVM was regularized with an unnormalized graph Laplacian, thus leading to the Laplacian SVM (LapSVM) that adopts the manifold assumption for semisupervised classification. Another framework based on neural network was presented in [11]. It consists of adding a flexible embedding regularizer to the loss function used for training neural networks, and leads to improvements in both classification accuracy and scalability on several hyperspectral image classification problems. In recent years, kernel-based methods have often been adopted for hyperspectral image classification [12–15]. They are certainly able to handle efficiently the high-dimensional input feature space and deal with the noisy samples in a robust way [16]. More recently, sparse representation has been increasingly popular for image classification. The sparse representation-based classification (SRC) [17] is mainly based on the observation that despite the high dimensionality of natural signals, signals belonging to the same class usually lie in a low-dimensional subspace. In [18], an SRC-based algorithm for hyperspectral classification was presented, that utilizes the sparsity of the input sample with respect to a given overcomplete training dictionary. It is based on a sparsity model where a test spectral pixel is approximately represented by a few training samples (atoms) among the entire atoms from a dictionary. The weights associated with the atoms are called the sparse code. The class label of the test pixel is then determined by the characteristics of the recovered sparse code. Experimental results show remarkable improvements in discriminative effects. However, the main difficulty with all supervised methods is that the learning process heavily depends on the quality of the training dataset. Even worse, labeled hyperspectral training samples are only available in a very limited number due to the cost of sample labeling. On the other hand, unsupervised methods are not sensitive to the number of labeled samples since they operate on the whole dataset, but the relationships between clusters and class labels are not ensured [19]. Moreover, typically in hyperspectral classification, a preliminary feature selection/extraction step is undertaken to reduce the high input space dimensionality, which is time-consuming, scenario-dependent, and needs prior knowledge.

As a trade-off, semisupervised classification methods become a natural alternative to yield better performance. In semisupervised learning literature, the algorithms are provided with some available supervised information in the form of labeled data in addition to the wealth of unlabeled data. Such a framework has recently attracted a considerable amount of research in remote sensing, such as the Laplacian SVM (LapSVM) [9,10], transductive SVM [20], biased-SVM [21] and graph-based methods [22]. Even though the above mentioned algorithms exhibit good performance in classifying hyperspectral images, most of them are based on the assumption that spectrally similar instances should share the same label. However in practice, we may have very different spectra corresponding to the same material, which sometimes makes the above strict assumption no longer valid. Moreover, in most recent hyperspectral classification approaches [23,24], the spatial information is exploited together with the spectral features, encouraging pixels in the local neighborhood to have similar labels. The spatial smoothness assumption holds well in the homogenous regions of hyperspectral images. However, conventional approaches often fail to capture the spatial variability of spectral signatures, e.g., on the border of regions belonging to different classes.

In this chapter, we introduce a hyperspectral image classification method, tackling the problems imposed by the special characteristics of hyperspectral images, namely, high-input dimension of pixels, low number of labeled samples, and spatial variability of the spectral signatures. To this end, the proposed method has the following characteristics and technical contributions:

• Semisupervised. Extending the task-driven dictionary learning formulation in [25] to the semisupervised framework for hyperspectral classification, the huge number of unlabeled samples in the image are exploited together with a limited amount of labeled samples to improve the classification performance in a task-driven setting.
• Joint optimization of feature extraction and classification. Almost all prior research on hyperspectral classifier design can be viewed as the combinations of two independent parts, extraction of features and a training procedure for designing the classifier. Although, in some prior work raw spectral pixels are used directly, it is widely recognized that features extracted from the input pixels, such as the sparse code, often promote a more discriminative and robust classification [17]. However, to consider the two stages separately typically leads to a suboptimal performance, because the extracted features are not optimized for the best performance of the following classification step. We jointly optimize the classifier parameters and dictionary atoms. This is different from the classical data-driven feature extraction approach [18] that only tries to reconstruct the training samples well. Our joint task-driven formulation ensures that the learned sparse code features are optimal for the classifier.
• Incorporation of spatial information. We incorporate spatial information by adding a spatial Laplacian regularization [9] to the probabilistic outputs of the classifier, i.e., the likelihood of the predicted labels. This is more flexible than the popular “naive” Laplacian smoothness constraint that simply enforces all pixels in a local window to have similar learned features.

A novel formulation of bi-level optimization is designed to meet our requirements [26,27], which is solved by a stochastic gradient descent algorithm [28]. The proposed method is then evaluated on three popular datasets and we see an impressive improvement in performance on all of them. Even for quite ill-posed classification problems, i.e., very small number of high dimensional labeled samples, the proposed method gains a remarkable and stable improvement in performance over comparable methods.

The rest of this chapter is organized as follows. Section 2.2 manifests a step-by-step construction of our formulation in details, followed by the optimization algorithm to solve it. Section 2.3 discusses the classification results of the proposed method in comparison to several other competitive methods, with a wide range of available labeled samples. It also investigates the influences of both the unlabeled samples and dictionary atoms on the classifier's performance, as well as the discriminability of the obtained dictionary. Section 2.4 includes some concluding remarks and indications for the future work.

2.2 Formulation and Algorithm

2.2.1 Notations

Consider a hyperspectral image $X \in R^{m \times n}$ of n pixels, each of which consists of an m-dimensional spectral vector. Let $X = [x_{1}, x_{2}, \dots, x_{n}]$ denote the pixel set in a hyperspectral image, with each spectral pixel $x_{i} \in R^{m \times 1}$ , $i = 1, 2, \dots, n$ . For all the corresponding labels $y = [y_{1}, y_{2}, \dots, y_{n}]$ , we assume l labels $[y_{1}, y_{2}, \dots, y_{l}]$ are known, constituting a labeled training set $X_{l} = [x_{1}, x_{2}, \dots, x_{l}]$ , while making $X_{u} = [x_{l + 1}, x_{l + 2}, \dots, x_{n}]$ the unlabeled training set with $u = n - l$ . We assume that the number of labeled samples is uniformly selected for each class. This means that for a K-class classification, each class has $l_{c} = \frac{l}{K}$ labeled samples.

Without loss of generality, we let all $y_{i} \in {- 1, 1}$ to focus on discussing a binary classification. However, the proposed classifier can be naturally extended to a multiclass case, by either replacing the binary classifier with the multiclass classifier (e.g., soft-max classifier [30]), or adopting the well-known one-versus-one or one-versus-all strategy.

Our goal is to jointly learn a dictionary D consisting of a set of basis for extracting the sparse code (feature vector), and the classification parameter w for a binary classifier applied to the extracted feature vector, while guaranteeing them to be optimal to each other.

2.2.2 Joint Feature Extraction and Classification

2.2.2.1 Sparse Coding for Feature Extraction

In [18], the authors suggest that the spectral signatures of pixels belonging to the same class are assumed to approximately lie in a low-dimensional subspace. Pixel can be compactly represented by only a few sparse coefficients (sparse code). We adopt the sparse code as the input features, since extensive literature has examined the outstanding effect of SRC for a more discriminative and robust classification [17].

We assume that all the data samples $X = [x_{1}, x_{2}, \dots, x_{n}]$ , $x_{i} \in R^{m \times 1}$ , $i = 1, 2, \dots, n$ , are encoded into their corresponding sparse codes $A = [a_{1}, a_{2}, \dots, a_{n}]$ , $a_{i} \in R^{p \times 1}$ , $i = 1, 2, \dots, n$ , using a learned dictionary $D = [d_{1}, d_{2}, \dots, d_{p}]$ , where $d_{i} \in R^{m \times 1}$ , $i = 1, 2, \dots, p$ are the learned atoms. It should be noted that the initial dictionary is generated by assigning equal number of atoms to each class. This means that for a K-class classification, there are $p_{c} = \frac{p}{K}$ atoms assigned to each class in a dictionary consisting of p atoms.

The sparse representation is obtained by the following convex optimization:

$\begin{matrix} A = \arg \min_{A} \frac{1}{2} | | X - DA | |_{F}^{2} + λ_{1} \sum_{i} | | a_{i} | |_{1} + λ_{2} | | A | |_{F}^{2}, \end{matrix}$

(2.1)

or rewritten in a separate form for each $x_{i}$ as

$\begin{matrix} a_{i} = \arg \min_{a_{i}} \frac{1}{2} | | x_{i} - {Da}_{i} | |_{2}^{2} + λ_{1} | | a_{i} | |_{1} + λ_{2} | | a_{i} | |_{2}^{2} . \end{matrix}$

(2.2)

Note $λ_{2} > 0$ is necessary for proving the differentiability of the objective function (see [2.1] in Appendix). However, setting $λ_{2} = 0$ proves to work well in practice [25].

Obviously, the effect of sparse coding (2.1) largely depends on the quality of dictionary D. The authors in [18] suggest to construct the dictionary by directly selecting atoms from the training samples. More sophisticated methods are widely used in SRC literature, discussing on how to learn a more compact and effective dictionary from a given training dataset, e.g., the K-SVD algorithm [31].

We recognize that many structured sparsity constraints (priors) [18,32] can also be considered for dictionary learning. They usually exploit the correlations among the neighboring pixels or their features. For example, the SRC dictionary has an inherent group-structured property since it is composed of several class-wise subdictionaries, i.e., the atoms belonging to the same class are grouped together to form a subdictionary. Therefore, it would be reasonable to enforce each pixel to be compactly represented by groups of atoms instead of individual ones. This could be accomplished by encouraging coefficients of only certain groups to be active, like the group lasso [33]. While the performance may be improved by enforcing structured sparsity priors, the algorithm will be considerably more complicated. Therefore, we do not take into account any structured sparsity prior here, and leave them for our future study.

2.2.2.2 Task-Driven Functions for Classification

Classical loss functions in SRC are often defined by the reconstruction error of data samples [18,39]. The performances of such learned classifiers highly hinge on the quality of the input features, which is only suboptimal without the joint optimization with classifier parameters. In [34], the authors study a straightforward joint representation and classification framework, by adding a penalty term to the classification error in addition to the reconstruction error. The authors of [35,36] propose to enhance the dictionary's representative and discriminative power by integrating both the discriminative sparse-code error and the classification error into a single objective function. The approach jointly learns a single dictionary and a predictive linear classifier. However, being a semisupervised method, the unlabeled data does not contribute much to promoting the discriminative effect in [36], as only the reconstruction error is considered on the unlabeled set except for an “expansion” strategy applied to a small set of highly-confident unlabeled samples.

In order to obtain an optimal classifier with regard to the input feature, we exploit a task-driven formulation which aims to minimize a classification-oriented loss [25]. We incorporate the sparse codes $a_{i}$ , which are dependent on the atoms of the dictionary D that are to be learned, into the training of the classifier parameter w. The logistic loss is used in the objective function for the classifier. We recognize that the proposed formulation can be easily extended to other classifiers, e.g., SVM. The loss function for the labeled samples is directly defined by the logistic loss

$\begin{matrix} L (A, w, x_{i}, y_{i}) = \sum_{i = 1}^{l} \log (1 + e^{- y_{i} w^{T} a_{i}}) . \end{matrix}$

(2.3)

For unlabeled samples, the label of each $x_{i}$ is unknown. We propose to introduce the predicted confidence probability $p_{i j}$ that sample $x_{i}$ has label $y_{j}$ ( $y_{j} = 1$ or −1), which is naturally set as the likelihood of the logistic regression

$\begin{matrix} p_{i j} = p (y_{j} | {w, a}_{i}, x_{i}) = \frac{1}{1 + e^{- y_{j} w^{T} a_{i}}}, y_{j} = 1 or - 1 . \end{matrix}$

(2.4)

The loss function for the unlabeled samples then turns into an entropy-like form

$\begin{matrix} U (A, w, x_{i}) = \sum_{i = l + 1}^{l + u} \sum_{y_{j}} p_{i j} L (a_{i}, w, x_{i}, y_{j}), \end{matrix}$

(2.5)

which can be viewed as a weighted sum of loss under different classification outputs $y_{j}$ .

Furthermore, we can similarly define $p_{i j}$ for the labeled sample $x_{i}$ , that is 1 when $y_{j}$ is the given correct label $y_{i}$ and 0 elsewhere. The joint loss functions for all the training samples can thus be written into a unified form

$\begin{matrix} T (A, w) = \sum_{i = 1}^{l + u} \sum_{y_{j}} p_{i j} L (a_{i}, w, x_{i}, y_{j}) . \end{matrix}$

(2.6)

A semisupervised task-driven formulation has also been proposed in [25]. However, it is posed as a naive combination of supervised and unsupervised steps. The unlabeled data are only used to minimize the reconstruction loss, without contributing to promoting the discriminative effect. In contrast, our formulation (2.6) clearly distinguishes itself by assigning an adaptive confidence weight (2.4) to each unlabeled sample, and minimizes a classification-oriented loss over both labeled and unlabeled samples. By doing so, unlabeled samples also contribute to improving the discriminability of learned features and classifier, jointly with the labeled samples, rather than only optimized for reconstruction loss.

2.2.2.3 Spatial Laplacian Regularization

We first introduce the weighting matrix G, where $G_{i k}$ characterizes the similarity between a pair of pixels $x_{i}$ and $x_{k}$ . We define $G_{i k}$ in the form of shift-invariant bilateral Gaussian filtering [37] (with controlling parameters $σ_{d}$ and $σ_{s}$ )

$\begin{matrix} G_{i k} = \exp (- \frac{d (x_{i}, x_{k})}{2 σ_{d}^{2}}) \cdot \exp (- \frac{| | x_{i} - x_{k} | |_{2}^{2}}{2 σ_{s}^{2}}), \end{matrix}$

(2.7)

which measures both the spatial Euclidean distance ( $d (x_{i}, x_{k})$ ) and the spectral similarity between an arbitrary pair of pixels in a hyperspectral image. Larger $G_{i k}$ represents higher similarity and vice versa. Further, rather than simply enforcing pixels within a local window to share the same label, $G_{i k}$ is defined over the whole image and encourages both spatially neighboring and spectrally similar pixels to have similar classification outputs. It makes our spatial constraints much more flexible and effective. Using the above similarity weights, we define the spatial Laplacian regularization function

$\begin{matrix} S (A, w) = \sum_{i = 1}^{l + u} \sum_{y_{j}} \sum_{k}^{l + u} G_{i k} | | p_{i j} - p_{k j} | |_{2}^{2} . \end{matrix}$

(2.8)

2.2.3 Bi-level Optimization Formulation

Finally, the objective cost function for the joint minimization formulation can be expressed by the following bi-level optimization (the quadratic term of w is to avoid overfitting)

$\begin{matrix} \min_{D, w} T (A, w) + S (A, w) + \frac{λ}{2} | | w | |_{2}^{2} \\ s . t . A = \arg \min_{A} \frac{1}{2} | | X - DA | |_{F}^{2} + λ_{1} \sum_{i} | | a_{i} | |_{1} + λ_{2} | | A | |_{F}^{2} . \end{matrix}$

(2.9)

Bi-level optimization [26] has been investigated in both theory and application sides. In [27], the authors propose a general bi-level sparse coding model for learning dictionaries across coupled signal spaces. Another similar formulation has been studied in [25] for general regression tasks.

In the testing stage, each test sample is first represented by solving (2.2) over the learned D. The resulting sparse coefficients are fed to the trained logistic classifier with the previously learned w. The test sample is classified into the class of the highest output probability (2.4).

2.2.4 Algorithm

Built on the similar methodologies of [25] and [27], we solve (2.9) using a projected first order stochastic gradient descent (SGD) algorithm, whose detailed steps are outlined in Algorithm 2.1. At a high level overview, it consists of an outer stochastic gradient descent loop that incrementally samples the training data. It uses each sample to approximate gradients with respect to the classifier parameter w and the dictionary D, which are then used to update them. Next, we briefly explain a few key technical points of the Algorithm 2.1.

Algorithm 2.1 Stochastic gradient descent algorithm for solving (2.9).

2.2.4.1 Stochastic Gradient Descent

The stochastic gradient descent (SGD) algorithm [28] is an iterative, “online” approach for optimizing an objective function, based on a sequence of approximate gradients obtained by randomly sampling from the training data set. In the simplest case, SGD estimates the objective function gradient on the basis of a single randomly selected example $x_{t}$

$\begin{matrix} w_{t + 1} = w_{t} - ρ_{t} \nabla_{w} F (x_{t}, w_{t}), \end{matrix}$

(2.10)

where F is a loss function, w is a weight being optimized and $ρ_{t}$ is a step size known as the“learning rate”. The stochastic process ${w_{t}, t = 1, \dots}$ depends upon the sequence of randomly selected examples $x_{t}$ from the training data. It thus optimizes the empirical cost, which is hoped to be a good proxy for the expected cost.

Following the derivations in [25], we can show that the objective function in (2.9), denoted as $B (A, w)$ for simplicity, is differentiable on $D \times w$ , and that

$\begin{matrix} \nabla_{w} B (A, w) = E_{x, y} [\nabla_{w} T (A, w) + \nabla_{w} S (A, w) + λ w], \\ \nabla_{D} B (A, w) = E_{x, y} [- D β^{⁎} A^{T} + (X_{t} - D A) {β^{⁎}}^{T}], \end{matrix}$

(2.11)

where $β^{⁎}$ is a vector defined by the following property:

$\begin{matrix} β_{S^{C}}^{⁎} = 0, β_{S}^{⁎} = {(D_{S}^{T} D_{S} + λ_{2} I)}^{- 1} \nabla_{A_{S}} [T (A, w) + S (A, w)], \end{matrix}$

(2.12)

and S are the indices of the nonzero coefficients of A. The proof of the above equations is given in the Appendix.

2.2.4.2 Sparse Reconstruction

The most computationally intensive step in Algorithm 2.1 is solving the sparse coding (step 3). We adopt the feature-sign algorithm [39] for efficiently solving the exact solution to the sparse coding problem.

Remark on SGD convergence and sampling strategy. As a typical case in machine learning, we use SGD in a setting where it is not guaranteed to converge in theory, but behaves well in practice, as shown in our experiments. (The convergence proof of SGD [29] for nonconvex problems indeed assumes three times differentiable cost functions.)

SGD algorithms are typically designed to minimize functions whose gradients have the form of an expectation. While an i.i.d. (independent and identically distributed) sampling process is required, it cannot be computed in a batch mode. In our algorithm, instead of sampling one per iteration, we adopt a mini-batch strategy by drawing more samples at a time. Authors in [25] further pointed out that solving multiple elastic-net problems with the same dictionary D can be accelerated by the precomputation of the matrix $D^{T} D$ . In practice, we draw a set of 200 samples in each iteration, which produces steadily good results in all our experiments under universal settings.

Strictly speaking, drawing samples from the distribution of training data should be made i.i.d. (step 2 in Algorithm 2.1). However, this is practically difficult since the distribution itself is typically unknown. As an approximation, samples are instead drawn by iterating over random permutations of the training set [29].

2.3 Experiments

In this section, we evaluate the proposed method on three popular datasets, and compare it with some related approaches in the literature, including:

• Laplacian Support Vector Machine (LapSVM) [9,10], which is a semisupervised extension of the SVM and applies the spatial manifold assumption to SVM. The classification is directly executed on raw pixels without any feature extraction, which follows the original setting in [10].
• Semisupervised Classification (SSC) approach [40] that employs a modified clustering assumption.
• Semisupervised hyperspectral image segmentation that adopts Multinomial Logistic Regression with Active Learning (MLR-AL) [41].

Regarding parameter choices of the three methods, we try our best to follow the settings in their original papers. For LapSVM, the regularization parameters $γ_{1}, γ_{2}$ are selected from $[10^{- 5}, 10^{5}]$ according to a five-fold cross-validation procedure. In SSC, the width parameter of Gaussian function is tuned using a five-fold cross-validation procedure. The parameter setting in MLR-AL follows that of the original paper [41].

Besides the above mentioned three algorithms, we also include the following algorithms in the comparison, in order to illustrate the merits of both joint optimization and spatial Laplacian regularization on the classifier outputs:

• Non-joint optimization of feature extraction and classification (Non-Joint). It refers to conducting the following two stages of the optimization sequentially:

$\begin{matrix} 1. Feature extraction: \\ A = \arg \min_{A} \frac{1}{2} | | X - DA | |_{F}^{2} + λ_{1} \sum_{i} | | a_{i} | |_{1} + λ_{2} | | A | |_{F}^{2} . \\ 2. Learning a classifier: \\ \min_{w} T (A, w) + \frac{λ}{2} | | w | |_{2}^{2} . \end{matrix}$

(2.13)

The training of D is independent of the learning of the classifier parameter w. This is different from the joint optimization of the dictionary and classifier as is done in (2.9) by the task-driven formulation.
• Non-joint optimization of feature extraction and classification, with spatial Laplacian regularization (Non-Joint + Laplacian). It is the same as the Non-Joint method except for adding a spatial Laplacian regularization term $S (A, w)$ to the second subproblem:

(2.14)

• The proposed joint method without spatial Laplacian regularization (Joint), which is done by dropping the $S (A, W)$ term in (2.9)

$\begin{matrix} \min_{D, W} T (A, w) + \frac{λ}{2} | | w | |_{2}^{2} \\ s . t . A = \arg \min_{A} \frac{1}{2} | | X - DA | |_{F}^{2} + λ_{1} \sum_{i} | | a_{i} | |_{1} + λ_{2} | | A | |_{F}^{2} . \end{matrix}$

(2.15)

• The proposed joint method with spatial Laplacian regularization (Joint + Laplacian), by minimizing our proposed bi-level formulation (2.9).

Parameter setting. For the proposed method, the regularization parameter λ in (2.9) is fixed to be $10^{- 2}$ , and $λ_{2}$ in (2.2) is set to 0 to exploit sparsity. The elastic-net parameter $λ_{1}$ in (2.2) is generated by a cross-validation procedure, which is similar to the one in [25]. The values of $λ_{1}$ are 0.225, 0.25, and 0.15 for the three experiments in Sections 2.3.1–2.3.3, respectively; $σ_{d}$ and $σ_{s}$ in (2.7) are fixed as 3 and 3000, respectively. The learning rate ρ is set as 1, and maximum number ITER is set as 1000 for all. Although, we have verified that these choices of parameters work well in extensive experiments, we recognize that a finer tuning of them may further improve the performance.

In particular, we would like to mention the initializations of D and w. For the two non-joint methods, D is initialized by solving the first subproblem (feature extraction) in (2.13) or (2.14). In this subproblem, for each class, we initialize its subdictionary atoms randomly. We then employ several iterations of K-SVD using only the available labeled data for that class, and finally combine all the output class-wise sub-dictionaries into a single initial dictionary D. Next, we solve A based on D, and continue to feed A into the second subproblems (learning a classifier) in (2.13) and (2.14) for good initializations of w, for Non-Joint and Non-Joint + Laplacian, respectively. For the two joint methods, we use the results of Non-Joint, and Non-Joint + Laplacian, to initialize D and w of Joint and Joint + Laplacian, respectively.

The one-versus-all strategy is adopted for addressing multiclass problems, which means that we train K different binary logistic classifiers with K corresponding dictionaries for a K-class problem. For each test sample, the classifier with the maximum score will provide the class label. When the class number is large, this one-versus-all approach has proven to be more scalable than learning a single large dictionary with a multi-class loss [25], while providing very good results.

For the two joint methods, we assign only five dictionary atoms per class to initialize the dictionary, which means for a K-class problem we have $p_{c} = 5$ and $p = 5 K$ for the total dictionary size. For the two non-joint methods, 50 dictionary atoms ( $p_{c} = 50$ ) are assigned per class in the first subproblems of (2.13) and (2.14), respectively. The reason why we use the term “atoms per class” are two-fold: (1) we initialize our dictionary by first applying KSVD to each class to obtain a class-wise subdictionary. This helps to improve the class discriminability of the learned dictionary more than just applying KSVD to the whole data. Therefore, we need to specify how many atoms are assigned per class in the initialization stage. Note when Algorithm 2.1 starts, the atoms become all entangled, and further it is impossible to identify how many (and which) atoms are representing a specific class in the final learned dictionary; (2) As each dataset has a different number of classes, and empirically, more classes demand more dictionary atoms to represent. Note, however, if we assign atoms in proportion to the number of samples per class, some minor classes will tend to be severely underrepresented. In all experiments here, we by default use all the unlabeled pixels (denoted as “ALL”) from each hyperspectral image for semi-supervised training.

2.3.1 Classification Performance on AVIRIS Indiana Pines Data

The AVIRIS sensor generates 220 bands across the spectral range from 0.2 to 2.4 μm. In the experiment, the number of bands is reduced to 200 by removing 20 water absorption bands. The AVIRIS Indiana Pines hyperspectral image has the spatial resolution of 20 m and $145 \times 145$ pixels. It contains 16 classes, most of which are different types of crops (e.g., corns, soybeans, and wheats). The ground-truth classification map is shown in Fig. 2.1A.

Figure 2.1 Classification maps for the AVIRIS Indian Pines scene using different methods with 10 labeled samples per class.

Table 2.1 evaluates the influence of the number of labeled samples per class $l_{c}$ on the classification of AVIRIS Indiana Pines data, with $l_{c}$ varying from 2 to 10. The dictionary consists of only $p = 80$ atoms to represent all the 16 classes for the joint methods, and $p = 800$ for the non-joint methods. The bold value in each column indicates the best result among all the seven methods. As can be seen from the table, the classification results improve for all the algorithms with the increase in the number of labeled samples. The last two methods, i.e., “Joint” and “Joint + Laplacian”, outperform the other five methods in terms of overall accuracy (OA) significantly. It is also observed that the “Joint + Laplacian” method obtains further improvement over the “Joint” method, showing the advantage of spatial Laplacian regularization. Amazingly, we notice that even when there are as few as three samples per class, the OA of the proposed method (“Joint + Laplacian”) is still higher than 80%.

Table 2.1

Overall classification results (%) for the AVIRIS Indiana Pines data with different numbers of labeled samples per class (u = ALL, λ = 10⁻², λ₁ = 0.225, λ₂ = 0, ρ = 1, σ_d = 3, σ_s = 3000)

$l_{c}$	2	3	4	5	6	7	8	9	10
LapSVM [9]	57.80	61.32	63.1	66.39	68.27	69.00	70.15	70.04	70.73
SSC [40]	44.61	56.98	58.27	60.56	60.79	64.19	66.81	69.40	70.50
MLR+AL [41]	52.34	56.16	59.21	61.47	65.16	69.21	72.14	73.89	74.43
Non-Joint (p_c = 50)	63.72	69.21	71.87	76.88	79.04	81.81	85.23	87.77	88.54
Non-Joint + Laplacian (p_c = 50)	66.89	72.37	75.33	78.78	81.21	84.98	87.25	88.61	89.88
Joint (p_c = 5)	69.81	76.03	80.42	82.91	84.81	85.76	86.95	87.54	89.31
Joint + Laplacian (p_c = 5)	76.55	80.63	84.28	86.33	88.27	90.68	91.87	92.53	93.11

Fig. 2.1 demonstrates the classification maps obtained by all the methods when 10 labeled samples are used per class. The proposed method, either with or without spatial regularization, obtains much less misclassifications compared with the other methods. What is more, the homogenous areas in (H) are significantly better preserved than that in (G), which again confirms the effectiveness of the spatial Laplacian regularization on the output of the classifier. Fig. 2.2 visually demonstrates that along with the increase of $l_{c}$ , the classification results gradually improve, and both the regional and scattered misclassifications are reduced dramatically.

Figure 2.2 Classification maps for the AVIRIS Indian Pines scene using the proposed Joint + Laplacian method with different numbers of labeled samples per class.

2.3.2 Classification Performance on AVIRIS Salinas Data

This dataset is collected over the Valley of Salinas, Southern California, in 1998. This hyperspectral image is of size $217 \times 512$ , with 16 different classes of objects in total. In our experiment, a nine-class subset is considered, including vegetables, bare soils, and vineyard fields. The ground-truth classification map is shown in Fig. 2.3A. As AVIRIS Salinas is recognized to be easier for classification than AVIRIS Indian Pines, all methods obtain high OAs as listed in Table 2.2, while the “Joint + Laplacian” method marginally stands out. When we turn to Fig. 2.3B–H for the comparison in classification maps, however, the “Joint + Laplacian” method is visually much superior in reducing scattered misclassifications.

Figure 2.3 Classification maps for the AVIRIS Salinas scene using different methods with 10 labeled samples per class.

Table 2.2

Overall classification results (%) for the AVIRIS Salinas data with different numbers of labeled samples per class (u = ALL, λ = 10⁻², λ₁ = 0.25, λ₂ = 0, ρ = 1, σ_d = 3, σ_s = 3000)

$l_{c}$	2	3	4	5	6	7	8	9	10
LapSVM [9]	90.77	91.53	92.95	93.50	94.77	95.08	96.05	97.17	98.00
SSC [40]	59.47	61.84	64.90	67.19	71.04	73.04	72.81	73.51	75.36
MLR+AL [41]	78.98	82.32	84.31	86.27	85.86	89.41	92.27	93.78	95.66
Non-Joint (p_c = 50)	85.88	87.21	89.29	90.76	91.42	92.87	93.95	94.78	95.26
Non-Joint + Laplacian (p_c = 50)	87.67	89.28	91.54	92.67	93.93	95.28	96.79	97.83	98.08
Joint (p_c = 5)	89.71	90.03	91.42	92.12	93.25	94.54	96.05	97.45	98.90
Joint + Laplacian (p_c = 5)	90.65	91.59	92.28	93.63	95.22	96.58	97.81	98.53	99.40

2.3.3 Classification Performance on University of Pavia Data

The ROSIS sensor collected this data during a flight campaign over the Pavia district in northern Italy. A total of 103 spectral bands were used for data acquisition in this dataset, comprising $610 \times 610$ pixel images with a geometric resolution of 1.3 m. A few samples contain no information and were discarded before the classification. The ground-truth data shows a total of nine distinct classes, and has been portrayed visually in Fig. 2.4A. Similar conclusions can be attained from both Table 2.3 and Fig. 2.4 that once again verify the merits of both the joint optimization framework and spatial regularization.

Figure 2.4 Classification maps for the University of Pavia scene using different methods with 10 labeled samples per class.

Table 2.3

Overall classification results (%) for the university of Pavia data with different numbers of labeled samples per class (u = ALL, λ = 10⁻², λ₁ = 0.15, λ₂ = 0, ρ = 1, σ_d = 3, σ_s = 3000)

$l_{c}$	2	3	4	5	6	7	8	9	10
LapSVM [9]	64.77	67.83	69.25	71.05	72.97	74.38	76.75	78.17	79.88
SSC [40]	69.54	72.84	74.69	76.21	77.24	78.43	79.81	80.25	80.95
MLR+AL [41]	76.27	78.66	79.30	80.22	81.36	82.41	83.27	84.78	85.53
Non-Joint (p_c = 50)	74.21	75.27	76.22	76.83	78.24	79.51	79.67	80.83	81.26
Non-Joint + Laplacian (p_c = 50)	79.23	80.26	82.58	84.07	86.21	86.88	87.56	88.23	88.78
Joint (p_c = 5)	74.21	76.73	79.24	80.82	82.35	84.54	86.97	87.27	88.08
Joint + Laplacian (p_c = 5)	78.56	80.29	82.84	83.76	85.12	87.58	88.33	89.52	90.41

2.4 Conclusion

In this chapter, we develop a semisupervised hyperspectral image classification method based on task-driven dictionary learning and spatial Laplacian regularization on the output of the logistic regression classifier. We jointly optimize both the dictionary for feature extraction and the associated classifier parameter, while both the spectral and the spatial information are explored to improve the classification accuracy. Experimental results verify the superior performance of our proposed method on three popular datasets, both quantitatively and qualitatively. A good and stable accuracy is produced in even quite ill-posed problem settings (high dimensional spaces with small number of labeled samples). In the future, we would like to explore the applications of the proposed method to general image classification and segmentation problems.

2.5 Appendix

Denote $X \in X$ , $y \in Y$ and $D \in D$ . Let the objective function $B (A, w)$ in (2.9) be denoted as B for short. The differentiability of B with respect to w is easy to show, using only the compactness of $X$ and $Y$ , as well as the fact that B is twice differentiable.

We will therefore focus on showing that B is differentiable with respect to D, which is more difficult since A, and thus $a_{i}$ , is not differentiable everywhere. Without loss of generality, we use a vector a instead of A for simplifying the derivations hereinafter. In some cases, we may equivalently express a as $a (D, w)$ in order to emphasize the functional dependence.

We recall the following theorem [2.1] that is proved in [42]:

Theorem 2.1

Regularity of the elastic net solution

Consider the formulation in (2.1). Assume $λ_{2} > 0$ , and that both $X$ and $Y$ are compact. Then,

• a is uniformly Lipschitz on $X \times D$ .
• Let $D \in D$ , σ be a positive scalar and s be a vector in ${- 1, 0, 1}^{p}$ . Define $K_{s} (D, σ)$ as the set of vectors x satisfying for all j in ${1, \dots, p}$ ,

$\begin{matrix} | d_{j}^{T} (x - Da) - λ_{2} a [j] | ⩽ λ_{1} - σ if s [j] = 0, \\ s [j] a [j] ⩾ σ if s [j] \neq 0 . \end{matrix}$

(2.16)

Then there exists $κ > 0$ independent of $s, D$ and σ so that for all $x \in K_{s} (D, σ)$ , the function a is twice continuously differentiable on $B_{κ σ} (x) \times B_{κ σ} (D)$ , where $B_{κ σ} (x)$ and $B_{κ σ} (D)$ denote the open balls of radius κσ respectively centered on x and D.

Built on [2.1] and given a small perturbation $E \in R^{m \times p}$ , it follows that

$\begin{matrix} B (a (D + E), w) - B (a (D), w) = \\ \nabla_{z} B_{w}^{T} (a (D + E) - a (D)) + O (| | E | |_{F}^{2}), \end{matrix}$

(2.17)

where the term $O (| | E | |_{F}^{2})$ is based on the fact that $a (D, x)$ is uniformly Lipschitz and $X \times D$ is compact. It is then possible to show that

$\begin{matrix} B (a (D + E), w) - B (a (D), w) = \\ T r (E^{T} g (a (D + E), w)) + O (| | E | |_{F}^{2}) \end{matrix}$

(2.18)

where g has the form given in (2.11). This shows that f is differentiable on $D$ , and its gradient with respect to D is g.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Bi-Level Sparse Coding: A Hyperspectral Image Classification Example

Create new playlist

Sign In

Sign Up