Chapter 2

Bi-Level Sparse Coding: A Hyperspectral Image Classification Example

Zhangyang Wang    Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States

Abstract

We present a semisupervised method for single pixel classification of hyperspectral images. The proposed method is designed to address the special problematic characteristics of hyperspectral images, namely, high dimensionality of hyperspectral pixels, lack of labeled samples, and spatial variability of spectral signatures. To alleviate these problems, the proposed method features the following components. First, being a semisupervised approach, it exploits the wealth of unlabeled samples in the image by evaluating the confidence probability of the predicted labels, for each unlabeled sample. Second, we propose to jointly optimize the classifier parameters and the dictionary atoms by a task-driven formulation, in order to ensure that the learned features (sparse codes) are optimal for the trained classifier. Finally, it incorporates spatial information through adding a Laplacian smoothness regularization to the output of the classifier, rather than the sparse codes, making the spatial constraint more flexible. The proposed method is compared to a few comparable methods for classification of several popular datasets, and it produces significantly better classification results.

Keywords

Hyperspectral image classification; Semisupervised learning; Task-driven dictionary learning; Sparse coding; Spatial Laplacian regularization; Bi-level optimization

2.1 Introduction

The spectral information contained in hyperspectral imagery allows characterization, identification, and classification of land-covers with improved accuracy and robustness. However, several critical issues should be addressed in the classification of hyperspectral data, among which are the following [1,2,38]: (1) small amount of available labeled data; (2) high dimensionality of each spectral sample; (3) spatial variability of spectral signatures; (4) high cost of sample labeling. In particular, the large number of spectral channels and small number of labeled training samples pose the problem of the curse of dimensionality and as a consequence resulting in the risk of overfitting the training data. For these reasons, desirable properties of a hyperspectral image classifier should be its ability to produce accurate land-cover maps when working within a high-dimensional feature space, low-sized training datasets, and high levels of spatial spectral signature variability.

Many supervised and unsupervised classifiers have been developed to tackle the hyperspectral data classification problem [3]. Classical supervised methods, such as artificial neural networks [4,5] and support vector machines (SVMs) [69], were readily revealed to be inefficient when dealing with a high number of spectral bands and lack of labeled data. In [10], SVM was regularized with an unnormalized graph Laplacian, thus leading to the Laplacian SVM (LapSVM) that adopts the manifold assumption for semisupervised classification. Another framework based on neural network was presented in [11]. It consists of adding a flexible embedding regularizer to the loss function used for training neural networks, and leads to improvements in both classification accuracy and scalability on several hyperspectral image classification problems. In recent years, kernel-based methods have often been adopted for hyperspectral image classification [1215]. They are certainly able to handle efficiently the high-dimensional input feature space and deal with the noisy samples in a robust way [16]. More recently, sparse representation has been increasingly popular for image classification. The sparse representation-based classification (SRC) [17] is mainly based on the observation that despite the high dimensionality of natural signals, signals belonging to the same class usually lie in a low-dimensional subspace. In [18], an SRC-based algorithm for hyperspectral classification was presented, that utilizes the sparsity of the input sample with respect to a given overcomplete training dictionary. It is based on a sparsity model where a test spectral pixel is approximately represented by a few training samples (atoms) among the entire atoms from a dictionary. The weights associated with the atoms are called the sparse code. The class label of the test pixel is then determined by the characteristics of the recovered sparse code. Experimental results show remarkable improvements in discriminative effects. However, the main difficulty with all supervised methods is that the learning process heavily depends on the quality of the training dataset. Even worse, labeled hyperspectral training samples are only available in a very limited number due to the cost of sample labeling. On the other hand, unsupervised methods are not sensitive to the number of labeled samples since they operate on the whole dataset, but the relationships between clusters and class labels are not ensured [19]. Moreover, typically in hyperspectral classification, a preliminary feature selection/extraction step is undertaken to reduce the high input space dimensionality, which is time-consuming, scenario-dependent, and needs prior knowledge.

As a trade-off, semisupervised classification methods become a natural alternative to yield better performance. In semisupervised learning literature, the algorithms are provided with some available supervised information in the form of labeled data in addition to the wealth of unlabeled data. Such a framework has recently attracted a considerable amount of research in remote sensing, such as the Laplacian SVM (LapSVM) [9,10], transductive SVM [20], biased-SVM [21] and graph-based methods [22]. Even though the above mentioned algorithms exhibit good performance in classifying hyperspectral images, most of them are based on the assumption that spectrally similar instances should share the same label. However in practice, we may have very different spectra corresponding to the same material, which sometimes makes the above strict assumption no longer valid. Moreover, in most recent hyperspectral classification approaches [23,24], the spatial information is exploited together with the spectral features, encouraging pixels in the local neighborhood to have similar labels. The spatial smoothness assumption holds well in the homogenous regions of hyperspectral images. However, conventional approaches often fail to capture the spatial variability of spectral signatures, e.g., on the border of regions belonging to different classes.

In this chapter, we introduce a hyperspectral image classification method, tackling the problems imposed by the special characteristics of hyperspectral images, namely, high-input dimension of pixels, low number of labeled samples, and spatial variability of the spectral signatures. To this end, the proposed method has the following characteristics and technical contributions:

  • •  Semisupervised. Extending the task-driven dictionary learning formulation in [25] to the semisupervised framework for hyperspectral classification, the huge number of unlabeled samples in the image are exploited together with a limited amount of labeled samples to improve the classification performance in a task-driven setting.
  • •  Joint optimization of feature extraction and classification. Almost all prior research on hyperspectral classifier design can be viewed as the combinations of two independent parts, extraction of features and a training procedure for designing the classifier. Although, in some prior work raw spectral pixels are used directly, it is widely recognized that features extracted from the input pixels, such as the sparse code, often promote a more discriminative and robust classification [17]. However, to consider the two stages separately typically leads to a suboptimal performance, because the extracted features are not optimized for the best performance of the following classification step. We jointly optimize the classifier parameters and dictionary atoms. This is different from the classical data-driven feature extraction approach [18] that only tries to reconstruct the training samples well. Our joint task-driven formulation ensures that the learned sparse code features are optimal for the classifier.
  • •  Incorporation of spatial information. We incorporate spatial information by adding a spatial Laplacian regularization [9] to the probabilistic outputs of the classifier, i.e., the likelihood of the predicted labels. This is more flexible than the popular “naive” Laplacian smoothness constraint that simply enforces all pixels in a local window to have similar learned features.

A novel formulation of bi-level optimization is designed to meet our requirements [26,27], which is solved by a stochastic gradient descent algorithm [28]. The proposed method is then evaluated on three popular datasets and we see an impressive improvement in performance on all of them. Even for quite ill-posed classification problems, i.e., very small number of high dimensional labeled samples, the proposed method gains a remarkable and stable improvement in performance over comparable methods.

The rest of this chapter is organized as follows. Section 2.2 manifests a step-by-step construction of our formulation in details, followed by the optimization algorithm to solve it. Section 2.3 discusses the classification results of the proposed method in comparison to several other competitive methods, with a wide range of available labeled samples. It also investigates the influences of both the unlabeled samples and dictionary atoms on the classifier's performance, as well as the discriminability of the obtained dictionary. Section 2.4 includes some concluding remarks and indications for the future work.

2.2 Formulation and Algorithm

2.2.1 Notations

Consider a hyperspectral image XRm×nImage of n pixels, each of which consists of an m-dimensional spectral vector. Let X=[x1,x2,,xn]Image denote the pixel set in a hyperspectral image, with each spectral pixel xiRm×1Image, i=1,2,,nImage. For all the corresponding labels y=[y1,y2,,yn]Image, we assume l labels [y1,y2,,yl]Image are known, constituting a labeled training set Xl=[x1,x2,,xl]Image, while making Xu=[xl+1,xl+2,,xn]Image the unlabeled training set with u=nlImage. We assume that the number of labeled samples is uniformly selected for each class. This means that for a K-class classification, each class has lc=lKImage labeled samples.

Without loss of generality, we let all yi{1,1}Image to focus on discussing a binary classification. However, the proposed classifier can be naturally extended to a multiclass case, by either replacing the binary classifier with the multiclass classifier (e.g., soft-max classifier [30]), or adopting the well-known one-versus-one or one-versus-all strategy.

Our goal is to jointly learn a dictionary D consisting of a set of basis for extracting the sparse code (feature vector), and the classification parameter w for a binary classifier applied to the extracted feature vector, while guaranteeing them to be optimal to each other.

2.2.2 Joint Feature Extraction and Classification

2.2.2.1 Sparse Coding for Feature Extraction

In [18], the authors suggest that the spectral signatures of pixels belonging to the same class are assumed to approximately lie in a low-dimensional subspace. Pixel can be compactly represented by only a few sparse coefficients (sparse code). We adopt the sparse code as the input features, since extensive literature has examined the outstanding effect of SRC for a more discriminative and robust classification [17].

We assume that all the data samples X=[x1,x2,,xn]Image, xiRm×1Image, i=1,2,,nImage, are encoded into their corresponding sparse codes A=[a1,a2,,an]Image, aiRp×1Image, i=1,2,,nImage, using a learned dictionary D=[d1,d2,,dp]Image, where diRm×1Image, i=1,2,,pImage are the learned atoms. It should be noted that the initial dictionary is generated by assigning equal number of atoms to each class. This means that for a K-class classification, there are pc=pKImage atoms assigned to each class in a dictionary consisting of p atoms.

The sparse representation is obtained by the following convex optimization:

A=argminA12||XDA||2F+λ1i||ai||1+λ2||A||2F,

Image (2.1)

or rewritten in a separate form for each xiImage as

ai=argminai12||xiDai||22+λ1||ai||1+λ2||ai||22.

Image (2.2)

Note λ2>0Image is necessary for proving the differentiability of the objective function (see [2.1] in Appendix). However, setting λ2=0Image proves to work well in practice [25].

Obviously, the effect of sparse coding (2.1) largely depends on the quality of dictionary D. The authors in [18] suggest to construct the dictionary by directly selecting atoms from the training samples. More sophisticated methods are widely used in SRC literature, discussing on how to learn a more compact and effective dictionary from a given training dataset, e.g., the K-SVD algorithm [31].

We recognize that many structured sparsity constraints (priors) [18,32] can also be considered for dictionary learning. They usually exploit the correlations among the neighboring pixels or their features. For example, the SRC dictionary has an inherent group-structured property since it is composed of several class-wise subdictionaries, i.e., the atoms belonging to the same class are grouped together to form a subdictionary. Therefore, it would be reasonable to enforce each pixel to be compactly represented by groups of atoms instead of individual ones. This could be accomplished by encouraging coefficients of only certain groups to be active, like the group lasso [33]. While the performance may be improved by enforcing structured sparsity priors, the algorithm will be considerably more complicated. Therefore, we do not take into account any structured sparsity prior here, and leave them for our future study.

2.2.2.2 Task-Driven Functions for Classification

Classical loss functions in SRC are often defined by the reconstruction error of data samples [18,39]. The performances of such learned classifiers highly hinge on the quality of the input features, which is only suboptimal without the joint optimization with classifier parameters. In [34], the authors study a straightforward joint representation and classification framework, by adding a penalty term to the classification error in addition to the reconstruction error. The authors of [35,36] propose to enhance the dictionary's representative and discriminative power by integrating both the discriminative sparse-code error and the classification error into a single objective function. The approach jointly learns a single dictionary and a predictive linear classifier. However, being a semisupervised method, the unlabeled data does not contribute much to promoting the discriminative effect in [36], as only the reconstruction error is considered on the unlabeled set except for an “expansion” strategy applied to a small set of highly-confident unlabeled samples.

In order to obtain an optimal classifier with regard to the input feature, we exploit a task-driven formulation which aims to minimize a classification-oriented loss [25]. We incorporate the sparse codes aiImage, which are dependent on the atoms of the dictionary D that are to be learned, into the training of the classifier parameter w. The logistic loss is used in the objective function for the classifier. We recognize that the proposed formulation can be easily extended to other classifiers, e.g., SVM. The loss function for the labeled samples is directly defined by the logistic loss

L(A,w,xi,yi)=li=1log(1+eyiwTai).

Image (2.3)

For unlabeled samples, the label of each xiImage is unknown. We propose to introduce the predicted confidence probability pijImage that sample xiImage has label yjImage (yj=1Image or −1), which is naturally set as the likelihood of the logistic regression

pij=p(yj|w,ai,xi)=11+eyjwTai,yj=1or1.

Image (2.4)

The loss function for the unlabeled samples then turns into an entropy-like form

U(A,w,xi)=l+ui=l+1yjpijL(ai,w,xi,yj),

Image (2.5)

which can be viewed as a weighted sum of loss under different classification outputs yjImage.

Furthermore, we can similarly define pijImage for the labeled sample xiImage, that is 1 when yjImage is the given correct label yiImage and 0 elsewhere. The joint loss functions for all the training samples can thus be written into a unified form

T(A,w)=l+ui=1yjpijL(ai,w,xi,yj).

Image (2.6)

A semisupervised task-driven formulation has also been proposed in [25]. However, it is posed as a naive combination of supervised and unsupervised steps. The unlabeled data are only used to minimize the reconstruction loss, without contributing to promoting the discriminative effect. In contrast, our formulation (2.6) clearly distinguishes itself by assigning an adaptive confidence weight (2.4) to each unlabeled sample, and minimizes a classification-oriented loss over both labeled and unlabeled samples. By doing so, unlabeled samples also contribute to improving the discriminability of learned features and classifier, jointly with the labeled samples, rather than only optimized for reconstruction loss.

2.2.2.3 Spatial Laplacian Regularization

We first introduce the weighting matrix G, where GikImage characterizes the similarity between a pair of pixels xiImage and xkImage. We define GikImage in the form of shift-invariant bilateral Gaussian filtering [37] (with controlling parameters σdImage and σsImage)

Gik=exp(d(xi,xk)2σ2d)exp(||xixk||222σ2s),

Image (2.7)

which measures both the spatial Euclidean distance (d(xi,xk)Image) and the spectral similarity between an arbitrary pair of pixels in a hyperspectral image. Larger GikImage represents higher similarity and vice versa. Further, rather than simply enforcing pixels within a local window to share the same label, GikImage is defined over the whole image and encourages both spatially neighboring and spectrally similar pixels to have similar classification outputs. It makes our spatial constraints much more flexible and effective. Using the above similarity weights, we define the spatial Laplacian regularization function

S(A,w)=l+ui=1yjl+ukGik||pijpkj||22.

Image (2.8)

2.2.3 Bi-level Optimization Formulation

Finally, the objective cost function for the joint minimization formulation can be expressed by the following bi-level optimization (the quadratic term of w is to avoid overfitting)

minD,wT(A,w)+S(A,w)+λ2||w||22s.t.A=argminA12||XDA||2F+λ1i||ai||1+λ2||A||2F.

Image (2.9)

Bi-level optimization [26] has been investigated in both theory and application sides. In [27], the authors propose a general bi-level sparse coding model for learning dictionaries across coupled signal spaces. Another similar formulation has been studied in [25] for general regression tasks.

In the testing stage, each test sample is first represented by solving (2.2) over the learned D. The resulting sparse coefficients are fed to the trained logistic classifier with the previously learned w. The test sample is classified into the class of the highest output probability (2.4).

2.2.4 Algorithm

Built on the similar methodologies of [25] and [27], we solve (2.9) using a projected first order stochastic gradient descent (SGD) algorithm, whose detailed steps are outlined in Algorithm 2.1. At a high level overview, it consists of an outer stochastic gradient descent loop that incrementally samples the training data. It uses each sample to approximate gradients with respect to the classifier parameter w and the dictionary D, which are then used to update them. Next, we briefly explain a few key technical points of the Algorithm 2.1.

Image
Algorithm 2.1 Stochastic gradient descent algorithm for solving (2.9).

2.2.4.1 Stochastic Gradient Descent

The stochastic gradient descent (SGD) algorithm [28] is an iterative, “online” approach for optimizing an objective function, based on a sequence of approximate gradients obtained by randomly sampling from the training data set. In the simplest case, SGD estimates the objective function gradient on the basis of a single randomly selected example xtImage

wt+1=wtρtwF(xt,wt),

Image (2.10)

where F is a loss function, w is a weight being optimized and ρtImage is a step size known as the“learning rate”. The stochastic process {wt,t=1,}Image depends upon the sequence of randomly selected examples xtImage from the training data. It thus optimizes the empirical cost, which is hoped to be a good proxy for the expected cost.

Following the derivations in [25], we can show that the objective function in (2.9), denoted as B(A,w)Image for simplicity, is differentiable on D×wImage, and that

wB(A,w)=Ex,y[wT(A,w)+wS(A,w)+λw],DB(A,w)=Ex,y[DβAT+(XtDA)βT],

Image (2.11)

where βImage is a vector defined by the following property:

βSC=0,βS=(DTSDS+λ2I)1AS[T(A,w)+S(A,w)],

Image (2.12)

and S are the indices of the nonzero coefficients of A. The proof of the above equations is given in the Appendix.

2.2.4.2 Sparse Reconstruction

The most computationally intensive step in Algorithm 2.1 is solving the sparse coding (step 3). We adopt the feature-sign algorithm [39] for efficiently solving the exact solution to the sparse coding problem.

Remark on SGD convergence and sampling strategy. As a typical case in machine learning, we use SGD in a setting where it is not guaranteed to converge in theory, but behaves well in practice, as shown in our experiments. (The convergence proof of SGD [29] for nonconvex problems indeed assumes three times differentiable cost functions.)

SGD algorithms are typically designed to minimize functions whose gradients have the form of an expectation. While an i.i.d. (independent and identically distributed) sampling process is required, it cannot be computed in a batch mode. In our algorithm, instead of sampling one per iteration, we adopt a mini-batch strategy by drawing more samples at a time. Authors in [25] further pointed out that solving multiple elastic-net problems with the same dictionary D can be accelerated by the precomputation of the matrix DTDImage. In practice, we draw a set of 200 samples in each iteration, which produces steadily good results in all our experiments under universal settings.

Strictly speaking, drawing samples from the distribution of training data should be made i.i.d. (step 2 in Algorithm 2.1). However, this is practically difficult since the distribution itself is typically unknown. As an approximation, samples are instead drawn by iterating over random permutations of the training set [29].

2.3 Experiments

In this section, we evaluate the proposed method on three popular datasets, and compare it with some related approaches in the literature, including:

  • •  Laplacian Support Vector Machine (LapSVM) [9,10], which is a semisupervised extension of the SVM and applies the spatial manifold assumption to SVM. The classification is directly executed on raw pixels without any feature extraction, which follows the original setting in [10].
  • •  Semisupervised Classification (SSC) approach [40] that employs a modified clustering assumption.
  • •  Semisupervised hyperspectral image segmentation that adopts Multinomial Logistic Regression with Active Learning (MLR-AL) [41].

Regarding parameter choices of the three methods, we try our best to follow the settings in their original papers. For LapSVM, the regularization parameters γ1,γ2Image are selected from [105,105]Image according to a five-fold cross-validation procedure. In SSC, the width parameter of Gaussian function is tuned using a five-fold cross-validation procedure. The parameter setting in MLR-AL follows that of the original paper [41].

Besides the above mentioned three algorithms, we also include the following algorithms in the comparison, in order to illustrate the merits of both joint optimization and spatial Laplacian regularization on the classifier outputs:

  • •  Non-joint optimization of feature extraction and classification (Non-Joint). It refers to conducting the following two stages of the optimization sequentially:

1. Feature extraction:A=argminA12||XDA||2F+λ1i||ai||1+λ2||A||2F.2. Learning a classifier:minwT(A,w)+λ2||w||22.

Image (2.13)

  • The training of D is independent of the learning of the classifier parameter w. This is different from the joint optimization of the dictionary and classifier as is done in (2.9) by the task-driven formulation.
  • •  Non-joint optimization of feature extraction and classification, with spatial Laplacian regularization (Non-Joint + Laplacian). It is the same as the Non-Joint method except for adding a spatial Laplacian regularization term S(A,w)Image to the second subproblem:

1. Feature extraction:A=argminA12||XDA||2F+λ1i||ai||1+λ2||A||2F.2. Learning a classifier:minwT(A,w)+S(A,w)+λ2||w||22.

Image (2.14)

  • •  The proposed joint method without spatial Laplacian regularization (Joint), which is done by dropping the S(A,W)Image term in (2.9)

minD,WT(A,w)+λ2||w||22s.t.A=argminA12||XDA||2F+λ1i||ai||1+λ2||A||2F.

Image (2.15)

  • •  The proposed joint method with spatial Laplacian regularization (Joint + Laplacian), by minimizing our proposed bi-level formulation (2.9).

Parameter setting. For the proposed method, the regularization parameter λ in (2.9) is fixed to be 102Image, and λ2Image in (2.2) is set to 0 to exploit sparsity. The elastic-net parameter λ1Image in (2.2) is generated by a cross-validation procedure, which is similar to the one in [25]. The values of λ1Image are 0.225, 0.25, and 0.15 for the three experiments in Sections 2.3.12.3.3, respectively; σdImage and σsImage in (2.7) are fixed as 3 and 3000, respectively. The learning rate ρ is set as 1, and maximum number ITER is set as 1000 for all. Although, we have verified that these choices of parameters work well in extensive experiments, we recognize that a finer tuning of them may further improve the performance.

In particular, we would like to mention the initializations of D and w. For the two non-joint methods, D is initialized by solving the first subproblem (feature extraction) in (2.13) or (2.14). In this subproblem, for each class, we initialize its subdictionary atoms randomly. We then employ several iterations of K-SVD using only the available labeled data for that class, and finally combine all the output class-wise sub-dictionaries into a single initial dictionary D. Next, we solve A based on D, and continue to feed A into the second subproblems (learning a classifier) in (2.13) and (2.14) for good initializations of w, for Non-Joint and Non-Joint + Laplacian, respectively. For the two joint methods, we use the results of Non-Joint, and Non-Joint + Laplacian, to initialize D and w of Joint and Joint + Laplacian, respectively.

The one-versus-all strategy is adopted for addressing multiclass problems, which means that we train K different binary logistic classifiers with K corresponding dictionaries for a K-class problem. For each test sample, the classifier with the maximum score will provide the class label. When the class number is large, this one-versus-all approach has proven to be more scalable than learning a single large dictionary with a multi-class loss [25], while providing very good results.

For the two joint methods, we assign only five dictionary atoms per class to initialize the dictionary, which means for a K-class problem we have pc=5Image and p=5KImage for the total dictionary size. For the two non-joint methods, 50 dictionary atoms (pc=50Image) are assigned per class in the first subproblems of (2.13) and (2.14), respectively. The reason why we use the term “atoms per class” are two-fold: (1) we initialize our dictionary by first applying KSVD to each class to obtain a class-wise subdictionary. This helps to improve the class discriminability of the learned dictionary more than just applying KSVD to the whole data. Therefore, we need to specify how many atoms are assigned per class in the initialization stage. Note when Algorithm 2.1 starts, the atoms become all entangled, and further it is impossible to identify how many (and which) atoms are representing a specific class in the final learned dictionary; (2) As each dataset has a different number of classes, and empirically, more classes demand more dictionary atoms to represent. Note, however, if we assign atoms in proportion to the number of samples per class, some minor classes will tend to be severely underrepresented. In all experiments here, we by default use all the unlabeled pixels (denoted as “ALL”) from each hyperspectral image for semi-supervised training.

2.3.1 Classification Performance on AVIRIS Indiana Pines Data

The AVIRIS sensor generates 220 bands across the spectral range from 0.2 to 2.4 μm. In the experiment, the number of bands is reduced to 200 by removing 20 water absorption bands. The AVIRIS Indiana Pines hyperspectral image has the spatial resolution of 20 m and 145×145Image pixels. It contains 16 classes, most of which are different types of crops (e.g., corns, soybeans, and wheats). The ground-truth classification map is shown in Fig. 2.1A.

Image
Figure 2.1 Classification maps for the AVIRIS Indian Pines scene using different methods with 10 labeled samples per class.

Table 2.1 evaluates the influence of the number of labeled samples per class lcImage on the classification of AVIRIS Indiana Pines data, with lcImage varying from 2 to 10. The dictionary consists of only p=80Image atoms to represent all the 16 classes for the joint methods, and p=800Image for the non-joint methods. The bold value in each column indicates the best result among all the seven methods. As can be seen from the table, the classification results improve for all the algorithms with the increase in the number of labeled samples. The last two methods, i.e., “Joint” and “Joint + Laplacian”, outperform the other five methods in terms of overall accuracy (OA) significantly. It is also observed that the “Joint + Laplacian” method obtains further improvement over the “Joint” method, showing the advantage of spatial Laplacian regularization. Amazingly, we notice that even when there are as few as three samples per class, the OA of the proposed method (“Joint + Laplacian”) is still higher than 80%.

Table 2.1

Overall classification results (%) for the AVIRIS Indiana Pines data with different numbers of labeled samples per class (u = ALL, λ = 10−2, λ1 = 0.225, λ2 = 0, ρ = 1, σd = 3, σs = 3000)

lc Image 2 3 4 5 6 7 8 9 10
LapSVM [9] 57.80 61.32 63.1 66.39 68.27 69.00 70.15 70.04 70.73
SSC [40] 44.61 56.98 58.27 60.56 60.79 64.19 66.81 69.40 70.50
MLR+AL [41] 52.34 56.16 59.21 61.47 65.16 69.21 72.14 73.89 74.43
Non-Joint (pc = 50) 63.72 69.21 71.87 76.88 79.04 81.81 85.23 87.77 88.54
Non-Joint + Laplacian (pc = 50) 66.89 72.37 75.33 78.78 81.21 84.98 87.25 88.61 89.88
Joint (pc = 5) 69.81 76.03 80.42 82.91 84.81 85.76 86.95 87.54 89.31
Joint + Laplacian (pc = 5) 76.55 80.63 84.28 86.33 88.27 90.68 91.87 92.53 93.11

Image

Fig. 2.1 demonstrates the classification maps obtained by all the methods when 10 labeled samples are used per class. The proposed method, either with or without spatial regularization, obtains much less misclassifications compared with the other methods. What is more, the homogenous areas in (H) are significantly better preserved than that in (G), which again confirms the effectiveness of the spatial Laplacian regularization on the output of the classifier. Fig. 2.2 visually demonstrates that along with the increase of lcImage, the classification results gradually improve, and both the regional and scattered misclassifications are reduced dramatically.

Image
Figure 2.2 Classification maps for the AVIRIS Indian Pines scene using the proposed Joint + Laplacian method with different numbers of labeled samples per class.

2.3.2 Classification Performance on AVIRIS Salinas Data

This dataset is collected over the Valley of Salinas, Southern California, in 1998. This hyperspectral image is of size 217×512Image, with 16 different classes of objects in total. In our experiment, a nine-class subset is considered, including vegetables, bare soils, and vineyard fields. The ground-truth classification map is shown in Fig. 2.3A. As AVIRIS Salinas is recognized to be easier for classification than AVIRIS Indian Pines, all methods obtain high OAs as listed in Table 2.2, while the “Joint + Laplacian” method marginally stands out. When we turn to Fig. 2.3B–H for the comparison in classification maps, however, the “Joint + Laplacian” method is visually much superior in reducing scattered misclassifications.

Image
Figure 2.3 Classification maps for the AVIRIS Salinas scene using different methods with 10 labeled samples per class.

Table 2.2

Overall classification results (%) for the AVIRIS Salinas data with different numbers of labeled samples per class (u = ALL, λ = 10−2, λ1 = 0.25, λ2 = 0, ρ = 1, σd = 3, σs = 3000)

lc Image 2 3 4 5 6 7 8 9 10
LapSVM [9] 90.77 91.53 92.95 93.50 94.77 95.08 96.05 97.17 98.00
SSC [40] 59.47 61.84 64.90 67.19 71.04 73.04 72.81 73.51 75.36
MLR+AL [41] 78.98 82.32 84.31 86.27 85.86 89.41 92.27 93.78 95.66
Non-Joint (pc = 50) 85.88 87.21 89.29 90.76 91.42 92.87 93.95 94.78 95.26
Non-Joint + Laplacian (pc = 50) 87.67 89.28 91.54 92.67 93.93 95.28 96.79 97.83 98.08
Joint (pc = 5) 89.71 90.03 91.42 92.12 93.25 94.54 96.05 97.45 98.90
Joint + Laplacian (pc = 5) 90.65 91.59 92.28 93.63 95.22 96.58 97.81 98.53 99.40

Image

2.3.3 Classification Performance on University of Pavia Data

The ROSIS sensor collected this data during a flight campaign over the Pavia district in northern Italy. A total of 103 spectral bands were used for data acquisition in this dataset, comprising 610×610Image pixel images with a geometric resolution of 1.3 m. A few samples contain no information and were discarded before the classification. The ground-truth data shows a total of nine distinct classes, and has been portrayed visually in Fig. 2.4A. Similar conclusions can be attained from both Table 2.3 and Fig. 2.4 that once again verify the merits of both the joint optimization framework and spatial regularization.

Image
Figure 2.4 Classification maps for the University of Pavia scene using different methods with 10 labeled samples per class.

Table 2.3

Overall classification results (%) for the university of Pavia data with different numbers of labeled samples per class (u = ALL, λ = 10−2, λ1 = 0.15, λ2 = 0, ρ = 1, σd = 3, σs = 3000)

lc Image 2 3 4 5 6 7 8 9 10
LapSVM [9] 64.77 67.83 69.25 71.05 72.97 74.38 76.75 78.17 79.88
SSC [40] 69.54 72.84 74.69 76.21 77.24 78.43 79.81 80.25 80.95
MLR+AL [41] 76.27 78.66 79.30 80.22 81.36 82.41 83.27 84.78 85.53
Non-Joint (pc = 50) 74.21 75.27 76.22 76.83 78.24 79.51 79.67 80.83 81.26
Non-Joint + Laplacian (pc = 50) 79.23 80.26 82.58 84.07 86.21 86.88 87.56 88.23 88.78
Joint (pc = 5) 74.21 76.73 79.24 80.82 82.35 84.54 86.97 87.27 88.08
Joint + Laplacian (pc = 5) 78.56 80.29 82.84 83.76 85.12 87.58 88.33 89.52 90.41

Image

2.4 Conclusion

In this chapter, we develop a semisupervised hyperspectral image classification method based on task-driven dictionary learning and spatial Laplacian regularization on the output of the logistic regression classifier. We jointly optimize both the dictionary for feature extraction and the associated classifier parameter, while both the spectral and the spatial information are explored to improve the classification accuracy. Experimental results verify the superior performance of our proposed method on three popular datasets, both quantitatively and qualitatively. A good and stable accuracy is produced in even quite ill-posed problem settings (high dimensional spaces with small number of labeled samples). In the future, we would like to explore the applications of the proposed method to general image classification and segmentation problems.

2.5 Appendix

Denote XXImage, yYImage and DDImage. Let the objective function B(A,w)Image in (2.9) be denoted as B for short. The differentiability of B with respect to w is easy to show, using only the compactness of XImage and YImage, as well as the fact that B is twice differentiable.

We will therefore focus on showing that B is differentiable with respect to D, which is more difficult since A, and thus aiImage, is not differentiable everywhere. Without loss of generality, we use a vector a instead of A for simplifying the derivations hereinafter. In some cases, we may equivalently express a as a(D,w)Image in order to emphasize the functional dependence.

We recall the following theorem [2.1] that is proved in [42]:

Theorem 2.1

Regularity of the elastic net solution

Consider the formulation in (2.1). Assume λ2>0Image, and that both XImage and YImage are compact. Then,

  • •  a is uniformly Lipschitz on X×DImage.
  • •  Let DDImage, σ be a positive scalar and s be a vector in {1,0,1}pImage. Define Ks(D,σ)Image as the set of vectors x satisfying for all j in {1,,p}Image,

|dTj(xDa)λ2a[j]|λ1σifs[j]=0,s[j]a[j]σifs[j]0.

Image (2.16)

  • Then there exists κ>0Image independent of s,DImage and σ so that for all xKs(D,σ)Image, the function a is twice continuously differentiable on Bκσ(x)×Bκσ(D)Image, where Bκσ(x)Image and Bκσ(D)Image denote the open balls of radius κσ respectively centered on x and D.

Built on [2.1] and given a small perturbation ERm×pImage, it follows that

B(a(D+E),w)B(a(D),w)=zBTw(a(D+E)a(D))+O(||E||2F),

Image (2.17)

where the term O(||E||2F)Image is based on the fact that a(D,x)Image is uniformly Lipschitz and X×DImage is compact. It is then possible to show that

B(a(D+E),w)B(a(D),w)=Tr(ETg(a(D+E),w))+O(||E||2F)

Image (2.18)

where g has the form given in (2.11). This shows that f is differentiable on DImage, and its gradient with respect to D is g.

References

[1] P.H. Swain, S.M. Davis, Fundamentals of pattern recognition in remote sensing, Remote sensing: the quantitative approach. New York: McGraw-Hill; 1978.

[2] A. Plaza, J. Benediktsson, J. Boardman, J. Brazile, L. Bruzzone, G. Camps-Valls, J. Chanussot, M. Fauvel, P. Gamba, A. Gualtieri, M. Marconcini, J. Tiltoni, G. Trianni, Recent advances in techniques for hyperspectral image processing, Remote Sensing of Environment Sept. 2009;113(s1):s110–s122.

[3] J.A. Richards, X. Jia, Remote sensing digital image analysis: an introduction. Berlin, Germany: Springer-Verlag; 1999.

[4] H. Bischof, A. Leona, Finding optimal neural networks for land use classification, IEEE Transactions on Geoscience and Remote Sensing Jan. 1998;36(1):337–341.

[5] H. Yang, F. van der Meer, W. Bakker, Z.J. Tan, A back-propagation neural network for mineralogical mapping from AVIRIS data, International Journal on Remote Sensing Jan. 1999;20(1):97–110.

[6] N. Cristianini, J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press; 2000.

[7] B. Scholkopf, A. Smola, Learning with kernels: support vector machines, regularization, optimization and beyond. Cambridge, MA: MIT Press; 2002.

[8] F. Bovolo, L. Bruzzone, L. Carline, A novel technique for subpixel image classification based on support vector machine, IEEE Transactions on Image Processing Nov. 2010;19(11):2983–2999.

[9] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, Journal of Maching Learning Research Nov. 2006;7:2399–2434.

[10] L. Gomez-Chova, G. Camps-Valls, J. Munoz-Mari, J. Calpe, Semi-supervised image classification with Laplacian support vector machines, IEEE Geoscience and Remote Sensing Letters Jul. 2008;5(3):336–340.

[11] F. Ratle, G. Camps-Valls, J. Weston, Semi-supervised neural networks for efficient hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing May. 2010;48(5):2271–2282.

[12] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Marı, J. Vila-Frances, J. Calpe-Maravilla, Composite kernels for hyperspectral image classification, IEEE Geoscience and Remote Sensing Letters Jan. 2006;3(1):93–97.

[13] C. Huang, L.S. Davis, J.R.G. Townshend, An assessment of support vector machines for land cover classification, International Journal on Remote Sensing Feb. 2002;23(4):725–749.

[14] G. Camps-Valls, L. Gomez-Chova, J. Calpe, E. Soria, J.D. Martín, L. Alonso, J. Moreno, Robust support vector method for hyperspectral data classification and knowledge discovery, IEEE Transactions on Geoscience and Remote Sensing Jul. 2004;42(7):1530–1542.

[15] G. Camps-Valls, L. Bruzzone, Kernel-based methods for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing Jun. 2005;43(6):1351–1362.

[16] J. Shawe-Taylor, N. Cristianini, Kernel methods for pattern analysis. Cambridge, UK: Cambridge University Press; 2004.

[17] J. Wright, A. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence Feb. 2009;31(2):210–227.

[18] Y. Chen, N.M. Nasrabadi, T.D. Tran, Hyperspectral image classification using dictionary-based sparse representation, IEEE Transactions on Geoscience and Remote Sensing 2011;49(10):3973–3985.

[19] S.K. Bidhendi, A.S. Shirazi, N. Fotoohi, M.M. Ebadzadeh, Material classification of hyperspectral images using unsupervised fuzzy clustering methods, Proceedings of third international IEEE conference on signal-image technologies and internet-based system (SITIS). 2007:619–623.

[20] L. Bruzzone, M. Chi, M. Marconcini, A novel transductive SVM for semisupervised classification of remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 2006;44(11):3363–3373.

[21] M.M. Jordi, B. Francesca, G.C. Luis, B. Lorenzo, C.V. Gustavo, Semisupervised one-class support vector machines for classification of remote sensing data, IEEE Transactions on Geoscience and Remote Sensing 2010;48(8):3188–3197.

[22] Y.F. Gu, K. Feng, L1Image-graph semisupervised learning for hyperspectral image classification, Proceedings of IEEE international conference on geoscience and remote sensing symposium (IGARSS). 2012:1401–1404.

[23] J. Li, J. Bioucas-Dias, A. Plaza, Spectral-spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields, IEEE Transactions on Geoscience and Remote Sensing 2012;50(3):809–823.

[24] A. Mathur, G.M. Foody, Multiclass and binary SVM classification: implications for training and classification users, IEEE Geoscience and Remote Sensing Letters Apr. 2008;5(2):241–245.

[25] J. Mairal, F. Bach, J. Ponce, Task-driven dictionary learning, IEEE Transactions on Pattern Analysis and Machine Intelligence Mar. 2012;34(2):791–804.

[26] B. Colson, P. Marcotte, G. Savard, An overview of bilevel optimization, Annals of Operations Research Sept. 2007;153(1):235–256.

[27] J. Yang, Z. Wang, Z. Lin, X. Shu, T. Huang, Bilevel sparse coding for coupled feature spaces, Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2012:2360–2367.

[28] H.J. Kushner, G.G. Yin, Stochastic approximation and recursive algorithms with applications. Second edition Springer; 2003.

[29] Léon Bottou, Online algorithms and stochastic approximations, Online learning and neural networks. Cambridge University Press; 1998.

[30] K. Duan, S. Keerthi, W. Chu, S. Shevade, A. Poo, Multi-category classification by soft-max combination of binary classifiers, Lecture Notes in Computer Science 2003;2709:125–134.

[31] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE Transactions on Signal Processing Nov. 2006;54(11):4311–4322.

[32] X. Sun, Q. Qu, N.M. Nasrabadi, T.D. Tran, Structured priors for sparse-representation-based hyperspectral image classification, IEEE Geoscience and Remote Sensing Letters 2014;11(4):1235–1239.

[33] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, A sparse group Lasso, Journal of Computational and Graphical Statistics 2013;22(2):231–245.

[34] D. Pham, S. Venkatesh, Joint learning and dictionary construction for pattern recognition, Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2008:1–8.

[35] Z. Jiang, Z. Lin, L. Davis, Learning a discriminative dictionary for sparse coding via label consistent K-SVD, Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2011:1697–1704.

[36] G. Zhang, Z. Jiang, L. Davis, Online semi-supervised discriminative dictionary learning for sparse representation, Proceedings of IEEE Asian conference on computer vision (ACCV). 2012:259–273.

[37] C. Tomasi, R. Manduchi, Bilateral filtering for gray and color images, Proceedings of the IEEE international conference on computer vision (ICCV). 1998:839–846.

[38] Qi Wang, Zhaotie Meng, Xuelong Li, Locality adaptive discriminant analysis for spectral-spatial classification of hyperspectral images, IEEE Geoscience and Remote Sensing Letters 2017;14(11):2077–2081.

[39] H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, Advances in neural information processing systems (NIPS). 2007:801–808.

[40] Y. Wang, S. Chen, Z. Zhou, New semi-supervised classification method based on modified clustering assumption, IEEE Transactions on Neural Network May 2012;23(5):689–702.

[41] J. Li, J. Bioucas-Dias, A. Plaza, Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning, IEEE Transactions on Geoscience and Remote Sensing Nov. 2010;48(11):4085–4098.

[42] J. Marial, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, Proceeding of international conference on machine learning (ICML). 2009:689–696.


 “©2015 IEEE. Reprinted, with permission, from Wang, Zhangyang, Nasrabadi, Nasser, and Huang, Thomas S. “Semi-supervised Hyperspectral Classification using Task-driven Dictionary Learning with Regularization.” IEEE Transactions on Geosciences and Remote Sensing (2015).”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.220.92