Chapter 7

Dimensionality Reduction

Shuyang WangZhangyang WangYun Fu    Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, United States
Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States
Department of Electrical and Computer Engineering and College of Computer and Information Science (Affiliated), Northeastern University, Boston, MA, United States

Abstract

Learning good representation for images is always a hot topic in machine learning and pattern recognition fields. Among the numerous algorithms, dictionary learning is a well-known strategy for effective feature extraction. Recently, more discriminative subdictionaries have been built by Fisher discriminative dictionary learning with specific class labels. Different types of constraints, such as sparsity, low-rankness and locality, are also exploited to make use of global and local information. On the other hand, as the basic building block of deep structure, the auto-encoder has demonstrated its promising performance in extracting new feature representation.

Keywords

Marginalized denoising auto-encoder; Locality constraint; Dictionary learning

7.1 Marginalized Denoising Dictionary Learning with Locality Constraint1

7.1.1 Introduction

Learning good representation for images is always a hot topic in machine learning and pattern recognition fields. Among the numerous algorithms, dictionary learning is a well-known strategy for effective feature extraction. Recently, more discriminative sub-dictionaries have been built by Fisher discriminative dictionary learning with specific class labels. Different types of constraints, such as sparsity, low-rankness and locality, are also exploited to make use of global and local information. On the other hand, as the basic building block of a deep structure, the auto-encoder has demonstrated its promising performance in extracting new feature representation. In this section, a unified feature learning framework by incorporating the marginalized denoising auto-encoder into a locality-constrained dictionary learning scheme, named Marginalized Denoising Dictionary Learning (MDDL), will be introduced [1]. Overall, the introduced method deploys a low-rank constraint on each sub-dictionary and locality constraint instead of sparsity on coefficients, in order to learn more concise and pure feature spaces while inheriting the discrimination from sub-dictionary learning. Experimental results listed in this section demonstrate the effectiveness and efficiency of MDDL by comparing with several state-of-the-art methods.

Sparse representation has experienced rapid growth in both theory and application from recent researches and led to interesting results in image classification [24], speech denoising [5], and bioinformatics [6], etc. For each input signal, the key idea is to find a linear combination using atoms from a given over-complete dictionary D as a new representation. Therefore, sparse representation is capable of revealing the underlying structure of high dimensional data. Among the large range of problems sparse representations has been applied to, in this section we focus on image classification which demands a discriminative representation.

The generative and discriminative ability of dictionary, apparently, is a major factor for sparse representation. Directly using the original training samples as dictionary in [7] will raise a problem that the noise and ambiguity contained in the training set could impede the test sample being faithfully represented. In addition, the discerning information hidden behind the training samples will be ignored in this strategy. Actually, the aforementioned problem can be solved through dictionary learning which intends to learn a proper dictionary from the original training samples. After the training process is finished, a new coming signal can be well represented by a set of basis learned from the original training set. Solutions to problems like face recognition have been dramatically improved with a well-learned dictionary. In order to distinctively represent the test samples, a lot of research efforts have been made to seek a well-adapted dictionary. Recently, a discriminative constraint was added to the dictionary learning model based on K-SVD [8], in which the classification error was considered in order to gain discriminability [9]. In Jiang et al.'s paper, the discerning ability of dictionary is enforced by associating label information with each dictionary atom [10]. In order to learn a structured dictionary, the Fisher criterion was introduced to learn a set of sub-dictionaries with specific class labels [11]. The above algorithms are designed to deal with clear signals or those corrupted only by small noise. For the situation in which training samples are corrupted by large noise, the learned dictionary will include corruptions resulting in a failure to represent the test samples.

7.1.2 Related Works

In this section, we mainly discuss two lines of related works, namely dictionary learning and auto-encoder.

7.1.2.1 Dictionary Learning

Recent researches on dictionary learning have demonstrated that a well-learned dictionary will greatly boost the performance by yielding a better representation in human action recognition [21], scene categorization [22], and image coloration [23].

A sparse representation based classification (SRC) method for robust face recognition was proposed by Wright et al. Suppose we have c classes of a training set X=[X1,X2,,Xc]Image where XiRd×niImage is the training sample from the ith class with dimension d and niImage samples. The SRC procedure is specified by two phases to classify a given test sample xtestImage. First, in the coding phase, we solve the following l1Image-norm minimization problem:

ˉa=argminaxtestXa22+λa1,

Image (7.1)

in which the l1Image-norm is used as the convex envelope to replace the l0Image-norm in order to avoid an NP-hard problem and while keeping the sparsity. Then the classification is done via

identity(xtest)=argminiεi,

Image (7.2)

where εi=xtestXiˉai2Image, and ˉaiImage is the coefficient vector associated with class i. SRC classifies a test image by picking the smallest reconstruction error εiImage.

Instead of directly using the training set itself as a dictionary, several algorithms and regularizations have been introduced into the dictionary learning framework to learn a compact dictionary with more representation power. In FDDL, a set of class-specified sub-dictionaries whose atoms correspond to the class labels are updated iteratively based on the Fisher discrimination criterion to include discriminative information. Jiang et al. presented a Label Consistent K-SVD (LC-KSVD) algorithm to make a learned dictionary more discriminative for sparse coding [10]. These methods have shown that a structured dictionary could dramatically improve classification performance. However, the performance of these methods will drop a lot if the training data is largely corrupted.

Recently introduced low-rank dictionary learning [12] aims to uncover the global structure by grouping similar samples into one cluster, which has been successfully applied to many applications, e.g., object detection [13], multi-view learning [14], unsupervised subspace segmentation [15], and 3D visual recovery [16]. The goal is to generate a low-rank matrix from corrupted original input data. That is, if a given data matrix X is corrupted by a sparse matrix E while the samples share a similar pattern, then the sparse noisy matrix can be separated to practically recover X via rank minimization. According to the previous research, many works have integrated low-rank regularization into sparse representation for separating the sparse noises from inputs signals while simultaneously optimizing the dictionary atoms in order to faithfully reconstruct the de-noised data. Moreover, a low-rank dictionary addresses the noisy data well by adding an error term with different norms, e.g., l1Image-norm, l2,1Image-norm.

Applying augmented Lagrange multipliers (ALM), Lin [24] proposed Robust PCA, with which the corrupted data in a single subspace can be recovered by solving a matrix rank minimization problem. In [12], Liu et al. proposed converting the task of face image clustering into a subspace segmentation problem with the assumption that face images from different individuals lie in different, nearly independent subspaces.

7.1.2.2 Auto-encoder

Most recently, deep learning has attracted a lot of interest when looking for better feature extraction. In this direction, auto-encoder [18] is one of the most popular building blocks to form a lite-version deep learning framework. The auto-encoder has drawn increasing attention in the feature learning area and has been considered as a simulation of the way the human visual system processes imagery. The auto-encoder architecture explicitly involves two modules, an encoder and a decoder. The encoder outputs a group of hidden representation units, which is realized by a linear deterministic mapping with a weight matrix and a nonlinear transformation employing a logistic sigmoid. The decoder reconstructs the input data based on the received sparse hidden representation. The aforementioned dictionary learning model can be formalized as a decoder module.

As a typical single hidden layer neural network with identical input and target, the auto-encoder [29] aims to discover data's intrinsic structure by encouraging the output to be as similar to the target as possible. Essentially, the neurons in the hidden layer can be seen as a good representation since they are able to reconstruct the input data. To encourage structured feature learning, further constraints have been imposed on the parameters during training.

Moreover, there is a well-known trick of the trade to deal with noisy data, that is, manually injecting noise into the training samples thereby learning with artificially corrupted data. Denoising auto-encoders (DAEs) [30,18], learned with artificial corrupted data as input, have been successfully applied to a wide range of machine learning tasks by learning a new denoising representation. During the training process, DAEs reconstruct the input data from partial corruption with a pre-specified corrupting distribution to its original clean version. This process learns a robust representation which ensures tolerance to certain distortions in input data.

On this basis, stacked denoising auto-encoders (SDAs) [19] have been successfully used to learn new representations and attained record accuracy on standard benchmark for domain adaptation. However, there are two crucial limitations of SDAs: (i) high computational cost, and (ii) lack of scalability to high-dimensional features. To address these two problems, the authors of [20] proposed marginalized SDAs (mSDA). Different from SDAs, mSDA marginalizes noise and thus the parameters can be computed in closed-form rather than using stochastic gradient descent or other optimization algorithms. Consequently, mSDA significantly speeds up SDAs by two orders of magnitude.

Wang et al. [28] explored the effectiveness of a locality constraint on two different types of feature learning technique, i.e., dictionary learning and auto-encoder, respectively. A locality-constrained collaborative auto-encoder (LCAE) was proposed to extract features with local information for enhancing the classification ability. In order to introduce locality into the coding procedure, the input data is first reconstructed by LLC coding criteria and then used as the target of the auto-encoder. That is, target ˆxImage is replaced by a locality reconstruction as follows:

minCNi=1xiDci2+λlici2 s.t. 1Tci=1,i,

Image (7.3)

where dictionary D will be initialized by PCA on the input training matrix X. The proposed LCAE can be trained using the backpropagation algorithm, which updates W and b by backpropagating the reconstruction error gradient from the output layer to the locality coded target layer.

In this chapter, the introduced model jointly learns the auto-encoder and dictionary to benefit from both techniques. To make the model fast, a lite version of the auto-encoder, i.e., marginalized denoising auto-encoder [20] is adapted, which has shown appealing performance and efficiency. Furthermore, there are several benchmarks used to evaluate the proposed algorithm, and the experimental results show its better performance compared to the state-of-the-art methods.

7.1.3 Marginalized Denoising Dictionary Learning with Locality Constraint

In this section, we first revisit locality-constrained dictionary learning and marginalized denoising auto-encoder. Then we introduce marginalized denoising dictionary learning with a locality constraint along with an efficient solution to optimize the proposed algorithm. The proposed overall framework could be viewed in Fig. 7.1.

Image
Figure 7.1 Illustration of MDDL. A marginalized denoising auto-encoder is adopted in dictionary learning (DL) schemes. The weights in the auto-encoder and sub-dictionaries in DL are trained jointly. Each sub-dictionary learned is of low-rank, which can narrow the negative effect of noise contained in training samples. For the marginalized denoising auto-encoder, the input is manually added with noise.

7.1.3.1 Preliminaries and Motivations

In this section, we introduce a feature learning model by unifying the marginalized denoising auto-encoder and locality-constrained dictionary learning (MDDL) together to simultaneously benefit from their merits [1]. Specifically, dictionary learning manages handling corrupted data from the sample space, while marginalized auto-encoder attempts to deal with the noisy data from the feature space. Thus, the algorithm aims to work well with the corrupted data from both the sample and feature spaces by integrating dictionary learning and auto-encoder into a unified framework. The key points of this introduced framework are listed as follows:

  • •  The MDDL model seeks a transformation matrix to filter out the noise inside the data with a marginalized denoising auto-encoder, which avoids forward and backward propagation, thereby working both efficiently and effectively.
  • •  Secondly, with the transformed data, the model aims to build a set of supervised sub-dictionaries with a locality constraint. In this way, the sub-dictionaries are discriminative for each class, which makes the new representation preserve the manifold structure while uncovering the global structure.
  • •  The marginalized denoising transformation and locality-constrained dictionary are jointly learned in a unified framework. In this way, the model can integrate auto-encoders and dictionary learning to produce features with denoising ability and discriminative information.

Consider a matrix X=[x1,x2,,xn]Image having n samples. A low-rank representation tries to segment these samples by minimizing the following objective function:

argminZZs.t. X=DZ,

Image (7.4)

where Image denotes the nuclear norm and Z is the coefficient matrix. In the subspace segmentation problem, a low-rank approximation is enforced by minimizing the error between X and its low rank representation (D=XImage). By applying low-rank regularization in dictionary updating, the DLRD [25] algorithm achieved impressive results, especially when corruption exists. Jiang et al. proposed a sparse and dense hybrid representation based on a supervised low-rank dictionary decomposition to learn a class-specific dictionary and erase non-class-specific information [26]. Furthermore, supervised information has been well utilized to seek a more discriminative dictionary [11,17]. In the introduced model, supervised information is also adopted to learn multiple sub-dictionaries so that samples from the same class are drawn from one low-dimensional subspace.

Above-mentioned sparse representation based methods consider each sample as an independent sparse linear combination, this assumption fails to exploit the spatial consistency between neighboring samples. Recent research efforts have yielded more promising results on the task of classification by using the idea of locality [27]. The method named Local Coordinate Coding (LCC), which specifically encourages the coding to rely on local structure, has been presented as a modification to sparse coding. In [27] the author also theoretically proved that locality is more essential than sparsity under certain assumptions. Inspired by the above learning techniques, Wang et al. proposed a Locality-Constrained Low-Rank Dictionary Learning (LC-LRD) to enhance the identification capability by using the geometric structure information [28].

7.1.3.2 LC-LRD Revisited

Given a set of training data X=[X1,X2,,Xc]Rd×nImage, where d is the feature dimensionality, n is the number of total training samples, c is the number of classes, and XiRd×niImage is the sample from class i which has niImage samples. The goal of dictionary learning is to learn an m atoms dictionary DRd×mImage which yields a sparse representation matrix ARm×nImage from X for future classification tasks. Then we can write X=DA+EImage, where E is the sparse noise matrix. Rather than learning the dictionary as a whole from all the training samples, we learn sub-dictionary DiImage for the ith class separately. Then A and D could be written as A=[A1,A2,,Ac]Image and D=[D1,D2,,Dc]Image, where AiImage is the sub-matrix that denotes the coefficients for XiImage over D.

In [28], Wang et al. have proposed the following LC-LRD model for each sub-dictionary:

minDi,Ai,EiR(Di,Ai)+αDi+βEi1+λnik=1li,kai,k2, s.t. Xi=DAi+Ei,

Image (7.5)

where R(Di,Ai)Image is the Fisher discriminant regularization on each sub-dictionary, DiImage is the nuclear norm to enforce low-rank properties, and li,kai,k2Image is a locality constraint to replace sparsity on the coding coefficient matrix; ai,kImage denotes the kth column in AiImage, which means the coefficient for the kth sample in class i. This model was broken down into the following modules: discriminative sub-dictionaries, low-rank regularization term, and the locality constraint on the coding coefficients.

Sub-dictionary DiImage should be endowed with the discrimination power to well represent samples from the ith class. Mathematically, the coding coefficients of XiImage over D can be written as Ai=[A1i;A2i;;Aci]Image, where AjiImage is the coefficient matrix of XiImage over DjImage. The discerning power of DiImage is produced by following two aspects: first, it is expected that XiImage should be well represented by DiImage but not by DjImage, jiImage. Therefore, we will have to minimize XiDiAiiEi2FImage, where EiImage is the residual. Meanwhile, DiImage should not be good at representing samples from other classes, that is, each AijImage, where jiImage, should have nearly zero coefficients so that DiAij2FImage is as small as possible. Thus we denote the discriminative fidelity term for sub-dictionary DiImage as follows:

R(Di,Ai)=XiDiAiiEi2F+cj=1,jiDiAij2F.

Image (7.6)

In the task of image classification, the within-class samples are linearly correlated and lie in a low dimensional manifold. Therefore, we want to find the dictionary with the most concise atoms by minimizing the rank of DiImage, which suggests replacing it by DiImage [31], where Image denotes the nuclear norm of a matrix (i.e., the sum of its singular values).

In addition, a locality constraint is deployed on the coefficient matrix instead of the sparsity constraint. As suggested by LCC [32], locality is more essential than sparsity under certain assumptions, as locality must lead to sparsity but not necessary vice versa. Specifically, the locality constraint uses the following criterion:

minAni=1liai2, s.t. 1Tai=1,i,

Image (7.7)

where ⊙ denotes the element-wise multiplication, and liRmImage is the locality adaptor which gives different freedom for each basis vector proportional to its similarity to the input sample xiImage. Specifically,

li=exp(dist(xi,D)δ),

Image (7.8)

where dist(xi,D)=[dist(xi,d1),dist(xi,d2),,dist(xi,dm)]TImage, and dist(xi,dj)Image is the Euclidean distance between sample xiImage and the jth dictionary atom djImage; δ controls the bandwidth of the distribution.

Generally speaking, LC-LRD is based on the following three observations: (i) locality is more essential than sparsity to ensure obtain the similar representations for similar samples; (ii) each sub-dictionary should have discerning ability by introducing the discriminative term; (iii) low-rank is introduced to each sub-dictionary to separate noise from samples and discover the latent structure.

7.1.3.3 Marginalized Denoising Auto-encoder (mDA)

Consider a vector input xRdImage, with d as the dimensionality of the visual descriptor. There are two important transformations, which can be considered as encoder and decoder processes involved in the auto-encoder, namely “input→hidden units” and “hidden units→output”, which are given by

h=σ(Wx+bh),ˆx=σ(WTh+bo),

Image (7.9)

where hRzImage is the hidden representation unit, and ˆxRdImage is interpreted as a reconstruction of input x. The parameter set includes a weight matrix WRz×dImage, and two offset vectors bhRzImage and boRdImage for hidden units and output, respectively; σ is a nonlinear mapping such as the sigmoid function of the form σ(x)=(1+ex)1Image. In general, an auto-encoder is a single layer hidden neural network, with identical input and target, meaning the auto-encoder encourages the output to be as similar to the target as possible, namely,

minW,bh,boL(x)=minW,bh,bo12nni=1xiˆxi22,

Image (7.10)

where n is the number of images, xiImage is the target, and ˆxiImage is the reconstructed input. In this way, the neurons in the hidden layer can be seen as a good representation for the input, since they are able to reconstruct the data.

Since an auto-encoder deploys nonlinear functions, it takes more time to train the model, especially when the dimension of the data is very high. Recently, marginalized denoising auto-encoder (mDA) [20] was developed to address the data reconstruction in a linear fashion and achieved comparable performance with the original auto-encoder. The general idea of mDA is to learn a linear transformation matrix W to reconstruct the data with the transformation matrix by minimizing the squared reconstruction loss

12nni=1xiW˜xi2,

Image (7.11)

where ˜xiImage is a corrupted version of xiImage. The above objective solution is correlated to the randomly corrupted features of each input. To make the variance lower, a marginal denoising auto-encoder was proposed to minimize the overall squared loss from t different corrupted versions

12tntj=1ni=1xiW˜xi,(j)2,

Image (7.12)

where ˜xi,(j)Image denotes the jth corrupted version of the original input xiImage. Define X=[x1,,xn]Image and its t-times repeated version as X=[X,,X]Image with its t-times differently corrupted version ˜X=[˜X(1),,˜X(t)]Image, where ˜X(i)Image denotes the ith corrupted version of X. In this way, Eq. (7.12) can be reformulated as

12tnXW˜X2F,

Image (7.13)

which has the well-known closed-form solution for ordinary least squares. When tImage, it can be solved by the expectation, using the weak law of large numbers [20].

7.1.3.4 Proposed MDDL Model

Previous discussion on mDA gives a brief idea that, with a linear transformation matrix, mDA can be implemented in several lines of Matlab code and works very efficiently. The learned transformation matrix can well reconstruct the data and extract the noisy data.

Inspired by this, the proposed model aims at jointly learning a dictionary and a marginalized denoising transformation matrix in a unified framework. We formulate the objective function as

minDi,Ai,Ei,WF(Di,Ai,Ei)+XW˜X2F, s.t.WXi=DAi+Ei

Image (7.14)

where F(Di,Ai,Ei)=R(Di,Ai)+αDi+β1Ei1+λnik=1li,kai,k2Image is the locality-constrained dictionary learning part in Eq. (7.5) and R(Di,Ai)=WXiDiAiiEi2F+cj=1,jiDiAij2FImage is the discriminative term in Eq. (7.6); α, β1Image, and λ are trade-off parameters.

Discussion. The proposed marginalized denoising regularized dictionary learning (MDDL) aims to learn a more discriminative dictionary on transformed data. Since the marginalized denoising regularizer could generate a better transformation matrix to address corrupted data, the dictionary could be learned on denoised clean data. The framework unifies the marginal denoising auto-encoder and locality-constrained dictionary learning together. Generally, dictionary learning seeks a well-represented basis in order to achieve more discriminative coefficients for original data. Therefore, dictionary learning can handle noisy data to some extent, while the denoising auto-encoder has demonstrated its power in many applications. To this end, the joint learning scheme can benefit from both the marginal denoising auto-encoder and locality-constrained dictionary learning.

7.1.3.5 Optimization

The proposed objective function in Eq. (7.14) could be optimized by dividing into two sub-problems: first, updating one by one each coefficient Ai(i=1,2,,c)Image and W, by fixing dictionary D and all other Aj(ji)Image, and then putting together to get the coding coefficient matrix A; second, updating DiImage by fixing other variables. These two steps are iteratively repeated to get the discriminative low-rank sub-dictionaries D, the marginal denoising transformation W, and the locality-constrained coefficients A. One problem arises in the second sub-problem, though. Recall that in Eq. (7.6) the coefficients AiiImage corresponding to XiImage over DiImage should be updated to minimize the term XiDiAiiEi2FImage. Therefore, when we update DiImage in the second sub-problem, the related variance AiiImage is also updated.

Sub-problem I. Assume that the structured dictionary D is given, the coefficient matrices Ai(i=1,2,,c)Image are updated one by one, then the original objective function in Eq. (7.14) reduces to the following locality-constrained coding problem for the coefficient of each class and W:

minAi,Ei,Wλnik=1li,kai,k2+β1Ei1+XW˜X2F s.t. WXi=DAi+Ei,

Image (7.15)

which can be solved by the following Augmented Lagrange Multiplier method [33]. We transform Eq. (7.15) into its Lagrange function as follows:

minAi,Ei,W,T1ci=1(λnik=1li,kai,k2+β1Ei1+T1,WXiDAiEi+μ2WXiDAiEi2F)+XW˜X2F,

Image (7.16)

where T1Image is the Lagrange multiplier, and μ is a positive penalty parameter. Different from traditional locality-constrained linear coding (LLC) [27], MDDL adds an error term which could handle large noise in samples. In the following, we perform iterative optimization on AiImage, EiImage, and W.

Updating AiImage:

Ai=argminAiμ2ZiDAi2F+λnik=1li,kai,k2,

Image

Ai=LLC(Zi,D,λ,δ),

Image (7.17)

where Zi=WXiEi+T1μImage, and li,k=exp(dist(zi,k,D)/δ)Image. Function LLC()Image is a locality-constrained linear coding function2 [27].

Updating EiImage:

Ei=argminEiβ1μEi1+12Ei(WXiDAi+T1μ)2F,

Image (7.18)

which can be solved by the shrinkage operator [34].

Updating W:

W=argminWci=1(μ2WXiDAiEi+T1μ2F)W=+XW˜X2FW=argminWμ2WXDA2F+XW˜X2F,

Image (7.19)

where X=[X1,,Xc]Image and DA=[DA1+E1T1,1μ,,DAc+EcT1,cμ]Image. Equation (7.19) has a well-known closed-form solution

W=(μDAXT+2X˜XT)(μXXT+2˜X˜XT)1

Image (7.20)

where XImage is the t-times repeated version of X and ˜XImage consists of its t-times corrupted version. We define P=μDAXT+2X˜XTImage and Q=μXXT+2˜X˜XTImage. And we would like the repetition number t to be ∞. Therefore, the denoising transformation W could be effectively learned from infinitely many copies of noisy data. Practically, we cannot generate ˜XImage with infinitely many corrupted versions; however, matrices P and Q converge to their expectations when t becomes very large. In this way, we can derive the expected values of P and Q, and calculate the corresponding mapping W as:

W=E[P]E[Q]1W=E[μDAXT+2X˜XT]E[μXXT+2˜X˜XT]1W=(μDAXT+2E[X˜XT])(μXXT+2E[˜X˜XT])1

Image (7.21)

where DAImage and μ are treated as constant values when optimizing W. The expectations E[X˜XT]Image and E[˜X˜XT]Image are easy to be computed through mDA [20].

Sub-problem II. For the procedure of sub-dictionary updating, MDDL uses the same method as in [25]. Considering the second sub-problem, when AiImage is fixed, sub-dictionaries Di(i=1,2,...,c)Image are updated one by one. The objective function in Eq. (7.14) is converted into the following problem:

minDi,Ei,Aiicj=1,jiDiAij2F+αDi+β2Ei1+λnik=1lii,kaii,k2, s.t. WXi=DiAii+Ei

Image (7.22)

where aii,kImage is the kth column in AiiImage, which means the coefficient for the kth sample in class i over DiImage. Problem Eq. (7.22) can be solved using the Augmented Lagrange Multiplier method [33] by introducing a relaxing variable J:

minDi,Ei,Aiiλnik=1lii,kaii,k2+cj=1,jiDiAij2F+αJ+β2Ei1+T2,WXiDiAiiEi+T3,DiJ+μ2(WXiDiAiiEi2F+DiJ2F),

Image (7.23)

where T2Image and T3Image are Lagrange multipliers, and μ is a positive penalty parameter. In the following, we describe the iterative optimization of DiImage and AiiImage.

Updating AiiImage:

Similar as for Eq. (7.17), we have the solution for AiiImage as follows:

Aii=LLC((WXiEi+T2μ),Di,λ,δ),

Image (7.24)

where function LLC()Image is a locality-constrained linear coding function [27].

Updating J and DiImage:

Here we convert Eq. (7.23) to a problem for J and DiImage as:

minJ,Dicj=1,jiDiAij2F+αJ+T2,WXiDiAiiEi+T3,DiJ+μ2(DiJ2F+WXiDiAiiEi2F),

Image (7.25)

where J=argminJαJ+T3,DiJ+μ2(DiJ2F)Image, and the solution for DiImage is:

Di=(J+WXiAiiTEiAiiT+(T2AiiTT3)/μ)(I+AiiAiiT+V)1,whereV=2λμcj=1,jiAijAijT;

Image (7.26)

Updating EiImage:

Ei=argminEiβ2Ei1+T2,WXiDiAiiEi+μ2WXiDiAiiEi2F,

Image (7.27)

which can be solved by the shrinkage operator [34]. The details of the optimization problem solution for the proposed model can be referred to as Algorithm 7.1.

Image
Algorithm 7.1 Optimization for MDDL.

7.1.3.6 Classification Based on MDDL

MDDL uses a linear classifier for classification. After the dictionary is learned, the locality-constrained coefficients A of training data X and AtestImage of test data XtestImage are calculated. The representation aiImage for test sample i is the ith column vector in AtestImage. The multivariate ridge regression model [35] was used to obtain a linear classifier ˆPImage:

ˆP=argminPLPA2F+γP2F

Image (7.28)

where L is the class label matrix. This yields ˆP=LAT(AAT+γI)1Image. When testing points AtestImage come in, we first compute ˆPAtestImage. Then the label for sample i is assigned by the position corresponding to the largest value in the label vector, that is, label=argmaxlabel(ˆPai)Image.

7.1.4 Experiments

7.1.4.1 Experimental Settings

To verify the effectiveness and generality of the introduced MDDL, we show experiments conducted on various visual classification applications in this section. The method is tested on five datasets, including three face datasets: ORL [36], Extend YaleB [37], and CMU PIE [38], one object categorization dataset COIL-100 [39], and digits recognition dataset MNIST [40].

We show the experiments in comparison with LDA [41], linear regression classification (LRC) [42] and several latest dictionary learning based classification methods, i.e., FDDL [11], DLRD [25], D2L2R2Image [43], DPL [44], and LC-LRD [28]. Moreover, for verifying the advantage of joint learning, a simple combination framework was proposed as a baseline, named as AE+DL, which first uses a traditional SAE to learn a new representation, then feeds in LC-LRD framework [28].

Parameter selection. The number of atoms in a sub-dictionary, which is denoted as miImage, is one of the most important parameters in most of dictionary learning algorithms. We conduct the experiment on Extended YaleB with different number of dictionary atoms miImage and analyze its effect on the performance of MDDL model and other competitors. Fig. 7.2 shows that all comparisons obtain better performance with larger dictionary size. In the experiments, the number of the dictionary columns of each class is fixed as the training size for ORL, Extend YaleB, AR and COIL-100 datasets, while it is fixed as 30 for CMU PIE and MNIST datasets. All the dictionaries are initialized with PCA on input data.

Image
Figure 7.2 The recognition rates of six DL based methods versus the number of dictionary atoms with 20 training samples per class on Extend YaleB dataset.

There are five parameters in Algorithm 7.1: α, λ, δ along with β1Image, β2Image as two error term parameters, respectively, for updating dictionary and coefficients. These five are associated with the dictionary learning part in MDDL and are chosen by 5-fold cross validation. Experiments show that β1Image and β2Image play more important roles than the other parameters; therefore, the other parameters are set as α=1Image, λ=1Image and δ=1Image. For Extended YaleB, β1=15Image, β2=100Image; for ORL, β1=5Image, β2=50Image; for CMU PIE, β1=5Image, β2=1.5Image; for COIL-100, β1=3Image, β2=150Image; for MNIST, β1=2.5Image, β2=2.5Image.

7.1.4.2 Face Recognition

ORL Face Database. The ORL dataset contains 400 images of 40 individuals, so that there are 10 images for each subject with varying pose and illumination. The subjects of the images are in frontal and upright posture while the background is dark and uniform. The images are resized to 32×32Image, converted to gray scale, normalized and the pixels are concatenated to form a vector. Each image is manually corrupted by a randomly located and unrelated block image. Fig. 7.3B shows four examples of images with increasing block corruptions. For each subject, 5 samples are selected for training and the rest for testing, and the experiment is repeated on 10 random splits for evaluation. Furthermore, SIFT and Gabor filter features are extracted to evaluate MDDL generality.

Image
Figure 7.3 Example images from three datasets: (A) images with 30, 20, 10, 5, and 1 db SNR addition of white Gaussian noise from MNIST digit dataset; (B) ORL with 10%, 20%, 30% block occlusion; (C) Extended YaleB with 10%, 15%, 20%, 25% random pixel corruption.

We illustrate the recognition rates under different percentages of occlusion in Table 7.1. From the table, we can observe two phenomena: first, locality-constrained dictionary learning based methods achieve the best results for all the settings; second, the MDDL model performs best when the data is clean; however, along with the percentage of occlusion increase, MDDL drops behind with LC-LRD. This makes sense because in this experiment occlusion noise is added on to the images, while the denoising auto-encoder module in MDDL is introduced to tackle Gaussian noise. In conclusion, first, MDDL can achieve top results in a no occlusion situation because of the locality term; second, in a larger occlusion situation, the low-rank term outweighs DAEs.

Table 7.1

Average recognition rate (%) of different algorithms on ORL dataset for various occlusion percentage (%). Red denotes the best results, while blue means the second best results. (For interpretation of the colors in the tables, the reader is referred to the web version of this chapter)

Occlusion LDA [41] LRC [42] FDDL [11] DLRD [25] D2L2R2 Image [43]
0 92.5±1.8 91.8±1.6 96.0±1.2 93.5±1.5 93.9±1.8
0 (SIFT) 95.8±1.3 92.9±2.1 95.2±1.3 93.7±1.3 93.9±1.5
0 (Gabor) 89.0±3.3 93.4±1.7 96.0±1.3 96.3±1.3 96.6±1.3
10 71.7±3.2 82.2±2.2 86.6±1.9 91.3±1.9 91.0±1.9
20 54.3±2.0 71.3±2.8 75.3±3.4 82.8±3.0 82.8±3.3
30 40.5±3.7 63.7±3.1 63.8±2.7 78.9±3.1 78.8±3.3
40 25.7±2.5 48.0±3.0 48.1±2.4 67.3±3.2 67.4±3.4
50 20.7±3.0 40.9±3.7 36.7±1.2 58.6±3.1 Image

Image

Occlusion DPL [44] LC-LRD [28] AE+DL [1] MDDL [1]
0 94.1±1.8 Image 96.2±1.2 Image
0 (SIFT) 95.3±1.4 93.5±2.6 Image Image
0 (Gabor) 97.0±1.4 94.6±1.8 Image Image
10 84.5±2.7 Image 91.5±1.2 Image
20 71.2±1.7 Image 83.6±1.8 Image
30 59.8±3.9 Image Image 78.0±2.4
40 43.0±2.9 Image 67.3±2.6 Image
50 32.2±3.5 Image 58.7±3.2 58.5±3.0

Image

The effects from two parameters of the error term, β1Image and β2Image, are demonstrated in Fig. 7.4. From the six sub-figures under increasing percentage of corrupted pixels, parameter β1Image in the coefficient updating procedure makes a larger impact. As more occlusion is applied, the best result appears when the parameter β1Image is smaller, which means the error term plays a more important role when noise exists.

Image
Figure 7.4 MDDL's performance (A)–(F) on ORL dataset under increasing percentage of corrupted pixels versus different parameters. As more occlusion is applied, the best result appears when the parameter β1 is smaller, which means the error term plays a more important role when noise exists.

The results show that MDDL introduces significant improvement on some datasets, and for some other datasets, its significance increases along with the noise level in the input data.

Extended YaleB Dataset. The Extended Yale Face Database B contains 2414 frontal-face images from 38 human subjects captured under various laboratory-controlled illumination conditions. The size of image is cropped to 32×32Image. Two experiments are deployed on this dataset. First, random subsets with p(=5,10,,40)Image images per individual taken with labels are chosen to form the training set, and the rest of the dataset is considered to be the testing set. For each given p, there are 10 random splits. Second, a certain percentage of randomly selected pixels from the images are replaced with pixel value of 255 (shown in Fig. 7.3C). Then we randomly take 30 images as training samples, with the rest reserved for testing, and the experiment is repeated 10 times. The experimental results are given in Tables 7.2 and 7.3, respectively.

Table 7.2

Average recognition rate (%) of different algorithms on Extended YaleB dataset with different number of training samples per class. Red denotes the best results, while blue means the second best results

Training 5 10 20 30 40
LDA [41] 74.1±1.5 86.7±0.9 90.6±1.1 86.8±0.9 95.3±0.8
LRC [42] 60.2±2.0 83.00±0.8 91.8±1.0 94.6±0.6 96.1±0.6
FDDL [11] 77.8±1.3 91.2±0.9 96.2±0.7 97.9±0.3 98.8±0.5
DLRD [25] 76.2±1.2 89.9±0.9 96.0±0.8 97.9±0.5 98.8±0.4
D2L2R2[43] 76.0±1.2 89.6±0.9 96.0±0.9 97.9±0.4 98.1±0.4
DPL [44] 75.2±1.9 89.3±0.6 95.7±0.9 97.8±0.4 98.7±0.4
LC-LRD [28] 78.6±1.2 92.1±0.9 Image Image Image
AE+DL [1] Image Image 96.6±0.9 98.6±0.5 99.2±0.4
MDDL [1] Image Image Image Image Image

Image

Table 7.3

Average recognition rate (%) of different algorithms on Extended YaleB dataset with various corruption percentage (%). Red denotes the best results, while blue means the second best results

Corruption 0 5 10 15 20
LDA [41] 86.8±0.9 29.0±0.8 18.5±1.2 13.6±0.5 11.3±0.5
LRC [42] 94.6±0.6 80.5±1.1 67.6±1.3 56.8±1.2 47.2±1.6
FDDL [11] 97.9±0.4 63.6±0.9 44.7±1.2 32.7±1.0 25.3±0.4
DLRD [25] 97.9±0.5 91.8±1.1 85.8±1.5 80.9±1.4 73.6±1.6
D2L2R2[43] 97.8±0.4 91.9±1.1 85.7±1.5 80.5±1.6 73.6±1.5
DPL [44] 97.8±0.4 78.3±1.2 64.6±1.1 53.8±0.9 44.9±1.4
LC-LRD [28] Image Image 87.0±0.9 Image 74.1±1.0
AE+DL [1] 98.6±0.5 93.2±0.7 Image 81.6±0.9 Image
MDDL [1] Image Image Image Image Image

Image

We can observe from Table 7.2 that in different training size settings three locality-constrained based methods (LC-LRD, AE+DL, MDDL) archive the best accuracy, and the proposed MDDL performs the best. MDDL's robustness to noise is demonstrated in Table 7.3. As the percentage of corruption increases, MDDL still produces the best recognition results. The performance of LDA, as well as LRC, FDDL and DPL, drops rapidly with larger corruption, while LC-LRD, MDDL, D2L2R2Image and DLRD can still get much better recognition accuracy. This demonstrates the effectiveness of low-rank regularization and the error term when noise exists. LC-LRD and simple AE+DL equally match in different situations, while MDDL constantly performs the best.

CMU PIE Dataset. The CMU PIE dataset contains a total of 41,368 face images from 68 people, each with 13 different poses, 4 different expressions, and 43 different illumination conditions. Two experiments are deployed on two subsets of CMU PIE. First of all, five near frontal poses (C05, C07, C09, C27, C29) are selected as a first subset of PIE, and all the images under different illuminations and expressions (11,554 samples in total) are used. Thus, there are about 170 images for each person, and each image is normalized to have size of 32×32Image pixels; 60 images per person are selected for training. Secondly, a relatively large-scale dataset is built by choosing more poses, which contains 24,245 samples in total. Overall, there are around 360 images for each person, and each image is normalized to have size of 32×32Image pixels. The training set is constructed by randomly selecting 200 images per person, while the rest is used for evaluation. Table 7.4 shows that MDDL achieves good results and outperforms the compared methods.

Table 7.4

Classification error rates (%) on CMU PIE dataset. Five poses (C05, C07, C09, C27, C29) are selected as near frontal poses
Methods CMU (near frontal poses) CMU (all poses)
LRC [42] 4.12 9.65
FDDL [11] 3.30 11.20
DLRD [25] 3.33 10.64
D2L2R2[43] 3.29 10.14
DPL [44] 3.47 9.30
LC-LRD [28] 3.01 8.98
MDDL [1] 2.74 7.64

7.1.4.3 Object Recognition

COIL-100 Dataset. In this section, MDDL is evaluated on object categorization by using the COIL-100 dataset. The training set is constructed by randomly selecting 10 images per object, and the testing set contains the rest of the images. This random selection is repeated 10 times, and the average results of all the compared methods are reported. To evaluate the scalability of different methods, the experiment separately utilizes images of 20, 40, 60, 80 and 100 objects from the dataset. Table 7.5 shows the average recognition rates with standard deviations of all compared methods. The results show that MDDL could not only work on face recognition but also on object categorization.

Table 7.5

Average recognition rate (%) with standard deviations of different algorithms on COIL-100 dataset with different number of classes. Red denotes the best results, while blue means the second best results

Class No. 20 40 60 80 100
LDA [41] 81.9±1.2 76.7±0.3 66.2±1.0 59.2±0.7 52.5±0.5
LRC [42] 90.7±0.7 89.0±0.5 86.6±0.4 85.1±0.3 83.2±0.6
FDDL [11] 85.7±0.8 82.1±0.4 77.2±0.7 74.8±0.6 73.6±0.6
DLRD [25] 88.6±1.0 86.4±0.5 83.5±0.1 81.5±0.5 79.9±0.6
D2L2R2[43] 91.0±0.4 88.3±0.4 86.4±0.5 84.7±0.5 83.1±0.4
DPL [44] 87.6±1.3 85.1±0.2 81.2±0.2 78.8±0.9 76.3±0.9
LC-LRD [28] Image Image 87.1±0.7 Image Image
AE+DL [1] 91.3±0.5 89.1±0.7 Image 85.1±0.5 84.1±0.4
MDDL [1] Image Image Image Image Image

Image

7.1.4.4 Digits Recognition

MNIST Dataset. This section tests MDDL on the subset of MNIST handwritten digit dataset downloaded from CAD website, which includes first 2000 training images and first 2000 test images, with the size of each digit image being 16×16Image. This experimental setting follows [43], and the experiments get consistent results. The recognition rates and training/testing time by different algorithms on MNIST dataset are summarized in Table 7.6. MDDL achieves the highest accuracy relative to its competitors. Compared within LC-LRD, MDDL costs only slightly more computational time thanks to the easy updating of marginalized auto-encoder.

Table 7.6

Average recognition rate (%) & running time (s) on MNIST dataset

Methods Accuracy Training time Testing time
LDA [41] 77.45 0.164 0.545
LRC [42] 82.70 227.192
FDDL [11] 85.35 240.116 97.841
DLRD [25] 86.05 156.575 48.373
D2L2R2[43] 84.65 203.375 48.154
DPL [44] 84.65 1.773 0.847
LC-LRD [28] 88.25 80.581 48.970
AE+DL [1] 87.95 176.525 49.230
MDDL [1] 89.75 81.042 49.781

Image

Another experiment setting is conducted on this dataset to evaluate the effect from denoising auto-encoders. All the training and testing images in MNIST are corrupted with additive white Gaussian noise having signal-to-noise ratio (snr) from 50 to 1 dB (shown in Fig. 7.3A). Fig. 7.5 illustrates the recognition rate curves on 8 noised version of datasets. On the X-axis, we show the noise ratio used in input reconstruction process in DAEs, where close to 1 means more noise added, and 0 means no DAEs evolved. From the figure, we can observe that, with the increasing noise added in the datasets (50 to 1 dB), the highest recognition rate appears when the noise parameter gets larger (from nearly 0.004 for 50 dB to nearly 0.1 for 1 dB). In other words, denoising auto-encoders play a more important role when the dataset contains more noise.

Image
Figure 7.5 The performance on MNIST datasets with different snr noise. As snr gets lower, the best result appears when the noise on reconstruction process is larger, which means the DAEs play a more important role when noise becomes more persistent on MNIST dataset.

To verify if the improvement from MDDL is statistically significant, a significance test (t-test) is further conducted for the results shown in Fig. 7.6. A significance level of 0.05 was used, that is, when the p-value is less than 0.05, the performance difference of two methods is statistically significant. The p-values of MDDL and other competitors are listed in Fig. 7.6. Since we do log(p)Image processing, the comparison shows that MDDL outperforms the others significantly if the values are greater than log(0.05)Image. The results show that MDDL introduces significant improvement on COIL-100 dataset, and for Extended YaleB dataset, the significance increases along with the noise level in the input data.

Image
Figure 7.6 p-values of the t-test between MDDL and others on Extended YaleB (upper figure, with 0% to 20% corruption) and COIL-100 (lower figure, with 20–100 classes) datasets. Pre-processing is deployed using log(p)Image, so that the larger value shown in the figure means more significance of MDDL compared with others.

7.1.5 Conclusion

In this section, we introduced an efficient marginalized denoising dictionary learning (MDDL) framework with a locality constraint. The proposed algorithm was designed to take advantage of two feature learning schemes, dictionary learning and auto-encoder. Specifically, MDDL adopted a lite version of the auto-encoder to seek a denoising transformation matrix. Then, dictionary learning with a locality constraint was built on the transformed data. These two strategies were iteratively optimized so that a marginalized denoising transformation and a locality-constrained dictionary were jointly learned. Experiments on several image datasets, e.g., face, object, digits, demonstrated the superiority of our proposed algorithm by comparing with other existing dictionary algorithms.

7.1.6 Future Works

In this chapter, we described MDDL model with a desire to seek a transformation matrix to filter out the noise inside the data with a marginalized denoising auto-encoder. However, one problem arises with the auto-encoder representation architecture, which forces the network to learn an approximation to the identity by encouraging the output to be as similar to the input as possible. This scheme leads to the problem that the majority of the learned high-level features may be blindly used to compresses not only the discriminative information but also lots of redundancies or even noise in data. In the following training procedure, it is unreasonable to endow the discriminablity to this kind of task-irrelevant units.

In the future, we plan to explore the auto-encoder high-level features further to extract the discriminative from the task-irrelevant ones. By compressing more task-relevant information on the hidden units, we hope the dictionary learning module could exploit more discriminative information and learn a more compact and pure dictionary. Furthermore, we plan to explore multi-view data classification where the auto-encoder could serve as a domain adaptation to learn a latent feature space for cross-domain datasets. We believe that low-rank and locality term could also play an important role in cross-domain applications.

7.2 Learning a Deep Image Encoder for Hashing3

We investigate the Image-constrained representation, which demonstrates robustness to quantization errors, utilizing the tool of deep learning. Based on the Alternating Direction Method of Multipliers (ADMM), we formulate the original convex minimization problem as a feed-forward neural network, named Deep Image Encoder, by introducing a novel Bounded Linear Unit (BLU) neuron and modeling the Lagrange multipliers as network biases. Such a structural prior acts as an effective network regularization, and facilitates model initialization. We then investigate the effective use of the proposed model in the application of hashing, by coupling the proposed encoders under a supervised pairwise loss, to develop a Deep Siamese Image Network, which can be optimized from end to end. Extensive experiments demonstrate the impressive performance of the proposed model. We also provide an in-depth analysis of its behaviors against the competitors.

7.2.1 Introduction

7.2.1.1 Problem Definition and Background

While 0Image and 1Image regularizations have been well-known and successfully applied in sparse signal approximations, utilizing the Image norm to regularize signal representations has been less explored. In this section, we are particularly interested in the following Image-constrained least squares problem:

minx||Dxy||22s.t.||x||λ,

Image (7.29)

where yRn×1Image denotes the input signal, DRn×NImage the (overcomplete) the basis (often called frame or dictionary) with N<nImage, and xRN×1Image the learned representation. Further, the maximum absolute magnitude of x is bounded by a positive constant λ, so that each entry of x has the smallest dynamic range [45]. As a result, model (7.29) tends to spread the information of y approximately evenly among the coefficients of x. Thus, x is called “democratic” [46] or “anti-sparse” [47], as all of its entries are of approximately the same importance.

In practice, x usually has most entries reaching the same absolute maximum magnitude [46], therefore resembling an antipodal signal in an N-dimensional Hamming space. Furthermore, the solution x to (7.29) withstands errors in a very powerful way: the representation error gets bounded by the average, rather than the sum, of the errors in the coefficients. These errors may be of arbitrary nature, including distortion (e.g., quantization) and losses (e.g., transmission failure). This property was quantitatively established in Section II.C of [45]:

Theorem 7.1

Assume ||x||2<1Image without loss of generality, and that each coefficient of x is quantized separately by performing a uniform scalar quantization of the dynamic range [λ,λ]Image with L levels. The overall quantization error of x from (7.29) is bounded by λNLImage. In comparison, a least squares solution xLSImage, by minimizing ||DxLSy||22Image without any constraint, would only give the bound nLImage.

In the case of NnImage, the above will yield great robustness for the solution to (7.29) with respect to noise, in particular, quantization errors. Also note that its error bound will not grow with the input dimensionality n, a highly desirable stability property for high-dimensional data. Therefore, (7.29) appears to be favorable for the applications such as vector quantization, hashing and approximate nearest neighbor search.

In this section, we investigate (7.29) in the context of deep learning. Based on the Alternating Direction Methods of Multipliers (ADMM) algorithm, we formulate (7.29) as a feed-forward neural network [48], called Deep Image Encoder, by introducing a novel Bounded Linear Unit (BLU) neuron and modeling the Lagrange multipliers as network biases. The major technical merit to be presented is how a specific optimization model (7.29) could be translated to designing a task-specific deep model, which displays the desired quantization-robust property. We then study its application in hashing, by developing a Deep Siamese Image Network that couples the proposed encoders under a supervised pairwise loss, which could be optimized from end to end. Impressive performances are observed in our experiments.

7.2.1.2 Related Work

Similar to the case of 0/1Image sparse approximation problems, solving (7.29) and its variants (e.g., [46]) relies on iterative solutions. The authors of [49] proposed an active set strategy similar to that of [50]. In [51], the authors investigated a primal–dual path-following interior-point method. Albeit effective, the iterative approximation algorithms suffer from their inherently sequential structures, as well as the data-dependent complexity and latency, which often constitute a major bottleneck in the computational efficiency. In addition, the joint optimization of the (unsupervised) feature learning and the supervised steps has to rely on solving complex bi-level optimization problems [52]. Further, to effectively represent datasets of growing sizes, larger dictionaries D are usually needed. Since the inference complexity of those iterative algorithms increases more than linearly with respect to the dictionary size [53], their scalability turns out to be limited. Last but not least, while the hyperparameter λ sometimes has physical interpretations, e.g., for signal quantization and compression, it remains unclear how to set or adjust it for many application cases.

Deep learning has recently attracted great attention [54]. The advantages of deep learning lie in its composition using multiple nonlinear transformations to yield more abstract and descriptive embedding representations. The feed-forward networks could be naturally tuned jointly with task-driven loss functions [55]. With the aid of gradient descent, it also scales linearly in time and space with the number of training samples.

There has been a blooming interest in bridging “shallow” optimization and deep learning models. In [48], a feed-forward neural network, named LISTA, was proposed to efficiently approximate the sparse codes, whose hyperparameters were learned from general regression. In [56], the authors leveraged a similar idea on fast trainable regressors and constructed feed-forward network approximations of the learned sparse models. It was later extended in [57] to develop a principled process of learned deterministic fixed-complexity pursuits, for sparse and low rank models. Lately, [55] proposed Deep 0Image Encoders, to model 0Image sparse approximation as feed-forward neural networks. The authors extended the strategy to the graph-regularized 1Image approximation in [58], and the dual sparsity model in [59]. Despite the above progress, to the best of our knowledge, few efforts have been made beyond sparse approximation (e.g., 0/1Image) models.

7.2.2 ADMM Algorithm

ADMM has been popular for its remarkable effectiveness in minimizing objectives with linearly separable structures [53]. We first introduce an auxiliary variable zRN×1Image, and rewrite (7.29) as

minx,z12||Dxy||22s.t.||z||λ,zx=0.

Image (7.30)

The augmented Lagrangian function of (7.30) is

12||Dxy||22+pT(zx)+β2||zx||22+Φλ(z).

Image (7.31)

Here pRN×1Image is the Lagrange multiplier attached to the equality constraint, β is a positive constant (with a default value of 0.6), and Φλ(z)Image is the indicator function which goes to infinity when ||z||>λImage, and is 0 otherwise. ADMM minimizes (7.31) with respect to x and z in an alternating direction manner, and updates p accordingly. It guarantees global convergence to the optimal solution to (7.29). Starting from any initialization points of x, z, and p, ADMM iteratively solves (t=0,1,2,Image denotes the iteration number):

(xupdate)minxt+112||Dxy||22pTtx+β2||ztx||22,

Image (7.32)

(zupdate)minzt+1β2||z(xt+1ptβ)||22+Φλ(z),

Image (7.33)

(pupdate)pt+1=pt+β(zt+1xt+1).

Image (7.34)

Furthermore, both (7.32) and (7.33) enjoy closed-form solutions:

xt+1=(DTD+βI)1(DTy+βzt+pt),

Image (7.35)

zt+1=min(max(xt+1ptβ,λ),λ).

Image (7.36)

The above algorithm could be categorized to the primal–dual scheme. However, discussing the ADMM algorithm in more details is beyond the focus of this section. Instead, the purpose of deriving (7.30)(7.36) is to prepare us for the design of the task-specific deep architecture, as presented below.

7.2.3 Deep Image Encoder

We first substitute (7.35) into (7.36), in order to derive an update form explicitly dependent on only z and p:

zt+1=Bλ((DTD+βI)1(DTy+βzt+pt)ptβ),

Image (7.37)

where BλImage is defined as a box-constrained element-wise operator (u denotes a vector and uiImage is its ith element):

[Bλ(u)]i=min(max(ui,λ),λ).

Image (7.38)

Equation (7.37) could be alternatively rewritten as

zt+1=Bλ(Wy+Szt+bt),whereW=(DTD+βI)1DT,S=β(DTD+βI)1,bt=[(DTD+βI)11βI]pt,

Image (7.39)

and expressed as the block diagram in Fig. 7.7, which outlines a recurrent structure when solving (7.29). Note that in (7.39), while W and S are pre-computed hyperparameters shared across iterations, btImage remains a variable dependent on ptImage, and has to be updated throughout iterations, too (btImage's update block is omitted in Fig. 7.7).

Image
Figure 7.7 The block diagram of solving (7.29).

By time-unfolding and truncating Fig. 7.7 to a fixed number of K iterations (K=2Image by default),4 we obtain a feed-forward network structure in Fig. 7.8, named Deep Image Encoder. Since the threshold λ is less straightforward to update, we repeat the same trick as in [55] to rewrite (7.38) as [Bλ(u)]i=λiB1(ui/λi)Image. The original operator is thus decomposed into two linear diagonal scaling layers, plus a unit-threshold neuron; the latter is called a Bounded Linear Unit (BLU) by us. All the hyperparameters W, SkImage and bkImage (k=1,2Image), as well as λ, are all to be learnt from data by backpropagation. Although the equations in (7.39) do not directly apply to solving the deep Image encoder, they can serve as high-quality initializations.

Image
Figure 7.8 Deep Encoder with two time-unfolded stages.

It is crucial to notice the modeling of the Lagrange multipliers ptImage as the biases, and to incorporate their updates into network learning. This provides important clues on how to relate deep networks to a larger class of optimization models, whose solutions rely on dual domain methods.

Comparing BLU with existing neurons. As shown in Fig. 7.9E, BLU tends to suppress large entries while not penalizing small ones, resulting in dense, nearly antipodal representations. A first look at the plot of BLU easily reminds of the tanh neuron (Fig. 7.9A). In fact, with its output range being [1,1]Image and a slope of 1 at the origin, tanh could be viewed as a smoothed differentiable approximation of BLU.

Image
Figure 7.9 A comparison among existing neurons and BLU.

We further compare BLU with other popular and recently proposed neurons: Rectifier Linear Unit (ReLU) [54], Soft-tHresholding Linear Unit (SHeLU) [58], and Hard thrEsholding Linear Unit (HELU) [55], as depicted in Fig. 7.9B–D, respectively. Contrary to BLU and tanh, they all introduce sparsity in the outputs, and thus prove successful and outperform tanh in classification and recognition tasks. Interestingly, HELU seems to rival BLU, as it does not penalize large entries but suppresses small ones down to zero.

7.2.4 Deep Image Siamese Network for Hashing

Rather than solving (7.29) first and then training the encoder as general regression, as [48] did, we instead concatenate encoder(s) with a task-driven loss, and optimize the pipeline from end to end. In this section, we focus on discussing its application in hashing, although the proposed model is not limited to one specific application.

Background. With the ever-growing large-scale image data on the Web, much attention has been devoted to nearest neighbor search via hashing methods [60]. For big data applications, compact bitwise representations improve the efficiency in both storage and search speed. The state-of-the-art approach, learning-based hashing, learns similarity-preserving hash functions to encode input data into binary codes. Furthermore, while earlier methods, such as linear search hashing (LSH) [60], iterative quantization (ITQ) [61] and spectral hashing (SH) [62], do not refer to any supervised information, it has been lately discovered that involving the data similarities/dissimilarities in training benefits the performance [63,64].

Prior Work. Traditional hashing pipelines first represent each input image as a (hand-crafted) visual descriptor, followed by separate projection and quantization steps to encode it into a binary code. The authors of [65] first applied the Siamese network [66] architecture to hashing, which fed two input patterns into two parameter-sharing “encoder” columns and minimized a pairwise-similarity/dissimilarity loss function between their outputs, using pairwise labels. The authors further enforced the sparsity prior on the hash codes in [67], by substituting a pair of LISTA-type encoders [48] for the pair of generic feed-forward encoders in [65] while [68,69] utilized tailored convolution networks with the aid of pairwise labels. The authors of [70] further introduced a triplet loss with a divide-and-encode strategy applied to reduce the hash code redundancy. Note that for the final training step of quantization, [67] relied on an extra hidden layer of tanh neurons to approximate binary codes, while [70] exploited a piecewise linear and discontinuous threshold function.

Our Approach. In view of its robustness to quantization noise, as well as BLU's property as a natural binarization approximation, we construct a Siamese network as in [65], and adopt a pair of parameter-sharing deep Image encoders as the two columns. The resulting architecture, named the Deep Image Siamese Network, is illustrated in Fig. 7.10. Assume y and y+Image make a similar pair while y and yImage make a dissimilar pair, and further denote x(y)Image the output representation by inputting y. The two coupled encoders are then optimized under the following pairwise loss (the constant m represents the margin between dissimilar pairs):

Lp:=12||x(y)x(y+)||212(max(0,m||x(y)x(y)||))2.

Image (7.40)

The representation is learned to make similar pairs as close as possible and dissimilar pairs to be at least a distance m away. In this section, we follow [65] to use a default m=5Image for all experiments.

Image
Figure 7.10 Deep Siamese Network, by coupling two parameter-sharing encoders, followed by a pairwise loss (7.40).

Once a deep Image Siamese network is learned, we apply its encoder part (i.e., a deep Image encoder) to a new input. The computation is extremely efficient, involving only a few matrix multiplications and element-wise thresholding operations, with a total complexity of O(nN+2N2)Image. One can obtain an N-bit binary code by simply quantizing the output.

7.2.5 Experiments in Image Hashing

Implementation. The proposed networks are implemented with the CUDA ConvNet package [54]. We use a constant learning rate of 0.01 with no momentum, and a batch size of 128. Different from prior findings such as in [55,58], we discover that untying the values of S1Image, b1Image and S2Image, b2Image boosts the performance more than sharing them. It is not only because that more free parameters enable a larger learning capacity, but also due to the important fact that ptImage (and thus bkImage) is in essence not shared across iterations, as in (7.39) and Fig. 7.8.

While many neural networks are trained well with random initializations, it has been discovered that sometimes poor initializations can still hamper the effectiveness of first-order methods [71]. On the other hand, it is much easier to initialize our proposed models in the right regime. We first estimate the dictionary D using the standard K-SVD algorithm [72], and then inexactly solve (7.29) for up to K (K=2Image) iterations, via the ADMM algorithm in Sect. 7.2.2, with the values of Lagrange multiplier ptImage recorded for each iteration. Benefiting from the analytical correspondence relationships in (7.39), it is then straightforward to obtain high-quality initializations for W, SkImage and bkImage (k=1,2Image). As a result, we could achieve a steadily decreasing curve of training errors, without performing common tricks such as annealing the learning rate, which are found to be indispensable if random initialization is applied.

Datasets. The CIFAR10 dataset [73] contains 60K labeled images of 10 different classes. The images are represented using 384-dimensional GIST descriptors [74]. Following the classical setting in [67], we used a training set of 200 images for each class, and a disjoint query set of 100 images per class. The remaining 59K images are treated as database.

NUS-WIDE [75] is a dataset containing 270K annotated images from Flickr. Every image is associated with one or more of the 81 different concepts, and is described using a 500-dimensional bag-of-features. In training and evaluation, we followed the protocol of [76]: two images were considered as neighbors if they share at least one common concept (only 21 most frequent concepts are considered). We use 100K pairs of images for training, and a query set of 100 images per concept in testing.

Comparison Methods. We compare the proposed deep Image Siamese network to six state-of-the-art hashing methods:

  • •  Four representative “shallow” hashing methods: kernelized supervised hashing (KSH) [64], anchor graph hashing (AGH) [76] (we compare with its two alternative forms: AGH1 and AGH2; see the original paper), parameter-sensitive hashing (PSH) [77], and LDA Hash (LH) [78].5
  • •  Two latest “deep” hashing methods: neural-network hashing (NNH) [65], and sparse neural-network hashing (SNNH) [67].

Comparing the two “deep” competitors to the deep Image Siamese network, the only difference among the three is the type of encoder adopted in each's twin columns, as listed in Table 7.7. We reimplement the encoder parts of NNH and SNNH, with three hidden layers (i.e., two unfolded stages for LISTA), so that all three deep hashing models have the same depth.6 Recalling that the input yRnImage and the hash code xRNImage, we immediately see from (7.39) that WRn×NImage, SkRN×NImage, and bkRNImage. We carefully ensure that both NNHash and SparseHash have all their weight layers of the same dimensionality with ours7 for a fair comparison.

Table 7.7

Comparison of NNH, SNNH, and the proposed deep Siamese network

encoder type neuron type structural prior on hashing codes
NNH generic tanh /
SNNH LISTA SHeLU sparse
Proposed deep BLU nearly antipodal
& quantization-robust

Image

We adopt the following classical criteria for evaluation: (i) precision and recall (PR) for different Hamming radii, and the F1 score as their harmonic average; (ii) mean average precision (MAP) [79]. Besides, for NUS-WIDE, as computing mAP is slow over this large dataset, we follow the convention of [67] to compute the mean precision (MP) of top-5K returned neighbors (MP@5K), as well as report mAP of top-10 results (mAP@10).

We have not compared with convolutional network-based hashing methods [6870], since it is difficult to ensure their models have the same parameter capacity as our fully-connected model in controlled experiments. We also do not include triplet loss-based methods, e.g., as in [70], into comparison because they will require three parallel encoder columns.

Results and Analysis. The performance of different methods on two datasets are compared in Tables 7.8 and 7.9. Our proposed method ranks top in almost all cases, in terms of mAP/MP and precision. Even under the Hamming radius of 0, our precision result is as high as 33.30% (N=64Image) for CIFAR10, and 89.49% (N=256Image) for NUS-WIDE. The proposed method also maintains the second best in most cases, in terms of recall, inferior only to SNNH. In particular, when the hashing code dimensionality is low, e.g., when N=48Image for CIFAR10, the proposed method outperforms all other competitors with a significant margin. It demonstrates the competitiveness of the proposed method in generating both compact and accurate hashing codes, achieving more precise retrieval results at lower computation and storage costs.

Table 7.8

Performance (%) of different hashing methods on the CIFAR10 dataset, with different code lengths N

Method N mAP Hamming radius ⩽2 Hamming radius =0
Prec. Recall F1 Prec. Recall F1
KSH 48 31.10 18.22 0.86 1.64 5.39 5.6×10−2 0.11
64 32.49 10.86 0.13 0.26 2.49 9.6×10−3 1.9×10−2
AGH1 48 14.55 15.95 2.8×10−2 5.6×10−2 4.88 2.2×10−3 4.4×10−3
64 14.22 6.50 4.1×10−3 8.1×10−3 3.06 1.2×10−3 2.4×10−3
AGH2 48 15.34 17.43 7.1×10−2 3.6×10−2 5.44 3.5×10−3 6.9×10−3
64 14.99 7.63 7.2×10−3 1.4×10−2 3.61 1.4×10−3 2.7×10−3
PSH 48 15.78 9.92 6.6×10−3 1.3×10−2 0.30 5.1×10−5 1.0×10−4
64 17.18 1.52 3.0×10−4 6.1×10−4 1.0×10−3 1.69×10−5 3.3×10−5
LH 48 13.13 3.0×10−3 1.0×10−4 5.1×10−5 1.0×10−3 1.7×10−5 3.4×10−5
64 13.07 1.0×10−3 1.7×10−5 3.3×10−5 0.00 0.00 0.00
NNH 48 31.21 34.87 1.81 3.44 10.02 9.4×10−2 0.19
64 35.24 23.23 0.29 0.57 5.89 1.4×10−2 2.8×10−2
SNNH 48 26.67 32.03 12.10 17.56 19.95 0.96 1.83
64 27.25 30.01 36.68 33.01 30.25 9.8 14.90
Proposed 48 31.48 36.89 12.47 18.41 24.82 0.94 1.82
64 36.76 38.67 30.28 33.96 33.30 8.9 14.05

Image

Table 7.9

Performance (%) of different hashing methods on the NUS-WIDE dataset, with different code lengths N

Method N mAP@10 MP@5K Hamming radius ⩽2 Hamming radius =0
Prec. Recall F1 Prec. Recall F1
KSH 64 72.85 42.74 83.80 6.1×10−3 1.2×10−2 84.21 1.7×10−3 3.3×10−3
256 73.73 45.35 84.24 1.4×10−3 2.9×10−3 84.24 1.4×10−3 2.9×10−3
AGH1 64 69.48 47.28 69.43 0.11 0.22 73.35 3.9×10−2 7.9×10−2
256 73.86 46.68 75.90 1.5×10−2 2.9×10−2 81.64 3.6×10−3 7.1×10−3
AGH2 64 68.90 47.27 68.73 0.14 0.28 72.82 5.2×10−2 0.10
256 73.00 47.65 74.90 5.3×10−2 0.11 80.45 1.1×10−2 2.2×10−2
PSH 64 72.17 44.79 60.06 0.12 0.24 81.73 1.1×10−2 2.2×10−2
256 73.52 47.13 84.18 1.8×10−3 3.5×10−3 84.24 1.5×10−3 2.9×10−3
LH 64 71.33 41.69 84.26 1.4×10−3 2.9×10−3 84.24 1.4×10−3 2.9×10−3
256 70.73 39.02 84.24 1.4×10−3 2.9×10−3 84.24 1.4×10−3 2.9×10−3
NNH 64 76.39 59.76 75.51 1.59 3.11 81.24 0.10 0.20
256 78.31 61.21 83.46 5.8×10−2 0.11 83.94 4.9×10−3 9.8×10−3
SNNH 64 74.87 56.82 72.32 1.99 3.87 81.98 0.37 0.73
256 74.73 59.37 80.98 0.10 0.19 82.85 0.98 1.94
Proposed 64 79.89 63.04 79.95 1.72 3.38 86.23 0.30 0.60
256 80.02 65.62 84.63 7.2×10−2 0.15 89.49 0.57 1.13

Image

The next observation is that, compared to the strongest competitor SNNH, the recall rates of our method seem less compelling. We plot the precision and recall curves of the three best performers (NNH, SNNH, deep lImage), with regard to the bit length of hashing codes N, within the Hamming radius of 2. Fig. 7.12 demonstrates that our method consistently outperforms both SNNH and NNH in precision. On the other hand, SNNH gains advantages in recall over the proposed method, although the margin appears vanishing as N grows.

Although it seems to be a reasonable performance tradeoff, we are curious about the behavior difference between SNNH and the proposed method. We are again reminded that they only differ in the encoder architecture, i.e., one with LISTA while the other using the deep lImage encoder. We thus plot the learned representations and binary hashing codes of one CIFAR image, using NNH, SNNH, and the proposed method, in Fig. 7.11. By comparing the three pairs, one could see that the quantization from (A) to (B) (also (C) to (D)) suffer visible distortion and information loss. Contrary to them, the output of the deep lImage encoder has a much smaller quantization error, as it naturally resembles an antipodal signal. Therefore, it suffers minimal information loss during the quantization step.

Image
Figure 7.11 The learned representations and binary hashing codes of one test image from CIFAR10 through: (A)–(B) NNH; (C)–(D) SNNH; (E)–(F) proposed.
Image
Figure 7.12 The comparison of three deep hashing methods on NUS-WIDE: (A) precision curve; (B) recall curve, both w.r.t. the hashing code length N, within the Hamming radius of 2.

In view of those, we conclude the following points towards the different behaviors, between SNNH and deep lImage encoder:

  • •  Both deep lImage encoder and SNNH outperform NNH, by introducing structure into the binary hashing codes.
  • •  The deep lImage encoder generates nearly antipodal outputs that are robust to quantization errors. Therefore, it excels in preserving information against hierarchical information extraction as well as quantization. This explains why our method reaches the highest precisions, and performs especially well when N is small.
  • •  SNNH exploits sparsity as a prior on hashing codes. It confines and shrinks the solution space, as many small entries in the SNNH outputs will be suppressed down to zero. That is also evidenced by Table 2 in [67], i.e., the number of unique hashing codes in SNNH results is one order smaller than that of NNH.
  • •  The sparsity prior improves the recall rate, since its obtained hashing codes can be clustered more compactly in high-dimensional space, with lower intra-cluster variations. But it also runs the risk of losing too much information, during the hierarchical sparsifying process. In that case, the inter-cluster variations might also be compromised, which causes the decrease in precision.

Further, it seems that the sparsity and lImage structure priors could be complementary. We will explore it as future work.

7.2.6 Conclusion

This section investigates how to import the quantization-robust property of an Image-constrained minimization model to a specially-designed deep model. It is done by first deriving an ADMM algorithm, which is then reformulated as a feed-forward neural network. We introduce the Siamese architecture concatenated with a pairwise loss, for the application purpose of hashing. We analyze in depth the performance and behaviors of the proposed model against its competitors, and hope it will evoke more interest from the community.

References

[1] S. Wang, Z. Ding, Y. Fu, Marginalized denoising dictionary learning with locality constraint, IEEE Transactions on Image Processing 2017.

[2] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, CVPR. IEEE; 2009:1794–1801.

[3] F. Rodriguez, G. Sapiro, Sparse representations for image classification: learning discriminative and reconstructive non-parametric dictionaries. [tech. rep., DTIC Document] 2008.

[4] E. Elhamifar, R. Vidal, Robust classification using structured sparse representation, CVPR. IEEE; 2011:1873–1879.

[5] M.G. Jafari, M.D. Plumbley, Fast dictionary learning for sparse representations of speech signals, Selected Topics in Signal Processing, IEEE Journal of 2011;5(5):1025–1031.

[6] R. Pique-Regi, J. Monso-Varona, A. Ortega, R.C. Seeger, T.J. Triche, S. Asgharzadeh, Sparse representation and Bayesian detection of genome copy number alterations from microarray data, Bioinformatics 2008;24(3):309–318.

[7] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, TPAMI 2009;31(2):210–227.

[8] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation, TSP 2006;54(11):4311–4322.

[9] Q. Zhang, B. Li, Discriminative K-SVD for dictionary learning in face recognition, CVPR. IEEE; 2010:2691–2698.

[10] Z. Jiang, Z. Lin, L.S. Davis, Learning a discriminative dictionary for sparse coding via label consistent K-SVD, CVPR. IEEE; 2011:1697–1704.

[11] M. Yang, D. Zhang, X. Feng, Fisher discrimination dictionary learning for sparse representation, ICCV. IEEE; 2011:543–550.

[12] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, ICML. 2010:663–670.

[13] X. Shen, Y. Wu, A unified approach to salient object detection via low rank matrix recovery, CVPR. IEEE; 2012:853–860.

[14] Z. Ding, Y. Fu, Low-rank common subspace for multi-view learning, ICDM. IEEE; 2014:110–119.

[15] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, TPAMI 2013;35(1):171–184.

[16] C. Zhang, J. Liu, Q. Tian, C. Xu, H. Lu, S. Ma, Image classification by non-negative sparse coding, low-rank and sparse decomposition, CVPR. IEEE; 2011:1673–1680.

[17] D. Zhang, P. Liu, K. Zhang, H. Zhang, Q. Wang, X. Jinga, Class relatedness oriented-discriminative dictionary learning for multiclass image classification, Pattern Recognition 2015.

[18] Y. Bengio, Learning deep architectures for AI, Foundations and Trends in Machine Learning 2009;2(1):1–127.

[19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion, JMLR 2010;11:3371–3408.

[20] M. Chen, Z. Xu, K. Weinberger, F. Sha, Marginalized denoising autoencoders for domain adaptation, arXiv preprint arXiv:1206.4683; 2012.

[21] Y. Chen, X. Guo, Learning non-negative locality-constrained linear coding for human action recognition, VCIP. IEEE; 2013:1–6.

[22] A. Shabou, H. LeBorgne, Locality-constrained and spatially regularized coding for scene categorization, CVPR. IEEE; 2012:3618–3625.

[23] Y. Liang, M. Song, J. Bu, C. Chen, Colorization for gray scale facial image by locality-constrained linear coding, Journal of Signal Processing Systems 2014;74(1):59–67.

[24] Z. Lin, M. Chen, Y. Ma, The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices, arXiv preprint arXiv:1009.5055; 2010.

[25] L. Ma, C. Wang, B. Xiao, W. Zhou, Sparse representation for face recognition based on discriminative low-rank dictionary learning, CVPR. IEEE; 2012:2586–2593.

[26] X. Jiang, J. Lai, Sparse and dense hybrid representation via dictionary decomposition for face recognition, TPAMI 2015;37(5):1067–1079.

[27] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, CVPR. IEEE; 2010:3360–3367.

[28] S. Wang, Y. Fu, Locality-constrained discriminative learning and coding, CVPRW. 2015:17–24.

[29] Y-Lan Boureau, Yann LeCun, et al., Sparse feature learning for deep belief networks, NIPS. 2008:1185–1192.

[30] P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, ICML. ACM; 2008:1096–1103.

[31] E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? Journal of the ACM (JACM) 2011;58(3):11.

[32] K. Yu, T. Zhang, Y. Gong, Nonlinear learning using local coordinate coding, NIPS. 2009:2223–2231.

[33] D.P. Bertsekas, Constrained optimization and Lagrange multiplier methods. Computer Science and Applied Mathematics. Boston: Academic Press; 1982;vol. 1.

[34] J. Yang, W. Yin, Y. Zhang, Y. Wang, A fast algorithm for edge-preserving variational multichannel image restoration, SIAM Journal on Imaging Sciences 2009;2(2):569–592.

[35] G. Zhang, Z. Jiang, L.S. Davis, Online semi-supervised discriminative dictionary learning for sparse representation, ACCV. Springer; 2013:259–273.

[36] F.S. Samaria, A.C. Harter, Parameterisation of a stochastic model for human face identification, Proceedings of the second IEEE workshop on applications of computer vision. IEEE; 1994:138–142.

[37] K.C. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, TPAMI 2005;27(5):684–698.

[38] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression database, TPAMI 2003;25(12):1615–1618.

[39] S. Nayar, S.A. Nene, H. Murase, Columbia object image library (COIL 100). [Tech Rep CUCS-006-96] Department of Comp Science, Columbia University; 1996.

[40] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 1998;86(11):2278–2324.

[41] P.N. Belhumeur, J.P. Hespanha, D. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, TPAMI 1997;19(7):711–720.

[42] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, TPAMI 2010;32(11):2106–2112.

[43] L. Li, S. Li, Y. Fu, Learning low-rank and discriminative dictionary for image classification, IVC 2014.

[44] S. Gu, L. Zhang, W. Zuo, X. Feng, Projective dictionary pair learning for pattern classification, NIPS. 2014:793–801.

[45] Y. Lyubarskii, R. Vershynin, Uncertainty principles and vector quantization, Information Theory, IEEE Transactions on 2010.

[46] C. Studer, T. Goldstein, W. Yin, R.G. Baraniuk, Democratic representations, arXiv preprint arXiv:1401.3420; 2014.

[47] J.J. Fuchs, Spread representations, ASILOMAR. IEEE; 2011:814–817.

[48] K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, ICML. 2010:399–406.

[49] P.B. Stark, R.L. Parker, Bounded-variable least-squares: an algorithm and applications, Computational Statistics 1995;10:129–141.

[50] C.L. Lawson, R.J. Hanson, Solving least squares problems, vol. 161. SIAM; 1974.

[51] M. Adlers, Sparse least squares problems with box constraints. Citeseer; 1998.

[52] Z. Wang, Y. Yang, S. Chang, J. Li, S. Fong, T.S. Huang, A joint optimization framework of sparse coding and discriminative clustering, IJCAI. 2015.

[53] D.P. Bertsekas, Nonlinear programming. Belmont: Athena Scientific; 1999.

[54] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS. 2012.

[55] Z. Wang, Q. Ling, T. Huang, Learning deep l0 encoders, AAAI 2016.

[56] P. Sprechmann, R. Litman, T.B. Yakar, A.M. Bronstein, G. Sapiro, Supervised sparse analysis and synthesis operators, NIPS. 2013:908–916.

[57] P. Sprechmann, A. Bronstein, G. Sapiro, Learning efficient sparse and low rank models, TPAMI 2015.

[58] Z. Wang, S. Chang, J. Zhou, M. Wang, T.S. Huang, Learning a task-specific deep architecture for clustering, SDM 2016.

[59] Z. Wang, S. Chang, D. Liu, Q. Ling, T.S. Huang, D3: deep dual-domain based fast restoration of jpeg-compressed images, IEEE CVPR. 2016.

[60] A. Gionis, P. Indyk, R. Motwani, et al., Similarity search in high dimensions via hashing, VLDB, vol. 99. 1999:518–529.

[61] Y. Gong, S. Lazebnik, Iterative quantization: a procrustean approach to learning binary codes, CVPR. IEEE; 2011.

[62] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, NIPS. 2009.

[63] B. Kulis, T. Darrell, Learning to hash with binary reconstructive embeddings, NIPS. 2009:1042–1050.

[64] W. Liu, J. Wang, R. Ji, Y.G. Jiang, S.F. Chang, Supervised hashing with kernels, CVPR. IEEE; 2012:2074–2081.

[65] J. Masci, D. Migliore, M.M. Bronstein, J. Schmidhuber, Descriptor learning for omnidirectional image matching, Registration and recognition in images and videos. Springer; 2014:49–62.

[66] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, CVPR. IEEE; 2006.

[67] J. Masci, A.M. Bronstein, M.M. Bronstein, P. Sprechmann, G. Sapiro, Sparse similarity-preserving hashing, arXiv preprint arXiv:1312.5479; 2013.

[68] R. Xia, Y. Pan, H. Lai, C. Liu, S. Yan, Supervised hashing for image retrieval via image representation learning, AAAI. 2014.

[69] W.J. Li, S. Wang, W.C. Kang, Feature learning based deep supervised hashing with pairwise labels, arXiv:1511.03855; 2015.

[70] H. Lai, Y. Pan, Y. Liu, S. Yan, Simultaneous feature learning and hash coding with deep neural networks, CVPR 2015.

[71] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, ICML. 2013:1139–1147.

[72] M. Elad, M. Aharon, Image denoising via sparse and redundant representations over learned dictionaries, TIP 2006;15(12):3736–3745.

[73] Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. 2009.

[74] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, IJCV 2001.

[75] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, NUS-WIDE: a real-world web image database from National University of Singapore, ACM CIVR. ACM; 2009:48.

[76] W. Liu, J. Wang, S. Kumar, S.F. Chang, Hashing with graphs, ICML. 2011.

[77] G. Shakhnarovich, P. Viola, T. Darrell, Fast pose estimation with parameter-sensitive hashing, ICCV. IEEE; 2003.

[78] C. Strecha, A.M. Bronstein, M.M. Bronstein, P. Fua, LDAHash: improved matching with smaller descriptors, TPAMI 2012;34(1):66–78.

[79] H. Müller, W. Müller, D.M. Squire, S. Marchand-Maillet, T. Pun, Performance evaluation in content-based image retrieval: overview and proposals, PRL 2001.


1  “©2017 IEEE. Reprinted, with permission, from Wang, Shuyang, Zhengming Ding, and Yun Fu. “Marginalized denoising dictionary learning with locality constraint.” IEEE Transactions on Image Processing 27.1 (2018): 500–510.”

2  ZiImage, D, λ and σ are set as the input of function LLC [27], and the code can be downloaded from http://www.ifp.illinois.edu/~jyang29/LLC.htm.”

3  “Reprinted, with permission, from Wang, Zhangyang, Yang, Yingzhen, Chang, Shiyu, Ling, Qing, and Huang, Thomas S. “Learning a deep Image encoder for hashing.” IJCAI (2016).”

4  “We tested larger K values (3 or 4). In several cases they did bring performance improvements, but added complexity, too.”

5  “Most of the results are collected from the comparison experiments in [67], under the same settings.”

6  “The performance is thus improved than reported in their original papers using two hidden layers, although with extra complexity.”

7  “Both the deep Image encoder and the LISTA network will introduce the diagonal layers, while the generic feed-forward networks will not. Besides, neither LISTA nor generic feed-forward networks contain layer-wise biases. Yet, since either a diagonal layer or a bias contains only N free parameters, the total amount is ignorable.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.101.109