Chapter 4

Single Image Super-Resolution: From Sparse Coding to Deep Learning

Ding LiuThomas S. Huang    Beckman Institute for Advanced Science and Technology, Urbana, IL, United States
Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL, United States

Abstract

Recently, deep learning has been successfully applied in numerous areas of computer vision, including low-level image restoration problems. For single image super-resolution (SR), which is an ill-posed problem that tries to recover a high-resolution image from its low-resolution observation, a number of models based on deep neural networks have been proposed and obtained superior performance that overshadows all previously handcrafted models. To regularize the solution of the problem, older methods have focused on using good priors from natural images such as sparse representation, or directly learning the priors from a large data set with models such as deep neural networks. In this chapter, we argue that domain expertise represented by the conventional sparse coding model is still valuable, and it can be combined with the key ingredients of deep learning to achieve further improvements. We demonstrate that a sparse coding model particularly designed for super-resolution can be incarnated as a neural network with the merit of end-to-end optimization over training data. The network has a cascaded structure which boosts the SR performance for both fixed and incremental scaling factors. The interpretation of the network based on sparse representation leads to much more efficient and effective training, as well as reduced model size. Moreover, we develop training and testing schemes that can be extended for robust handling of images with additional degradation such as noise and blurring. A subjective assessment is conducted and analyzed in order to thoroughly evaluate various SR techniques. Our proposed model is tested on several benchmark datasets, and it significantly outperforms existing state-of-the-art methods for various scaling factors both quantitatively and qualitatively.

Furthermore, we propose the method of learning a mixture of SR inference modules in a unified framework to tackle the problem of single image SR. Specifically, a number of SR inference modules specialized in different image local patterns are first independently applied on the LR image to obtain various HR estimates, and the resultant HR estimates are adaptively aggregated to form the final HR image. By selecting neural networks as the SR inference module, the whole procedure can be incorporated into a unified network and optimized jointly. Extensive experiments are conducted to investigate the relation between restoration performance and various network architecture designs. Compared with other recent image SR approaches, this method consistently achieves superior restoration results on a wide range of images while allowing more flexible design choices.

Keywords

Sparse coding; Image super-resolution; Deep learning; Neural network

4.1 Robust Single Image Super-Resolution via Deep Networks with Sparse Prior1

4.1.1 Introduction

Single image super-resolution (SR) aims at obtaining a high-resolution (HR) image from a low-resolution (LR) input image by inferring all the missing high frequency contents. With the known variables in LR images greatly outnumbered by the unknowns in HR images, SR is a highly ill-posed problem and the current techniques are far from being satisfactory for many real applications [1,2], such as surveillance, medical imaging and consumer photo editing [3].

To regularize the solution of SR, people have exploited various priors of natural images. Analytical priors, such as bicubic interpolation, work well for smooth regions; while image models based on statistics of edges [4] and gradients [5] can recover sharper structures. Sparse priors are utilized in the patch-based methods [68]. HR patch candidates are recovered from similar examples in the LR image itself at different locations and across different scales [9,10].

More recently, inspired by the success achieved by deep learning [11] in other computer vision tasks, people began to use neural networks with deep architecture for image SR. Multiple layers of collaborative auto-encoders are stacked together in [12,13] for robust matching of self-similar patches. Deep convolutional neural networks (CNN) [14] and deconvolutional networks [15] are designed that directly learn the nonlinear mapping from the LR space to HR space in a way similar to coupled sparse coding [7]. As these deep networks allow end-to-end training of all the model components between LR input and HR output, significant improvements have been observed over their shadow counterparts.

The networks in [12,14] are built with generic architectures, which means all their knowledge about SR is learned from training data. On the other hand, people's domain expertise for the SR problem, such as natural image prior and image degradation model, is largely ignored in deep learning based approaches. It is then worthwhile to investigate whether domain expertise can be used to design better deep model architectures, or whether deep learning can be leveraged to improve the quality of handcrafted models.

In this section, we extend the conventional sparse coding model [6] using several key ideas from deep learning, and show that domain expertise is complementary to large learning capacity in further improving SR performance. First, based on the learned iterative shrinkage and thresholding algorithm (LISTA) [16], we implement a feed-forward neural network in which each layer strictly correspond to one step in the processing flow of sparse coding based image SR. In this way, the sparse representation prior is effectively encoded in our network structure; at the same time, all the components of sparse coding can be trained jointly through back-propagation. This simple model, which is named sparse coding based network (SCN), achieves notable improvement over the generic CNN model [14] in terms of both recovery accuracy and human perception, and yet has a compact model size. Moreover, with the correct understanding of each layer's physical meaning, we have a more principled way to initialize the parameters of SCN, which helps to improve optimization speed and quality.

A single network is only able to perform image SR by a particular scaling factor. In [14], different networks are trained for different scaling factors. In this section, we propose a cascade of multiple SCNs to achieve SR for arbitrary factors. This approach, motivated by the self-similarity based SR approach [9], not only increases the scaling flexibility of our model, but also reduces artifacts for large scaling factors. Moreover, inspired by the multi-pass scheme of image denoising [17], we demonstrate that the SR results can be further enhanced by cascading multiple SCNs for SR of a fixed scaling factor. A cascade of SCNs (CSCN) can also benefit from the end-to-end training of deep network with a specially designed multi-scale cost function.

In practical SR scenarios, the real LR measurements usually suffer from various types of corruption, such as noise and blurring. Sometimes the degradation process is even too complicated or unclear. We propose several schemes using our SCN to robustly handle such practical SR cases. When the degradation mechanism is unknown, we fine-tune the generic SCN with the requirement of only a small amount of real training data and manage to adapt our model to the new scenario. When the forward model for LR generation is clear, we propose an iterative SR scheme incorporating SCN with additional regularization based on priors from the degradation mechanism.

Subjective assessment is important to the SR technology because commercial products equipped with such technology are usually evaluated subjectively by the end users. In order to thoroughly compare our model with other prevailing SR methods, we conduct a systematic subjective evaluation of these methods, in which the assessment results are statistically analyzed and one score is given for each method.

In the following, we first review the literature in Sect. 4.1.2 and introduce the SCN model in Sect. 4.1.3. Then the cascade scheme of SCN models is detailed in Sect. 4.1.4. The method for robustly handling images with additional degradation such as noise and blurring is discussed in Sect. 4.1.5. Implementation details are provided in Sect. 4.1.6. Extensive experimental results are reported in Sect. 4.1.7, and the subjective evaluation is described in Sect. 4.1.8. Finally, conclusion and future work are presented in Sect. 4.1.9.

4.1.2 Related Work

Single image SR is the task of recovering an HR image from only one LR observation. A comprehensive review can be found in [18]. Generally, existing methods can be classified into three categories: interpolation based [19], image statistics based [4,20], and example based methods [6,21].

Interpolation based methods include bilinear, bicubic and Lanczos filtering [19], which usually run very fast because of the low algorithm complexity. However, the simplicity of these methods leads to the failure of modeling the complex mapping between the LR feature space and the corresponding HR feature space, generating overly-smoothed unsatisfactory regions.

Image statistics based methods utilize the statistical edge information to reconstruct HR images [4,20]. They rely on the priors of edge statistics in images while facing the shortcoming of losing high-frequency detail information, especially in the case of large upscaling factors.

The current most popular and successful approaches are built on example based learning techniques, which aim to learn the correspondence between the LR feature space and HR feature space through a large number of representative exemplar pairs. The pioneer work in this area includes [22].

Given the origin of exemplar pairs, these methods can be further categorized into three classes: self-example based [9,10], external-example based methods [6,21] and the joint of them [23]. Self-example based methods only exploit the single input LR image as references, and extract exemplar pairs merely from the LR image across different scales to predict the HR image. Such methods usually work well on the images containing repetitive patterns or textures but lack the richness of image structures outside the input image and thus fail to generate satisfactory prediction for images of other classes. Huang et al. [24] extend this idea by building self-dictionaries for handling geometric transformations.

External-example based methods first utilize the exemplar pairs extracted from a large external dataset, in order to learn the universal image characteristics between the LR feature space and HR feature space, and then apply the learned mapping for SR. Usually, representative patches from external datasets are compactly embodied in pre-trained dictionaries. One representative approach is the sparse coding based method [6,7]. For example, in [7] two coupled dictionaries are trained for the LR feature space and HR patch feature space, respectively, such that the LR patch over LR dictionary and its corresponding HR patch over HR dictionary share the same sparse representation. Although it is able to capture the universal LR–HR correspondence from external datasets and recover fine details and sharpened edges, it suffers from the high computational cost when solving complicated nonlinear optimization problems.

Timofte et al. [21,25] propose a neighboring embedding approach for SR, and formulate the problem as a least squares optimization with l2Image norm regularization, which drastically reduces the computation complexity compared with [6,7]. Neighboring embedding approaches approximate HR patches as a weighted average of similar training patches in a low dimensional manifold.

Random forest is built for SR without dictionary learning in [26,27]. Such an approach achieves fast inference time but usually suffers from the huge model size.

4.1.3 Sparse Coding Based Network for Image SR

In this section, we first introduce the background of sparse coding for image SR and its network implementation. Then we illustrate the design of our proposed sparse coding based network and its advantage over previous models.

4.1.3.1 Image SR Using Sparse Coding

The sparse representation based SR method [6] models the transform from each local patch yRmyImage in the bicubic-upscaled LR image to the corresponding patch xRmxImage in the HR image. The dimension myImage is not necessarily the same as mxImage when image features other than raw pixel are used to represent patch y. It is assumed that the LR (HR) patch y (x) can be represented with respect to an overcomplete dictionary DyImage (DxImage) using some sparse linear coefficients αyImage (αxImage) RnImage, which are known as sparse code. Since the degradation process from x to y is nearly linear, the patch pair can share the same sparse code αy=αx=αImage if the dictionaries DyImage and DxImage are defined properly. Therefore, for an input LR patch y, the HR patch can be recovered as

x=Dxα,s.t.α=argminzyDyz22+λz1,

Image (4.1)

where 1Image denotes the 1Image norm, which is convex and sparsity-inducing, and λ is a regularization coefficient.

In order to learn the dictionary pair (Dy,DxImage), the goal is to minimize the recovery error of x and y, and thus the loss function L in [7] is defined as

L=12(γxDxz22+(1γ)yDyz22),

Image (4.2)

where γ (0<γ1Image) balances the two reconstruction errors. Then the optimal dictionary pair {Dx,Dy}Image can be found by minimizing the empirical expectation of (4.2) over all the training LR/HR pairs,

minDx,Dy1Ni=1NL(Dx,Dy,xi,yi)s.t.zi=argminαyiDyα22+λα1,i=1,2,,N,Dx(,k)21,Dy(,k)21,k=1,2,,K.

Image (4.3)

Since the objective function in (4.2) is highly nonconvex, the dictionary pair (Dy,Dx)Image is usually learned alternatively while keeping one of them fixed [7]. The authors of [28,23] also incorporated patch-level self-similarity with dictionary pair learning.

4.1.3.2 Network Implementation of Sparse Coding

There is an intimate connection between sparse coding and neural network, which has been well studied in [29,16]. A feed-forward neural network as illustrated in Fig. 4.1 is proposed in [16] to efficiently approximate the sparse code α of input signal y as it would be obtained by solving (4.1) for a given dictionary DyImage. The network has a finite number of recurrent stages, each of which updates the intermediate sparse code according to

zk+1=hθ(Wy+Szk),

Image (4.4)

where hθImage is an element-wise shrinkage function defined as [hθ(a)]i=sign(ai)(|ai|θi)+Image with positive thresholds θ.

Image
Figure 4.1 An LISTA network [16] with 2 time-unfolded recurrent stages, whose output α is an approximation of the sparse code of input signal y. The linear weights W, S and the shrinkage thresholds θ are learned from data.

Different from the iterative shrinkage and thresholding algorithm (ISTA) [30,31] which finds an analytical relationship between network parameters (weights W, S and thresholds θ) and sparse coding parameters (DyImage and λ), the authors of [16] learn all the network parameters from training data using a back-propagation algorithm called learned ISTA (LISTA). In this way, a good approximation of the underlying sparse code can be obtained within a fixed number of recurrent stages.

4.1.3.3 Network Architecture of SCN

Given the fact that sparse coding can be effectively implemented with an LISTA network, it is straightforward to build a multi-layer neural network that mimics the processing flow of the sparse coding based SR method [6]. Similar to most patch-based SR methods, our sparse coding based network (SCN) takes the bicubic-upscaled LR image IyImage as input, and outputs the full HR image IxImage. Fig. 4.2 shows the main network structure, and each layer is described in the following.

Image
Figure 4.2 (Top left) The proposed SCN model with a patch extraction layer H, an LISTA subnetwork for sparse coding (with k recurrent stages denoted by the dashed box), an HR patch recovery layer Dx, and a patch combination layer G. (Top right) A neuron with an adjustable threshold decomposed into two linear scaling layers and a unit-threshold neuron. (Bottom) The SCN reorganized with unit-threshold neurons and adjacent linear layers merged together in the gray boxes.

The input image IyImage first goes through a convolutional layer H which extracts features for each LR patch. There are myImage filters of spatial size sy×syImage in this layer, so that our input patch size is sy×syImage and its feature representation y has myImage dimensions.

Each LR patch y is then fed into an LISTA network with k recurrent stages to obtain its sparse code αRnImage. Each stage of LISTA consists of two linear layers parameterized by WRn×myImage and SRn×nImage, and a nonlinear neuron layer with activation function hθImage. The activation thresholds θRnImage are also to be updated during training, which complicates the learning algorithm. To restrict all the tunable parameters in our linear layers, we do a simple trick to rewrite the activation function as

[hθ(a)]i=sign(ai)θi(|ai|/θi1)+=θih1(ai/θi).

Image (4.5)

Equation (4.5) indicates that the original neuron with an adjustable threshold can be decomposed into two linear scaling layers and a unit-threshold neuron, as shown in Fig. 4.2(top-right). The weights of the two scaling layers are diagonal matrices defined by θ and its element-wise reciprocal, respectively.

The sparse code α is then multiplied with HR dictionary DxRmx×nImage in the next linear layer, reconstructing HR patch x of size sx×sx=mxImage.

In the final layer G, all the recovered patches are put back to the corresponding positions in the HR image IxImage. This is realized via a convolutional filter of mxImage channels with spatial size sg×sgImage. The size sgImage is determined as the number of neighboring patches that overlap with the same pixel in each spatial direction. The filter will assign appropriate weights to the overlapped recoveries from different patches and take their weighted average as the final prediction in IxImage.

As illustrated in Fig. 4.2(bottom), after some simple reorganizations of the layer connections, the network described above has some adjacent linear layers which can be merged into a single layer. This helps to reduce the computation load and redundant parameters in the network. The layers H and G are not merged because we apply additional nonlinear normalization operations on patches y and x, which will be detailed in Sect. 4.1.6.

Thus, there are a total of 5 trainable layers in our network: 2 convolutional layers H and G, and 3 linear layers shown as gray boxes in Fig. 4.2. The k recurrent layers share the same weights and are therefore conceptually regarded as one. Note that all the linear layers are actually implemented as convolutional layers applied on each patch with filter spatial size of 1×1Image, a structure similar to the network in network [32]. Also note that all these layers have only weights but no biases (zero biases).

Mean squared error (MSE) is employed as the cost function to train the network, and our optimization objective can be expressed as

minΘiSCN(Iy(i);Θ)Ix(i)22,

Image (4.6)

where Iy(i)Image and Ix(i)Image form the ith pair of LR/HR training data, and SCN(Iy;Θ)Image denotes the HR image for IyImage predicted using the SCN model with parameter set Θ. All the parameters are optimized through the standard back-propagation algorithm. Although it is possible to use other cost terms that are more correlated with human visual perception than MSE, our experimental results show that simply minimizing MSE leads to improvement in subjective quality.

4.1.3.4 Advantages over Previous Models

The construction of our SCN follows exactly each step in the sparse coding based SR method [6]. If the network parameters are set according to the dictionaries learned in [6], it can reproduce almost the same results. However, after training, SCN learns a more complex regression function and can no longer be converted to an equivalent sparse coding model. The advantage of SCN comes from its ability to jointly optimize all the layer parameters from end to end; while in [6] some variables are manually designed and some are optimized individually by fixing all the others.

Technically, our network is also a CNN and it has similar layers as the CNN model proposed in [14] for patch extraction and reconstruction. The key difference is that we have an LISTA subnetwork specifically designed to enforce sparse representation prior; while in [14] a generic rectified linear unit (ReLU) [33] is used for nonlinear mapping. Since SCN is designed based on our domain knowledge in sparse coding, we are able to obtain a better interpretation of the filter responses and have a better way to initialize the filter parameters in training. We will see in the experiments that all these contribute to better SR results, faster training speed and smaller model size than a vanilla CNN.

4.1.4 Network Cascade for Scalable SR

In this section, we investigate two different network cascade techniques in order to fully exploit our SCN model in SR applications.

4.1.4.1 Network Cascade for SR of a Fixed Scaling Factor

First, we observe that the SR results can be further improved by cascading multiple SCNs trained for the same objective in (4.6), which is inspired by the multi-pass scheme in [17]. The only difference for training these SCNs is to replace the bicubic interpolated input by its latest HR estimate, while the target output remains the same.

The first SCN plays as a function approximator to model the nonlinear mapping from the bicubic upscaled image to the ground-truth image. The following SCN plays as another function approximator, with the starting point changed to a better estimate: the output of its previous SCN.

In other words, the cascade of SCNs as a whole can be considered as a new deeper network having more powerful learning capability, which is able to better approximate the mapping between the LR inputs to the HR counterparts, and these SCNs can be trained jointly to pursue even better SR performance.

4.1.4.2 Network Cascade for Scalable SR

Like most SR models learned from external training examples, the SCN discussed previously can only upscale images by a fixed factor. A separate model needs to be trained for each scaling factor to achieve the best performance, which limits the flexibility and scalability in practical use. One way to overcome this difficulty is to repeatedly enlarge the image by a fixed scale until the resulting HR image reaches a desired size. This practice is commonly adopted in the self-similarity based methods [9,10,12], but is not so popular in other cases for the fear of error accumulation during repetitive upscaling.

In our case, however, it is observed that a cascade of SCNs trained for small scaling factors can generate even better SR results than a single SCN trained for a large scaling factor, especially when the target scaling factor is large (greater than 2). This is illustrated by the example in Fig. 4.3. Here an input image is magnified 4 times in two ways: with a single SCN×4 model through the processing flow (A) → (B) → (D); and with a cascade of two SCN×2 models through (A) → (C) → (E). It can be seen that the input to the second cascaded SCN×2 in (C) is already sharper and contains less artifacts than the bicubic×4 input to the single SCN×4 in (B), which naturally leads to the better final result in (E) than that in (D).

Image
Figure 4.3 SR results for the “Lena” image upscaled 4 times. (A) → (B) → (D) represents the processing flow with a single SCN×4 model. (A) → (C) → (E) represents the processing flow with two cascaded SCN×2 models. PSNR is given in parentheses.

To get a better understanding of the above observation, we can draw a loose analogy between the SR process and a communication system. Bicubic interpolation is like a noisy channel through which an image is “transmitted” from the LR domain to HR domain. And our SCN model (or any SR algorithm) behaves as a receiver which recovers clean signals from noisy observations. A cascade of SCNs is then like a set of relay stations that enhance signal-to-noise ratio before the signal becomes too weak for further transmission. Therefore, cascading will work only when each SCN can restore enough useful information to compensate for the new artifacts it introduces as well as the magnified artifacts from previous stages.

4.1.4.3 Training Cascade of Networks

Taking into account the two aforementioned cascade techniques, we can consider the cascade of all SCNs as a deeper network (CSCN), in which the final output of the consecutive SCNs of the same ground truth is connected to the input of the next SCN with bicubic interpolation in the between. To construct the cascade, besides stacking several SCNs trained individually with respect to (4.6), we can also optimize all of them jointly as shown in Fig. 4.4. Without loss of generality, we assume that each stage in Sect. 4.1.4.2 has the same scaling factor s. Let Iˆj,kImage (j>0,k>0Image) denote the output image of the jth SCN in the kth stage upscaled by a total of ×skImage times. In the same stage, each output of SCNs is compared with the associated ground-truth image IkImage according to the MSE cost, leading to a multi-scale objective function:

min{Θj,k}ijkSCN(Iˆj1,k(i);Θj,k)Ik(i)22,

Image (4.7)

where i denotes the data index, and j,kImage denote the SCN index. For simplicity of notation, Iˆ0,kImage specially denotes the bicubic interpolated image of the final output in the (k1)Imageth stage upscaled by a total of ×sk1Image times. This multi-scale objective function makes full use of the supervision information in all scales, sharing a similar idea as heterogeneous networks [34]. All the layer parameters {Θj,k}Image in (4.7) could be optimized from end to end by back-propagation. The SCNs share the same training objective can be trained simultaneously, taking advantage of the merit of deep learning. For the SCNs with different training objectives, we use a greedy algorithm here to train them sequentially from the beginning of the cascade so that we do not need to care about the gradient of bicubic layers. Applying back-propagation through a bicubic layer or its trainable surrogate will be considered in a future work.

Image
Figure 4.4 Training cascade of SCNs with multi-scale objectives.

4.1.5 Robust SR for Real Scenarios

Most of recent SR works generate the LR images for both training and testing by downscaling HR images using bicubic interpolation [6,21]. However, this assumption of the forward model may not always hold in practice. For example, real LR measurements are usually blurred, or corrupted with noise. Sometimes, the LR generation mechanism may be complicated, or even unknown. We now investigate a practical SR problem, and propose two approaches to handle such non-ideal LR measurements, using the generic SCN. In the case that the underlying mechanism of the real LR generation is unclear or complicated, we propose a data-driven approach by fine-tuning the learned generic SCN with a limited number of real LR measurements, as well as their corresponding HR counterparts. On the other hand, if the real training samples are unavailable but the LR generation mechanism is clear, we formulate this inverse problem as the regularized HR image reconstruction problem which can be solved using iterative methods. The proposed methods demonstrate the robustness of our SCN model to different SR scenarios. In the following, we elaborate the details of these two approaches.

4.1.5.1 Data-Driven SR by Fine-Tuning

Deep learning models can be efficiently transferred from one task to another by reusing the intermediate representation in the original neural network [35]. This method has been proven successful on a number of high-level vision tasks, even if there is a limited amount of training data in the new task [36].

The success of super-resolution algorithms usually highly depends on the accuracy of the model of the imaging process. When the underlying mechanism of the generation of LR images is not clear, we can take advantage of the aforementioned merit of deep learning models by learning our model in a data-driven manner, to adapt it for a particular task. Specifically, we start training from the generic SCN model while using very limited amount of training data from a new SR scenario, and manage to adapt it to the new SR scenario and obtain promising results. In this way, it is demonstrated that the SCN has strong capability of learning complex mappings between the non-ideal LR measurements and their HR counterparts, as well as the high flexibility of adapting to various SR tasks.

4.1.5.2 Iterative SR with Regularization

The second approach considers the case that the mechanism of generating the real LR images is relatively simple and clear, indicating the training data is always available if we synthesize LR images with the known degradation process. We propose an iterative SR scheme which incorporates the generic SCN model with additional regularization based on task-related priors (e.g., the known kernel for deblurring, or the data sparsity for denoising). In this section, we specifically discuss handling blurred and noisy LR measurements in details as examples, though the iterative SR methods can be generalized to other practical imaging models.

Blurry Image Upscaling

The real LR images can be generated with various types of blurring. Directly applying the generic SCN model is obviously not optimal. Instead, with the known blurring kernel, we propose to estimate the regularized version of the HR image IˆxImage based on the directly upscaled image I˜xImage by the learned SCN as follows:

Iˆx=argminIII˜x2,s.t.DBI=Iy0

Image (4.8)

where Iy0Image is the original blurred LR input, and the operators B and D are blurring and subsampling, respectively. Similar to the previous work [6], we use back-projection to iteratively estimate the regularized HR input on which our model can perform better. Specifically, given the regularized estimate Iˆxi1Image at iteration i1Image, we estimate a less blurred LR image Iyi1Image by downsampling IˆxiImage using bicubic interpolation. The upscaled I˜xiImage by learned SCN serves the regularizer for the ith iteration as follows:

Iˆxi=argminIII˜xi22+DBIIy022.

Image (4.9)

Here we use a penalty method to form an unconstrained problem. The upscaled HR image I˜xiImage can be computed as SCN(Iyi1,Θ)Image. The same process is repeated until convergence. We have applied the proposed iterative scheme to LR images generated from Gaussian blurring and subsampling as an example. The empirical performance is illustrated in Sect. 4.1.7.

Noisy Image Upscaling

Noise is a ubiquitous cause of corruption in image acquisition. State-of-the-art image denoising methods usually adopt priors such as patch similarity [37], patch sparsity [38,17], or both [39], as a regularizer in image restoration. In this section, we propose a regularized noisy image upscaling scheme, for specifically handling noisy LR images, in order to obtain improved SR quality. Though any denoising algorithm can be used in our proposed scheme, here we apply spatial similarity combined with transform domain image patch group-sparsity as our regularizer [39], to form the regularized iterative SR problem as an example.

Similar to the method in Sect. 4.1.5.2, we iteratively estimate the less noisy HR image from the denoised LR image. Given the denoised LR estimate Iˆyi1Image at iteration i1Image, we directly upscale it, using the learned generic SCN, to obtain the HR image Iˆxi1Image. It is then downsampled using bicubic interpolation, to generate the LR image I˜yiImage, which is used in the fidelity term in the ith iteration of LR image denoising. The same process is repeated until convergence. The iterative LR image denoising problem is formulated as follows:

{Iˆyi,{αˆi}}=argminI,{αi}II˜yi22+j=1N{W3DGjIαj22+ταj0}

Image (4.10)

where the operator GjImage generates the 3D vectorized tensor, which groups the jth overlapping patch from the LR image I, together with the spatially similar patches within its neighborhood by block matching [39]. The codes {αj}Image of the patch groups in the domain of 3D sparsifying transform W3DImage are sparse, which is enforced by the l0Image norm penalty [40]. The weight τ controls the sparsity level, which normally depends on the remaining noise level in I˜yiImage [41,40].

In (4.10), we use the patch group sparsity as our denoising regularizer. The 3D sparsifying transform W3DImage can be one of commonly used analytical transforms, such as discrete cosine transform (DCT) or wavelets. The state-of-the-art BM3D denoising algorithm [39] is based on such an approach, but further improved by more sophisticated engineering stages. In order to achieve the best practical SR quality, we demonstrate the empirical performance comparison using BM3D as the regularizer in Sect. 4.1.7. Additionally, our proposed iterative method is a general practical SR framework, which is not dedicated to SCN. One can conveniently extend it to other SR methods, which generate I˜yiImage in the ith iteration. A performance comparison of these methods is given in Sect. 4.1.7.

4.1.6 Implementation Details

We determine the number of nodes in each layer of our SCN mainly according to the corresponding settings used in sparse coding [7]. Unless otherwise stated, we use input LR patch size sy=9Image, LR feature dimension my=100Image, dictionary size n=128Image, output HR patch size sx=5Image, and patch aggregation filter size sg=5Image. All the convolution layers have a stride of 1. Each LR patch y is normalized by its mean and variance, and the same mean and variance are used to restore the final HR patch x. We crop 56×56Image regions from each image to obtain fixed-sized input samples to the network, which produces outputs of size 44×44Image.

To reduce the number of parameters, we implement the LR patch extraction layer H as the combination of two layers: the first layer has 4 trainable filters, each of which is shifted to 25 fixed positions by the second layer. Similarly, the patch combination layer G is also split into a fixed layer which aligns pixels in overlapping patches and a trainable layer whose weights are used to combine overlapping pixels. In this way, the number of parameters in these two layers are reduced by more than an order, and there is no observable loss in performance.

We employ a standard stochastic gradient descent algorithm to train our networks with mini-batch size of 64. Based on the understanding of each layer's role in sparse coding, we use Harr-like gradient filters to initialize layer H, and use uniform weights to initialize layer G. All the remaining three linear layers are related to the dictionary pair (Dx,Dy)Image in sparse coding. To initialize them, we first randomly set DxImage and DyImage with Gaussian noise, and then find the corresponding layer weights as in ISTA [30]:

w1=CDyT,w2=IDyTDy,w3=1CLDx,

Image (4.11)

where w1Image, w2Image and w3Image denote the weights of the three subsequent layers after layer H; L is the upper bound on the largest eigenvalue of DyTDyImage, and C is the threshold value before normalization. We empirically set L=C=5Image.

The proposed models are all trained using the CUDA ConvNet package [11] on a workstation with 12 Intel Xeon 2.67 GHz CPUs and 1 GTX680 GPU. Training an SCN usually takes less than one day. Note that this package is customized for classification networks, and its efficiency can be further optimized for our SCN model.

In testing, to make the entire image covered by output samples, we crop input samples with overlap and extend the boundary of original image by reflection. Note we shave the image border in the same way as [14] for objective evaluations to ensure fair comparison. Only the luminance channel is processed with our method, and bicubic interpolation is applied to the chrominance channels, as their high frequency components are less noticeable to human eyes. To achieve arbitrary scaling factors using CSCN, we upscale an image by a factor of 2 repeatedly until it is at least as large as the desired size. Then a bicubic interpolation is used to downscale it to the target resolution if necessary.

When reporting our best results in Sect. 4.1.7.2, we also use the multi-view testing strategy commonly employed in image classification. For patch-based image SR, multi-view testing is implicitly used when predictions from multiple overlapping patches are averaged. Here, besides sampling overlapping patches, we also add more views by flipping and transposing the patch. Such strategy is found to improve SR performance for general algorithms at the sheer cost of computation.

4.1.7 Experiments

We evaluate and compare the performance of our models using the same data and protocols as in [21], which are commonly adopted in SR literature. All our models are learned from a training set with 91 images, and tested on Set5 [42], Set14 [43] and BSD100 [44] which contain 5, 14 and 100 images, respectively. We have also trained on other different larger data sets, and observe marginal performance change (around 0.1 dB). The original images are downsized by bicubic interpolation to generate LR–HR image pairs for both training and evaluation. The training data are augmented with translation, rotation and scaling.

4.1.7.1 Algorithm Analysis

We first visualize the four filters learned in the first layer H in Fig. 4.5. The filter patterns do not change much from the initial first- and second-order gradient operators. Some additional small coefficients are introduced in a highly structured form that capture richer high frequency details.

Image
Figure 4.5 The four learned filters in the first layer H.

The performance of several networks during training is measured on Set5 in Fig. 4.6. Our SCN improves significantly over sparse coding (SC) [7], as it leverages data more effectively with end-to-end training. The SCN initialized according to (4.11) can converge faster and better than the same model with random initialization, which indicates that the understanding of SCN based on sparse coding can help its optimization. We also train a CNN model [14] of the same size as SCN, but find its convergence speed much slower. It is reported in [14] that training a CNN takes 8×108Image back-propagations (equivalent to 12.5×106Image mini-batches here). To achieve the same performance as CNN, our SCN requires less than 1% back-propagations.

Image
Figure 4.6 The PSNR change for ×2 SR on Set5 during training using different methods: SCN; SCN with random initialization; CNN. The horizontal dashed lines show the benchmarks of bicubic interpolation and sparse coding (SC).

The network size of SCN is mainly determined by the dictionary size n. Besides the default value n=128Image, we have tried other sizes and plot their performance versus the number of network parameters in Fig. 4.7. The PSNR of SCN does not drop too much as n decreases from 128 to 64, but the model size and computation time can be reduced significantly, as shown in Table 4.1. Fig. 4.7 also shows the performance of CNN with various sizes. Our smallest SCN can achieve higher PSNR than the largest model (CNN-L) in [45] while only using about 20% of parameters.

Image
Figure 4.7 PSNR for ×2 SR on Set5 using SCN and CNN with various network sizes.

Table 4.1

Time consumption for SCN to upscale the “baby” image from 256 × 256 to 512 × 512 using different dictionary size n

n 64 96 128 256 512
time (s) 0.159 0.192 0.230 0.445 1.214

Image

Different numbers of recurrent stages k have been tested for SCN, and we find increasing k from 1 to 3 only improves performance by less than 0.1 dB. As a tradeoff between speed and accuracy, we use k=1Image throughout the section.

In Table 4.2, different network structures with cascade for scalable SR in Sect. 4.1.4.2 (in each row) are compared at different scaling factors (in each column). SCN×a denotes the model trained with fixed scaling factor a without any cascade technique. For a fixed a, we use SCN×a as a basic module and apply it one or more times to super-resolve images for different upscaling factors, which is shown in each row of Table 4.2. It is observed that SCN×2 can perform as well as the scale-specific model for small scaling factor (1.5), and much better for large scaling factors (3 and 4). Note that the cascade of SCN×1.5 does not lead to good results since artifacts quickly get amplified through many repetitive upscalings. Therefore, we use SCN×2 as the default building block for CSCN, and drop the notation ×2 when there is no ambiguity. The last row in Table 4.2 shows that a CSCN trained using the multi-scale objective in (4.7) can further improve the SR results for scaling factors 3 and 4, as the second SCN in the cascade is trained to be robust to the artifacts generated by the first one.

Table 4.2

PSNR of different network cascading schemes on Set5, evaluated for different scaling factors in each column

scaling factor ×1.5 ×2 ×3 ×4
SCN×1.5 40.14 36.41 30.33 29.02
SCN×2 40.15 36.93 32.99 30.70
SCN×3 39.88 36.76 32.87 30.63
SCN×4 39.69 36.54 32.76 30.55
CSCN 40.15 36.93 33.10 30.86

Image

As shown in [45], the amount of training data plays an important role in the field of deep learning. In order to evaluate the effect of various amount of data on training CSCN, we change the training set from a relatively small set of 91 images (Set91) [21] to two other sets: the 199 out of 200 training images2 in BSD500 dataset (BSD200) [44], and a subset of 7500 images from the ILSVRC2013 dataset [71]. A model of exactly the same architecture without any cascade is trained on each data set, and another 100 images from the ILSVRC2013 dataset are included as an additional test set. From Table 4.3, we can observe that the CSCN trained on BSD200 consistently outperforms its counterpart trained on Set91 by around 0.1 dB on all test data. However, the performance of the model trained on ILSVRC2013 is slightly different from the one trained on BSD200, which shows the saturation of the performance as the amount of training data increases. The inferior quality of images in ILSVRC2013 may be a hurdle to further improve the performance. Therefore, our method is robust to training data and can benefit marginally from a larger set of training images.

Table 4.3

Effect of various training sets on the PSNR of ×2 upscaling with single view SCN

Training Set Test Set
Set5 Set14 BSD100 ILSVRC (100)
Set91 36.93 32.56 31.40 32.13
BSD200 36.97 32.69 31.55 32.27
ILSVRC (7.5k) 36.84 32.67 31.51 32.31

Image

4.1.7.2 Comparison with State-of-the-Art

We compare the proposed CSCN with other recent SR methods on all the images in Set5, Set14 and BSD100 for different scaling factors. Table 4.4 shows the PSNR and structural similarity (SSIM) [46] for adjusted anchored neighborhood regression (A+) [25], CNN [14], CNN trained with larger model size and much more data (CNN-L) [45], the proposed CSCN, and CSCN with our multi-view testing (CSCN-MV). We do not list other methods [7,21,43,47,24] whose performance is worse than A+ or CNN-L.

Table 4.4

PSNR (SSIM) comparison on three test data sets among different methods. Image indicates the best and Image indicates the second best performance. The performance gain of our best model over all the others' best is shown in the last row. (For interpretation of the colors in the tables, the reader is referred to the web version of this chapter)

Data Set Set5 Set14 BSD100
Upscaling ×2 ×3 ×4 ×2 ×3 ×4 ×2 ×3 ×4
A+ [25] 36.55 32.59 30.29 32.28 29.13 27.33 30.78 28.18 26.77
(0.9544) (0.9088) (0.8603) (0.9056) (0.8188) (0.7491) (0.8773) (0.7808) (0.7085)
CNN [14] 36.34 32.39 30.09 32.18 29.00 27.20 31.11 28.20 26.70
(0.9521) (0.9033) (0.8530) (0.9039) (0.8145) (0.7413) (0.8835) (0.7794) (0.7018)
CNN-L [45] 36.66 32.75 30.49 32.45 29.30 27.50 31.36 28.41 26.90
(0.9542) (0.9090) (0.8628) (0.9067) (0.8215) (0.7513) (0.8879) (0.7863) (0.7103)
CSCN Image Image Image Image Image Image Image Image Image
Image Image Image Image Image Image Image Image Image
CSCN-MV Image Image Image Image Image Image Image Image Image
Image Image Image Image Image Image Image Image Image
Our 0.55 0.59 0.65 0.35 0.27 0.31 0.24 0.19 0.24
Improvement (0.0029) (0.0083) (0.0161) (0.0034) (0.0048) (0.0106) (0.0036) (0.0042) (0.0088)

Image

It can be seen from Table 4.4 that CSCN performs consistently better than all previous methods in both PSNR and SSIM, and with multi-view testing the results can be further improved. CNN-L improves over CNN by increasing model parameters and training data. However, it is still not as good as CSCN which is trained with a much smaller size and on a much smaller data set. Clearly, the better model structure of CSCN makes it less dependent on model capacity and training data in improving performance. Our models are generally more advantageous for large scaling factors due to the cascade structure. A larger performance gain is observed on Set5 than the other two test sets because Set5 has more similar statistics as the training set.

The visual qualities of the SR results generated by sparse coding (SC) [7], CNN and CSCN are compared in Fig. 4.8. Our approach produces image patterns with shaper boundaries and richer textures, and is free of the ringing artifacts observable in the other two methods.

Image
Figure 4.8 SR results given by SC [7] (first row), CNN [14] (second row) and our CSCN (third row). Images from left to right: the “monarch” image upscaled by ×3; the “zebra” image upscaled by ×3; the “comic” image upscaled by ×3.

Fig. 4.9 shows the SR results on the “chip” image compared among more methods including the self-example based method (SE) [10] and the deep network cascade (DNC) [12]. SE and DNC can generate very sharp edges on this image, but also introduce artifacts and blurs on corners and fine structures due to the lack of self-similar patches. On the contrary, the CSCN method recovers all the structures of the characters without any distortion.

Image
Figure 4.9 The “chip” image upscaled by ×4 using different methods.

4.1.7.3 Robustness to Real SR Scenarios

We evaluate the performance of the proposed practical SR methods in Sect. 4.1.5, by providing the empirical results of several experiments for the two aforementioned approaches.

Data-Driven SR by Fine-Tuning

The proposed method in Sect. 4.1.5.1 is data-driven, and thus the generic SCN can be easily adapted for a particular task, with a small amount of training samples. We demonstrate the performance of this method in the application of enlarging low-DPI scanned document images with heavy noise. We first obtain several pairs of LR and HR images by scanning a document under two settings of 150 and 300 DPI. Then we fine-tune our generic CSCN model using only one pair of scanned images for a few iterations. Fig. 4.11 illustrates the visualization of the upscaled image from the 150 DPI scanned image. As shown by the SR results in Fig. 4.11, the CSCN before adaptation is very sensitive to LR measurement corruption, so the enlarged texts in (B) are much more corrupted than they are in the nearest neighbor upscaled image (A). However, the adapted CSCN model removes almost all the artifacts and can restore clear texts in (C), which is promising for practical applications such as quality enhancement of online scanned books and restoration of legacy documents.

Regularized Iterative SR

We now show experimental results of practical SR for blurred and noisy LR images, using the proposed regularized iterative methods in Sect. 4.1.5.2. We first compare the SR performance on blurry images using the proposed method in Sect. 4.1.5.2 with several other recent methods [50,48,49], using the same test images and settings. All these methods are designed for blurry LR input, while our model is trained on sharp LR input. As shown in Table 4.5, our model achieves much better results than the competitors. Note the speed of our model is also much faster than the conventional sparse coding based methods.

Table 4.5

PSNR of ×3 upscaling on LR images with different blurring kernels

Kernel Gaussian σ=1.0Image Gaussian σ=1.6Image
Method CSR [48] NLM [49] SCN CSR [48] GSC [50] SCN
Butterfly 27.87 26.93 28.70 28.19 25.48 29.03
Parrots 30.17 29.93 30.75 30.68 29.20 30.83
Parthenon 26.89 27.06 27.23 26.44 27.40
Bike 24.41 24.38 24.81 24.72 23.78 25.11
Flower 29.14 28.86 29.50 29.54 28.30 29.78
Girl 33.59 33.44 33.57 33.68 33.13 33.65
Hat 31.09 30.81 31.32 31.33 30.29 31.62
Leaves 26.99 26.47 27.45 27.60 24.78 27.87
Plants 33.92 33.27 34.35 34.00 32.33 34.53
Raccoon 29.09 28.99 29.29 28.81 29.16
Average 29.32 29.26 29.65 29.63 28.25 29.90

Image

To test the performance of upscaling noisy LR images, we simulate additive Gaussian noise for the LR input images at 4 different noise levels (σ=5,10,15,20Image) as the noisy input images. We compare the practical SR results in Set5 obtained from the following algorithms: directly using SCN, our proposed iterative SCN method using BM3D as denoising regularizer (iterative BM3D-SCN), and fine-tuning SCN with additional noisy training pairs. Note that knowing the underlying corruption model of real LR image (e.g., noise distribution or blurring kernel), one can always synthesize real training pairs for fine-tuning the generic SCN. In other words, once the iterative SR method is feasible, one can always apply our proposed data-driven method for SR alternatively. However, the converse is not true. Therefore, the knowledge of the corruption model of real measurements can be considered as a stronger assumption, compared to providing real training image pairs. Correspondingly, the SR performances of these two methods are evaluated when both can be applied. We also provide the results of methods directly using another generic SR model: CNN-L [45], and the similar iterative SR method involving CNN-L (iterative BM3D-CNN-L).

The practical SR results are listed in Table 4.6. We observed the improved PSNR using our proposed regularized iterative SR method over all noise levels. The proposed iterative BM3D-SCN achieves much higher PSNR than the method of directly using SCN. The performance gap (in terms of SR PSNR) between iterative BM3D-SCN and direct SCN becomes larger, as the noise level increases. Similar observation can be made when comparing iterative BM3D-CNN-L and direct CNN-L. Compared to the method of fine-tuning SCN, the iterative BM3D-SCN method demonstrates better empirical performance, with 0.3 dB improvement on average. The iterative BM3D-CNN-L method provides comparable results as the iterative BM3D-SCN method, which demonstrates that our proposed regularized iterative SCN scheme can be easily extended for other SR methods, and is able to effectively handle noisy LR measurements.

Table 4.6

PSNR values for ×2 upscaling noisy LR images in Set5 by directly using SCN (Direct SCN), directly using CNN-L (Direct CNN-L), SCN after fine-tuning on new noisy training data (Fine-tuning SCN), the iterative method of BM3D & SCN (Iterative BM3D-SCN), and the iterative method of BM3D & CNN-L (Iterative BM3D-CNN-L)

σ 5 10 15 20
Direct SCN 30.23 25.11 21.81 19.45
Direct CNN-L 30.47 25.32 21.91 19.46
Fine-tuning SCN 33.03 31.00 29.46 28.44
Iterative BM3D-SCN 33.51 31.22 29.65 28.61
Iterative BM3D-CNN-L 33.42 31.16 29.62 28.59

Image

An example of upscaling noisy LR images using the aforementioned methods is demonstrated in Fig. 4.10. Both fine-tuning SCN and iterative BM3D-SCN are able to significantly suppress the additive noise, while many artifacts induced by noise are observed in the SR result of direct SCN. It is notable that the fine-tuning SCN method performs better recovering the texture and the iterative BM3D-SCN method is preferable in smooth regions.

Image
Figure 4.10 The “building” image corrupted by additive Gaussian noise of σ = 10 and then upscaled by ×2 using different methods.
Image
Figure 4.11 Low-DPI scanned document upscaled by ×4 using different methods.

4.1.8 Subjective Evaluation

Subjective perception is an important metric to evaluate SR techniques for commercial use, other than the quantitative evaluation. In order to more thoroughly compare various SR methods and quantify the subjective perception, we utilize an online platform for subjective evaluation of SR results from several methods [23], including bicubic, SC [7], SE [10], self-example regression (SER) [51], CNN [14] and CSCN. Each participant is invited to conduct several pair-wise comparisons of SR results from different methods. The SR methods of displayed SR images in each pair are randomly selected. Ground-truth HR images are also included when they are available as references. For each pair, the participant needs to select the better one in terms of perceptual quality. A snapshot of our evaluation webpage3 is shown in Fig. 4.12.

Image
Figure 4.12 The user interface of a web-based image quality evaluation, where two images are displayed side by side and local details can be magnified by moving mouse over the corresponding region.

Specifically, there are SR results over 6 images with different scaling factors: “kid”×4, “chip”×4, “statue”×4, “lion”×3, “temple”×3 and “train”×3. The images are shown in Fig. 4.13. All the visual comparison results are then summarized into a 7×7Image winning matrix for 7 methods (including ground truth). A Bradley–Terry [52] model is calculated based to these results and the subjective score is estimated for each method according to this model. In the Bradley–Terry model, the probability that an object X is favored over Y is assumed to be

p(XY)=esXesX+esY=11+esYsX,

Image (4.12)

where sXImage and sYImage are the subjective scores for X and Y. The scores s for all the objects can be jointly estimated by maximizing the log-likelihood of the pairwise comparison observations:

maxsi,jwijlog(11+esjsi),

Image (4.13)

where wijImage is the (i,j)Imageth element in the winning matrix W, meaning the number of times when method i is favored over method j. We use the Newton–Raphson method to solve Eq. (4.13) and set the score for ground truth method as 1 to avoid the scale ambiguity.

Image
Figure 4.13 The 6 images used in subjective evaluation.

Now we describe the detailed experiment results. We have a total of 270 participants giving 720 pairwise comparisons over six images with different scaling factors, which are shown in Fig. 4.13. Not every participant completed all the comparisons but their partial responses are still useful.

Fig. 4.14 shows the estimated scores for the six SR methods in our evaluation, with the score for ground truth method normalized to 1. As expected, all the SR methods have much lower scores than ground-truth, showing the great challenge in SR problem. The bicubic interpolation is significantly worse than other SR methods. The proposed CSCN method outperforms other previous state-of-the-art methods by a large margin, demonstrating its superior visual quality. It should be noted that the visual difference between some image pairs is very subtle. Nevertheless, the human subjects are able to perceive such difference when seeing the two images side by side, and therefore make consistent ratings. The CNN model becomes less competitive in the subjective evaluation than it is in PSNR comparison. This indicates that the visually appealing image appearance produced by CSCN should be attributed to the regularization from sparse representation, which cannot be easily learned by merely minimizing reconstruction error as in CNN.

Image
Figure 4.14 Subjective SR quality scores for different methods including bicubic, SC [7], SE [10], SER [51], CNN [14] and the proposed CSCN. The score for ground-truth result is 1.

4.1.9 Conclusion and Future Work

We propose a new approach for image SR by combining the strengths of sparse representation and deep network, and make considerable improvement over existing deep and shallow SR models both quantitatively and qualitatively. Besides producing outstanding SR results, the domain knowledge in the form of sparse coding can also benefit training speed and model compactness. Furthermore, we investigate the cascade of network for both fixed and incremental scaling factors so as to enhance SR performance. In addition, the robustness to real SR scenarios is discussed for handling non-ideal LR measurements. More generally, our observation is in line with other recent extensions made to CNN with better domain knowledge for different tasks.

In a future work, we will apply the SCN model to other problems where sparse coding can be useful. The interaction between deep networks for low- and high-level vision tasks, such as in [53], will also be explored. Another interesting direction to explore is video super-resolution [54], which is the task of inferring a high-resolution video sequence from a low-resolution one. This problem has drawn growing attention in both the research community and industry recently. From the research perspective, this problem is challenging because video signals vary in both temporal and spatial dimensions. In the meantime, with the prevalence of high-definition (HD) display such as HDTV in the market, there is an increasing need for converting low quality video sequences to high-definition so that they can be played on the HD displays in a visually pleasant manner.

There are two types of relation that are utilized for video SR: the intra-frame spatial relation and the inter-frame temporal relation. Neural network based models have successfully demonstrated the strong capability of modeling the spatial relation. Compared with the intra-frame spatial relation, the inter-frame temporal relation is more important for video SR, as researches of vision systems suggest that the human vision system is more sensitive to motion [55]. Thus it is essential for video SR algorithm to capture and model the effect of motion information on visual perception. Sparse priors have been shown useful for video SR [56]. We will try employing the sparse coding domain knowledge in deep network models for utilizing the temporal relation among consecutive LR video frames in the future.

4.2 Learning a Mixture of Deep Networks for Single Image Super-Resolution4

4.2.1 Introduction

The main difficulty of single image SR resides in the loss of much information in the degradation process. Since the known variables from the LR image are usually greatly outnumbered by that from the HR image, this problem is a highly ill-posed problem.

A large number of single image SR methods have been proposed in the literature, including interpolation based method [57], edge model based method [4] and example based method [58,9,6,21,45,24]. Since the former two methods usually suffer the sharp drop in restoration performance with large upscaling factors, the example based method has drawn great attention from the community recently. It usually learns the mapping from LR images to HR images in a patch-by-patch manner, with the help of sparse representation [6,23], random forest [26] and so on. The neighbor embedding method [58,21] and neural network based method [45] are two representatives of this category.

Neighbor embedding is proposed in [58,42] which estimates HR patches as a weighted average of local neighbors with the same weights as in LR feature space, based on the assumption that LR/HR patch pairs share similar local geometry in low-dimensional nonlinear manifolds. The coding coefficients are first acquired by representing each LR patch as a weighted average of local neighbors, and then the HR counterpart is estimated by the multiplication of the coding coefficients with the corresponding training HR patches. Anchored neighborhood regression (ANR) is utilized in [21] to improve the neighbor embedding methods, which partitions the feature space into a number of clusters using the learned dictionary atoms as a set of anchor points. A regressor is then learned for each cluster of patches. This approach has demonstrated superiority over the counterpart of global regression in [21]. Other variants of learning a mixture of SR regressors can be found in [25,59,60].

Recently, neural network based models have demonstrated the strong capability for single image SR [12,45,61], due to its large model capacity and the end-to-end learning strategy to get rid of hand-crafted features.

In this section, we propose a method to combine the merits of the neighborhood embedding methods and the neural network based methods via learning a mixture of neural networks for single image SR. The entire image signal space can be partitioned into several subspaces, and we dedicate one SR module to the image signals in each subspace, the synergy of which allows for a better capture of the complex relation between the LR image signal and its HR counterpart than the generic model. In order to take advantage of the end-to-end learning strategy of neural network based methods, we choose neural networks as the SR inference modules and incorporate these modules into one unified network, and design a branch in the network to predict the pixel-level weights for HR estimates from each SR module before they are adaptively aggregated to form the final HR image.

A systematic analysis of different network architectures is conducted with the focus on the relation between SR performance and various network architectures via extensive experiments, where the benefit of utilizing a mixture of SR models is demonstrated. Our proposed approach is contrasted with other current popular approaches on a large number of test images, and achieves state-of-the-arts performance consistently along with more flexibility of model design choices.

The section is organized as follows. The proposed method is introduced and explained in detail in Sect. 4.2.2. Implementation details are provided in Sect. 4.2.3. Section 4.2.4 describes our experimental results, in which we analyze thoroughly different network architectures and compare the performance of our method with other current SR methods both quantitatively and qualitatively. Finally, in Sect. 4.2.5, we conclude the section and discuss the future work. The more detailed version of this work can be found in [62].

4.2.2 The Proposed Method

First we give an overview of our method. The LR image serves as the input to our method. There are several SR inference modules {Bi}i=1NImage in our method. Each of them, BiImage, is dedicated to inferring a certain class of image patches, and applied on the LR input image to predict an HR estimate. We also devise an adaptive weight module, T, to adaptively combine at the pixel-level the HR estimates from SR inference modules. When we select neural networks as the SR inference modules, all the components can be incorporated into a unified neural network and jointly learned. The final estimated HR image is adaptively aggregated from the estimates of all SR inference modules. By the multi-branch design of our network, the super-resolution performance is improved comparing with its single branch counterpart, which will be shown in Sect. 4.2.4. The overview of our method is shown in Fig. 4.15.

Image
Figure 4.15 An overview of our proposed method. It consists of a number of SR inference modules and an adaptive weight module. Each SR inference module is dedicated to inferring a certain class of image local patterns, and is independently applied on the LR image to predict one HR estimate. These estimates are adaptively combined using pixel-wise aggregation weights from the adaptive weight module in order to form the final HR image.

Now we will introduce the network architecture in detail.

SR Inference Module. Taking the LR image as input, each SR inference module is designed to better capture the complex relation between a certain class of LR image signals and its HR counterpart, while predicting an HR estimate. For the sake of inference accuracy, we choose as the SR inference module a recent sparse coding based network (SCN) in [61], which implicitly incorporates the sparse prior into neural networks via employing the learned iterative shrinkage and thresholding algorithm (LISTA), and closely mimics the sparse coding based image SR method [7]. The architecture of SCN is shown in Fig. 4.2. Note that the design of the SR inference module is not limited to SCN, and all other neural network based SR models, e.g., SRCNN [45], can work as the SR inference module as well. The output of BiImage serves as an estimate to the final HR frame.

Adaptive Weight Module. The goal of this module is to model the selectivity of the HR estimates from every SR inference module. We propose assigning pixel-wise aggregation weights of each HR estimate, and again the design of this module is open to any operation in the field of neural networks. Taking into account the computation cost and efficiency, we utilize only three convolutional layers for this module, and ReLU is applied on the filter responses to introduce nonlinearity. This module finally outputs the pixel-level weight maps for all the HR estimates.

Aggregation. Each SR inference module's output is pixel-wisely multiplied with its corresponding weight map from the adaptive weight module, and then these products are summed up to form the final estimated HR frame. If we use y to denote the LR input image, a function W(y;θw)Image with parameters θwImage to represent the behavior of the adaptive weight module, and a function FBi(y;θBi)Image with parameters θBiImage to represent the output of SR inference module BiImage, the final estimated HR image F(y;Θ)Image can be expressed as

F(y;Θ)=i=1NWi(y;θw)FBi(y;θBi),

Image (4.14)

where ⊙ denotes the point-wise multiplication.

In training, our model tries to minimize the loss between the target HR frame and the predicted output, as

minΘjF(yj;Θ)xj22,

Image (4.15)

where F(y;Θ)Image represents the output of our model, xjImage is the jth HR image and yjImage is the corresponding LR image; Θ is the set of all parameters in our model.

If we plug Eq. (4.14) into Eq. (4.15), then the cost function can be expanded as:

minθw,{θBi}i=1Nji=1NWi(yj;θw)FBi(yj;θBi)xj22.

Image (4.16)

4.2.3 Implementation Details

We conduct experiments following the protocols in [21]. Different learning based methods use different training data in the literature. We choose 91 images proposed in [6] to be consistent with [25,26,61]. These training data are augmented with translation, rotation and scaling, providing approximately 8 million training samples of 56×56Image pixels.

Our model is tested on three benchmark data sets, which are Set5 [42], Set14 [43] and BSD100 [44]. The ground-truth images are downscaled by bicubic interpolation to generate LR–HR image pairs for both training and testing.

Following the convention in [21,61], we convert each color image into the YCbCr colorspace and only process the luminance channel with our model; bicubic interpolation is applied to the chrominance channels, because the visual system of human is more sensitive to details in intensity than in color.

Each SR inference module adopts the network architecture of SCN, while the filters of all three convolutional layers in the adaptive weight module have the spatial size of 5×5Image and the numbers of filters of three layers are set to be 32,16Image and N, which is the number of SR inference modules.

Our network is trained on a machine with 12 Intel Xeon 2.67 GHz CPUs and one Nvidia TITAN X GPU. For the adaptive weight module, we employ a constant learning rate of 105Image and initialize the weights from Gaussian distribution, while we stick to the learning rate and the initialization method in [61] for the SR inference modules. The standard gradient descent algorithm is employed to train our network with a batch size of 64 and the momentum of 0.9.

We train our model for the upscaling factor of 2. For larger upscaling factors, we adopt the model cascade technique in [61] to apply ×2 models several times until the resulting image reaches at least as large as the desired size. The resulting image is downsized via bicubic interpolation to the target resolution if necessary.

4.2.4 Experimental Results

In this section, we first analyze the architecture of our proposed model and then compare it with several other recent SR methods. Finally, we provide a runtime analysis of our approach and other competing methods.

4.2.4.1 Network Architecture Analysis

In this section we investigate the relation between various numbers of SR inference modules and SR performance. For the sake of our analysis, we increase the number of inference modules as we decrease the module capacity of each of them, so that the total model capacity is approximately consistent and thus the comparison is fair. Since the chosen SR inference module, SCN [61], closely mimics the sparse coding based SR method, we can reduce the module capacity of each inference module by decreasing the embedded dictionary size n (i.e., the number of filters in SCN) for sparse representation. We compare the following cases:

  • •  one inference module with n=128Image, which is equivalent to the structure of SCN in [61], denoted as SCN (n=128)Image. Note that there is no need to include the adaptive weight module in this case.
  • •  two inference modules with n=64Image, denoted as MSCN-2 (n=64)Image.
  • •  four inference modules with n=32Image, denoted as MSCN-4 (n=32)Image.

The average Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) [46] are measured to quantitatively compare the SR performance of these models over Set5, Set14 and BSD100 for various upscaling factors (×2,×3,×4Image), and the results are displayed in Table 4.7.

Table 4.7

PSNR (in dB) and SSIM comparisons on Set5, Set14 and BSD100 for ×2, ×3 and ×4 upscaling factors among various network architectures. Image indicates the best and Image indicates the second best performance

Benchmark SCN MSCN-2 MSCN-4
(n=128Image) (n=64Image) (n=32Image)
Set5 ×2 36.93 / 0.9552 Image / Image Image / Image
×3 33.10 / Image Image / Image Image / 0.9130
×4 30.86 / Image Image / 0.8709 Image / Image
Set14 ×2 32.56 / 0.9069 Image / Image Image / Image
×3 29.41 / 0.8235 Image / Image Image / Image
×4 27.64 / 0.7578 Image / Image Image / Image
BSD100 ×2 31.40 / 0.8884 Image / Image Image / Image
×3 28.50 / 0.7885 Image / Image Image / Image
×4 27.03 / 0.7161 Image / Image Image / Image

Image

It can be observed that MSCN-2 (n=64)Image usually outperforms the original SCN network, i.e., SCN (n=128)Image, and MSCN-4 (n=32)Image can achieve the best SR performance by improving the performance marginally over MSCN-2 (n=64)Image. This demonstrates the effectiveness of our approach, namely that each SR inference model is able to super-resolve its own class of image signals better than one single generic inference model.

In order to further analyze the adaptive weight module, we select several input images, namely, butterfly, zebra, barbara, and visualize the four weight maps for every SR inference module in the network. Moreover, we record the index of the maximum weight across all weight maps at every pixel and generate a max label map. These results are displayed in Fig. 4.16.

Image
Figure 4.16 Weight maps for the HR estimate from every SR inference module in MSCN-4 are given in the first four rows. The map (max label map) which records the index of the maximum weight across all weight maps at every pixel is shown in the last row. Images from left to right: the butterfly image upscaled by ×2; the zebra image upscaled by ×2; the barbara image upscaled by ×2.

From these visualizations it can be seen that weight map 4 shows high responses in many uniform regions, and thus mainly contributes to the low frequency regions of HR predictions. On the contrary, weight maps 1, 2 and 3 have large responses in regions with various edges and textures, and restore the high frequency details of HR predictions. These weight maps reveal that these sub-networks work in a complementary manner for constructing the final HR predictions. In the max label map, similar structures and patterns of images usually share the same label, indicating that such similar textures and patterns are favored to be super-resolved by the same inference model.

4.2.4.2 Comparison with State-of-the-Art

We conduct experiments on all the images in Set5, Set14 and BSD100 for different upscaling factors (×2, ×3, and ×4), to quantitatively and qualitatively compare our own approach with a number of state-of-the-art image SR methods. Table 4.8 shows the PSNR and SSIM for adjusted anchored neighborhood regression (A+) [25], SRCNN [45], RFL [26], SelfEx [24] and our proposed model, MSCN-4 (n=128)Image, that consists of four SCN modules with n=128Image. The single generic SCN without multi-view testing in [61], i.e., SCN (n=128)Image, is also included for comparison as the baseline. Note that all the methods use the same 91 images [6] for training except SRCNN [45], which uses 395,909 images from ImageNet as training data.

Table 4.8

PSNR (SSIM) comparison on three test data sets for various upscaling factors among different methods. The best performance is indicated in Image and the second best performance is shown in Image. The performance gain of our best model over all the other models' best is shown in the last row

Data Set Set5 Set14 BSD100
Upscaling ×2 ×3 ×4 ×2 ×3 ×4 ×2 ×3 ×4
A+ [25] 36.55 32.59 30.29 32.28 29.13 27.33 31.21 28.29 26.82
(0.9544) (0.9088) (0.8603) (0.9056) (0.8188) (0.7491) (0.8863) (0.7835) (0.7087)
SRCNN [45] 36.66 32.75 30.49 32.45 29.30 27.50 31.36 28.41 26.90
(0.9542) (0.9090) (0.8628) (0.9067) (0.8215) (0.7513) (0.8879) (0.7863) (0.7103)
RFL [26] 36.54 32.43 30.14 32.26 29.05 27.24 31.16 28.22 26.75
(0.9537) (0.9057) (0.8548) (0.9040) (0.8164) (0.7451) (0.8840) (0.7806) (0.7054)
SelfEx [24] 36.49 32.58 30.31 32.22 29.16 27.40 31.18 28.29 26.84
(0.9537) (0.9093) (0.8619) (0.9034) (0.8196) (0.7518) (0.8855) (0.7840) (0.7106)
SCN [61] Image Image Image Image Image Image Image Image Image
Image Image Image Image Image Image Image Image Image
MSCN-4 Image Image Image Image Image Image Image Image Image
Image Image Image Image Image Image Image Image Image
Our 0.23 0.23 0.22 0.29 0.24 0.23 0.25 0.16 0.16
Improvement (0.0013) (0.0011) (0.0008) (0.0010) (0.0034) (0.0046) (0.0044) (0.0056) (0.0068)

Image

It can be observed that our proposed model achieves the best SR performance consistently over three data sets for various upscaling factors. It outperforms SCN (n=128)Image which obtains the second best results by about 0.2 dB across all the data sets, owing to the power of multiple inference modules.

We compare the visual quality of SR results among various methods in Fig. 4.17. The region inside the bounding box is zoomed in and shown for the sake of visual comparison. Our proposed model MSCN-4 (n=128)Image is able to recover sharper edges and generate less artifacts in the SR inferences.

Image
Figure 4.17 Visual comparisons of SR results among different methods. From left to right: the ppt3 image upscaled by ×3; the 102061 image upscaled by ×3; the butterfly image upscaled by ×4.

4.2.4.3 Runtime Analysis

The inference time is an important factor of SR algorithms other than the SR performance. The relation between the SR performance and the inference time of our approach is analyzed in this section. Specifically, we measure the average inference time of different network structures in our method for upscaling factor ×2 on Set14. The inference time costs versus the PSNR values are displayed in Fig. 4.18, where several other current SR methods [24,26,45,25] are included as reference (the inference time of SRCNN is from the public slower implementation of CPU). We can see that, generally, the more modules our network has, the more inference time is needed and the better SR results are achieved. By adjusting the number of SR inference modules in our network structure, we can achieve a tradeoff between SR performance and computation complexity. However, our slowest network still has the superiority in term of inference time, compared with other previous SR methods.

Image
Figure 4.18 The average PSNR and the average inference time for upscaling factor ×2 on Set14 are compared among different network structures of our method and other SR methods. SRCNN uses the public slower implementation of CPU.

4.2.5 Conclusion and Future Work

In this section, we propose to jointly learn a mixture of deep networks for single image super-resolution, each of which serves as an SR inference module to handle a certain class of image signals. An adaptive weight module is designed to predict pixel-level aggregation weights of the HR estimates. Various network architectures are analyzed in terms of the SR performance and the inference time, which validates the effectiveness of our proposed model design. Extensive experiments manifest that our proposed model is able to achieve outstanding SR performance along with more flexibility of design.

Recent SR approaches increase the network depth in order to boost SR accuracy [6367]. Kim et al. [63] proposed a very deep CNN with residual architecture to achieve outstanding SR performance, which utilizes broader contextual information with larger model capacity. Another network was designed by Kim et al. [64], which has recursive architectures with skip connection for image SR to boost performance while only exploiting a small number of model parameters. Tai et al. [65] discovered that many residual SR learning algorithms are based on either global residual learning or local residual learning, which are insufficient for very deep models. Instead, they proposed a model that applies both global and local learning while remaining parameter efficient via recursive learning. More recently, Tong et al. [67] proposed making use of Densely Connected Networks (DenseNet) [68] instead of ResNet [69] as the building block for image SR. Besides developing deeper networks, we show that increasing the number of parallel branches inside the network can achieve the same goal.

In the future, this approach of image super-resolution will be explored to facilitate other high-level vision tasks [53]. While the visual recognition research has made tremendous progress in recent years, most models are trained, applied, and evaluated on high-quality (HQ) visual data, such as the LFW [70] and ImageNet [71] benchmarks. However, in many emerging applications such as autonomous driving, intelligent video surveillance and robotics, the performances of visual sensing and analytics can be seriously endangered by different corruptions in complex unconstrained scenarios, such as limited resolution. Therefore, image super-resolution may provide one solution to feature enhancement for improving the performance of high-level vision tasks [72].

References

[1] S. Baker, T. Kanade, Limits on super-resolution and how to break them, IEEE TPAMI 2002;24(9):1167–1183.

[2] Z. Lin, H.Y. Shum, Fundamental limits of reconstruction-based superresolution algorithms under local translation, IEEE TPAMI 2004;26(1):83–97.

[3] S.C. Park, M.K. Park, M.G. Kang, Super-resolution image reconstruction: a technical overview, IEEE Signal Processing Magazine 2003;20(3):21–36.

[4] R. Fattal, Image upsampling via imposed edge statistics, ACM transactions on graphics (TOG), vol. 26. ACM; 2007:95.

[5] H.A. Aly, E. Dubois, Image up-sampling using total-variation regularization with a new observation model, IEEE TIP 2005;14(10):1647–1659.

[6] J. Yang, J. Wright, T.S. Huang, Y. Ma, Image super-resolution via sparse representation, IEEE TIP 2010;19(11):2861–2873.

[7] J. Yang, Z. Wang, Z. Lin, S. Cohen, T. Huang, Coupled dictionary training for image super-resolution, IEEE TIP 2012;21(8):3467–3478.

[8] X. Gao, K. Zhang, D. Tao, X. Li, Image super-resolution with sparse neighbor embedding, IEEE TIP 2012;21(7):3194–3205.

[9] D. Glasner, S. Bagon, M. Irani, Super-resolution from a single image, ICCV. IEEE; 2009:349–356.

[10] G. Freedman, R. Fattal, Image and video upscaling from local self-examples, ACM Transactions on Graphics 2011;30(2):12.

[11] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS. 2012:1097–1105.

[12] Z. Cui, H. Chang, S. Shan, B. Zhong, X. Chen, Deep network cascade for image super-resolution, ECCV. Springer; 2014:49–64.

[13] Z. Wang, Y. Yang, Z. Wang, S. Chang, W. Han, J. Yang, et al., Self-tuned deep super resolution, CVPR workshops. 2015:1–8.

[14] C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, ECCV. Springer; 2014:184–199.

[15] C. Osendorfer, H. Soyer, P. van der Smagt, Image super-resolution with fast approximate convolutional sparse coding, Neural information processing. Springer; 2014:250–257.

[16] K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, ICML. 2010:399–406.

[17] B. Wen, S. Ravishankar, Y. Bresler, Structured overcomplete sparsifying transform learning with convergence guarantees and applications, IJCV 2015;114(2):137–167.

[18] C.Y. Yang, C. Ma, M.H. Yang, Single-image super-resolution: a benchmark, ECCV. 2014:372–386.

[19] C.E. Duchon, Lanczos filtering in one and two dimensions, Journal of Applied Meteorology 1979;18(8):1016–1022.

[20] J. Sun, Z. Xu, H.Y. Shum, Gradient profile prior and its applications in image super-resolution and enhancement, IEEE Transactions on Image Processing 2011;20(6):1529–1542.

[21] R. Timofte, V. De, L.V. Gool, Anchored neighborhood regression for fast example-based super-resolution, ICCV. IEEE; 2013:1920–1927.

[22] W.T. Freeman, T.R. Jones, E.C. Pasztor, Example-based super-resolution, IEEE Computer Graphics and Applications 2002;22(2):56–65.

[23] Z. Wang, Y. Yang, Z. Wang, S. Chang, T.S. Huang, Learning super-resolution jointly from external and internal examples, IEEE TIP 2015;24(11):4359–4371.

[24] J.B. Huang, A. Singh, N. Ahuja, Single image super-resolution from transformed self-exemplars, CVPR. IEEE; 2015:5197–5206.

[25] R. Timofte, V. De Smet, L. Van Gool, A+: adjusted anchored neighborhood regression for fast super-resolution, ACCV. Springer; 2014:111–126.

[26] S. Schulter, C. Leistner, H. Bischof, Fast and accurate image upscaling with super-resolution forests, CVPR. 2015:3791–3799.

[27] J. Salvador, E. Pérez-Pellitero, Naive Bayes super-resolution forest, Proceedings of the IEEE international conference on computer vision. 2015:325–333.

[28] Z. Wang, Z. Wang, S. Chang, J. Yang, T. Huang, A joint perspective towards image super-resolution: unifying external- and self-examples, Applications of computer vision (WACV), 2014 IEEE winter conference on. IEEE; 2014:596–603.

[29] K. Kavukcuoglu, M. Ranzato, Y. LeCun, Fast inference in sparse coding algorithms with applications to object recognition, arXiv preprint arXiv:1010.3467; 2010.

[30] I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Communications on Pure and Applied Mathematics 2004;57(11):1413–1457.

[31] C.J. Rozell, D.H. Johnson, R.G. Baraniuk, B.A. Olshausen, Sparse coding via thresholding and local competition in neural circuits, Neural Computation 2008;20(10):2526–2563.

[32] M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint arXiv:1312.4400; 2013.

[33] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, ICML. 2010:807–814.

[34] S. Chang, W. Han, J. Tang, G.J. Qi, C.C. Aggarwal, T.S. Huang, Heterogeneous network embedding via deep architectures, ACM SIGKDD. ACM; 2015.

[35] Q.V. Le, Building high-level features using large scale unsupervised learning, ICASSP. 2013:8595–8598.

[36] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, CVPR. 2014:1717–1724.

[37] A. Buades, B. Coll, J.M. Morel, A non-local algorithm for image denoising, CVPR. 2005.

[38] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE TSP 2006;54(11):4311–4322.

[39] K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, Image denoising by sparse 3D transform-domain collaborative filtering, IEEE TIP 2007;16(8):2080–2095.

[40] B. Wen, S. Ravishankar, Y. Bresler, Video denoising by online 3d sparsifying transform learning, ICIP. 2015.

[41] S. Ravishankar, B. Wen, Y. Bresler, Online sparsifying transform learning – part i: algorithms, IEEE Journal of Selected Topics in Signal Process 2015;9(4):625–636.

[42] M. Bevilacqua, A. Roumy, C. Guillemot, M.L.A. Morel, Low-complexity single-image super-resolution based on nonnegative neighbor embedding, BMVC. BMVA Press; 2012.

[43] R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-representations, Curves and surfaces. Springer; 2012:711–730.

[44] D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, ICCV, vol. 2. IEEE; 2001:416–423.

[45] C. Dong, C.C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, TPAMI 2015.

[46] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE TIP 2004;13(4):600–612.

[47] K.I. Kim, Y. Kwon, Single-image super-resolution using sparse regression and natural image prior, IEEE TPAMI 2010;32(6):1127–1133.

[48] W. Dong, L. Zhang, G. Shi, Centralized sparse representation for image restoration, ICCV. 2011:1259–1266.

[49] K. Zhang, X. Gao, D. Tao, X. Li, Single image super-resolution with non-local means and steering kernel regression, IEEE TIP 2012;21(11):4544–4556.

[50] X. Lu, H. Yuan, P. Yan, Y. Yuan, X. Li, Geometry constrained sparse coding for single image super-resolution, CVPR. 2012:1648–1655.

[51] J. Yang, Z. Lin, S. Cohen, Fast image super-resolution based on in-place example regression, CVPR. 2013:1059–1066.

[52] R.A. Bradley, M.E. Terry, Rank analysis of incomplete block designs: I. The method of paired comparisons, Biometrika 1952:324–345.

[53] Z. Wang, S. Chang, Y. Yang, D. Liu, T.S. Huang, Studying very low resolution recognition using deep networks, CVPR. IEEE; 2016:4792–4800.

[54] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, et al., Learning temporal dynamics for video super-resolution: a deep learning approach, IEEE Transactions on Image Processing 2018;27(7):3432–3445.

[55] M. Dorr, T. Martinetz, K.R. Gegenfurtner, E. Barth, Variability of eye movements when viewing dynamic natural scenes, Journal of Vision 2010;10(10):28.

[56] Q. Dai, S. Yoo, A. Kappeler, A.K. Katsaggelos, Sparse representation-based multiple frame video super-resolution, IEEE Transactions on Image Processing 2017;26(2):765–781.

[57] B.S. Morse, D. Schwartzwald, Image magnification using level-set reconstruction, CVPR 2001, vol. 1. IEEE; 2001.

[58] H. Chang, D.Y. Yeung, Y. Xiong, Super-resolution through neighbor embedding, CVPR, vol. 1. IEEE; 2004:275–282.

[59] D. Dai, R. Timofte, L. Van Gool, Jointly optimized regressors for image super-resolution, Eurographics, vol. 7. 2015:8.

[60] R. Timofte, R. Rasmus, L. Van Gool, Seven ways to improve example-based single image super resolution, CVPR. IEEE; 2016.

[61] Z. Wang, D. Liu, J. Yang, W. Han, T. Huang, Deep networks for image super-resolution with sparse prior, ICCV. 2015:370–378.

[62] D. Liu, Z. Wang, N. Nasrabadi, T. Huang, Learning a mixture of deep networks for single image super-resolution, ACCV. Springer; 2016:145–156.

[63] J. Kim, J.K. Lee, K.M. Lee, Accurate image super-resolution using very deep convolutional networks, CVPR. IEEE; 2016.

[64] J. Kim, J.K. Lee, K.M. Lee, Deeply-recursive convolutional network for image super-resolution, CVPR. 2016.

[65] Y. Tai, J. Yang, X. Liu, Image super-resolution via deep recursive residual network, CVPR. 2017.

[66] Y. Fan, H. Shi, J. Yu, D. Liu, W. Han, H. Yu, et al., Balanced two-stage residual networks for image super-resolution, Computer vision and pattern recognition workshops (CVPRW), 2017 IEEE conference on. IEEE; 2017:1157–1164.

[67] T. Tong, G. Li, X. Liu, Q. Gao, Image super-resolution using dense skip connections, ICCV. 2017.

[68] G. Huang, Z. Liu, K.Q. Weinberger, L. van der Maaten, Densely connected convolutional networks, Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[69] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770–778.

[70] G.B. Huang, M. Mattar, T. Berg, E. Learned-Miller, Labeled faces in the wild: a database for studying face recognition in unconstrained environments, Workshop on faces in real-life images: detection, alignment, and recognition. 2008.

[71] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on. IEEE; 2009:248–255.

[72] D. Liu, B. Cheng, Z. Wang, H. Zhang, T.S. Huang, Enhance visual recognition under adverse conditions via deep networks, arXiv preprint arXiv:1712.07732; 2017.


1  “©2017 IEEE. Reprinted, with permission, from Liu, Ding, Wang, Zhaowen, Wen, Bihan, Yang, Jianchao, Han, Wei, and Huang, Thomas S. “Robust single image super-resolution via deep networks with sparse prior.” IEEE Transactions on Image Processing 25.7 (2016): 3194–3207.”

2  “Since one out of 200 training images coincides with one image in Set5, we exclude it from our training set.”

3  www.ifp.illinois.edu/~wang308/survey.”

4  “©2017 Springer. Reprinted, with permission, from Liu, Ding, Wang, Zhaowen, Nasrabadi, Nasser, and Huang, Thomas S. “Learning a mixture of deep networks for single image super-resolution.” Asian Conference on Computer Vision (2016).”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.116.51